Category Archives: Case of the

GetType() is a Scalability Issue

TLDR; Avoid SafeHandles and Native\Managed transitions in scale-up applications (e.g. web servers, middle tiers), such as calls to GetType(). Additionally, you won’t see these types of performance issues until your application has high loads (i.e. when the Garbage Collector kicks in).

All the CPUs!

Recently, the Entity Framework team was investigating a customer’s performance issue, namely that they couldn’t get 100% CPU usage using EF on multiple threads, no matter how many threads they were using. The EF team had verified that the customer was using threads correctly (avoiding shared state, especially DbContext objects) and that the application was not being limited by IO (the test was run against a local SQL Server using a small enough data set that the results would be cached). They gathered some profiles from the customer, and saw that a large amount of time was spent waiting on Critical Sections in System.Data.SqlClient, so they enlisted my help.

As it turned out, the Critical Section was a red herring – a side effect of the way that we implemented synchronous network calls for MARS. However, there was another event that a lot of threads were waiting on which appeared to be reset when Garbage Collection started and set when it completed. Why were these threads waiting for the GC to complete? Because they wanted to disable being preempted by the GC and you can’t do that while the GC is running.

Clearing the table while the customers are still eating

In a managed application the Garbage Collector deals with memory management, however it needs to be able to do this while the program is running in a thread-safe manner. The typical solution to this is to simply have the GC pause all managed threads in the application while it collects and then resume them afterwards. The key in that above sentence is “managed thread”, there are times in a managed threads life that it has to delve into native code, such as when using PInvoke or using Managed C++ to call into Native C++ (as is the case for SqlClient). In this situation the thread needs to tell the runtime that it is switching to being a native thread, and a side effect of this is that the thread can no longer be preempted by the GC.

There is another situation where we don’t want the GC to interrupt us, and that is during a call to SafeHandle.ReleaseHandle(). The documentation for this method states “The garbage collector guarantees… that the method will not be interrupted while it is in progress”, and one of the things that can interrupt it is the GC itself.

Getting to GetType()

Walking further up the call stack, one of the methods that was disabling GC preemption were network read\write calls in SqlClient. However, the default packet size is 8Kb, so that should be enough data to keep the CPU busy between calls – we also reuse buffers, so there isn’t much stress on the GC. So these calls weren’t likely to be the issue.

The other call that kept appearing was GetType() – this was a surprise since I didn’t see why this method needed to disable the GC. As it turns out, every time GetType() is called it creates and disposes a SafeTypeNameParserHandle, which inherits from SafeHandle. Furthermore, there a number of calls to native methods and memory pinning. Put together, this thread puts a lot of stress on the GC and keeps disabling\enabling GC preemption. Even worse, as the customer added more threads, the GC was stressed more, and so the threads had to wait longer for the GC to finish.

Getting past GetType()

When checking type compatibility, there is no need to call GetType() – the as and is keywords are much faster (roughly 20x and 5x respectively, for single-threaded performance), they also take care of inheritance and interfaces. There are only two reasons why you may need to use GetType(): either you are checking that an object exactly matches a given type, or to see if an object has a particular method. For the first of these scenarios, why does it matter that an object is a specific type, or a sub-class? I honestly can’t think of a scenario where it would truly matter, unless you don’t trust the derived class to conform correctly to the contracts set up by the parent, which is a very unusual situation to be in. As for checking the existence of a method on an object, using reflection is probably a bad idea since the same named method doesn’t necessarily have the same purpose, consider Stream.Read vs DbDataReader.Read vs Interlocked.Read; or even more confusing Type.GetType vs Type.GetType – it is a much better practice to cast (via the as keyword) to the interface or class that has the method that you are looking for and use the casted object.

Worst case scenario, if you really can’t avoid it, then at least try to cache and reuse the result of GetType.

(As a side note, if you’re ever interested in seeing how the .Net Framework works, you can download and browse through the Reference Source)

Tagged , , , , ,

When in doubt, try the other port

First off – sorry for not posting in 6 months… I was somewhat preoccupied with finishing up my honors and starting my new job at Microsoft. As you may have noticed by my spelling, I’m now in the US and have settled in quite well. But enough about me, on to the blog post!

So, last Tuesday was Patch Tuesday and, lo and behold, my laptop couldn’t connect to Exchange, and my phone hadn’t synced since 3am that morning. Since I was foolish enough to have my server set to automatically install updates (which, by default, happens at 3am) I immediately knew what was wrong. Logging into the server confirmed my suspicions – the server had installed updates and now the Exchange Information Store service was stuck in the ‘starting’ state. I hit reset on the server, blocked my ears as the fans revved to maximum speed while the machine booted and waited. And waited. And waited. Something had gone horribly wrong – the hard drive light had stopped, but I couldn’t remote in; I tried plugging in my monitor – but it simply said “Mode not compatible: 74.9Hz”, most likely because the server hadn’t booted with the monitor connected. I hit the reset button and waited again, this time watching the boot sequence. Two things caused alarm bells to ring in my head – firstly the RAID status was “Verify” not “Optimal” and, secondly, Windows start up never got past “Applying computer settings” (with the “Donut of Doom” spinning ominously to the left of it).

My first reaction was to restore from the last backup – I had nightly backups going and Server 2K8 R2 makes restoring from backups (even onto ‘bare metal’) extremely easy. I rolled back to the last backup but, to my horror, Windows was still stuck at “Applying computer settings”. I tried backups that were even older, but to no avail – the server wouldn’t boot. By this time the fact that the RAID status was still “Verify” had gone from a curiosity in my mind to a possible cause – so I broke the array and attempted to restore onto a single disc, with that failing I recreated the array and tried again. Still nothing. At this point I was pretty desperate, I couldn’t remote in or physically log in, so I pulled out the debugging tools.

First up was the old faithful, the most useful tool that ships with Windows and is severely underrated and probably under used – Event Viewer. Even though the server hadn’t booted properly, and my laptop wasn’t on the same domain as the server, Event Viewer still managed to connect to the server. What I saw in the logs (past all of the spam that P3SS generates – I really need to fix that…) is that the MSExchange ADAccess was having 2102 and 2114 events, indicating that it couldn’t find the AD server – which was especially weird since the AD server WAS the Exchange server… But things began to make more sense now.

The “Applying computer settings” part of Windows start up is when a large majority of services running in Windows start. If these services hang then the machine gets ‘stuck’ in this screen. Except that these services should never hang because the Service Monitor is supposed to kill them after 30 seconds. But how do you kill a service that is in the process of “stopping” and refuses to respond? Exchange not finding the AD service meant that either AD or DNS was not running properly. Since I had to provide domain credentials to Event Viewer in order to log into to the server, AD must have been ok – therefore the DNS server had borked. And the easiest way to unbork a DNS server? Yank the network cable.

I pulled out the network cable from the server, and nothing happened. Damn. Perhaps the DNS services was still trying to bind to the static IP of the disconnected network card? So I plugged the cable into the secondary network card (which had a dynamic IP) and… the server finally booted! Logging in I found my theory confirmed again – the DNS service and a number of  Exchange services that were set to “Automatic” had not started. I started the services, and everything was running as smooth as butter – so I reset the server. And it got stuck again. A quick switch of network ports and kicking off some services, and we were back in business.

So there you have it – when in doubt, try the other network port!

Addendum: Binging “DSC_E_NO_SUITABLE_CDC” has resulted in a few things to try, including enabling IPv6 (which I don’t really want to do, as it tends to break Outlook Anywhere) and adding the Exchange Server to the “Domain Admins” group… I’ll try some of these over the weekend and let you all know how it goes!

Update: It looks like Travis Wright on the TechNet forums had the correct answer – if you had IPv6 enabled when you installed Exchange 2010, it needs to remain enabled. (In hindsight, this makes sense, as it may have been possible that the DNS service was trying to bind to a non-existent IPv6 address and that the Exchange AD Topology service may have been looking for AD on the IPv6 loopback address, and connecting the cable to the secondary card worked because that secondary card still had IPv6 enabled)

Tagged , ,