Tag Archives: connection pool

Connection Resiliency in ADO.Net

Despite “reliability” being one of the core features of TCP, network connections are typically unreliable, especially when being routed via the internet. This is even more of a problem with cloud computing, since most connections do route via the internet, and backend machines can go down at any moment for any reason. With this in mind, we have been making changes to ADO.Net to adjust from its client\server origins (with always-up enterprise hardware) to the cloud world (with commodity hardware).

Connection Pool Resiliency

In .Net 4.0 Reliability Update 1 we added a new feature referred to as “Connection Pool Resiliency” (or, as I liked to call it, “Check Connections in the Pool”). This feature would ensure that connections being retrieved from the connection pool were still alive before handing them back to the caller. We achieved this by asking TCP what the state of the connection was and, if TCP reported the connection to be disconnected, then we would create a new physical connection and return that to the caller.

Ensuring that only live connections were returned from the pool covered most low-traffic scenarios, since connections typically spend most of their time in the pool rather than being used. However, it provides little assistance for high-traffic scenarios (where connections do not remain in the pool for very long) or for code where connections are held open for a long time (which we strongly recommend against – but is inevitable in applications like SQL Server Management Studio).

One thing to be aware of for this type of resiliency is that we can only detect failures in the physical connection – if the connection’s state has become corrupt (e.g. commands always time out), then we will not be able to detect it and so we can not recover the connection. (In situations like these you can either abandon the pool by changing your connection string, or take the nuclear option and call ClearPool)

Idle Connection Resiliency

ADO.Net 4.5.1 combined with Azure SQL Database or SQL Server 2014 introduced a new feature to provide resiliency for more scenarios: Idle Connection Resiliency. With this feature, we can recover connections even if they were opened at the time that they failed. One of the difficulties with doing this is that SQL connections (unlike, say, a HTTP connection) have state – you may have changed the database you were connected to, or the collation that you were using, or the level of transaction isolation. Either way, when the connection is recovered, we need to ensure that it resumes whatever state it had previously (i.e. it should be completely transparent that you are now using a new physical connection). To do this we introduced a new TDS token called SESSIONSTATE (to allow the server to hand the client its current state as a blob) and a new Feature Extension called SESSIONRECOVERY (to resume the state on a recovered connection).

Let’s assume that you’ve attempted to execute a command on a dead connection. The process to recover it would look like this:

  1. Execute* is called, check if the connection is alive (using a similar mechanism as the Connection Pool Resiliency)
  2. If the connection is dead, then check if it is recoverable from both client and server perspective
  3. If recoverable, dispose the old connection and attempt to open a new connection with the state of the old connection – we will retry a given number of times and sleep between each retry for a given time
  4. Connection is recovered, resume the normal process

There are a few things to note from this. Firstly, the entire process is async – even if you are executing synchronously, we run the process as sync-over-async. Secondly, there are a number of situations where we can not recover the connection. From the client side, the connection must be idle – for most connections this is guaranteed since we perform the check just before execution, but for MARS connections, it is possible that there is another open session (it doesn’t have to be actively in use by another thread – just being open is enough). From the server side there are a number of situations where the state will not fit, or can not be represented in the SESSIONSTATE blob, for instance a temporary table.

Retry Logic

Even though we continue to improve ADO.Net’s ability to handle unreliable connections, this does not eliminate the need to have your own retry logic in your code – our connection recoveries are best effort and can only detect a limited set of faults. If you want a Microsoft supported implementation of generic retry logic, then I would recommend the Transient Fault Handing Application Block.

Bonus chatter: .Net 4.5.1 is now being pushed out via Windows Update, or you can grab it straight from the Download Center (if you’re running Windows 8.1 – and why wouldn’t you be? – then you already have 4.5.1).

I also added used a few links above to the revamped .Net Reference Source site – this contains the managed source code for most of .Net (including System.Data.dll).

Tagged , , , , , , , ,

Issue 14

(Rather than just reiterating the new features in ADO.NET that we announced for //Build/, I figured that I’d do a series of posts covering various features in depth – although this first "feature" shipped a bit earlier than the 4.5 Developer Preview)

What’s in a fix

If you remember last month’s Patch Tuesday, the first Reliability Update for .Net 4.0 was release, including a bug fix for System.Data.dll. However, those of you who read the support article would have been greeted by this cryptic message:


Issue 14
Consider the following scenario

  • You use the .NET Framework Data Provider for SQL Server (SqlClient) to connect to an instance of Microsoft SQL Azure or of Microsoft SQL Server.
  • An established connection is removed from the connection pool.
  • The first request is sent to the server.

In this scenario, an instance of SqlException is encountered, and you receive the following error message:
A transport-level error has occurred when sending the request to the server.


So given that description, can you tell what the original bug was, or what we fixed?
No? Neither can I – and I wrote the fix…

Historical Perspective

To explain "Issue 14", we’ve first got to look back at the history of ADO.NET, back to the 2.0 (or, possibly, 1.0) days. In the original design of the Connection Pool it was decided that, if there was a catastrophic failure of a connection, then the entire pool should be cleared. This was a reasonable assumption, since being unable to communicate with the server typically means that either the server is down (or restarted), the client network connection has died or failover has occurred – in any of these circumstances, it is unlikely that any other connection in the pool had survived.

Fast forward to today, and some of the original assumptions of the connection pool are no longer valid. Due to the increased popularity of cloud computing and connected devices, connections to SQL Servers are might not be going over ultra-fast and ultra-reliable links inside a data center. Instead, they may be going over the internet, which means that they are unreliable and could drop at any time. This, combined with SQL Azure’s policy of dropping connections that have been idle for over 30 minutes, meant that we could have one dead connection in the pool, but the rest would be ok.

Check Connections

So now we’re connecting over unreliable connections and still clearing the pool when it’s possible that only one of the connection had died. On top of that, we don’t know the connection is dead until someone tries to execute a command on the connection (and then gets an error, despite the fact that they had just opened the connection). So what to do?

Firstly, we are now checking the state of the connection when we remove it from the connection pool and, if its dead, giving you a new connection. This greatly decreases the likelihood that you will be trying to execute on a bad connection (although its still possible, as we are relying on Windows to know about the state of the underlying TCP connection, and since there is a race condition between us checking and you executing on the connection).

Secondly, we no longer clear the pool when there is a fatal error on a connection – so we’re no longer dropping (hopefully) good connections just because one connection is bad. Conversely, since we are checking connections before using them, we are still responsive to events like failover or network disconnects.

Best Practices

If you read the last section carefully, you would have noticed one of the caveats of this feature: "there is a race condition between us checking and you executing on the connection", which leads to my first recommendation:

Follow the Open-Execute-Close pattern

Every time you need to interact with SQL Server, you should open a fresh connection, execute your command (and deal with the data reader if you open one) and then close the connection (since SqlConnection, SqlCommand and SqlDataReader are all implement IDisposable, the best way to do this is with a ‘using’ statement). If you need to expose the data reader to a higher level API, and don’t want to cache the data in the data reader, then you should wrap the connection, command and reader inside a new class that implements IDisposable and return that instead.

My second recommendation relates back to my previous post on connection strings:

Use our connection pool

Despite the connection pooling code being rather old, it is extremely fast, reliable and it works. Opening connections from the pool and returning them afterwards is incredibly quick, especially when compared to opening a fresh connection or executing a command on a connection. Additionally we have the ability to introduce features like this which custom pooling code can’t.

Finally, this improvement is no replacement for proper retry code.

Improvements in 4.5

In the .Net 4.5 Developer Preview, we’ve made this feature more scalable, especially for high-throughput application servers where connections do not sit idle in the pool for very long.

As a final note, if you haven’t already upgraded to .Net 4.5, then you should make sure that you’ve installed the 4.0 Reliability Update.

Tagged , , , , , ,