Wednesday, November 16, 2005

Sticky Stale connections

My first encounter with the stale connections was when we first launched an interactive J2EE application on the Websphere Application Server 3.5.6 on a SUN Solaris box. The application frequently received StaleConnectionException (com.ibm.ejs.cm.portability.StaleConnectionException), an exception that became a potential showstopper for us. In spite of frantic googling and research and changes varying from implementation of retry logic in the application to changing the connection pool settings to changing the zparams values on the Db2 server, the sticky issue continued to haunt us. Frustrated with the results, we finally had to do away with the connection pooling support of WAS and obtain the connections directly from the Driver Manager instead. Now as we migrate the application from WAS 3.5.6 to WAS 5.1.1, we were again challenged by the same issue, but this time we finally seem to have got a solution. What follows is an account of the steps we took to combat this "sticky" exception.

Well before I forget, the package structure for the exception has changed since WAS 4, and it is now qualified as com.ibm.websphere.ce.cm.StaleConnectionException.
The first thing we did was to google on StaleConnectionException. Sure enough, there is enough documentation on the net on the possible causes of this exception. And any of these could cause the managed connection objects in the connection pool to become "stale". Any statement executed on this stale connection would cause the driver implementation to throw a SQLException with some SQL state and error code. The WAS implementation maps a set of these error codes to a sub class of the SQLException - StaleConnectionException - and throws this exception to the requesting application. What it also does is that it clears the connection pool off all the unusable connection objects from the datasource in question. So ideally, the application should receive a good connection object the next time it requests for one. This prompted us to implement a retry logic something like this:
do {
try {
get connection object;
execute statement;
...
break;
}catch(StaleConnectionException se) {
retryCount++;
}
}while(retryCount < MaxTtries);
Unfortunately, this did not help. Thinking that the pool implementation might be taking some time to refresh the connection pool, we tried putting delay between the tries... but to no avail.
Meanwhile, we were in touch with contacts from other projects. We got all the data access code reviewed to confirm that we were closing all the connection objects, and closing all the cursurs and statements explicitly. Thus there was no chance of spurious connection objects becoming stale.

We also tried changing the driver implementation from the type 2 db2java to type2/4 UDB driver but the exception continued to recur.

Sometime in the middle of all this, we also tinkered with a few connection pool parameters but without success. If only we had been a little more careful [:(] 'coz we ultimately found the solution in one of those parameters! Well, before that... by now we were almost certain that there was something going awry in the WAS connection pool implementation but influenced by the statement

Recovering from stale connections is a joint effort between the application
server run time and the application developer.
in the IBM infocenter for WAS 5.1 http://www-306.ibm.com/software/webservers/appserv/was/library/library51.html we implemented a solution to redirect the users to an error page with a friendly error message asking them to retry the operation. Just when we had given up the hope and were thinking that this was the best we could do, we stumbled upon a link that suggested that applications running behind a firewall might be affected by connection timeout setting at the firewall. This could inturn surface as StaleConnectionException for the application. They also suggested that the "unused connection timeout" - one of the connection pool parameters - be set to a value smaller than the firewall setting. It was then that we realized that the connections might be getting closed lower in the network stack, WAS runtime at the application layer being unaware of it. Please note that the pool implementation does not do any check on the validity of the managed connection object before returning one to the requesting application. We set the connection timeout value to 300 seconds and sure enough... we have not seen the exception post this change. Note that the unused connection timeout value should still be greater than the reap time (default 180 sec.) for it to make sense.

One more suggestion or rather a best practice to add to this "list" is to localize the data access as much as possible. While the DAO pattern helps to a certain extent, what it does not suggest is the localization of the code that uses the connection object. Have a single place where you open the database connection, make the backend call, and close the connection. (Avoid replicating the same in every data access method for instance).

Causes for the stale connections are many. Although we found our solution in the connection pool parameter setting, it's equally likely that some other application would find theirs in connection objects being left open in the code. What I am heading at is that if you are a victim of StaleConnectionException, you might want to try all the above options... and this might not even be an exhaustive list!