MySQL transient error handling [#476048]

MySQL can generate some transient errors when running queries. These can be due to an overloaded MySQL server or due to database deadlocks (when using InnODB).

In many cases, simply rerunning the query would make it succeed, but Drupal has no handling for this currently.

The attached patch attempts to re-execute a failed query N times until it succeeds, based on the error that was thrown. It works for MyISAM as well as InnoDB.

The patch is based on information from http://openquery.com/blog/error-handling-for-mysql-applications and applies cleanly to Drupal 6.12.

Comment	File	Size	Author
#16	pressflow-476048-16-mysql-retry-on-lock-D6-do-not-test.patch	5.99 KB	mikeytown2
#15	pressflow-476048-15-mysql-retry-on-lock-D6-do-not-test.patch	5.99 KB	mikeytown2
#13	drupal-476048-13-mysql-retry-on-lock.patch	3.97 KB	mikeytown2
#12	476048-mysql-error-retry.patch	4.35 KB	cafuego
#11	476048-mysql-error-retry.patch	4.35 KB	cafuego
	drupal-mysql-error-retry.patch	2.65 KB	cafuego

Comments

Comment #1

cafuego commented 29 May 2009 at 02:30

Actually, having this kind of error handling integrated into D7 would be great too.

Comment #2

cafuego commented 21 June 2009 at 00:36

Version:

6.x-dev

» 7.x-dev

I've had a look at the D7 database code, but I can't easily get my head around it.

Still, it would be great to have this error handing go into D7 before it is released, so I'm changing the version tag on this to 7.x.

Comment #3

damien tournoud commented 21 June 2009 at 11:58

Highly transactional systems, like Drupal + MySQL should never try to reissue transactions. It can only provoke meltdowns (on the front-end server, or on the database server, depending on the root cause of the initial error).

The only error codes I think are worth discussing are: 2006 (CR_SERVER_GONE_ERROR) and 2013 (CR_SERVER_LOST). Those two can be triggered by scripts running longer then the wait_timeout, which can easily happen on cron (while fetching new RSS feeds, for example). Those generate the infamous "MySQL Server has gone away" errors.

Comment #4

arjenlentz commented 21 June 2009 at 13:16

Damien, Drupal is not highly transactional, it merely issues a lot of queries. Not the same thing at all ;-)
But the issue is that we're talking about transient errors, i.e. non-fatal error situations; the specific intent of those errors is that the app appropriately re-issues the query or transaction (a db server can't magically re-run an entire transaction as params could change, so it's something the app needs to handle). In any case, they are not "errors" at all, in the usual sense of the word.
Regarding them as fatal errors is, strictly speaking, a bug.

With any transactional engine (or other RDBMS, including PostgreSQL, Oracle, etc) a deadlock, for instance, is a perfectly normal and *transient* error, and an app should re-run the statement/transaction at least once.

Lock wait timeouts should also be handled, the key there is that the default innodb_wait_timeout is 50 seconds which is way too long. But that setting is not directly a Drupal-code concern, since if a user is affected by the 50 second wait, re-running the transaction/query (which will then succeed) will then at least come up with a real result rather than a failure. And on a properly configured production server, the user won't notice at all exactly because the handling is correct.

I should also note in this respect that these things are really not MySQL specific; however the misconceptions on error handling are mainly unfortunate bad development habits from the old "MyISAM non-transactional way of doing things". But the even when using the MyISAM storage engine, the "proper" workflow for these cases works. For more detail on that background, more possible "errors" to handle, and appropriate ways of handling them, please see:

http://openquery.com/blog/error-handling-for-mysql-applications
and
http://openquery.com/blog/migrating-mysql-myisam-apps-to-the-innodb-stor...

Comment #5

arjenlentz commented 21 June 2009 at 13:23

Code-wise, I have the following suggestions.

For the transient errors
- there doesn't need to be a delay before re-running the query.
- if inside an explicit transaction, issue a ROLLBACK before restarting.
- only retry once - perhaps twice.

For
case 11: // Resource temporarily unavailable (EAGAIN)
case 1040: // Too many connections
I'd agree that rerunning on these errors can cause serious trouble; it needs to be reported through the backend.

Comment #6

damien tournoud commented 21 June 2009 at 13:32

@arjenlentz: if your front-end server receives 200 requests per second, blocking a request for 50ms to reissue a query means that you are blocking 10 requests. Drupal doesn't issue a lot of queries, but as all web applications, it is highly transactional. In that context, it is critical never to block.

I don't know about the lock-related error codes (1205, 1213, 1614), because I never saw those happening on production, but clearly retrying on 11 and 1040 (EAGAIN, Too many connections) is a very bad idea.

Comment #7

arjenlentz commented 21 June 2009 at 13:37

@Damien: okidoki, then I suggest cafuego simply applies my code suggestions, and we'll be right!
For the transients no delay before reissuing is necessary, and I agree with you on 11 and 1040.
Agree?

Comment #8

Crell commented 22 June 2009 at 18:25

I'll be perfectly honest and say that I've never seen one of those errors happen to me in production. Just how common are they? I'm a little concerned about adding complexity for an extreme edge case.

Comment #9

arjenlentz commented 23 June 2009 at 00:05

They happen, naturally more under load. And also when components use multi-statement transactions. That's not reason not to do them, it's just a fact and perfectly normal. All good db apps have this logic built-in.

It's not an extreme edge case. It's just that MyISAM, through its design, is deadlock free. This has made app coders lazy, in a way - in reality they're just unaware that all other db servers and MySQL engines need to deal with it.

Comment #10

cafuego commented 23 June 2009 at 03:51

@Crell; I have deadlocks happen on Drupal (and other applications) from time to time; even when load is not particularly high. It just depends on what the site users are doing.

You'll never hit these particular errors if you're using MyISAM though.

I'll rework the patch as per @arjenlentz suggestion.

Comment #11

cafuego commented 29 September 2009 at 00:02

Version:	7.x-dev	» 6.x-dev
Status:	Active	» Needs review

Status	File	Size
new	476048-mysql-error-retry.patch	4.35 KB

I've updated the patch to no longer check errors 11 and 1040 - and also added the transient error check to the mysqli connector.

I'm running on my Drupals and it seems to be fine.

Comment #12

cafuego commented 29 September 2009 at 00:13

Status	File	Size
new	476048-mysql-error-retry.patch	4.35 KB

Ugh. I got the mysqli_query() call back to front. More coffee needed. Updated patch (tested *and* working) attached.

Comment #13

mikeytown2 commented 14 September 2012 at 18:55

Status	File	Size
new	drupal-476048-13-mysql-retry-on-lock.patch	3.97 KB

re-roll of #12.

Comment #14

damien tournoud commented 14 September 2012 at 19:23

Version:	6.x-dev	» 8.x-dev
Status:	Needs review	» Needs work

If you are coming here because of DEADLOCK issues on InnoDB, you are on the wrong issue. See #937284: DEADLOCK errors on MergeQuery INSERT due to InnoDB gap locking when condition in SELECT ... FOR UPDATE results in 0 rows for the cause and potential solutions.

Bumping this to 8.x since it needs to be dealt with there.

I'm not sold on retrying, expect in case we are 100% certain it is safe to do so (we already do that in SQLite in a particular case). We should only retry once, and then give up. Looking at the current patch:

+      case 1205:    // Lock wait timeout exceeded; try restarting transaction
+      case 1213:    // Deadlock found when trying to get lock; try restarting transaction

^ Those looks decently safe.

+      case 1614:    // Transaction branch was rolled back: deadlock was detected

^ I doubt retrying will be idempotent (the error says that a transaction branch was rolled back, after all), so we should not retry.

+      case 2013:    // Lost connection to MySQL server during query

^ We don't auto-reconnect, so there is no point in retrying on error 2013.

Comment #15

mikeytown2 commented 19 March 2013 at 22:12

Status	File	Size
new	pressflow-476048-15-mysql-retry-on-lock-D6-do-not-test.patch	5.99 KB

D6 patch for pressflow

Comment #16

mikeytown2 commented 20 March 2013 at 00:42

Status	File	Size
new	pressflow-476048-16-mysql-retry-on-lock-D6-do-not-test.patch	5.99 KB

Fixed a bug with the last patch

Comment #17

mikeytown2 commented 19 December 2013 at 01:21

How to catch these errors in D7. Question is where would it go, and how to restart the transaction?
https://api.drupal.org/api/drupal/7/search/%3A%3Aexecute

    catch (PDOException $e) {
      // Try restarting the transaction.
      if ($e->errorInfo[1] == '1205') {
        // Lock wait timeout exceeded; try restarting transaction.
      }
      elseif ($e->errorInfo[1] == '1213') {
        // Deadlock found when trying to get lock; try restarting transaction.
      }
      else {
        throw $e;
      }
    }

Comment #18

damien tournoud commented 31 January 2014 at 21:04

Component:	mysql database	» ajax system
Issue summary:	View changes
Status:	Needs work	» Closed (won't fix)

Contrary to what I said in #14, it seems like it is only safe to retry the last query on error

1205</code (Lock wait timeout exceeded). All the other error codes are returned after the transaction has been rolled-back. Retrying the transaction here would mean retrying the whole processing, and we don't know how to do that on Drupal.

On the other hand, error <code>1205

occurs after a configurable timeout, which is of 50s by default. Retrying anything after having waited for 50s is just a no-go for a transactional application like Drupal.

This makes this whole issue won't fix.

MySQL transient error handling