In the last couple of days I've have my connection count to MySQL go above a set threshold of 35 sometimes peaking at over 150. When this happens SHOW PROCESSLIST
and SHOW ENGINE INNODB STATUS
are ran so I can investigate what is causing the issue. After looking it over it appears that TRUNCATE cache_field
is causing the issue. Looking over the options it appears that TRUNCATE commands need to not be ran on a live table. This means we need to rename the table before the truncate happens.
http://bugs.mysql.com/bug.php?id=61935#c368453
http://stackoverflow.com/questions/17762361/truncate-table-taking-very-l...
http://dba.stackexchange.com/questions/20600/what-can-cause-truncate-tab...
http://whatwebwhat.com/en/page/466/truncating-big-mysql-innodb-tables
Comment | File | Size | Author |
---|---|---|---|
#46 | Query-61.txt | 17.52 KB | shardesty |
#46 | show_process_list_2.txt | 9.38 KB | shardesty |
#46 | show_process_list_1.txt | 19.44 KB | shardesty |
#46 | query_2.txt | 1.44 KB | shardesty |
#46 | query_1.txt | 1.19 KB | shardesty |
Comments
Comment #1
mikeytown2 CreditAttribution: mikeytown2 commentedComment #2
mikeytown2 CreditAttribution: mikeytown2 commentedThe plot thickens... this means that AUTOCOMMIT will not help.
TruncateQuery::__toString
Comment #3
mikeytown2 CreditAttribution: mikeytown2 commentedCreated a patch that does this
This way the DDL statement will no longer lock other queries and it's valid syntax http://dev.mysql.com/doc/refman/5.0/en/rename-table.html
Comment #4
mikeytown2 CreditAttribution: mikeytown2 commentedComment #5
danblack CreditAttribution: danblack commented#2229013 is related and as you can see in the comment #6 I've done a delete by create time, that way it doesn't block new inserts. Access of ones due to delete AFAIK will get a deadlock error which is caught by the cache module and ignored.
The problem with renaming is you've put your entire caching offline for the time (despite it being quicker than a DELETE ALL). More recent mysql truncate table effectively do you what you've done in SQL.
Comment #6
mikeytown2 CreditAttribution: mikeytown2 commentedI'm having issues where a truncate takes over 30 seconds on MySQL 5.6 and I then get over 150 connections to the database all waiting on metadata for the cache_* table in a short amount of time. I do agree that since the rename is atomic, MySQL will lock here temporally preventing access to that table while the rename is happening; I believe it will not error out and it will merely wait for the rename to finish (from what I've seen on our MySQL server). Drupal does not use any foreign keys on the cache tables so luckily that is one less thing to worry about here as well. I'm also not dropping any tables as that is slow.
There are some disadvantages when comparing
DELETE *
toTRUNCATE
according to http://dba.stackexchange.com/questions/20600/what-can-cause-truncate-tab..., but I think you addressed this in #2229013: PDOException: SQLSTATE[HY000]: General error: 1030 Got error -1 from storage engine: TRUNCATE {cache_field} ; Array ( ) in cache by using the created column. The main issue I see with using created is accurate time keeping in a multi-webhead environment and time resolution. There is the possibility of stale data persisting in the cache. Some of these concerns could be mitigated by using time() instead of REQUEST_TIME. Out of sync system clocks in a multi-webhead environment is where the created delete isn't as good as this patch from my point of view. There is also a slight chance of old data not getting cleared because it is "<" instead of "<=".Comment #7
danblack CreditAttribution: danblack commentedYes, I believe I've addressed the DELETE FROM by adding the date criteria.
In addition the the previous time based concerns mentioned above there is overheads of indexes #165061.
How about adding a serial column on the cache and then doing SELECT serialid FROM cache ORDER BY serialid DESC LIMIT 1; DELETE FROM cache WHERE serialid < {selected_id}? Mitigates time, metadata locks and allows new cache to be created on new items after this. (I'll wait until 2224847 is committed before doing a patch on this approach).
Comment #8
mikeytown2 CreditAttribution: mikeytown2 commentedI like what your trying to do in #2224847: Automatically shorten cid's in Cache\DatabaseBackend and PhpBackend. Do you think using a serialid (assuming an auto-increment column) would introduce new issues, like hitting the 32bit limit?
ps https://drupal.org/filter/tips#filter-filter_autop
Comment #9
danblack CreditAttribution: danblack commentedserial:big perhaps :-)
postgres has a CYCLE option. MySQL will fail once reached however it can be reset back with an alter table statement though it will need a test to see what happens with existing data there. However once this happens the logic on previous optimisation falls down quickly and though you end up with:
Comment #10
mikeytown2 CreditAttribution: mikeytown2 commentedTruncate cache on cron if max is within 1 billion of the max value should take care of that issue.
Comment #11
danblack CreditAttribution: danblack commentedTime options
Given the availability of alternate higher performing caching options should we even worry about the index insert time for created?
I guess we need to assess all of these https://api.drupal.org/api/drupal/includes%21cache.inc/function/calls/ca...
Looking - most seem to be admin, cron or test related.
Autoincrement overflow handling
> Truncate cache on cron if max is within 1 billion of the max value should take care of that issue.
Or catch exception on reaching and process it:
renumber or better in a locking sense:
Comment #12
mikeytown2 CreditAttribution: mikeytown2 commentedI think your approach (delete where serialid) will work out well for D8. I think it's too risky to do in D7 though; people access the cache_* tables directly instead of using the API, thus doing something more conservative would be ideal (my vote is for D7 is the rename trick).
Catching the exception when auto increment overflow happens could cause a major headache as that table can not be written to and will keep throwing exceptions. I do like the update cache id trick though; that should work instead of truncate. I think we should do both; run a check on cron in order to be proactive and if cron stops working then we will still catch the exception and deal with it.
Comment #13
danblack CreditAttribution: danblack commented> I think your approach (delete where serialid) will work out well for D8.
Works for me.
> I think it's too risky to do in D7 though; people access the cache_* tables directly instead of using the API
crazy.
> , thus doing something more conservative would be ideal (my vote is for D7 is the rename trick).
works for me. Looked at your patch. Seems fine with one fairly significant exception - create table like isn't in sqlite.
> Catching the exception when auto increment overflow happens could cause a major headache as that table can not be written to and will keep throwing exceptions.
for at least the time of an select count ; alter table
fair call though.
> . I do like the update cache id trick though; that should work instead of truncate. I think we should do both; run a check on cron in order to be proactive and if cron stops working then we will still catch the exception and deal with it.
how about instead of cron we grab last_insert_id in a cache set call?
throw a warning after $early_limit
perform renumber on a some low probably request after $high_limit
provide api / ui to do process?
Comment #14
danblack CreditAttribution: danblack commentedI'd done part one (assuming my cache-hash patch gets accepted) that covers a basic delete.
I wanted to get the structure in place before we had to deal with database migration scripts (which would be horribly blocking for anyone doing running an early 8.x).
I'm still up for doing a renumber function. As a bonus, because cid/cidhash is no longer a unique key, the retry that is doSet probably can go away too.
Comment #16
danblack CreditAttribution: danblack commentedlocking proof of concept.
By doing a delete from cache where id <= max_id we don't block new inserts.
session1
As this isn't committed (with a commit) the locks are still active as we can see by the record locks in session 2 there lock the pages of the primary key. We can also see inserts in session 2 aren't locked by the delete.
session2
ps. warnings are the double use of NOW() to an int field.
(which means I need to get transaction isolation level support done otherwise it would only not block with < max).
Comment #17
mikeytown2 CreditAttribution: mikeytown2 commentedMoving this issue to D8 and setting this as a beta blocker as the upgrade path would suck to write. Creating a new tag called database cache and going to try and tag all the DB cache issues.
Comment #18
xjmOnly core maintainers should set the beta blocker tag and only critical issues should be beta-blocking; did you speak to catch about this issue?
Comment #19
catchIf TRUNCATE isn't safe, we should change the implementation in the MySQL driver, not work around it in the cache backend.
Since this will need backport to 7.x. which will require an update, I don't see why it should block an 8.x beta. The update can drop and recreate all the cache tables if necessary rather than trying to actually update.
Comment #20
mikeytown2 CreditAttribution: mikeytown2 commentedMore results from testing the rename table truncate trick.
TRUNCATE isn't quick, it takes around 150ms to run on an empty table due to the usage of DROP + CREATE internally. The rename trick does take care of the deadlock issue on cache tables which is nice; but deadlocks can still happen if the request takes longer than 30 seconds due to the semaphore expiring. After adding a new field to a node type, the request took over 30 seconds, and I got a deadlock & process list emailed to me (happens when the number of active DB connections hits a certain limit). The process list looked good; nothing was over 1 second, but the Deadlock was on 'field_info:field_map' even with this patch #2193149: Deadlocks occur in cache_field table.. This makes me think that the truncate on the cache__truncated_table is causing the slowdown.
Running the TRUNCATE via a register_shutdown_function would be better but this leads me to the conclusion that being able to TRUNCATE in the background would be ideal.
Not having to issue truncate would be nice as well but this syntax doesn't seem to work
Comment #21
Damien Tournoud CreditAttribution: Damien Tournoud commentedI don't understand where the discussion is going here. The fact that the database cache is transactional is a feature, not a bug, we should not try to workaround it.
Why not simply doing a
DELETE
unconditionally, instead of trying to do aTRUNCATE
if outside a transaction, aDELETE
otherwise?Comment #22
danblack CreditAttribution: danblack commented> I don't understand where the discussion is going here. The fact that the database cache is transactional is a feature, not a bug, we should not try to workaround it.
Yep. We agree.
> Why not simply doing a DELETE unconditionally, instead of trying to do a TRUNCATE if outside a transaction, a DELETE otherwise?
An unconditional DELETE in repeatable-read will lock all inserts and take a while. A read-committed level will delete new cache entries. A range based delete will only lock the records that exist and still allow new cache to be inserted. #2238203: DatabaseCacheBackend/CacheBackendInterface get method should have timeout should stop existing records in the process of being deleted from undergoing a long lock timeout (which I'm working on test cases for at the moment).
Comment #23
Damien Tournoud CreditAttribution: Damien Tournoud commented> Why not simply doing a DELETE unconditionally, instead of trying to do a TRUNCATE if outside a transaction, a DELETE otherwise?
We don't want repeatable-read, but that's another question. I guess what you are saying is only true for statement-based replication, in which InnoDB will add a gap-lock?
From the perspective of Drupal 8 (that only has tag-based mass deletion), what's actually desirable is to delete all the entries matching the tag *and* prevent new entries from being created.
From the perspective of Drupal 7, it seems desirable to delete any entries matching the wildcard *and* prevent new entries matching the wildcard from being created. Of course, that last one is not possible with standard SQL tools, but that what sounds desirable from a transactional perspective.
So, as a second best, locking the whole table doesn't sound too bad from this perspective.
Comment #24
mikeytown2 CreditAttribution: mikeytown2 commentedLocking the table is an awful idea when you have a lot of action on a site; that's the reason for using InnoDB, row based locks instead of table based.
I'll work on a patch monday that does the truncate of the spare cache table in a shutdown request and I'll make sure the spare table is empty before promoting it to be the cache bin. Putting the truncate of the spare table inside a non-blocking request would be idea; similar to this issue #1599622: Run poormanscron via a non-blocking HTTP request
Comment #25
Damien Tournoud CreditAttribution: Damien Tournoud commented@mikeytown2:
The burden of proof bears on you. If you have a full wildcard flushing of a cache bin, it's desirables that it blocks further inserts in the table until the transaction is committed. See #23.
If it's a performance problem, the solution is to not do such full wildcard flushing...
That will make the cache non-transactional, which is an horrible idea.
Drupal barely supports concurrency properly right now. The only thing that saves us from horrible inconsistencies is that the default cache implementation is transactional. Start using memcache as the field cache, for example, and everything falls apart. See #1679344: Race condition in node_save() when not using DB for cache_field.
I think in many ways you are looking at the wrong end of the problem. Look at why/where those cache flushings are done and fix them.
Comment #26
mikeytown2 CreditAttribution: mikeytown2 commentedPatch for D7 is attached.
The rename will lock the table preventing anything from happening; timing wise the lock will be a much smaller window than before.
If we want to get very tricky, this code can be used to rollback TRUNCATE due to the use of the rename; simply swap the tables again and presto, TRUNCATE has been undone & transactions when using TRUNCATE is possible in theory, but this would be a bad idea most likely.
In order to do this, a master cache ID for that cache bin would be needed from my understanding. It would allow entries to be created but they won't be used in the future due to the master cache ID not matching (thus having a similar effect). In AdvAgg 7.x I do something similar; I call it a global counter. Once the counter gets incremented old processes still running with old data will write that info with the old counter. New data with the new counter value will then write new data. Next request comes in using the new counter. Doing anything less than some sort of global counter will not give the desired outcome if I'm understanding this correctly.
The advantage of using a global counter is clearing out old cache data can be done on cron; using old data if new data isn't there is possible as well; alt cache backends should be happy as well. The mail disadvantage is duplicate cache data.
Going down the global counter route would take a lot of R&D to do it correctly. If the cache tags had a version number in it, and a hash of the cache tags was used as some sort of global counter that might be ideal; I don't know if it is though, this would require a lot of testing.
Comment #27
mikeytown2 CreditAttribution: mikeytown2 commentedComment #28
mikeytown2 CreditAttribution: mikeytown2 commentedPasses tests, moving back to D8.
Comment #29
Fabianx CreditAttribution: Fabianx commentedBtw. TRUNCATE itself is implemented in that way (rename, create like, drop, rename) and had bugs before MySQL 5.5 (still buggy in 5.1):
http://bugs.mysql.com/bug.php?id=42643
Comment #30
mikeytown2 CreditAttribution: mikeytown2 commentedI'm Using MySQL 5.6 and patch in #26 has helped us with deadlocks.
From [8 Jul 2010 1:38] Paul Dubois
This makes the cache table a bottle neck when TRUNCATE TABLE is issues due to the exclusive metadata lock.
Comment #31
Fabianx CreditAttribution: Fabianx commentedI thought about the whole patch again and it totally makes sense.
It is pretty simple to get into a performance situation with this during a revert of field_instances. In show processlist you see:
TRUNCATE cache_field
1
2
TRUNCATE cache_token
1
2
So it takes up to 2s to execute the truncate. The reason is the global metadata lock, because there is new SELECTS going on during the TRUNCATE.
This patch and the rename tricks solves it because:
The 2nd TRUNCATE on the table__truncated will be _fast_ as nothing is trying to select from it (besides other processes doing a count query in the same function), but because it is fast, this should still improve things very much.
And we know that consistency is not a problem with the cache here, so it is okay if INSERTS already scheduled fail.
The only thing I don't get is the !inTransaction() part.
Would that not mean that the rename is called but the table __truncated could not exist, yet?
If the first to hit this code path would be in a transcation? (unlikely, but possible)
Because I think this double-buffers the table, right?
EDIT:
I think it should then look like:
EDIT 2:
What I outlined above is basically single-threaded, that means I had no concurrency issues, which means that the TRUNCATE was running in the background anyway. Which might mean it has no effect on the run time performance. I am gonna check that.
Comment #32
mikeytown2 CreditAttribution: mikeytown2 commentedFeel free to create a patch with modified
inTransaction()
logic. I was trying to make sure the code path was the same. The toString method is where we have a delete or the rename query; thus the execute method was patched differently as well.Comment #33
Fabianx CreditAttribution: Fabianx commentedI see now, we also had query cache lock problems.
A co-worker pointed out also that SELECT * from TABLE LIMIT 1 should return 0 when empty and something else when not and that count can be a performance bottleneck by itself.
JFYI
Comment #34
AnybodyThe patch from #26 helped us a lot. How can we get some more eyes on this? I'm looking forward to see this patch in a future D7/D8 release! Great work!
Comment #35
shardesty CreditAttribution: shardesty commentedNot totally sure if this is the right place for this, but this seemed the most relevant of all the related threads:
About three weeks ago, began to experience seemingly random site crashes. Looking through the logs, we saw PDOException errors, and realized we had a database lockup. Kicking the database server restored the site, but did not, of course, eliminate the problem. Over the next couple weeks, we installed a number of patches to Drupal, and made some changes to our database set up. The number of PDOException errors dramatically decreased, but we still had intermittent site failures.
We did everything on this list: "Fixes for MySQL Deadlocks in D7" (https://groups.drupal.org/node/415883) and also applied the patch from this thread: https://www.drupal.org/node/930876.
* Use READ COMMITTED for MySQL transactions
* Do not use InnoDB for the semaphore table, use Memory
* Deadlocks occur in cache_field table (patch to modules/field/field.info.class.inc and modules/field/field.info.inc)
* Use asynchronous MySQLi connection to avoid RECORD LOCKS (Deadlocks) on cache tables (patch to includes/cache.inc and includes/database/mysql/database.inc)
* PDOException: SQLSTATE[08004] [1040] Too many connections in lock_may_be_available() (patch to lock.inc)
* Waiting for table metadata lock on cache_field table (patch to includes/database/query.inc)
Having eliminated most of the other PDOExceptions we were able to see that the exact sequence of events that led to the site going down was the generation of some PDOException and *then* clearing the cache.
That last patch to the query.inc file is intended to help fix the issue of the website locking up in the middle of a cache clear. But applying that patch did not solve our issue. Clearing cache alone does not drop the site. However, we have another (related or not, no idea) issue where you cannot unblock a user using the admin UI. Attempting to unblock a user throws a PDO Exception - and if you clear cache after that, the site goes down. (If you use drush first, it works. If you attempt in the UI first, it won't work in the UI or in Drush.)
Here are the two errors, drush and from the dlog:
***
root@CNS0904:/var/www# drush uublk dsherman
WD user: PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait[error]
timeout exceeded; try restarting transaction: DELETE FROM
{cache_entity_user}
WHERE (cid IN (:db_condition_placeholder_0)) ; Array
(
[:db_condition_placeholder_0] => 1252
)
in cache_clear_all() (line 165 of /var/www/includes/cache.inc).
WD php: PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait [error]
timeout exceeded; try restarting transaction: DELETE FROM
{cache_entity_user}
WHERE (cid IN (:db_condition_placeholder_0)) ; Array
(
[:db_condition_placeholder_0] => 1252
)
in cache_clear_all() (line 165 of /var/www/includes/cache.inc).
PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction: DELETE FROM {cache_entity_user}
WHERE (cid IN (:db_condition_placeholder_0)) ; Array
(
[:db_condition_placeholder_0] => 1252
)
in cache_clear_all() (line 165 of /var/www/includes/cache.inc).
Drush command terminated abnormally due to an unrecoverable error. [error]
***
Type php
Date Monday, November 3, 2014 - 6:45am
User Anonymous (not verified)
Location http://default/index.php
Referrer
Message PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction: DELETE FROM {cache_entity_user}
WHERE (cid IN (:db_condition_placeholder_0)) ; Array ( [:db_condition_placeholder_0] => 1252 ) in cache_clear_all() (line 165 of /var/www/includes/cache.inc).
Severity error
Hostname 127.0.0.1
***
Once the site went down after those errors and a clear cache, I restarted the database server and the site came back up. Went to drush and was able to unblock the user successfully.
So:
* Need to know why using the UI to unblock a user throws that error. Searching on the error leads to all the fixes listed above.
* I think I have an idea of why the database locks up when we clear cache, but why have all the previous fixes and patches not prevented it from going down in this case?
* And what changed three weeks ago such that PDOExceptions suddenly started locking up the database and killing the site? Our logs show random PDOExceptions as far back as we've looked, but we weren't dropping the site.
Other relevant information:
* We updated to 7.31 on the 10/16, and saw the first site crash on the 21st. (That could be because no one clicked the clear cache button, possibly.)
* We installed the SQL Injection vulnerability patch (not the whole 7.32 update, just the patch), also on the 16th.
* The production site is load-balanced, two servers synced with unison. PDO errors were mostly on the primary but all the cache errors were on the secondary.
* We have been completely unable to replicate the error on any of our development servers. (Not load balanced, but otherwise pretty close to production.)
* Site sits on Ubuntu 14.04 lts, and the mysql server is version: 5.5.40-0ubuntu0.14.04.1 (Ubuntu)
Any guidance would be greatly appreciated.
Comment #36
mikeytown2 CreditAttribution: mikeytown2 commented@shardesty
The patch in this issue currently works for when a cache clear is used to wipe the table. In your case this query
DELETE FROM cache_entity_user WHERE cid IN (1252)
is only targeting a single row. The error that the PDO driver throws is "Lock wait timeout exceeded; try restarting transaction". This lock wait timeout referred to here is usually the innodb_lock_wait_timeout variable; default value is 50 seconds. If you have 2 different threads trying to operate on the same row, this innodb timeout can get triggered. In my async patch (#2336521-35: Use asynchronous MySQLi connection to avoid RECORD LOCKS (Deadlocks) on cache tables) I change this timeout to be 2 seconds instead of the default of 50 seconds viaSET SESSION innodb_lock_wait_timeout = 2
. You can try to adjust this value from its default, that should help with the database locking up; be careful though, the 2 seconds I use is there just for the cache, data that if it doesn't get written isn't the end of the world. Adjusting the innodb_lock_wait_timeout so the timeouts happen sooner has the potential to negatively affect normal queries (non cache bin).If you can trigger this database stall fairly easily can you trigger it and then run
SHOW ENGINE INNODB STATUS
via phpmyadmin or some other mysql tool and put the output of it here. Be sure to capture it while the database is locked up otherwise it won't have the info that is needed to look deeper into this issue. If you could also runSHOW PROCESSLIST
that would be helpful as well; but the INNODB STATUS is the most important one to get.In order to improve calls to cache_clear_all() I would want to create an alt cache backend #2361183: Database cache master plan so async deletes can happen. I've taken patching core to it's limits and an alt cache database backend seems like the best way forward.
Comment #37
shardesty CreditAttribution: shardesty commented@mikeytown2
Thanks for replying so quickly. We're talking to our client to arrange bringing the site down for a few minutes to gather the information you requested. That will happen sometime later today, and we'll post our findings as soon as we have them.
Comment #38
shardesty CreditAttribution: shardesty commented@mikeytown2
We captured the information you asked about, attached.
Comment #39
mikeytown2 CreditAttribution: mikeytown2 commentedThanks for the info :)
This is the thread that is trying to do something (clear the cache_entity_user table)
This is the thread that has locked up the table and appears to be just sitting there doing nothing as no query is listed. Means that this thread is inside a transaction or it took an exclusive lock on a table. If you want to free up your database, kill this db connection.
The thread ID is 188 so
KILL 188
should bring your database back from the dead; yes this is a "sleeping" thread. Noted that in this stackoverflow issue they have the timout being exceeded; in your case you actually want the timeout to happen so the zombie db connection gets killed. You can try to fix your case by adjusting the wait_timeout variable.SET SESSION wait_timeout = 90
is what I use in the async patch; means that if a mysql thread has been sitting for over 90 seconds it will auto close the connection and a "Warning: MySQL server has gone away" error will appear. With async connections I have it setup so it will auto reconnect if this happens and no fatal error is displayed (think about a long running cron job where 90 seconds might not be enough time).All of this usually points towards an issue in the application (drupal in our case), where some rows/tables get a lock and then that lock doesn't get released and the db connection never closes.
Do you have a query log of all queries made by thread 188? That would be very helpful.
Comment #40
shardesty CreditAttribution: shardesty commented@mikeytown2, thanks
Just to clarify in my own head - the first transaction is coming from your truncate patch, correct? Also, are you saying that this thread 188 is a sort of zombie thread that has been causing our deadlocks all along, or just in this particular case? (And by extension, is killing this thread the solution, or just treating a symptom?)
We've been rebooting the mysql server whenever it deadlocks - this brings our site back, and I would have thought that it would have killed any threads, because reboot. Of course, I know next to nothing about mysql.
Comment #41
mikeytown2 CreditAttribution: mikeytown2 commentedThe locking transaction is coming from thread 188 (thread IDs get reset on mysql server restarts). The query that ran on that connection somehow got a lock on the cache_entity_user table and never released it. Then the truncate query comes along (thread 597) and wants to clear the cache but it can't do so because of the lock on the table from thread 188. This results in a metadata lock and a bad one because thread 188 isn't really doing anything yet it has a database lock; thus you have to wait for that connection to timeout.
In order to manually clear this issue it without too much hassle (deciphering INNODB STATUS output is hard)
SHOW PROCESSLIST
is helpful. Find a thread that has been sleeping for longer or about as long as the query that's been running the longest with a metadata lock and kill it. The thread will look like it's doing nothing (sleeping). Most of the time you'll kill the correct thread and things start to work again. Every now and then the kill won't do it, so find another thread in sleep that is about as old as the thread with the metadata lock and kill that one as well. This should release the lock and then things work as normal.Comment #42
Fabianx CreditAttribution: Fabianx commentedWow, thanks mikeytown2 for a _lot_ of background information here.
Comment #43
shardesty CreditAttribution: shardesty commentedmikeytown2, thanks for all the information so far. So, we now have a manual option for restoring database function with the kill thread, so we don't have to reboot the database server every time we get a deadlock. However, manually performing that is not a long term plan, especially as we've asked our clients to refrain from clicking on the clear cache button or modifying user accounts until this is resolved.
We're planning on triggering a deadlock intentionally again tomorrow morning, with the query logs turned on. Is there any other information that we should gather that would be helpful in diagnosing the actual cause of the deadlocks?
You've mentioned adjusting the innodb_lock_wait_timeout from the 2 seconds, and the wait_timeout variable from 90 - which of these do you recommend, and why? If the problem is, as you suggest, in drupal, what can we do to stop the the user modify / cache clear / deadlock situation?
Comment #44
mikeytown2 CreditAttribution: mikeytown2 commentedAdjusting the wait_timeout in theory would help to make this issue auto-resolve. Biggest issue I see with setting a low wait_timeout value is cron. Looking at drupal_cron_run() I think the best option is to use hook_cron_queue_info() to re-adjust the wait_timeout value to be something like 300 seconds and then in your settings.php file set the wait_timeout value to 60 seconds.
In terms of the query log, getting the thread id for every query in the log as well as the output from SHOW ENGINE INNODB STATUS while the site is locked.
Comment #45
mikeytown2 CreditAttribution: mikeytown2 commentedI have an experimental query that might shed some light on the issue while it's happening.
Don't run this in phpmyadmin if you have it with frames (this is the default) as that will then query all of the tables in the information_schema database and your mysql server will be slow/crash. If phpmyadmin shows you a list of tables on the left side then you shouldn't run the above query in it. If you do want to run it with phpmyadmin be sure to do it outside of a frame.
/tbl_sql.php?db=information_schema&table=innodb_trx
would be a good url for a starting point.Here is another query; same rules about phpmyadmin apply to this as well.
This query should show all processes that have been waiting the same time (+-10 seconds) as your longest running 'Waiting for table metadata lock' query.
Inspiration for the queries comes from http://illumineconsulting.blogspot.com/2013/10/digging-into-mysql-innodb...
Comment #46
shardesty CreditAttribution: shardesty commented@mikeytown2:
This morning we took down the site again, and gathered yet more info, attached. (The two query files are the results of your queries in the previous post. The two process list files are from during the deadlock, before and after killing thread 61. Qyery 61 is an excerpt from the query log for thread 61. If you need more of the query log, just say so, we have over 500MB for the five minutes we had the query logging turned on.) Also, we watched closely what was happening in phpmyadmin as the site went down. So, we knew before we saw the results of your custom queries that thread 61 was the initial problem. We can take the site down - cause a deadlock in the database by clicking "save" on a user edit page, and then clearing the cache. This morning, as soon as I clicked save, thread 61 appeared and immediately went to sleep / null:
61 dbuser xxx.xx.xxx.xxx:38286 database Sleep 448 NULL
We waited about 30 seconds, then clicked the clear cache button. Within a couple seconds, thread 112 collided with thread 61.
112 dbuser xxx.xx.xxx.xxx:38292 database Query 393 Waiting for table metadata lock RENAME TABLE cache_entity_user TO cache_entity_use...
And then we had the big pile up of threads "Waiting for table metadata lock " right after it:
We waited until everything was fully locked up, and then killed thread 61. We saw the results in "show_process_list_2" for a few seconds, and then it all cleared up and the site was back to normal.
Next, we put in place the changes to settings.php and created a new module, and installed/enabled that. Ran the test again with results that were functionally identical to the above, just with different thread ID numbers.
Looking back over the last couple weeks, the only things of note that happened before the site started going down were:
* Updating core from 7.28 to 7.31 (five days previous)
* Ubuntu updates to libmysqlclient18 mysql-client-5.5 mysql-client-core-5.5 mysql-common mysql-server mysql-server-5.5 mysql-server-core-5.5 (five days previous)
* Ubuntu updates to libssl1.0.0 openssl (one day previous)
I haven't seen any issues related to being unable to save changes to users. So maybe I need to post that as a new bug report - but the problem certainly feels related to all these other PDOException type issues.
Comment #47
mikeytown2 CreditAttribution: mikeytown2 commentedLooks like a TRANSACTION was started but it was never finished. Thus when the cache was cleared in another thread the cache_entity_user table is locked and for some reason thread_id 61 just sat and did nothing for the rest of the time.
Sounds like the wait_timeout idea didn't help... You waited at least 60 seconds for the sleeping thread before killing the test right? Just trying to rule out anything.
Comment #48
mikeytown2 CreditAttribution: mikeytown2 commentedLooks like the first query from #45 was able to pinpoint the zombie id. The second one had too small of a range to get the zombie id. Good to know in the future. Thanks for your work with this.
The heart of the issue is database queries to the cache tables being executed inside of a transaction and then that thread stalling for an unknown reason. If you're already using this patch in prod #2336521-35: Use asynchronous MySQLi connection to avoid RECORD LOCKS (Deadlocks) on cache tables then it looks like the only solution at this point is to find out why the transaction is not getting committed in user_save(). I will be working on a alt database backend in order to avoid the transaction issue, but that's not going to help you today.
Comment #49
mikeytown2 CreditAttribution: mikeytown2 commentedLooking into this more I think there is an issue inside of field_attach_update('user', $account); or one of the hooks that it uses. Looking at where the queries died points to it happening inside of here.
Comment #50
shardesty CreditAttribution: shardesty commentedWe waited almost 400 seconds in the second test, after we installed the patched settings.php and module, we wanted to be sure we gave it a chance to work.
Glad we could help testing your queries, just wish we didn't have the opportunity. We'll look more into field_attach_update, and thanks once more for your help with this.
Comment #51
mikeytown2 CreditAttribution: mikeytown2 commentedLooks like timeouts by default do not roll back the transaction http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_inn...
The advice here says that the wait_timeout should roll it back http://stackoverflow.com/questions/9936699/mysql-rollback-on-transaction...
Eh... transactions in mysql are a pain.
Comment #52
mikeytown2 CreditAttribution: mikeytown2 commentedAdjusted the query so it only shows queries that have a row or table lock. Also added a couple more collumns from the processlist table
Comment #53
Fabianx CreditAttribution: Fabianx commentedI would strongly suggest to not use entitycache module with DB cache backend.
Do you have the option to have at least that bin in memcache or redis?
Comment #54
mikeytown2 CreditAttribution: mikeytown2 commented@shardesty
One more question, what version of PHP are you using?
Comment #55
mikeytown2 CreditAttribution: mikeytown2 commentedFound a post that talks about auto killing hanging transactions. Uses a stored procedure that gets called every X seconds. Looks like an interesting idea.
http://mysqlblog.fivefarmers.com/2012/08/28/identifying-and-killing-bloc...
Comment #56
Marko B CreditAttribution: Marko B commentedI see in #48 something very similar to what I am having now in D7, described here more
http://drupal.stackexchange.com/questions/137979/user-import-from-other-...
maybe it is somehow related so my problem could bring some light to it.
Comment #57
mikeytown2 CreditAttribution: mikeytown2 commented@Marko B
Checkout slides 74-106 from my DIY Drupal 7 Performance Presentation; you should be able to get it working correctly.
Comment #58
Marko B CreditAttribution: Marko B commentedThanx Mikey. I'll just add that my issue was with something else. I hade a cURL connection that was slow and it made site timeout. Sorry for this.
Comment #59
mikeytown2 CreditAttribution: mikeytown2 commentedI just encountered this while editing field info. The detection query is works very well; it ID-ed the stuck query and I was able to kill it without issue.
I've worked on reducing the query down to the parts I care about. More refinements might come later.
I've been working on a new module Asynchronous Prefetch Database Query Cache that is being built in order to avoid issues like this one from happening. It does incorporate this patch; but the module is not production ready.
Comment #60
BerdirSomewhat related: #2347867: Race conditions with lock/cache using non-database storage - add a non-transactional database connection
Comment #61
shardesty CreditAttribution: shardesty commentedBelated update:
@fabienx: We uninstalled the entitycache module, and attempted to recreate the error. The error occurred anyway. We inherited this site from another company, and the cache modules like many other aspects of the site were apparently poorly thought out. The site is now consolidated on default drupal cache, and all the various contrib cache modules are disabled.
@mikeytown2: PHP version is:
PHP 5.5.9-lubuntu4.5 (cli) (built: Oct 29 2014 11:59:10)
Zend Engine v2.5.0 with Zend OPcache v7.0.3
We waited until our weekly maintenance window to do some more testing, seeing as the site was up so long as no one clicked on the save user button. What we found was strange - previously, we had been completely unable to replicate the error in any of our other environments, despite our best efforts to make those other environments match production as closely as possible.
The one difference that was most difficult to replicate was the fact that our production environment is load-balanced, with two servers hitting the same database, and files replicated between the two servers with unison. Outside users would hit one of the two servers randomly, but all administrative functions went through just one, the primary (server 9). We'd checked, and the pdo errors all occurred on server 9, which you'd expect since that's where the user save and cache clear would originate.
So, we rewired things a bit so that the secondary server (server 10) could be used for drupal admin as well. Since all the files and settings are constantly synced from 9 to 10, we expected that when we hit the save user then cleared cache, we'd get the database lockup. But we didn't. Hitting the same database as server 10, we could not recreate the database lockup at all. So far as we can determine, all settings - php, mysql, apache, ubuntu, drupal, whatever - were identical between the two servers.
As best we can now figure, there is some obscure difference, some setting, on server 9 that uniquely causes the error condition. Some research showed that server 9 was built by a different individual than server 10, but aside from that, we have no clue as to what the difference might be.
Having isolated the issue to one server - we made the decision to just blow the server away, and rebuild using the same procedure as we used for server 10, then restore the site. Since that time, we have not experienced any issues with PDO errors, or anything else. Our db performance is significantly better, though, thanks to all the tuning patches from mikeytown2.
Thanks to everyone for all their help with this issue - though it seems at this point that we'll not have any answer for why, exactly, the error was happening. The various patches that we applied were essential, though, in narrowing down the issue - without clearing away many of the errors we were seeing through the successive patches and without the sql scripts that allowed us to watch what was happening on the sql server in real time, it would have been much harder if not impossible to see that the issue was coming from that one server.
Comment #62
SocialNicheGuru CreditAttribution: SocialNicheGuru commentedFrom patch in 26 I am experiencing a number of these __truncate tables not being defined in the database for Drupal 7:
cache__truncate
cache_views__truncate
cache_variable__truncate
cache_rules__truncate
cache_libraries__truncate
cache_field__truncate
If you uninstall a module that has a truncated table, the truncated table does not get uninstalled. Shouldn't it?
Comment #63
morgantocker CreditAttribution: morgantocker commentedSo I'm a little late to this issue. There are some known issues w.r.t. MySQL:
- 5.6 and above has a 'fast' DROP TABLE (but not truncate)
- 5.7 and above has a 'fast' TRUNCATE
See this MySQL bug for more details. I am simplifying 'fast' a little, but essentially with a large buffer pool MySQL can take some time to remove the pages from the deleted table. The new approach is to do this in a lazy way.
Drupal supports MySQL 5.5+ so that technically means neither statement is safe for large buffer pools.
Comment #65
daffie CreditAttribution: daffie commentedI think this problem will be fixed in #2684359: The truncate method is does different things on different databases and depending if in a transation.
Comment #77
catchDrupal requires MySQL 5.7 now, so I believe this is resolved by https://bugs.mysql.com/bug.php?id=68184
Closing as outdated.
#2684359: The truncate method is does different things on different databases and depending if in a transation is still open for some of the other issues.