#802446: Cache stampede protection: drupal_render() and page cache uses lock_wait() for render cache misses, which is a great idea. However it's quite likely that a specific render fragment could be built and cached in less than one second, to account for cases like that, I think we should replace with sleep(1) in lock_wait() with usleep(100000); - so that it can return with 100ms granularity if that's possible.
Comment | File | Size | Author |
---|---|---|---|
#30 | drupal-lock_wait-802856-30-D6.patch | 3.24 KB | mikeytown2 |
#21 | lock_wait_802856.patch | 2.21 KB | catch |
#17 | lock_wait_802856.patch | 2.21 KB | catch |
#16 | lock_wait_802856.patch | 2.21 KB | catch |
#12 | 802856_lock_wait.patch | 1 KB | catch |
Comments
Comment #1
catchUpdated patch sets $delay * 10 so that the $delay argument behaves the same.
The result of this is ten select queries per second where there would have been one, but see the performance improvement when combined with #802446: Cache stampede protection: drupal_render() and page cache - in that example, processes made to wait ended up returning in 250ms instead of 1050ms without this patch, or 400ms without stampede protection.
Since lock.inc is pluggable this change wouldn't have to go into core, but I think it's a fair trade-off to make. We could also make the polling time a variable so it could be configured without a custom lock.inc -which would be a very simple patch too, but not sure if we need that level of control here or not compared to using a non-SQL backend for it.
Comment #2
chx CreditAttribution: chx commentedSee, ten SELECTs... not that much, eh. Indexed anyways. And noone with a sizeable traffic will use SQL locks, no, that'd be crazy.
Comment #3
catchThis is better done once the poll decay in the other issue is in.
Comment #4
Anonymous (not verified) CreditAttribution: Anonymous commentedchx: While reading through other lock-related threads I can't seem to find any decisions that non-sql storage for locks (apc, memcached, ...) are configurable. Am I missing something?
Comment #5
Anonymous (not verified) CreditAttribution: Anonymous commentedsubscribe
Comment #6
catchThat other patch has been RTBC for months now, but this is still a big issue. I'm bumping this to major since the variable_init() lock is the equivalent of a global
writeread lock across the whole of a site, which means it's very easy to suddenly have a tonne of processes waiting one or more seconds on a variable cache miss, which happens a lot.New patch does both the usleep(100000); and the decay polling.
Comment #7
pillarsdotnet CreditAttribution: pillarsdotnet commentedRe-rolled against d7 head after applying patch from #802446: Cache stampede protection: drupal_render() and page cache
Comment #8
pillarsdotnet CreditAttribution: pillarsdotnet commentedCombined with #802446: Cache stampede protection: drupal_render() and page cache.
Comment #9
catchI split this out from #802446: Cache stampede protection: drupal_render() and page cache so it could get in separate from adding more locking. Re-attaching the patch from #6.
Comment #10
paul.lovvik CreditAttribution: paul.lovvik commentedComment #11
catchJust thought of something - we should double the wait time until we reach two seconds, but not continue to double it after that. One query every two seconds is nothing.
So this will now look like:
100ms (100)
200ms (300)
400ms (700)
800ms (1500)
1600ms (3100)
2000ms (5100)
2000ms (7100)
etc.
Comment #12
catchPaul pointed out in irc that min(2 etc.) was going to mean two usec all the time, not a maximum of two seconds. Re-rolled to correct that.
Comment #13
paul.lovvik CreditAttribution: paul.lovvik commentedComment #14
catchMeant to bump this to major in #7 but never actually did it. This is the scenario:
1. variable_initialize() acquires a lock
2. Several other requests hit immediately afterwards and have to lock_wait()
3. Your site stops serving requests altogether for a while because your apache processes are maxed out while PHP is sleeping.
While people can install other options, a lot of sites won't, and fixing the core implementation will encourage people to look at how best to handle this in the alternatives.
Comment #15
webchickSorry to be a pain, but could we add a few more comments around this code? I'm not sure the comments adequately reflect the knowledge in this issue for new people coming into that and trying to figure out how it works.
Comment #16
catchNo that's fair enough, I've updated the comments with more detail.
While writing the comments, I started feeling even more paranoid.
If using normal page caching, the variables cache is loaded when serving a cached page. If there's a lock on the variables table, then requests that normally take 20-30ms to serve in total, would be waiting for at least 100ms (and building and saving the variables cache should take less than 100ms anyway). Also, even if a lock is acquired and released in say 200ms, a lot of requests could start waiting 150ms or 175ms into this process - and they won't release until say 275ms.
Let's say there's even half a second between a cache miss and the last process to start lock waiting to exit from lock_wait() (and it could be more), if you serve 100 requests / second and have max clients set to 25, you could already start to run into trouble with apache itself getting backed up with requests. Something like css/js aggregation that's calling variable_set() multiple times from multiple different pages is going to prolong this process - so some could wait across other processes acquiring the same lock and take much longer to exit than it does to build the cache once.
So this patch also reduces the minimum wait time to 25ms, and instead of doubling, adds 25ms each wait instead.
So it'll look like:
poll time / total time spent waiting so far
25 (25)
50 (75)
75 (150)
100 (250)
125 (375)
150 (525)
175 (700)
200 (900)
225 (1125)
250 (1375)
275 (1650) (10 queries)
300 (1950)
325 (2275)
350 (2625)
375 (3000)
400 (3400)
425 (3825)
450 (4275)
475 (4750)
500 (5250)
500 (5750)...
This still means that a process that needs to wait for one second, will issue less db queries than the original patch waiting 100ms every time. Drupal 6 would sometimes issue several hundred queries to build one page with no waiting between them, so this is still a tiny number of queries comparatively, and it's still possible to use memcache-lock.inc too to eliminate them altogether.
Comment #17
catchgrr, had to make that change /after/ updating the comments.
Comment #18
pwolanin CreditAttribution: pwolanin commentedThe reason we used sleep originally in the lock API (as targetd for Drupal6) was that windoze PHP didn't not support usleep().
However, checking the PHP docs I see: "5.0.0 This function now works on Windows systems. "
So this patch should be fine for Druapl 7, but the backport to Drupal6 might require some extra work.
Comment #19
catch#17: lock_wait_802856.patch queued for re-testing.
Comment #20
pwolanin CreditAttribution: pwolanin commentedminor, but min() can be called directly with the values - any reason to pass the array in?
Comment #21
catchHmm, no reason at all. Attached patch which just changes the min() call.
Comment #22
moshe weitzman CreditAttribution: moshe weitzman commentedcomments and code and justification look good to me.
Comment #23
pillarsdotnet CreditAttribution: pillarsdotnet commentedEDIT: removed comment; posted to wrong issue.
Comment #24
pillarsdotnet CreditAttribution: pillarsdotnet commentedEDIT: previous comment posted to wrong issue; restoring status.
Comment #25
DanPir CreditAttribution: DanPir commentedsubscribe
Comment #26
aspilicious CreditAttribution: aspilicious commentedNees to be applied to D8 first
Comment #27
catchTagging.
Comment #28
Dries CreditAttribution: Dries commentedCommitted to 7.x and 8.x. Thanks for your hard work, catch.
Comment #29
catchCan be applied to 6 but cnw due to the windows issue pwolanin outlined.
Comment #30
mikeytown2 CreditAttribution: mikeytown2 commentedPatch for D6. If the server is running on windows with php < 5, they can wait a full second... heck they should probably wait 2. Version checking and OS checking code from: check_plain() & file_directory_temp().
Comment #31
Anonymous (not verified) CreditAttribution: Anonymous commentedI'm going to guess this is a won't fix for D6, since D6 commits focus on security fixes at this point.
Comment #32
Anonymous (not verified) CreditAttribution: Anonymous commentedChanging issue status to reflect that it was fixed in 7.x.