#802446: Cache stampede protection: drupal_render() and page cache uses lock_wait() for render cache misses, which is a great idea. However it's quite likely that a specific render fragment could be built and cached in less than one second, to account for cases like that, I think we should replace with sleep(1) in lock_wait() with usleep(100000); - so that it can return with 100ms granularity if that's possible.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

catch’s picture

FileSize
1.06 KB

Updated patch sets $delay * 10 so that the $delay argument behaves the same.

The result of this is ten select queries per second where there would have been one, but see the performance improvement when combined with #802446: Cache stampede protection: drupal_render() and page cache - in that example, processes made to wait ended up returning in 250ms instead of 1050ms without this patch, or 400ms without stampede protection.

Since lock.inc is pluggable this change wouldn't have to go into core, but I think it's a fair trade-off to make. We could also make the polling time a variable so it could be configured without a custom lock.inc -which would be a very simple patch too, but not sure if we need that level of control here or not compared to using a non-SQL backend for it.

chx’s picture

Status: Needs review » Reviewed & tested by the community

See, ten SELECTs... not that much, eh. Indexed anyways. And noone with a sizeable traffic will use SQL locks, no, that'd be crazy.

catch’s picture

Status: Reviewed & tested by the community » Needs work

This is better done once the poll decay in the other issue is in.

Anonymous’s picture

chx: While reading through other lock-related threads I can't seem to find any decisions that non-sql storage for locks (apc, memcached, ...) are configurable. Am I missing something?

Anonymous’s picture

subscribe

catch’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +scalability
FileSize
984 bytes

That other patch has been RTBC for months now, but this is still a big issue. I'm bumping this to major since the variable_init() lock is the equivalent of a global writeread lock across the whole of a site, which means it's very easy to suddenly have a tonne of processes waiting one or more seconds on a variable cache miss, which happens a lot.

New patch does both the usleep(100000); and the decay polling.

pillarsdotnet’s picture

FileSize
847 bytes

Re-rolled against d7 head after applying patch from #802446: Cache stampede protection: drupal_render() and page cache

pillarsdotnet’s picture

catch’s picture

FileSize
984 bytes

I split this out from #802446: Cache stampede protection: drupal_render() and page cache so it could get in separate from adding more locking. Re-attaching the patch from #6.

paul.lovvik’s picture

Status: Needs review » Reviewed & tested by the community
catch’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
1023 bytes

Just thought of something - we should double the wait time until we reach two seconds, but not continue to double it after that. One query every two seconds is nothing.

So this will now look like:

100ms (100)
200ms (300)
400ms (700)
800ms (1500)
1600ms (3100)
2000ms (5100)
2000ms (7100)
etc.

catch’s picture

FileSize
1 KB

Paul pointed out in irc that min(2 etc.) was going to mean two usec all the time, not a maximum of two seconds. Re-rolled to correct that.

paul.lovvik’s picture

Status: Needs review » Reviewed & tested by the community
catch’s picture

Meant to bump this to major in #7 but never actually did it. This is the scenario:

1. variable_initialize() acquires a lock
2. Several other requests hit immediately afterwards and have to lock_wait()
3. Your site stops serving requests altogether for a while because your apache processes are maxed out while PHP is sleeping.

While people can install other options, a lot of sites won't, and fixing the core implementation will encourage people to look at how best to handle this in the alternatives.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Sorry to be a pain, but could we add a few more comments around this code? I'm not sure the comments adequately reflect the knowledge in this issue for new people coming into that and trying to figure out how it works.

catch’s picture

Priority: Normal » Major
Status: Needs work » Needs review
FileSize
2.21 KB

No that's fair enough, I've updated the comments with more detail.

While writing the comments, I started feeling even more paranoid.

If using normal page caching, the variables cache is loaded when serving a cached page. If there's a lock on the variables table, then requests that normally take 20-30ms to serve in total, would be waiting for at least 100ms (and building and saving the variables cache should take less than 100ms anyway). Also, even if a lock is acquired and released in say 200ms, a lot of requests could start waiting 150ms or 175ms into this process - and they won't release until say 275ms.

Let's say there's even half a second between a cache miss and the last process to start lock waiting to exit from lock_wait() (and it could be more), if you serve 100 requests / second and have max clients set to 25, you could already start to run into trouble with apache itself getting backed up with requests. Something like css/js aggregation that's calling variable_set() multiple times from multiple different pages is going to prolong this process - so some could wait across other processes acquiring the same lock and take much longer to exit than it does to build the cache once.

So this patch also reduces the minimum wait time to 25ms, and instead of doubling, adds 25ms each wait instead.

So it'll look like:

poll time / total time spent waiting so far

25 (25)
50 (75)
75 (150)
100 (250)
125 (375)
150 (525)
175 (700)
200 (900)
225 (1125)
250 (1375)
275 (1650) (10 queries)
300 (1950)
325 (2275)
350 (2625)
375 (3000)
400 (3400)
425 (3825)
450 (4275)
475 (4750)
500 (5250)
500 (5750)...

This still means that a process that needs to wait for one second, will issue less db queries than the original patch waiting 100ms every time. Drupal 6 would sometimes issue several hundred queries to build one page with no waiting between them, so this is still a tiny number of queries comparatively, and it's still possible to use memcache-lock.inc too to eliminate them altogether.

catch’s picture

FileSize
2.21 KB

grr, had to make that change /after/ updating the comments.

pwolanin’s picture

The reason we used sleep originally in the lock API (as targetd for Drupal6) was that windoze PHP didn't not support usleep().

However, checking the PHP docs I see: "5.0.0 This function now works on Windows systems. "

So this patch should be fine for Druapl 7, but the backport to Drupal6 might require some extra work.

catch’s picture

#17: lock_wait_802856.patch queued for re-testing.

pwolanin’s picture

minor, but min() can be called directly with the values - any reason to pass the array in?

catch’s picture

FileSize
2.21 KB

Hmm, no reason at all. Attached patch which just changes the min() call.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

comments and code and justification look good to me.

pillarsdotnet’s picture

Status: Reviewed & tested by the community » Needs work

EDIT: removed comment; posted to wrong issue.

pillarsdotnet’s picture

Status: Needs work » Reviewed & tested by the community

EDIT: previous comment posted to wrong issue; restoring status.

DanPir’s picture

subscribe

aspilicious’s picture

Version: 7.x-dev » 8.x-dev

Nees to be applied to D8 first

catch’s picture

Issue tags: +Needs backport to D7

Tagging.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to 7.x and 8.x. Thanks for your hard work, catch.

catch’s picture

Version: 8.x-dev » 6.x-dev
Status: Fixed » Needs work

Can be applied to 6 but cnw due to the windows issue pwolanin outlined.

mikeytown2’s picture

Status: Needs work » Needs review
Issue tags: -Needs backport to D7 +Needs backport to D6
FileSize
3.24 KB

Patch for D6. If the server is running on windows with php < 5, they can wait a full second... heck they should probably wait 2. Version checking and OS checking code from: check_plain() & file_directory_temp().

Anonymous’s picture

Status: Needs review » Closed (won't fix)

I'm going to guess this is a won't fix for D6, since D6 commits focus on security fixes at this point.

Anonymous’s picture

Version: 6.x-dev » 7.x-dev
Status: Closed (won't fix) » Fixed

Changing issue status to reflect that it was fixed in 7.x.

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -Needs backport to D6, -scalability

Automatically closed -- issue fixed for 2 weeks with no activity.