Attached simple patch uses lock system to ensure that only one request is rebuilding a given cache item at a time.
Note that the first line of the patch is just cleanup. $cid can't be empy or else we left this function a few lines earlier.
If we agree on this approach, I will submit similar patches for block and page caches (others?).
Comment | File | Size | Author |
---|---|---|---|
#143 | cache_stampede-802446-143.patch | 6.54 KB | pillarsdotnet |
#141 | cache_stampede-802446-141.patch | 5.29 KB | pillarsdotnet |
#139 | 802446.stampede_0.patch | 4.71 KB | catch |
#136 | 802446.stampede.patch | 5.5 KB | pillarsdotnet |
#134 | 802446.stampede.patch | 5.99 KB | Anonymous (not verified) |
Comments
Comment #1
moshe weitzman CreditAttribution: moshe weitzman commentedI was not handling the 'wait' condition where subsequent requests don't get the lock.
Comment #2
moshe weitzman CreditAttribution: moshe weitzman commentedNow uses lock_wait() instead of sleep.
Comment #3
catchThis looks great to me. I have a D6 site which slows to a crawl whenever content is posted, locking framework is designed just for cases like this. My only concern is it'd be nice if lock_wait() checked for the lock release at granularity less than a second (say 100ms), but that's not for this patch to deal with.
Comment #4
catchLooked at this again, there's nothing not to like here, RTBC.
Comment #5
moshe weitzman CreditAttribution: moshe weitzman commentedThis patch adds a lock for page cache using same pattern.
I'm stopping short of block cache for now as that has a slightly different flow. Until we add lock for block, it is simple enough for hurting developers to convert expensive blocks to use drupal_render #cache instead using hook_block_view_alter().
Comment #6
moshe weitzman CreditAttribution: moshe weitzman commentedComment #8
moshe weitzman CreditAttribution: moshe weitzman commentedReroll. Changed header() calls to drupal_add_http_header()
Comment #9
catchDo we need the header call for WAIT? Indented one too many places.
Additionally the lock name isn't per page? Locking all page cache generation across the site seems not ideal to me, we have the requested path by this point already.
Powered by Dreditor.
Comment #10
moshe weitzman CreditAttribution: moshe weitzman commentedFixed lock name for page cache. That was a silly late change.
I am keeping the WAIT header as that provides good information about why a cached page was slower than usual (i.e. it had to wait).
I reverted back to header() from drupal_add_http_header() since the latter could cause our hit/miss/wait header to get cached along with the page which is confusing.
Comment #11
catchLooks great. Fine with keeping WAIT. Just waiting for testbot now.
Comment #12
Dries CreditAttribution: Dries commentedAny idea how we can test the performance impact and correctness of this patch?
Comment #13
moshe weitzman CreditAttribution: moshe weitzman commentedIt is hard to test those. I tested correctness by adding a very long sleep after the page cache lock is acquired and then I launched a new anon request and verified that this new request went into a holding pattern while the first request "built" the page ... Not sure how to test performance regression. What I'm intending is a reliability improvement; site stays up during a busy cache flush.
I asked a couple lock experts to review this patch. Their support should ameliorate these concerns.
Comment #14
chx CreditAttribution: chx commentedThis is scary, very very scary. Compared to where this lock framework was used (=menu rebuild) this is a very very frequent operation. I have no idea where are we with deadlocks and in general, how large is the lock overhead. Anyone has an insight?
Comment #15
moshe weitzman CreditAttribution: moshe weitzman commentedPlease, no FUD. Thats what I call the use of like "very, very scared" without presenting any basis.
Requests that are waiting for cache to be built will check once a second until the lock is released. Once released, they cache_get() the item they need. Checking once per second means one query to the semaphore table for the core lock.inc, even faster checks for APC, memcache, etc. See lock_wait().
We have well known stampede bugs. This patch addresses them with no API change.
Comment #16
catchLock waiting has very little overhead. The overhead I'd be more concerned about would be in http://api.drupal.org/api/function/lock_acquire/7
However - with database caching, cache_set() can be very expensive, let alone building the cached data, so there's trade off on stability. Since lock.inc is pluggable, it ought to be possible to reduce the lock_acquire() and lock_wait() overhead to pretty much nothing, at which point you have the protection against the site going down (which is a very real risk at the moment), without any performance impact.
I will see if I get a basic test case up with ab at some point to verify some of the above.
Comment #17
catchOK here's a test case:
1. Enable page caching for anonymous users.
2. apply attached patch to cache.inc - just log that we wrote to cache.
3. truncate cache_page
4. ab -c5 -n50
without locking:
with locking:
So, the locking works.
Then, to test performance impact, I did the following:
1. ab -c1 -n50 http://d7.7 (acquired lock but we didn't need the stampede protection - this would be the worst case from a performance standpoint)
HEAD:
Patch:
In this case the patched version was faster, so there's no measurable impact from acquiring the lock then not using it. A better test case would hit a higher spread of paths but still with a high cache hit rate, but this is fine from a sanity check standpoint.
2. ab -c5 -n50 (stamped protection kicks in)
HEAD:
Patch:
As you can see, in the latter example, we lose requests per second, but that's not the interesting thing.
Without the patch, there are these long requests:
Each of those is a full page build and cache_set() - which takes about 200-400ms, of which all the time is spent generating the content and writing to cache. Note this was on HEAD with only one node on the front page.
With the patch, we instead get
in this case, the 152ms request builds the content and write to cache, the other 4 requests sit around and do nothing for a second, and only 50ms is spent actually doing anything - sleep(1) uses 0 cpu for the second.
So, while raw performance is a little slower, the webserver has done 1/5th the work for four out of five cache misses. On a real site where a page cache miss might mean 1s to build and write the cache, we'd see no speed impact either.
It's also possible, especially with a non-SQL lock.inc, but even with it, to make lock_wait() more granular - see #802856: Make lock_wait() wait less - removing wait times for users who are put into the holding pattern on the cache miss. With both patches applied, I ran the same ab.
This time, the longest request is twice as fast as without the patch - there's no contention for either CPU or database writes on building the cache, and once the lock is released, requests which are made to wait still get from cache and return the page far within the time which would have otherwise been spent building and writing to cache themselves. We might want to leave that more aggressive lock_wait() to a pluggable include, but it shows that this approach can both improve overall stability and have zero raw performance cost in terms of user experience.
Comment #18
catchForgot to attach the hack to cache.inc - this includes both patches as well.
Comment #19
chx CreditAttribution: chx commentedI asked others to do research into lock.inc which was answered that my request is FUD. lock_acquire does a database write. That's not much of an overhead, but it's there. However, it's pluggable, so be it.
There can be no deadlock as far as I can see.
What IS possible, however, that you recurse so many times that PHP dies out. We need a recurse protection in there.
Edit: that's only necessary if catch decimates the wait time. If he raises $delay = 30 to $delay = 300 changing the behaviour to the same just more fine grained, then such a protection is not needed.
Comment #20
chx CreditAttribution: chx commentedOK, so let's go with it in the hopes the other patch gets in too.
Comment #21
dhthwy CreditAttribution: dhthwy commentedTo go along with catch's ab benchmark.
I made it so Drupal clears the cache for every other page request, so I get a HIT/MISS scenario. I couldn't get it to do a WAIT no matter how much I bombarded it. I'm sure it works though, just couldn't get the right conditions for it to fire.
Locking stuff average (MISS): 0.0143499279022158
Other stuff inside (WHEN CACHE MISS) _drupal_bootstrap_page_cache() average: 0.1099051904678
Minimum lock stuff time: 0.00162601470947
Minimum other stuff time: 0.0664360523224
Maximum lock stuff time: 0.150331020355
Maximum other stuff time: 0.228527784348
"Locking stuff average (MISS)" means elapsed time from comment "// Cache miss. Avoid a stampede." to the point it sets the header to "X-Drupal-Cache: MISS".
"Other stuff inside (WHEN CACHE MISS) _drupal_bootstrap_page_cache()" means elapsed time from beginning of function to the point it sets the header to "X-Drupal-Cache: MISS".
Conclusion is that it looks like the locking stuff takes very little time compared to the rest of function. No opcode cachers being used btw.
Arggg I messed up somewhere writing this down.. I'll have to correct these times.Ok those times should be right now. I've got a perl app that makes 100 page requests to Drupal, half of them (when they miss) insert those elapsed times into the header. I've run it several times and they're pretty consistent.
Comment #22
lostchord CreditAttribution: lostchord commentedJust out of curiosity... you have fully researched distributed lock management here? Is the "lock system" being referred to a "distributed lock manager"?
If you have a cache shared across a number of distinct processors - a web server farm - then you will need this level of synchronisation I think. There are all sorts of interesting race conditions and failure modes that need to be considered.
This is not easy. I had discussions with some of the VMS developers who did the first ever DLM and it took this amazing development team quite a few iterations to get it right.
cheers
Comment #23
catchPHP is single threaded - one process can only ever request or secure a lock on one thing at a time, so the only situation where you could have a deadlock is where a lock is acquired then the process dies with a fatal error before releasing the lock - in that case the lock will expire and be released eventually. If there are other potential race conditions in the locking framework, then those deserve separate issues against lock.inc. The original issue is here #251792: Implement a locking framework for long operations.
@dhthwy - to get a WAIT, you'll need to run your perl script in two processes (or just use ab which handles that for you).
Comment #24
lostchord CreditAttribution: lostchord commented@catch
This applies to PHP running on seperate servers where the database is shared between the servers?
Comment #25
dhthwy CreditAttribution: dhthwy commentedcatch, I ran a bunch of perl procs. Still couldn't get it.. hehe. Maybe ab+perl. I did it in perl so I could get the header stuff I injected from Drupal.
But not worth me trying for anyway since you've already tested that and the locking doesn't appear to have a noticable impact on performance.
Comment #26
Dries CreditAttribution: Dries commentedInstead of locking and waiting, I'd prefer to come up with a write-through cache. In such scenario, other requests continue to use the old data until it was replaced in the cache. That requires zero waiting from other requests. Have we considered that?
Comment #27
Damien Tournoud CreditAttribution: Damien Tournoud commentedRe: the polling delay question: the standard, database-based,
lock.inc
implementation is nothing more then a cheap baseline. There are several ways to do that better, without polling, even with a database-based implementation, but that requires some additional privileges (locking, etc.) that most shared hostings cannot provide.The best way to avoid replacing a cache stampede by a lock stampede would probably be to implement decay polling in lock.inc by default.
This said, let's study how much it takes to implement a semi-write-through strategy. (complete write-though is not possible, because there are a lot of cases where the data itself has completely disappeared, and we have to lock in that case anyway)
Comment #28
catchThis is for absolute cache misses - whenever cache_get() returns FALSE, not for when there is existing, stale, data in the cache which needs to be updated. They're two different issues.
At the moment core's cache.inc returns false for the update case too - so this patch would stop a stampede in that case, but with a write-through, you'd have cache_get() return TRUE - so this is entirely compatible with that if we added it in a separate patch.
For reference:
http://tag1consulting.com/patches/pagecache_lock
http://groups.drupal.org/node/59318
http://drupal.org/node/295738
These deal with stale cached data, but not empty data, if we could get that into core too it'd be great, but it's doesn't solve all cases.
Comment #29
catchCross posted, I still think this belongs where it is, and write-through belongs in cache.inc, so leaving at CNR rather than CNW.
Comment #30
Damien Tournoud CreditAttribution: Damien Tournoud commentedIn here, let's at the minimum implement decay polling in lock.inc.
Comment #31
catchThere's a D6 patch for write through on #762978: Prevent cache stamped for page cache table, I just bumped that issue to D7.
Here's the same patch with decay polling, I didn't change the minimum wait from 1s, that should stay in a separate issue.
Comment #32
catchBrain no worky. Less broken patch.
Comment #33
catchMinus one more stupid error, and with some improvements to the lock.inc change via Damien in irc.
Comment #34
jriedel CreditAttribution: jriedel commented@lostchord
Being an old VMS internals guy...
We are not talking about DLM in this patch. The locking is being done by inserting a row in the database, not by the webservers on many different nodes. If your database is clustered, then it is up to the database to worry about the locking required to make transactions work correctly.
In VMS I didn't care about DLM as long as $ENQ/$DEQ calls worked the same on single node or in a cluster.
Comment #35
moshe weitzman CreditAttribution: moshe weitzman commentedI'm happy with latest patch. I think we have a lot of experts in this issue now.
Comment #36
lostchord CreditAttribution: lostchord commented@jriedel
Ditto the old VMS internals bit... I'm much happier in VAX assembler and a relative newbie to PHP :-) I think the critical part in your reply is this:
I suspect this will probably not be the case in most 'simple' replication scenarios.
I'm assuming that as far as locking is concerned we are dealing with this part of the API and the intent is to provide a mutex capability rather than a generalised resource synchronisation capability. In this context I've got a couple of questions:
I have some other questions regarding how the lock_* stuff behaves but I'll dump those in a seperate post when I've dug a bit more. Finding API documentation and code is relatively easy, finding out what the code was actually intended to do (as opposed to what you can see that its doing), the requirements it's based on, is proving a lot harder.
cheers
Comment #37
Dries CreditAttribution: Dries commentedIs 'building cache' proper English?
Why do we have to call this function recursively? That seems odd given that we might execute SQL queries and whatnot. For example, the IP blocking code would be executed twice.
Comment #38
catchIt looks like the lock protection should go in drupal_page_get_cache() rather than here. Moshe, any reason not to do that?
Comment #39
moshe weitzman CreditAttribution: moshe weitzman commentedThis patch avoids duplicate drupal_block_denied() calls.
I looked into avoiding recursion altogether but the way I did it had us assuming that the cache took less than x seconds to build and I can't think of any sensible hard-coded value there. Its simpler IMO to add a static as I have in this patch.
Comment #40
moshe weitzman CreditAttribution: moshe weitzman commentedDitched the new static var and instead moved the lock_acquire code into drupal_page_get_cache() as suggested by catch. This is a touch simpler. We still use recursion as we don't know how long a page will take to build. Its plausible that it will take more than one 30 second wait period. There is minimal performance cost to the recursion.
Comment #41
moshe weitzman CreditAttribution: moshe weitzman commentedFixed the code comment language per Dries' request in 37. No functionality change.
Comment #42
catchI'm wondering if instead of recursion we could just do a while (lock_wait()); will try to roll a patch for that tomorrow. If for some reason doesn't work out this is RTBC.
Comment #43
catchI had a look at the while() loop but it was more verbose than recursion (iirc, can't believe I left this hanging for a week, sorry :( ). Since the recursion is otherwise completely harmless, and only happens on a cache miss that's failed to acquire a lock anyway, marking this back to RTBC.
Comment #44
catchlock_initialize() moved to _drupal_bootstrap_variables(), removed the call from the patch, no other changes.
Comment #45
moshe weitzman CreditAttribution: moshe weitzman commentedfyi, catch's last change removed a tiny bit of code that was no longer needed due to variable stampede patch ... wait for green before commit.
Comment #46
moshe weitzman CreditAttribution: moshe weitzman commented#44: render_cache.patch queued for re-testing.
Comment #47
catchbeejeebus noticed a very nasty typo in irc. No function is called _drupal_page_get_cache().
Comment #48
Anonymous (not verified) CreditAttribution: Anonymous commenteddiscussed an alternative locking implementation with catch. here's an example implementation that uses do { } while() instead of recursion.
in general, i'd favour retrying n times, then bailing (whether its a 50x header or whatever the code around the lock decides) over using recursion.
they'll be the same 99.99% of the time, until the locking goes sideways for whatever reason, then the recursion will just cause the processes to loop until death.
in this case, IMO we can just proceed rather than exit, but we could do something different where $tries == 10 if we wanted to.
Comment #49
moshe weitzman CreditAttribution: moshe weitzman commentedI'm fine with the alternative proposed. If you write it, please post here. In the meanwhile, this is back to RTBC and is much better than no stampede protection.
Comment #50
Anonymous (not verified) CreditAttribution: Anonymous commentedok, here's a patch based on #48. i'm ok if people decide they just want to get #49 in - i'll reroll in a follow up issue.
this patch will attempt to get the locks multiple times, then continue to render the page or elements even if lock acquisition fails. both caching and locking are pluggable, so the site may still be able to serve the request when this happens.
Comment #51
moshe weitzman CreditAttribution: moshe weitzman commentedDid a code and logic review and this is a slight improvement so back to RTBC.
Comment #52
moshe weitzman CreditAttribution: moshe weitzman commentedDid a code and logic review and this is a slight improvement so back to RTBC.
Comment #53
catchWhy does this add the lock initialization back in? That's done in _drupal_bootstrap_variables() now so shouldn't be necessary to repeat.
Comment #54
catchWell except in the case where you set $conf['page_cache_without_database'] = TRUE, then variable_initialize() gets skipped...
Comment #55
Anonymous (not verified) CreditAttribution: Anonymous commentedok, after discussion with catch in #drupal, here's another go.
- the lock_acquire() call looked wrong to me after i read lock_acquire() ;-) we should definitely hold the lock for longer than 1 second, otherwise this stampede protection wont work for non-cached pages that take longer than that to generate
- lock initialisation checks $conf['page_cache_without_database'] as per #53 and #54
- don't let the _drupal_bootstrap_page_cache() invoke bootstrap and exit hooks if $conf['page_cache_without_database'] is TRUE, because its not safe to do so
Comment #56
moshe weitzman CreditAttribution: moshe weitzman commentedNice catch. Back to RTBC.
Comment #57
Anonymous (not verified) CreditAttribution: Anonymous commentedcatch noticed that i'd used 'initialise' instead of 'initialize'. new patch uses Uhmerican speling, no other changes.
Comment #58
Anonymous (not verified) CreditAttribution: Anonymous commentedahem.
Comment #60
Anonymous (not verified) CreditAttribution: Anonymous commentedone more time.
Comment #61
catchBack to RTBC again.
Comment #62
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #63
sunTrailing white-space.
Function names should have trailing ().
Powered by Dreditor.
Comment #64
sun#60: render.lock_.patch queued for re-testing.
Comment #65
catchComment issues were already in HEAD, but cleaned up anyway.
Comment #66
Anonymous (not verified) CreditAttribution: Anonymous commentedthanks sun and catch for keeping this alive, now how do we get DriesChick to way in?
Comment #67
moshe weitzman CreditAttribution: moshe weitzman commented*Still* RTBC.
Comment #68
Damien Tournoud CreditAttribution: Damien Tournoud commentedI do not understand the use of this loop:
Something as simple as:
Feels more then enough.
Comment #69
webchickSounds like this still needs discussion based on Damien's last comment.
Comment #70
Anonymous (not verified) CreditAttribution: Anonymous commentedi'm ok with Damien's idea. the most important thing is to avoid the broken "hey, lets just recurse until the heat death of the universe, or the php process death, whichever comes first".
whether we go with #68 or not, i agree we should shorten the wait period. unless someone says otherwise, i'll post a patch based on #68 later today.
Comment #71
catchI'd want more than 2 seconds for a wait, Drupal.org itself used to take around 30 seconds to load /tracker. However no loop seems fine here.
Comment #72
Anonymous (not verified) CreditAttribution: Anonymous commentedok, attached patch addresses #68 and #71.
Comment #73
Anonymous (not verified) CreditAttribution: Anonymous commentedwoops, needs review.
Comment #74
catchLooks fine to me. lock_wait() bails out as soon as possible anyway, so I don't see a reason to set anything but the default there.
Comment #75
moshe weitzman CreditAttribution: moshe weitzman commentedIt looks like there are some small differences in the lock code between page cache and render cache. Is that deliberate?
Comment #76
Anonymous (not verified) CreditAttribution: Anonymous commentedyes :-)
in the page cache, we want to send a cache miss header if we acquire the lock.
in the render cache, we want to wait for less time, because we're only dealing with bits of a page.
Comment #77
moshe weitzman CreditAttribution: moshe weitzman commentedIMO it is helpful to send cache miss header for render cache too. obviously the CID has to get sent too. If others think that’s overkill, then OK.
Comment #78
catchlock_wait() exits as soon as it can anyway, I don't see any reason to specify 5 seconds. If I render cache something that takes 15 seconds to build (a huge page callback, someone changes a 2 second view to a 20 second one by adding an extra sort on it, a block on every page that has to make a load of network requests to a slow third party service, plenty of possibilities), then that's precisely what this patch should add a little bit of protection against.
Comment #79
Anonymous (not verified) CreditAttribution: Anonymous commentedok, new patch addresses #78. i've left the render-cache headers out, i think that's overkill, but i'll add it back if either catch or DamZ vote it should go in.
Comment #81
Anonymous (not verified) CreditAttribution: Anonymous commented#79: lock.patch queued for re-testing.
Comment #83
moshe weitzman CreditAttribution: moshe weitzman commentedHmm. bot is not happy. Anyone?
Comment #84
moshe weitzman CreditAttribution: moshe weitzman commented#79: lock.patch queued for re-testing.
Comment #85
moshe weitzman CreditAttribution: moshe weitzman commentedI am running run-tests.sh successfully with this patch. Neither myself nor justinrendell can reproduce this bot result. Anyone know how to resolve "Failed to run tests: failed during invocation of run-tests.sh."
Comment #87
carlos8f CreditAttribution: carlos8f commentedSubscribing. Test failure is odd.
Comment #88
Damien Tournoud CreditAttribution: Damien Tournoud commentedIt seems that #79 makes
"Forum functionality (ForumTestCase)"
timeout. The failure seems legit.Comment #89
carlos8f CreditAttribution: carlos8f commentedYeah, a timeout is likely. The PIFR bot can't distinguish infinite loops from other errors yet. Usually it's reported as a fatal error, but here we're getting run-tests.sh error. In any case, the forum tests may have some clues.
Comment #90
Anonymous (not verified) CreditAttribution: Anonymous commentedthanks for the investigationing, i'll pick this up and see if i can figure out the failure in the forum test.
Comment #91
Anonymous (not verified) CreditAttribution: Anonymous commentedthe test fails are probably coming from http://drupal.org/node/886152#comment-3624434 - once that lands i'll reroll and retest.
Comment #92
carlos8f CreditAttribution: carlos8f commented#886152: All fields are hidden after the administrator newly configures a view mode to not use 'default' display landed
Comment #93
Anonymous (not verified) CreditAttribution: Anonymous commentedalllllrighty, updated patch for HEAD, no other changes, lets see what the bot thinks now.
Comment #95
Anonymous (not verified) CreditAttribution: Anonymous commentedoh blurgh. i still can't reproduce that error. i can successfully run the tests on my machine via run-tests.sh.
i'll chase rfay/boombatower/damz to see if can get access to a bot to figure this out.
Comment #96
carlos8f CreditAttribution: carlos8f commented@justinrandell, when you ran run-tests.sh, what concurrency did you set? I have a feeling the tests only fail when a higher concurrency is set (the production test bots have concurrency 4+, I think). Concurrency would probably trigger lock_wait(), etc. so that might be the key to reproducing the failure.
I've only skimmed through the code in the patch, but I am interested in what changed between #72 and #79 to cause the failure (interdiff please?). Either HEAD broke the patch, or #79 added something that broke it. Hmm.
Comment #97
carlos8f CreditAttribution: carlos8f commentedAlso, I'm interested in a unit test being written for this (read #12 - #13, yes it's difficult, but I think we should try). For the reasons that this touches critical bootstrap code in a major way, and secondly to debug the run-tests.sh failure, which if it is due to concurrency, would point to a legitimate bug.
Comment #98
carlos8f CreditAttribution: carlos8f commentedWell, actually run-tests.sh concurrency in theory shouldn't trigger lock_wait(), since each test method gets its own database, and its own set of locks.
And although lock.inc has some rudimentary tests in core, apparently there is no way to actually test concurrent requests with our framework. Maybe we'll need some thinking out-of-the-box on this one.
Comment #99
Anonymous (not verified) CreditAttribution: Anonymous commentedok, lets see what the bot thinks of the patch at #72, rerolled only to remove fuzz. if this passes, then we have a smaller target.
Comment #100
carlos8f CreditAttribution: carlos8f commentedWow! Now we can isolate the change that causes failure.
Looks like the only change is using '5' as the timeout/delay. Interesting.
Comment #101
carlos8f CreditAttribution: carlos8f commentedI think this is the problem:
The lock is supposed to be released in http://api.drupal.org/api/function/drupal_page_footer/7 (via drupal_page_set_cache(), etc), but it's possible that in some cases, that code is not running and the lock is not properly released. In that case, with the timeout being 30 (default), PHP times out waiting for a lock that was never released properly. It might be that we need to call lock_release() in a shutdown function to ensure that it is called whenever exit() is called, to catch cases like drupal_goto().
Comment #102
carlos8f CreditAttribution: carlos8f commentedOK, taking http://api.drupal.org/api/function/lock_release_all/7 into consideration, all locks should be released on shutdown. Nevertheless, it appears that our lock is not being released. There's some bug here that we don't grok yet.
Comment #103
Anonymous (not verified) CreditAttribution: Anonymous commentedok, here's a wild stab in the dark to clear the locks on initialization. lets see what the bot thinks.
Comment #104
Anonymous (not verified) CreditAttribution: Anonymous commentedoh boy, that's a bad idea. please disregard #103.
Comment #106
carlos8f CreditAttribution: carlos8f commentedAnother perplexing part is that PHP is not supposed to time out due to sleeping! (on Linux anyway). It still seems like there is a direct correlation between the lock_wait() timeout being 30 seconds in the failed patch, and PHP having a global (default) timeout of 30. The passing patch uses 5 as lock_wait() timeout, and I think it's safe to say that it's using the full 5 seconds and continuing silently.
Created #956948: Locking framework ungrokably fails testing to investigate test failures.
Comment #107
carlos8f CreditAttribution: carlos8f commentedFacepalm.
Output:
@chx, can we get a phpwtf on this?
Attached patch is #93 with proper parentheses.
Comment #108
carlos8f CreditAttribution: carlos8f commentedignore this one
Comment #109
Damien Tournoud CreditAttribution: Damien Tournoud commentedBasic boolean logic.
Comment #110
suncommon.inc looks correct to me, bootstrap.inc not.
Comment #111
carlos8f CreditAttribution: carlos8f commented@Damien, did you see the patch in #107? #109 is incomplete, it doesn't fix
I believe the patch in #107 is more complete.
And also,
is fine by itself, but #109 changes it. It's only when we add the && that problems happen with operator precedence, I actually prefer to leave out the paren's if it's a simple expression like above.
Comment #112
Damien Tournoud CreditAttribution: Damien Tournoud commented#109 was a crosspost. We just came to the same conclusion.
Comment #113
carlos8f CreditAttribution: carlos8f commentedOh oh. #107 vs. #109:
The only notable difference is that #107 fixes the page cache operator precedence. And then we get page cache test fails. This is exactly the opposite of what to expect: #107 should pass, #109 should fail!
Regardless, we have moved past the run-tests.sh error by fixing this operator bug. That is at least encouraging, but still, using timeout of 30 v.s 5 shouldn't cause the bot to crash like that. I'll try to expose the bug in #956948: Locking framework ungrokably fails testing (looks like the bot is caught in an infinite loop there).
Comment #114
carlos8f CreditAttribution: carlos8f commentedThis simple test shows that in #109 the page cache logic should be totally broken, and #107, which contains the paren's, should be correct. Yet the bot prefers #109. WTF is going on here? :P
Comment #115
carlos8f CreditAttribution: carlos8f commentedIn #956948: Locking framework ungrokably fails testing I've isolated the run-tests.sh failure to the operator precedence bug. We still need to figure out why page cache tests are failing in #107 though.
Comment #116
carlos8f CreditAttribution: carlos8f commentedTo fix the page cache tests, all I needed to do was move header('X-Drupal-Cache: MISS') out of the lock_acquire() block and to the bottom of the function. The header should get sent regardless of whether locking succeeded.
Comment #117
moshe weitzman CreditAttribution: moshe weitzman commentedcarlos8f gets the bulldog award for getting the tests to pass. great work.
Comment #118
Anonymous (not verified) CreditAttribution: Anonymous commentednice work! i think this is RTBC as well.
Comment #119
carlos8f CreditAttribution: carlos8f commented#116: 802446-116-lock-page-render-caches.patch queued for re-testing.
Comment #121
carlos8f CreditAttribution: carlos8f commentedHmm, more fallout from #808560: Node comment statistics are only partially exposed in node_load() and are missing last_comment_uid. Fortunately that was just fixed!
Comment #122
carlos8f CreditAttribution: carlos8f commented#116: 802446-116-lock-page-render-caches.patch queued for re-testing.
Comment #124
carlos8f CreditAttribution: carlos8f commented#116: 802446-116-lock-page-render-caches.patch queued for re-testing.
Comment #125
carlos8f CreditAttribution: carlos8f commentedStill RTBC; test fails were not related.
Comment #126
carlos8f CreditAttribution: carlos8f commented#116: 802446-116-lock-page-render-caches.patch queued for re-testing.
Comment #127
carlos8f CreditAttribution: carlos8f commentedI talked with @David Strauss today about this patch; he noted that when using varnish, page cache stampedes are already taken care of. That could cause some unnecessary locking, although Pressflow d7 may introduce an opt-out of core page caching ("External" cache mode), which would defer everything to varnish instead. Render cache stampede protection is still a nice improvement though.
There could be cases where fatal errors or uncaught exceptions leave the lock hanging for the duration (lock_release_all() doesn't run in that case). That's a tricky issue; in the case of fatals it's impossible to release the lock and it will just have to expire. In the case of exceptions, it might be possible to have a global try { } catch { release locks, re-throw exception }. That is another issue though. I think the current patch is OK to commit.
Comment #128
catchlock_release_all() happens in a shutdown function. I haven't personally tested it but I think others did at the time and this was pretty resilient to some deliberately triggered fatal errors.
Unless I'm mistaken, while varnish has a 'grace' setting which allows for stampede protection against expired entries, it doesn't have any protection against stampedes in a cold start situation (although maybe it's possible to do that with a custom vcl). So this is a slightly different issue. fwiw I think you can do roughly the equivalent to pressflow D6 external page caching with vanilla D7 already, see #627496: Document using a null cache for page caching with reverse proxies.
Comment #129
webchickI tried to look at this tonight, but not only is it way over my head, it's also got Dries's fingerprints all over it. I do also get very nervous about sentences like "If we agree on this approach, I will submit similar patches for block and page caches (others?)." Tons of follow-up patches are not something we have time for at this point. But the trade-off might be worth it, dunno.
Leaving for The Spikey Haired One to make the call.
Comment #130
moshe weitzman CreditAttribution: moshe weitzman commented@webchick - no worries, thats an ancient comment. The page cache protection is now in the patch and the block cache isn't getting stampede protection this release.
@dries - please consider that this has been mostly rtbc since may. thanks.
Comment #131
yonailo CreditAttribution: yonailo commentedsubscribing
Comment #132
carlos8f CreditAttribution: carlos8f commented#116: 802446-116-lock-page-render-caches.patch queued for re-testing.
Comment #133
carlos8f CreditAttribution: carlos8f commented#116: 802446-116-lock-page-render-caches.patch queued for re-testing.
Comment #134
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated #116 to get rid of fuzz, no code changes.
Comment #135
achtonSubscribing.
Comment #136
pillarsdotnet CreditAttribution: pillarsdotnet commentedDe-fuzzed again...
Comment #137
mgiffordSubscribe.
Comment #138
mgiffordSubscribe.
Comment #139
catchI keep thinking about this patch and have concerns about it now. These are similar to the concerns around the locking in variable_initialize(), and #802856: Make lock_wait() wait less is designed to deal with part of this, but I'm not sure if it's enough.
Say you have something that takes 300ms to build on a cache miss, and that is rendered on every page (for example a block in the footer).
Without locking, you'll get a cache stampede - every request does that 300ms of processing until one sets the cache (and all the ones that missed will try to write back to cache).
With locking, only one request tries to build the item, so you save the apache/database CPU time from that 300ms.
However between acquiring the lock and writing to cache you have a 300+ms window where every other process hitting the site has to wait.
If they start waiting right at the beginning of the cache miss, then the request will take 700ms longer than it would. If they start to wait right at the end, then it'll be 1300ms after the initial lock_acquire() before those requests are able to continue.
During this time CPU load on the server is going to be low (apart from lock_may_be_available() which is minimal), but it runs the risk of exceeding MaxClients - since you'll have all these sleeping processes waiting around. What we're effectively doing in those situations is adding a global read lock across the entire site (even with multiple webservers and database servers there'd still be a single read lock).
And the current patch would make this even more pronounced than the current variable patch, because the polling in lock_wait() doubles each time.
For now, I'm just removing the changes to lock_wait() from here, but I think we need to try to come up with a uniform approach to this if possible. For example we could lock cache writes but not reads, or we could lock_wait($name, 1); by default in these examples so there's less chance of stacking up apache processes, then go ahead and build the items anyway, while still not writing back to cache (or checking the cache before writing back to it).
But there are complex trade-offs here that need a bit more discussion.
Comment #140
catchBack to CNR.
Comment #141
pillarsdotnet CreditAttribution: pillarsdotnet commentedI'm getting multiple errors like the following:
This happens in a call to
lock_acquire()
added by the patch in #139.The problem is that if
$config['page_cache_without_database']
is set, thendrupal_page_get_cache()
gets called before the database system has been initialized. The #139 patch loadslock.inc
but does not loaddatabase.inc
. Callinglock_acquire()
therefore results in the above error.The attached patch loads both
database.inc
andlock.inc
IF:cache_get($cid, 'cache_page')
returns FALSE, ANDvariable_get('page_cache_without_database')
returns TRUEShould there should be a test that generates a page load of modified content with $conf['page_cache_without_database'] set? If so, should that test be in core?
Comment #142
pillarsdotnet CreditAttribution: pillarsdotnet commentedWow. Testbot is reporting failure to checkout the drupal repository. I wish someone with access to qa.drupal.org would update its status page when stuff like that happens.
I'd post a comment to the qa.drupal.org status page, but my account has apparently been banned from that server.
Comment #143
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed silly typo.
Comment #145
pillarsdotnet CreditAttribution: pillarsdotnet commented#143: cache_stampede-802446-143.patch queued for re-testing.
Comment #146
DanPir CreditAttribution: DanPir commentedsubscribe
Comment #147
catchJust a note that we added stampede protection for empty caches to the 6.x-1.x memcache backend in #962422: Empty cache stampede protection, this (as well as the 'grace' mode for expired items) is configurable in settings.php, disabled by default (since we want to ensure people use memcache-lock.inc with it). Patch needs to be ported to 7.x branch still.
I am starting to think that cache backends are a better place for this to live, and we should reserve locking in core for where there are actual consistency issues (such as menu_rebuild(), arguably variable caching). A lock_acquire() with memcache is extremely cheap, but this is not going to be the case with the SQL version we ship with, and the cost of that happens whether there is a stampede or not.
Not changing status, interested to see what other people think about this, but since I was a strong supporter of this patch originally wanted to confirm that I've changed my mind on it now. If custom/contrib code is building an extremly expensive item and wants to make sure that only one process tries to do this, then it makes sense to do it there - but for the render cache some things may be quite cheap, or unique to each page (even user) and not that likely to cause a stampede.
Comment #148
Anonymous (not verified) CreditAttribution: Anonymous commentedI'm ok with #147.
My main concern with this was to stop the recursive locking pattern.
Comment #149
pillarsdotnet CreditAttribution: pillarsdotnet commented#143: cache_stampede-802446-143.patch queued for re-testing.
Comment #150
catchAdding backport tag, this is a task.
Comment #151
xjmTagging issues not yet using summary template.
Comment #152
moshe weitzman CreditAttribution: moshe weitzman commentedIn order to evaluate whether we should punt to contrib, it would be helpful to see how complex that pattern will be for contrib to implement. Any examples out there? Maybe we could add a property to the 'cache' definition which signals to core that we are doing an expensive operation and locking is wanted. Something like #lock => TRUE.
Comment #153
catch@Moshe - well #147/memcache has generic stampede protection for all cache entries using the lock framework, so if it's simple enough at that level, it should be OK elsewhere.
I'm going to put this down to normal priority.
I like the idea of #lock => TRUE, tried to build something like that into DrupalCacheArray.
Comment #154
moshe weitzman CreditAttribution: moshe weitzman commentedOK, I think the only todo left might be the #lock = TRUE bit. We could start a new issue for that.
Comment #155
thedavidmeister CreditAttribution: thedavidmeister commented#143: cache_stampede-802446-143.patch queued for re-testing.
Comment #158
YesCT CreditAttribution: YesCT commentedputting back lost tag for needs issue summary update (instructions)
Comment #159
Fabianx CreditAttribution: Fabianx as a volunteer commentedToo bad this never got in ... Hopefully we can find a better (parallel) solution for D8 ...
E.g. placeholder data while waiting for the lock, then try later again ...
Comment #160
mgiffordUnassigning stale issue. Hopefully someone else will pursue this.
Comment #163
xjmI don't think this is in a state where it needs committer feedback (six years later) since the proposed resolution is not applicable for D8. It would need to be updated with a new proposal.
(Also we don't use "Needs committer feedback" anymore for D8 core since there are three distinct committer roles.)