Cache stampede protection: drupal_render() and page cache [#802446]

Comment	File	Size	Author
#143	cache_stampede-802446-143.patch	6.54 KB	pillarsdotnet
#141	cache_stampede-802446-141.patch	5.29 KB	pillarsdotnet
#139	802446.stampede_0.patch	4.71 KB	catch
#136	802446.stampede.patch	5.5 KB	pillarsdotnet
#134	802446.stampede.patch	5.99 KB	Anonymous (not verified)
#116	802446-116-lock-page-render-caches.patch	5.99 KB	carlos8f
#109	802446-lock-render-proper.patch	5.79 KB	damien tournoud
#107	facepalm.patch	6.02 KB	carlos8f
#103	lock.20.patch	6.45 KB	Anonymous (not verified)
#99	lock.802446.72.patch	6.02 KB	Anonymous (not verified)
#93	lock.patch	6.01 KB	Anonymous (not verified)
#79	lock.patch	6.01 KB	Anonymous (not verified)
#72	lock.patch	6.02 KB	Anonymous (not verified)
#65	render.lock__3.patch	6.41 KB	catch
#60	render.lock_.patch	6.4 KB	Anonymous (not verified)
#58	render.lock_.patch	12.81 KB	Anonymous (not verified)
#57	render.lock_.patch	6.42 KB	Anonymous (not verified)
#55	render.lock_.patch	6.42 KB	Anonymous (not verified)
#50	render.lock_.patch	4.46 KB	Anonymous (not verified)
#47	render_cache.patch	3.92 KB	catch
#44	render_cache.patch	3.93 KB	catch
#41	rendercache.diff	3.55 KB	moshe weitzman
#40	rendercache.diff	3.52 KB	moshe weitzman
#39	rendercache.diff	4.17 KB	moshe weitzman
#33	stampede.patch	3.82 KB	catch
#32	stampede.patch	3.8 KB	catch
#31	stampede.patch	3.8 KB	catch
#18	hack.txt	4.36 KB	catch
#10	render_cache.diff	2.52 KB	moshe weitzman
#8	render_cache.diff	2.99 KB	moshe weitzman
#5	render_cache.diff	2.5 KB	moshe weitzman
#2	render_cache.diff	1.37 KB	moshe weitzman
#1	render_cache.diff	1.27 KB	moshe weitzman
	render_cache.diff	1.11 KB	moshe weitzman

Comment #1

moshe weitzman commented 18 May 2010 at 19:18

Status	File	Size
new	render_cache.diff	1.27 KB

I was not handling the 'wait' condition where subsequent requests don't get the lock.

Log in or register to post comments

Comment #2

moshe weitzman commented 19 May 2010 at 03:56

Status:

Active

» Needs review

Status	File	Size
new	render_cache.diff	1.37 KB

Now uses lock_wait() instead of sleep.

Log in or register to post comments

Comment #3

catch

he/him

English

commented 19 May 2010 at 04:27

This looks great to me. I have a D6 site which slows to a crawl whenever content is posted, locking framework is designed just for cases like this. My only concern is it'd be nice if lock_wait() checked for the lock release at granularity less than a second (say 100ms), but that's not for this patch to deal with.

Log in or register to post comments

Comment #4

catch

he/him

English

commented 19 May 2010 at 04:44

Status:

Needs review

» Reviewed & tested by the community

Looked at this again, there's nothing not to like here, RTBC.

Log in or register to post comments

Comment #5

moshe weitzman commented 19 May 2010 at 06:03

Status:

Reviewed & tested by the community

» Needs review

Status	File	Size
new	render_cache.diff	2.5 KB

This patch adds a lock for page cache using same pattern.

I'm stopping short of block cache for now as that has a slightly different flow. Until we add lock for block, it is simple enough for hurting developers to convert expensive blocks to use drupal_render #cache instead using hook_block_view_alter().

Log in or register to post comments

Comment #6

moshe weitzman commented 19 May 2010 at 06:08

Title:

Cache stampede protection: drupal_render()

» Cache stampede protection: drupal_render() and page cache

Log in or register to post comments

Comment #7

19 May 2010 at 06:20

Status:

Needs review

» Needs work

The last submitted patch, render_cache.diff, failed testing.

Log in or register to post comments

Comment #8

moshe weitzman commented 19 May 2010 at 12:38

Status:

Needs work

» Needs review

Status	File	Size
new	render_cache.diff	2.99 KB

Reroll. Changed header() calls to drupal_add_http_header()

Log in or register to post comments

Comment #9

catch

he/him

English

commented 19 May 2010 at 13:37

Status:

Needs review

» Needs work

+++ includes/bootstrap.inc
@@ -2076,7 +2076,20 @@ function _drupal_bootstrap_page_cache() {
+        // Some other request is building cache. Wait, then re-run this function.
+         drupal_add_http_header('X-Drupal-Cache', 'WAIT');

Do we need the header call for WAIT? Indented one too many places.

Additionally the lock name isn't per page? Locking all page cache generation across the site seems not ideal to me, we have the requested path by this point already.

Powered by Dreditor.

Log in or register to post comments

Comment #10

moshe weitzman commented 19 May 2010 at 14:09

Status:

Needs work

» Needs review

Status	File	Size
new	render_cache.diff	2.52 KB

Fixed lock name for page cache. That was a silly late change.

I am keeping the WAIT header as that provides good information about why a cached page was slower than usual (i.e. it had to wait).

I reverted back to header() from drupal_add_http_header() since the latter could cause our hit/miss/wait header to get cached along with the page which is confusing.

Log in or register to post comments

Comment #11

catch

he/him

English

commented 19 May 2010 at 14:16

Issue tags:

+Performance, +scalability

Looks great. Fine with keeping WAIT. Just waiting for testbot now.

Log in or register to post comments

Comment #12

dries commented 19 May 2010 at 19:37

Any idea how we can test the performance impact and correctness of this patch?

Log in or register to post comments

Comment #13

moshe weitzman commented 20 May 2010 at 01:36

It is hard to test those. I tested correctness by adding a very long sleep after the page cache lock is acquired and then I launched a new anon request and verified that this new request went into a holding pattern while the first request "built" the page ... Not sure how to test performance regression. What I'm intending is a reliability improvement; site stays up during a busy cache flush.

I asked a couple lock experts to review this patch. Their support should ameliorate these concerns.

Log in or register to post comments

Comment #14

chx commented 20 May 2010 at 02:49

This is scary, very very scary. Compared to where this lock framework was used (=menu rebuild) this is a very very frequent operation. I have no idea where are we with deadlocks and in general, how large is the lock overhead. Anyone has an insight?

Log in or register to post comments

Comment #15

moshe weitzman commented 20 May 2010 at 03:16

Please, no FUD. Thats what I call the use of like "very, very scared" without presenting any basis.

Requests that are waiting for cache to be built will check once a second until the lock is released. Once released, they cache_get() the item they need. Checking once per second means one query to the semaphore table for the core lock.inc, even faster checks for APC, memcache, etc. See lock_wait().

We have well known stampede bugs. This patch addresses them with no API change.

Log in or register to post comments

Comment #16

catch

he/him

English

commented 20 May 2010 at 04:15

Lock waiting has very little overhead. The overhead I'd be more concerned about would be in http://api.drupal.org/api/function/lock_acquire/7

However - with database caching, cache_set() can be very expensive, let alone building the cached data, so there's trade off on stability. Since lock.inc is pluggable, it ought to be possible to reduce the lock_acquire() and lock_wait() overhead to pretty much nothing, at which point you have the protection against the site going down (which is a very real risk at the moment), without any performance impact.

I will see if I get a basic test case up with ab at some point to verify some of the above.

Log in or register to post comments

Comment #17

catch

he/him

English

commented 20 May 2010 at 05:36

OK here's a test case:

1. Enable page caching for anonymous users.
2. apply attached patch to cache.inc - just log that we wrote to cache.
3. truncate cache_page
4. ab -c5 -n50

without locking:

cache_page
cache_page
cache_page
cache_page
cache_page

with locking:

cache_page

So, the locking works.

Then, to test performance impact, I did the following:

1. ab -c1 -n50 http://d7.7 (acquired lock but we didn't need the stampede protection - this would be the worst case from a performance standpoint)

HEAD:

Concurrency Level:      1
Time taken for tests:   0.557 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    89.76 [#/sec] (mean)
Time per request:       11.141 [ms] (mean)
Time per request:       11.141 [ms] (mean, across all concurrent requests)
Transfer rate:          607.19 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     7   11  22.1      8     164
Waiting:        7   11  22.0      7     163
Total:          7   11  22.1      8     164

Patch:

Concurrency Level:      1
Time taken for tests:   0.402 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    124.27 [#/sec] (mean)
Time per request:       8.047 [ms] (mean)
Time per request:       8.047 [ms] (mean, across all concurrent requests)
Transfer rate:          840.63 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     5    8  20.2      5     148
Waiting:        4    8  20.1      5     147
Total:          5    8  20.2      5     148

In this case the patched version was faster, so there's no measurable impact from acquiring the lock then not using it. A better test case would hit a higher spread of paths but still with a high cache hit rate, but this is fine from a sanity check standpoint.

2. ab -c5 -n50 (stamped protection kicks in)

HEAD:

Concurrency Level:      5
Time taken for tests:   0.403 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346355 bytes
HTML transferred:       322200 bytes
Requests per second:    124.13 [#/sec] (mean)
Time per request:       40.282 [ms] (mean)
Time per request:       8.056 [ms] (mean, across all concurrent requests)
Transfer rate:          839.68 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   4.6      0      17
Processing:     5   39  94.3      7     386
Waiting:        5   38  93.6      7     385
Total:          5   40  98.7      7     403

Percentage of the requests served within a certain time (ms)
  50%      7
  66%     10
  75%     12
  80%     15
  90%    216
  95%    346
  98%    403
  99%    403
 100%    403 (longest request)

Patch:

Concurrency Level:      5
Time taken for tests:   1.050 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    47.62 [#/sec] (mean)
Time per request:       104.991 [ms] (mean)
Time per request:       20.998 [ms] (mean, across all concurrent requests)
Transfer rate:          322.15 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   6.3      0      24
Processing:     5   90 277.7      6    1030
Waiting:        5   90 277.3      6    1026
Total:          5   92 284.0      6    1050

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      7
  75%      7
  80%      7
  90%    152
  95%   1042
  98%   1050
  99%   1050
 100%   1050 (longest request)

As you can see, in the latter example, we lose requests per second, but that's not the interesting thing.

Without the patch, there are these long requests:

Each of those is a full page build and cache_set() - which takes about 200-400ms, of which all the time is spent generating the content and writing to cache. Note this was on HEAD with only one node on the front page.

With the patch, we instead get

 90%    152
  95%   1042
  98%   1050
  99%   1050
 100%   1050

in this case, the 152ms request builds the content and write to cache, the other 4 requests sit around and do nothing for a second, and only 50ms is spent actually doing anything - sleep(1) uses 0 cpu for the second.

So, while raw performance is a little slower, the webserver has done 1/5th the work for four out of five cache misses. On a real site where a page cache miss might mean 1s to build and write the cache, we'd see no speed impact either.

It's also possible, especially with a non-SQL lock.inc, but even with it, to make lock_wait() more granular - see #802856: Make lock_wait() wait less - removing wait times for users who are put into the holding pattern on the cache miss. With both patches applied, I ran the same ab.

Concurrency Level:      5
Time taken for tests:   0.363 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    137.56 [#/sec] (mean)
Time per request:       36.347 [ms] (mean)
Time per request:       7.269 [ms] (mean, across all concurrent requests)
Transfer rate:          930.56 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   4.9      0      20
Processing:     5   34  63.8     11     240
Waiting:        5   33  62.7     10     231
Total:          5   36  67.9     11     247

Percentage of the requests served within a certain time (ms)
  50%     11
  66%     16
  75%     22
  80%     24
  90%    198
  95%    240
  98%    247
  99%    247
 100%    247 (longest request)

This time, the longest request is twice as fast as without the patch - there's no contention for either CPU or database writes on building the cache, and once the lock is released, requests which are made to wait still get from cache and return the page far within the time which would have otherwise been spent building and writing to cache themselves. We might want to leave that more aggressive lock_wait() to a pluggable include, but it shows that this approach can both improve overall stability and have zero raw performance cost in terms of user experience.

Log in or register to post comments

Comment #18

catch

he/him

English

commented 20 May 2010 at 05:38

Status	File	Size
new	hack.txt	4.36 KB

Forgot to attach the hack to cache.inc - this includes both patches as well.

Log in or register to post comments

Comment #19

chx commented 20 May 2010 at 05:53

Status:

Needs review

» Needs work

I asked others to do research into lock.inc which was answered that my request is FUD. lock_acquire does a database write. That's not much of an overhead, but it's there. However, it's pluggable, so be it.

There can be no deadlock as far as I can see.

What IS possible, however, that you recurse so many times that PHP dies out. We need a recurse protection in there.

Edit: that's only necessary if catch decimates the wait time. If he raises $delay = 30 to $delay = 300 changing the behaviour to the same just more fine grained, then such a protection is not needed.

Log in or register to post comments

Comment #20

chx commented 20 May 2010 at 06:07

Status:

Needs work

» Reviewed & tested by the community

OK, so let's go with it in the hopes the other patch gets in too.

Log in or register to post comments

Comment #21

dhthwy commented 20 May 2010 at 06:56

To go along with catch's ab benchmark.

I made it so Drupal clears the cache for every other page request, so I get a HIT/MISS scenario. I couldn't get it to do a WAIT no matter how much I bombarded it. I'm sure it works though, just couldn't get the right conditions for it to fire.

Locking stuff average (MISS): 0.0143499279022158
Other stuff inside (WHEN CACHE MISS) _drupal_bootstrap_page_cache() average: 0.1099051904678

Minimum lock stuff time: 0.00162601470947
Minimum other stuff time: 0.0664360523224

Maximum lock stuff time: 0.150331020355
Maximum other stuff time: 0.228527784348

"Locking stuff average (MISS)" means elapsed time from comment "// Cache miss. Avoid a stampede." to the point it sets the header to "X-Drupal-Cache: MISS".

"Other stuff inside (WHEN CACHE MISS) _drupal_bootstrap_page_cache()" means elapsed time from beginning of function to the point it sets the header to "X-Drupal-Cache: MISS".

Conclusion is that it looks like the locking stuff takes very little time compared to the rest of function. No opcode cachers being used btw.

~~Arggg I messed up somewhere writing this down.. I'll have to correct these times.~~

Ok those times should be right now. I've got a perl app that makes 100 page requests to Drupal, half of them (when they miss) insert those elapsed times into the header. I've run it several times and they're pretty consistent.

Log in or register to post comments

Comment #22

lostchord commented 20 May 2010 at 06:45

Just out of curiosity... you have fully researched distributed lock management here? Is the "lock system" being referred to a "distributed lock manager"?

If you have a cache shared across a number of distinct processors - a web server farm - then you will need this level of synchronisation I think. There are all sorts of interesting race conditions and failure modes that need to be considered.

This is not easy. I had discussions with some of the VMS developers who did the first ever DLM and it took this amazing development team quite a few iterations to get it right.

cheers

Log in or register to post comments

Comment #23

catch

he/him

English

commented 20 May 2010 at 07:29

PHP is single threaded - one process can only ever request or secure a lock on one thing at a time, so the only situation where you could have a deadlock is where a lock is acquired then the process dies with a fatal error before releasing the lock - in that case the lock will expire and be released eventually. If there are other potential race conditions in the locking framework, then those deserve separate issues against lock.inc. The original issue is here #251792: Implement a locking framework for long operations.

@dhthwy - to get a WAIT, you'll need to run your perl script in two processes (or just use ab which handles that for you).

Log in or register to post comments

Comment #24

lostchord commented 20 May 2010 at 07:35

@catch

This applies to PHP running on seperate servers where the database is shared between the servers?

Log in or register to post comments

Comment #25

dhthwy commented 20 May 2010 at 07:39

catch, I ran a bunch of perl procs. Still couldn't get it.. hehe. Maybe ab+perl. I did it in perl so I could get the header stuff I injected from Drupal.
But not worth me trying for anyway since you've already tested that and the locking doesn't appear to have a noticable impact on performance.

Log in or register to post comments

Comment #26

dries commented 20 May 2010 at 08:57

Instead of locking and waiting, I'd prefer to come up with a write-through cache. In such scenario, other requests continue to use the old data until it was replaced in the cache. That requires zero waiting from other requests. Have we considered that?

Log in or register to post comments

Comment #27

damien tournoud commented 20 May 2010 at 09:10

Status:

Reviewed & tested by the community

» Needs work

Re: the polling delay question: the standard, database-based, lock.inc implementation is nothing more then a cheap baseline. There are several ways to do that better, without polling, even with a database-based implementation, but that requires some additional privileges (locking, etc.) that most shared hostings cannot provide.

The best way to avoid replacing a cache stampede by a lock stampede would probably be to implement decay polling in lock.inc by default.

This said, let's study how much it takes to implement a semi-write-through strategy. (complete write-though is not possible, because there are a lot of cases where the data itself has completely disappeared, and we have to lock in that case anyway)

Log in or register to post comments

Comment #28

catch

he/him

English

commented 20 May 2010 at 09:10

Status:

Needs work

» Reviewed & tested by the community

This is for absolute cache misses - whenever cache_get() returns FALSE, not for when there is existing, stale, data in the cache which needs to be updated. They're two different issues.

At the moment core's cache.inc returns false for the update case too - so this patch would stop a stampede in that case, but with a write-through, you'd have cache_get() return TRUE - so this is entirely compatible with that if we added it in a separate patch.

For reference:

http://tag1consulting.com/patches/pagecache_lock
http://groups.drupal.org/node/59318
http://drupal.org/node/295738

These deal with stale cached data, but not empty data, if we could get that into core too it'd be great, but it's doesn't solve all cases.

Log in or register to post comments

Comment #29

catch

he/him

English

commented 20 May 2010 at 09:12

Status:

Reviewed & tested by the community

» Needs review

Cross posted, I still think this belongs where it is, and write-through belongs in cache.inc, so leaving at CNR rather than CNW.

Log in or register to post comments

Comment #30

damien tournoud commented 20 May 2010 at 09:13

Status:

Needs review

» Needs work

In here, let's at the minimum implement decay polling in lock.inc.

Log in or register to post comments

Comment #31

catch

he/him

English

commented 20 May 2010 at 10:30

Status:

Needs work

» Needs review

Status	File	Size
new	stampede.patch	3.8 KB

There's a D6 patch for write through on #762978: Prevent cache stamped for page cache table, I just bumped that issue to D7.

Here's the same patch with decay polling, I didn't change the minimum wait from 1s, that should stay in a separate issue.

Log in or register to post comments

Comment #32

catch

he/him

English

commented 20 May 2010 at 10:39

Status	File	Size
new	stampede.patch	3.8 KB

Brain no worky. Less broken patch.

Log in or register to post comments

Comment #33

catch

he/him

English

commented 20 May 2010 at 12:24

Status	File	Size
new	stampede.patch	3.82 KB

Minus one more stupid error, and with some improvements to the lock.inc change via Damien in irc.

Log in or register to post comments

Comment #34

jriedel commented 20 May 2010 at 14:10

@lostchord

Being an old VMS internals guy...

We are not talking about DLM in this patch. The locking is being done by inserting a row in the database, not by the webservers on many different nodes. If your database is clustered, then it is up to the database to worry about the locking required to make transactions work correctly.

In VMS I didn't care about DLM as long as $ENQ/$DEQ calls worked the same on single node or in a cluster.

Log in or register to post comments

Comment #35

moshe weitzman commented 21 May 2010 at 18:19

Status:

Needs review

» Reviewed & tested by the community

I'm happy with latest patch. I think we have a lot of experts in this issue now.

Log in or register to post comments

Comment #36

lostchord commented 22 May 2010 at 06:20

@jriedel

Ditto the old VMS internals bit... I'm much happier in VAX assembler and a relative newbie to PHP :-) I think the critical part in your reply is this:

If your database is clustered, then it is up to the database to worry about the locking required to make transactions work correctly.

I suspect this will probably not be the case in most 'simple' replication scenarios.

I'm assuming that as far as locking is concerned we are dealing with this part of the API and the intent is to provide a mutex capability rather than a generalised resource synchronisation capability. In this context I've got a couple of questions:

Would it be true to say that if I'm sharing tables between sites using common prefixes then the semaphore table would have to be shared as well? (Actually I've just noticed that this form of sharing seems to have been discouraged as of March 4th...that was an attractive feature).
The integrity of the system relies on independent module developers choosing non-conflicting lock names. Wouldn't it be better to force modules to register lock names and detect conflicts before they potentially cause trouble?

I have some other questions regarding how the lock_* stuff behaves but I'll dump those in a seperate post when I've dug a bit more. Finding API documentation and code is relatively easy, finding out what the code was actually intended to do (as opposed to what you can see that its doing), the requirements it's based on, is proving a lot harder.

cheers

Log in or register to post comments

Comment #37

dries commented 23 May 2010 at 21:08

+++ includes/bootstrap.inc	20 May 2010 12:21:53 -0000
@@ -2076,7 +2076,20 @@ function _drupal_bootstrap_page_cache() 
+        // Some other request is building cache. Wait, then re-run this function.

Is 'building cache' proper English?

+++ includes/bootstrap.inc	20 May 2010 12:21:53 -0000
@@ -2076,7 +2076,20 @@ function _drupal_bootstrap_page_cache() 
+        return _drupal_bootstrap_page_cache();

Why do we have to call this function recursively? That seems odd given that we might execute SQL queries and whatnot. For example, the IP blocking code would be executed twice.

Log in or register to post comments

Comment #38

catch

he/him

English

commented 24 May 2010 at 05:30

Status:

Reviewed & tested by the community

» Needs work

It looks like the lock protection should go in drupal_page_get_cache() rather than here. Moshe, any reason not to do that?

Log in or register to post comments

Comment #39

moshe weitzman commented 3 June 2010 at 23:13

Status:

Needs work

» Needs review

Status	File	Size
new	rendercache.diff	4.17 KB

This patch avoids duplicate drupal_block_denied() calls.

I looked into avoiding recursion altogether but the way I did it had us assuming that the cache took less than x seconds to build and I can't think of any sensible hard-coded value there. Its simpler IMO to add a static as I have in this patch.

Log in or register to post comments

Comment #40

moshe weitzman commented 11 June 2010 at 13:28

Status	File	Size
new	rendercache.diff	3.52 KB

Ditched the new static var and instead moved the lock_acquire code into drupal_page_get_cache() as suggested by catch. This is a touch simpler. We still use recursion as we don't know how long a page will take to build. Its plausible that it will take more than one 30 second wait period. There is minimal performance cost to the recursion.

Log in or register to post comments

Comment #41

moshe weitzman commented 11 June 2010 at 15:03

Status	File	Size
new	rendercache.diff	3.55 KB

Fixed the code comment language per Dries' request in 37. No functionality change.

Log in or register to post comments

Comment #42

catch

he/him

English

commented 13 June 2010 at 15:51

I'm wondering if instead of recursion we could just do a while (lock_wait()); will try to roll a patch for that tomorrow. If for some reason doesn't work out this is RTBC.

Log in or register to post comments

Comment #43

catch

he/him

English

commented 22 June 2010 at 16:58

Status:

Needs review

» Reviewed & tested by the community

I had a look at the while() loop but it was more verbose than recursion (iirc, can't believe I left this hanging for a week, sorry :( ). Since the recursion is otherwise completely harmless, and only happens on a cache miss that's failed to acquire a lock anyway, marking this back to RTBC.

Log in or register to post comments

Comment #44

catch

he/him

English

commented 25 June 2010 at 03:17

Status:

Reviewed & tested by the community

» Needs review

Status	File	Size
new	render_cache.patch	3.93 KB

lock_initialize() moved to _drupal_bootstrap_variables(), removed the call from the patch, no other changes.

Log in or register to post comments

Comment #45

moshe weitzman commented 25 June 2010 at 03:39

Status:

Needs review

» Reviewed & tested by the community

fyi, catch's last change removed a tiny bit of code that was no longer needed due to variable stampede patch ... wait for green before commit.

Log in or register to post comments

Comment #46

moshe weitzman commented 6 July 2010 at 17:06

#44: render_cache.patch queued for re-testing.

Log in or register to post comments

Comment #47

catch

he/him

English

commented 8 July 2010 at 07:49

Status:

Reviewed & tested by the community

» Needs review

Status	File	Size
new	render_cache.patch	3.92 KB

beejeebus noticed a very nasty typo in irc. No function is called _drupal_page_get_cache().

Log in or register to post comments

Comment #48

Anonymous (not verified) commented 8 July 2010 at 08:16

discussed an alternative locking implementation with catch. here's an example implementation that uses do { } while() instead of recursion.

in general, i'd favour retrying n times, then bailing (whether its a 50x header or whatever the code around the lock decides) over using recursion.

they'll be the same 99.99% of the time, until the locking goes sideways for whatever reason, then the recursion will just cause the processes to loop until death.

function drupal_page_get_cache($check_only = FALSE) {
  global $base_root;
  static $cache_hit = FALSE;
            
  if ($check_only) { 
    return $cache_hit;
  }       
        
  if (drupal_page_is_cacheable()) {
    $name = $GLOBALS['base_root'] . request_uri();
    $tries = 0;
    require_once DRUPAL_ROOT . '/' . variable_get('lock_inc', 'includes/lock.inc');
    lock_initialize();
    do {    
      if (!$cache = cache_get($base_root . request_uri(), 'cache_page')) {
        if (!$lock_acquired = lock_acquire($name, 1)) {
          lock_wait($name);
        }
        else {
          // Proceed with page build;  cache its result, then release the lock.
          header('X-Drupal-Cache: MISS');
        }
      }
    }
    while ($cache === FALSE && $lock_acquired === FALSE && ++$tries < 10);
    
    if ($cache !== FALSE) {
      $cache_hit = TRUE;
    }
    return $cache;
  }
}

in this case, IMO we can just proceed rather than exit, but we could do something different where $tries == 10 if we wanted to.

Log in or register to post comments

Comment #49

moshe weitzman commented 8 July 2010 at 15:58

Status:

Needs review

» Reviewed & tested by the community

I'm fine with the alternative proposed. If you write it, please post here. In the meanwhile, this is back to RTBC and is much better than no stampede protection.

Log in or register to post comments

Comment #50

Anonymous (not verified) commented 11 July 2010 at 03:52

Status:

Reviewed & tested by the community

» Needs review

Status	File	Size
new	render.lock_.patch	4.46 KB

ok, here's a patch based on #48. i'm ok if people decide they just want to get #49 in - i'll reroll in a follow up issue.

this patch will attempt to get the locks multiple times, then continue to render the page or elements even if lock acquisition fails. both caching and locking are pluggable, so the site may still be able to serve the request when this happens.

Log in or register to post comments

Comment #51

moshe weitzman commented 12 July 2010 at 04:37

Status:

Needs review

» Reviewed & tested by the community

Did a code and logic review and this is a slight improvement so back to RTBC.

Log in or register to post comments

Comment #52

moshe weitzman commented 12 July 2010 at 04:37

Did a code and logic review and this is a slight improvement so back to RTBC.

Log in or register to post comments

Comment #53

catch

he/him

English

commented 12 July 2010 at 05:02

Why does this add the lock initialization back in? That's done in _drupal_bootstrap_variables() now so shouldn't be necessary to repeat.

Log in or register to post comments

Comment #54

catch

he/him

English

commented 12 July 2010 at 05:05

Well except in the case where you set $conf['page_cache_without_database'] = TRUE, then variable_initialize() gets skipped...

Log in or register to post comments

Comment #55

Anonymous (not verified) commented 12 July 2010 at 06:43

Status:

Reviewed & tested by the community

» Needs review

Status	File	Size
new	render.lock_.patch	6.42 KB

ok, after discussion with catch in #drupal, here's another go.

- the lock_acquire() call looked wrong to me after i read lock_acquire() ;-) we should definitely hold the lock for longer than 1 second, otherwise this stampede protection wont work for non-cached pages that take longer than that to generate

- lock initialisation checks $conf['page_cache_without_database'] as per #53 and #54

- don't let the _drupal_bootstrap_page_cache() invoke bootstrap and exit hooks if $conf['page_cache_without_database'] is TRUE, because its not safe to do so

Log in or register to post comments

Comment #56

moshe weitzman commented 12 July 2010 at 12:15

Status:

Needs review

» Reviewed & tested by the community

Nice catch. Back to RTBC.

Log in or register to post comments

Comment #57

Anonymous (not verified) commented 12 July 2010 at 14:48

Status	File	Size
new	render.lock_.patch	6.42 KB

catch noticed that i'd used 'initialise' instead of 'initialize'. new patch uses Uhmerican speling, no other changes.

Log in or register to post comments

Comment #58

Anonymous (not verified) commented 12 July 2010 at 15:03

Status	File	Size
new	render.lock_.patch	12.81 KB

ahem.

Log in or register to post comments

Comment #59

12 July 2010 at 15:20

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, render.lock_.patch, failed testing.

Log in or register to post comments

Comment #60

Anonymous (not verified) commented 13 July 2010 at 07:22

Status:

Needs work

» Needs review

Status	File	Size
new	render.lock_.patch	6.4 KB

one more time.

Log in or register to post comments

Comment #61

catch

he/him

English

commented 14 July 2010 at 01:27

Status:

Needs review

» Reviewed & tested by the community

Back to RTBC again.

Log in or register to post comments

Comment #62

Anonymous (not verified) commented 23 July 2010 at 02:47

Issue tags:

+Needs committer feedback

Log in or register to post comments

Comment #63

sun

German

Karlsruhe

commented 28 September 2010 at 21:59

+++ includes/bootstrap.inc	13 Jul 2010 07:13:37 -0000
@@ -870,11 +870,35 @@ function drupal_page_get_cache($check_on
+    // the database, drupal_page_get_cache() could get called before the lock ¶

+++ includes/common.inc	13 Jul 2010 07:13:40 -0000
@@ -5235,7 +5236,17 @@ function drupal_render_cache_get($elemen
+  ¶

Trailing white-space.

+++ includes/bootstrap.inc	13 Jul 2010 07:13:37 -0000
@@ -2127,23 +2155,18 @@ function _drupal_bootstrap_page_cache() 
+      // If the skipping of bootstrap hooks is not enforced, call hook_boot.
...
+      // If the skipping of bootstrap hooks is not enforced, call hook_exit.

Function names should have trailing ().

Powered by Dreditor.

Log in or register to post comments

Comment #64

sun

German

Karlsruhe

commented 28 September 2010 at 21:59

#60: render.lock_.patch queued for re-testing.

Log in or register to post comments

Comment #65

catch

he/him

English

commented 29 September 2010 at 02:51

Status	File	Size
new	render.lock__3.patch	6.41 KB

Comment issues were already in HEAD, but cleaned up anyway.

Log in or register to post comments

Comment #66

Anonymous (not verified) commented 29 September 2010 at 23:20

thanks sun and catch for keeping this alive, now how do we get DriesChick to way in?

Log in or register to post comments

Comment #67

moshe weitzman commented 3 October 2010 at 22:46

Priority:

Normal

» Major

*Still* RTBC.

Log in or register to post comments

Comment #68

damien tournoud commented 3 October 2010 at 23:06

+ // Try to get either the cached page or a lock five times. If that fails,
+ // continue with page execution, because the site may still be able to
+ // serve the request.

I do not understand the use of this loop:

It's the job of lock_wait() to loop (and it already does)
The default $delay of lock_wait() is 30s, so this patch would make Drupal wait for up to 150s before proceeding to non-cached page rendering (!!!!)

Something as simple as:

$cache = cache_get($base_root . request_uri(), 'cache_page');
if (!$cache && !$lock_acquired = lock_acquire($name)) {
  lock_wait($name, 2);
  $cache = cache_get($base_root . request_uri(), 'cache_page');
  if (!$cache) {
    $lock_acquired = lock_acquire($name);
  }
}

Feels more then enough.

Log in or register to post comments

Comment #69

webchick

she/they

English

Vancouver 🇨🇦

commented 3 October 2010 at 23:42

Status:

Reviewed & tested by the community

» Needs review

Sounds like this still needs discussion based on Damien's last comment.

Log in or register to post comments

Comment #70

Anonymous (not verified) commented 4 October 2010 at 01:01

Status:

Needs review

» Needs work

i'm ok with Damien's idea. the most important thing is to avoid the broken "hey, lets just recurse until the heat death of the universe, or the php process death, whichever comes first".

whether we go with #68 or not, i agree we should shorten the wait period. unless someone says otherwise, i'll post a patch based on #68 later today.

Log in or register to post comments

Comment #71

catch

he/him

English

commented 4 October 2010 at 02:18

I'd want more than 2 seconds for a wait, Drupal.org itself used to take around 30 seconds to load /tracker. However no loop seems fine here.

Log in or register to post comments

Comment #72

Anonymous (not verified) commented 4 October 2010 at 06:56

Status	File	Size
new	lock.patch	6.02 KB

ok, attached patch addresses #68 and #71.

Log in or register to post comments

Comment #73

Anonymous (not verified) commented 4 October 2010 at 07:00

Status:

Needs work

» Needs review

woops, needs review.

Log in or register to post comments

Comment #74

catch

he/him

English

commented 4 October 2010 at 07:11

Looks fine to me. lock_wait() bails out as soon as possible anyway, so I don't see a reason to set anything but the default there.

Log in or register to post comments

Comment #75

moshe weitzman commented 4 October 2010 at 12:47

It looks like there are some small differences in the lock code between page cache and render cache. Is that deliberate?

Log in or register to post comments

Comment #76

Anonymous (not verified) commented 4 October 2010 at 22:21

yes :-)

in the page cache, we want to send a cache miss header if we acquire the lock.

in the render cache, we want to wait for less time, because we're only dealing with bits of a page.

Log in or register to post comments

Comment #77

moshe weitzman commented 4 October 2010 at 23:00

IMO it is helpful to send cache miss header for render cache too. obviously the CID has to get sent too. If others think that’s overkill, then OK.

Log in or register to post comments

Comment #78

catch

he/him

English

commented 5 October 2010 at 02:35

lock_wait() exits as soon as it can anyway, I don't see any reason to specify 5 seconds. If I render cache something that takes 15 seconds to build (a huge page callback, someone changes a 2 second view to a 20 second one by adding an extra sort on it, a block on every page that has to make a load of network requests to a slow third party service, plenty of possibilities), then that's precisely what this patch should add a little bit of protection against.

Log in or register to post comments

Comment #79

Anonymous (not verified) commented 5 October 2010 at 03:00

Status	File	Size
new	lock.patch	6.01 KB

ok, new patch addresses #78. i've left the render-cache headers out, i think that's overkill, but i'll add it back if either catch or DamZ vote it should go in.

Log in or register to post comments

Comment #80

5 October 2010 at 04:50

Status:	Needs review	» Needs work
Issue tags:	-Performance, -scalability, -Needs committer feedback

The last submitted patch, lock.patch, failed testing.

Log in or register to post comments

Comment #81

Anonymous (not verified) commented 5 October 2010 at 05:09

Status:

Needs work

» Needs review

#79: lock.patch queued for re-testing.

Log in or register to post comments

Comment #82

5 October 2010 at 08:20

Status:	Needs review	» Needs work
Issue tags:		+Performance, +scalability, +Needs committer feedback

The last submitted patch, lock.patch, failed testing.

Log in or register to post comments

Comment #83

moshe weitzman commented 21 October 2010 at 19:48

Hmm. bot is not happy. Anyone?

Log in or register to post comments

Comment #84

moshe weitzman commented 21 October 2010 at 20:24

Status:

Needs work

» Needs review

#79: lock.patch queued for re-testing.

Log in or register to post comments

Comment #85

moshe weitzman commented 22 October 2010 at 04:09

I am running run-tests.sh successfully with this patch. Neither myself nor justinrendell can reproduce this bot result. Anyone know how to resolve "Failed to run tests: failed during invocation of run-tests.sh."

Log in or register to post comments

Comment #86

22 October 2010 at 09:30

Status:

Needs review

» Needs work

The last submitted patch, lock.patch, failed testing.

Log in or register to post comments

Comment #87

carlos8f commented 23 October 2010 at 20:54

Subscribing. Test failure is odd.

Log in or register to post comments

Comment #88

damien tournoud commented 24 October 2010 at 22:14

It seems that #79 makes "Forum functionality (ForumTestCase)" timeout. The failure seems legit.

Log in or register to post comments

Comment #89

carlos8f commented 24 October 2010 at 22:18

Yeah, a timeout is likely. The PIFR bot can't distinguish infinite loops from other errors yet. Usually it's reported as a fatal error, but here we're getting run-tests.sh error. In any case, the forum tests may have some clues.

Log in or register to post comments

Comment #90

Anonymous (not verified) commented 25 October 2010 at 03:59

thanks for the investigationing, i'll pick this up and see if i can figure out the failure in the forum test.

Log in or register to post comments

Comment #91

Anonymous (not verified) commented 26 October 2010 at 20:55

the test fails are probably coming from http://drupal.org/node/886152#comment-3624434 - once that lands i'll reroll and retest.

Log in or register to post comments

Comment #92

carlos8f commented 28 October 2010 at 23:04

#886152: All fields are hidden after the administrator newly configures a view mode to not use 'default' display landed

Log in or register to post comments

Comment #93

Anonymous (not verified) commented 28 October 2010 at 23:24

Status:

Needs work

» Needs review

Status	File	Size
new	lock.patch	6.01 KB

alllllrighty, updated patch for HEAD, no other changes, lets see what the bot thinks now.

Log in or register to post comments

Comment #94

29 October 2010 at 01:40

Status:

Needs review

» Needs work

The last submitted patch, lock.patch, failed testing.

Log in or register to post comments

Comment #95

Anonymous (not verified) commented 29 October 2010 at 01:53

oh blurgh. i still can't reproduce that error. i can successfully run the tests on my machine via run-tests.sh.

i'll chase rfay/boombatower/damz to see if can get access to a bot to figure this out.

Log in or register to post comments

Comment #96

carlos8f commented 29 October 2010 at 02:48

@justinrandell, when you ran run-tests.sh, what concurrency did you set? I have a feeling the tests only fail when a higher concurrency is set (the production test bots have concurrency 4+, I think). Concurrency would probably trigger lock_wait(), etc. so that might be the key to reproducing the failure.

I've only skimmed through the code in the patch, but I am interested in what changed between #72 and #79 to cause the failure (interdiff please?). Either HEAD broke the patch, or #79 added something that broke it. Hmm.

Log in or register to post comments

Comment #97

carlos8f commented 29 October 2010 at 02:54

Issue tags:

+Needs tests

Also, I'm interested in a unit test being written for this (read #12 - #13, yes it's difficult, but I think we should try). For the reasons that this touches critical bootstrap code in a major way, and secondly to debug the run-tests.sh failure, which if it is due to concurrency, would point to a legitimate bug.

Log in or register to post comments

Comment #98

carlos8f commented 29 October 2010 at 03:09

Well, actually run-tests.sh concurrency in theory shouldn't trigger lock_wait(), since each test method gets its own database, and its own set of locks.

And although lock.inc has some rudimentary tests in core, apparently there is no way to actually test concurrent requests with our framework. Maybe we'll need some thinking out-of-the-box on this one.

Log in or register to post comments

Comment #99

Anonymous (not verified) commented 29 October 2010 at 03:18

Status:

Needs work

» Needs review

Status	File	Size
new	lock.802446.72.patch	6.02 KB

ok, lets see what the bot thinks of the patch at #72, rerolled only to remove fuzz. if this passes, then we have a smaller target.

Log in or register to post comments

Comment #100

carlos8f commented 29 October 2010 at 04:17

Wow! Now we can isolate the change that causes failure.

interdiff lock.802446.72.patch lock_13.patch 
diff -u includes/common.inc includes/common.inc
--- includes/common.inc	29 Oct 2010 03:16:31 -0000
+++ includes/common.inc	28 Oct 2010 23:23:55 -0000
@@ -5574,10 +5574,10 @@
   }
   $bin = isset($elements['#cache']['bin']) ? $elements['#cache']['bin'] : 'cache';
 
-  if (!$cache = cache_get($cid, $bin) && !lock_acquire($cid, 5)) {
-    lock_wait($cid, 5);
+  if (!$cache = cache_get($cid, $bin) && !lock_acquire($cid)) {
+    lock_wait($cid);
     if (!$cache = cache_get($cid, $bin)) {
-      lock_acquire($cid, 5);
+      lock_acquire($cid);
     }
   }

Looks like the only change is using '5' as the timeout/delay. Interesting.

Log in or register to post comments

Comment #101

carlos8f commented 29 October 2010 at 05:12

I think this is the problem:

The lock is supposed to be released in http://api.drupal.org/api/function/drupal_page_footer/7 (via drupal_page_set_cache(), etc), but it's possible that in some cases, that code is not running and the lock is not properly released. In that case, with the timeout being 30 (default), PHP times out waiting for a lock that was never released properly. It might be that we need to call lock_release() in a shutdown function to ensure that it is called whenever exit() is called, to catch cases like drupal_goto().

Log in or register to post comments

Comment #102

carlos8f commented 29 October 2010 at 05:22

OK, taking http://api.drupal.org/api/function/lock_release_all/7 into consideration, all locks should be released on shutdown. Nevertheless, it appears that our lock is not being released. There's some bug here that we don't grok yet.

Log in or register to post comments

Comment #103

Anonymous (not verified) commented 29 October 2010 at 08:30

Status	File	Size
new	lock.20.patch	6.45 KB

ok, here's a wild stab in the dark to clear the locks on initialization. lets see what the bot thinks.

Log in or register to post comments

Comment #104

Anonymous (not verified) commented 29 October 2010 at 09:04

oh boy, that's a bad idea. please disregard #103.

Log in or register to post comments

Comment #105

29 October 2010 at 10:40

Status:

Needs review

» Needs work

The last submitted patch, lock.20.patch, failed testing.

Log in or register to post comments

Comment #106

carlos8f commented 29 October 2010 at 14:40

Status:

Needs work

» Needs review

Another perplexing part is that PHP is not supposed to time out due to sleeping! (on Linux anyway). It still seems like there is a direct correlation between the lock_wait() timeout being 30 seconds in the failed patch, and PHP having a global (default) timeout of 30. The passing patch uses 5 as lock_wait() timeout, and I think it's safe to say that it's using the full 5 seconds and continuing silently.

Created #956948: Locking framework ungrokably fails testing to investigate test failures.

Log in or register to post comments

Comment #107

carlos8f commented 29 October 2010 at 16:09

Status	File	Size
new	facepalm.patch	6.02 KB

Facepalm.


header('Content-Type: text/plain; charset=utf-8');

function foo() {
  return FALSE;
}

function bar() {
  return TRUE;
}

if ($cache = foo() && bar()) {
  print "A is wrong!\n";
}

if (!$cache = foo() && !bar()) {
  print "B is wrong!\n";
}

if (!($cache = foo()) && !bar()) {
  print "C is wrong!\n";
}

Output:

B is wrong!

@chx, can we get a phpwtf on this?

Attached patch is #93 with proper parentheses.

interdiff lock_13.patch facepalm.patch 
diff -u includes/bootstrap.inc includes/bootstrap.inc
--- includes/bootstrap.inc	28 Oct 2010 23:23:51 -0000
+++ includes/bootstrap.inc	28 Oct 2010 23:23:51 -0000
@@ -869,7 +869,7 @@
     }
 
     $cid = $base_root . request_uri();
-    if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {
+    if (!($cache = cache_get($cid, 'cache_page')) && !lock_acquire($cid)) {
       lock_wait($cid);
       if (!$cache = cache_get($cid, 'cache_page')) {
         if (lock_acquire($cid)) {
diff -u includes/common.inc includes/common.inc
--- includes/common.inc	28 Oct 2010 23:23:55 -0000
+++ includes/common.inc	28 Oct 2010 23:23:55 -0000
@@ -5574,7 +5574,7 @@
   }
   $bin = isset($elements['#cache']['bin']) ? $elements['#cache']['bin'] : 'cache';
 
-  if (!$cache = cache_get($cid, $bin) && !lock_acquire($cid)) {
+  if (!($cache = cache_get($cid, $bin)) && !lock_acquire($cid)) {
     lock_wait($cid);
     if (!$cache = cache_get($cid, $bin)) {
       lock_acquire($cid);

Log in or register to post comments

Comment #108

carlos8f commented 29 October 2010 at 16:27

ignore this one

Log in or register to post comments

Comment #109

damien tournoud commented 29 October 2010 at 16:32

Status	File	Size
new	802446-lock-render-proper.patch	5.79 KB

Basic boolean logic.

Log in or register to post comments

Comment #110

sun

German

Karlsruhe

commented 29 October 2010 at 16:52

+++ includes/bootstrap.inc
@@ -860,11 +860,29 @@ function drupal_page_get_cache($check_only = FALSE) {
+    if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {
...
+      if (!$cache = cache_get($cid, 'cache_page')) {

+++ includes/common.inc
@@ -5573,7 +5574,14 @@ function drupal_render_cache_get($elements) {
+  if (!($cache = cache_get($cid, $bin)) && !lock_acquire($cid)) {
...
+    if (!($cache = cache_get($cid, $bin))) {

common.inc looks correct to me, bootstrap.inc not.

Log in or register to post comments

Comment #111

carlos8f commented 29 October 2010 at 16:56

@Damien, did you see the patch in #107? #109 is incomplete, it doesn't fix

if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {

I believe the patch in #107 is more complete.

And also,

if (!$cache = cache_get($cid, $bin)) {

is fine by itself, but #109 changes it. It's only when we add the && that problems happen with operator precedence, I actually prefer to leave out the paren's if it's a simple expression like above.

Log in or register to post comments

Comment #112

damien tournoud commented 29 October 2010 at 17:31

#109 was a crosspost. We just came to the same conclusion.

Log in or register to post comments

Comment #113

carlos8f commented 29 October 2010 at 18:06

Oh oh. #107 vs. #109:

interdiff facepalm.patch 802446-lock-render-proper.patch
diff -u includes/bootstrap.inc includes/bootstrap.inc
--- includes/bootstrap.inc	28 Oct 2010 23:23:51 -0000
+++ includes/bootstrap.inc
@@ -869,7 +869,7 @@
     }
 
     $cid = $base_root . request_uri();
-    if (!($cache = cache_get($cid, 'cache_page')) && !lock_acquire($cid)) {
+    if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {
       lock_wait($cid);
       if (!$cache = cache_get($cid, 'cache_page')) {
         if (lock_acquire($cid)) {
diff -u includes/common.inc includes/common.inc
--- includes/common.inc	28 Oct 2010 23:23:55 -0000
+++ includes/common.inc
@@ -5576,7 +5576,7 @@
 
   if (!($cache = cache_get($cid, $bin)) && !lock_acquire($cid)) {
     lock_wait($cid);
-    if (!$cache = cache_get($cid, $bin)) {
+    if (!($cache = cache_get($cid, $bin))) {
       lock_acquire($cid);
     }
   }

The only notable difference is that #107 fixes the page cache operator precedence. And then we get page cache test fails. This is exactly the opposite of what to expect: #107 should pass, #109 should fail!

Regardless, we have moved past the run-tests.sh error by fixing this operator bug. That is at least encouraging, but still, using timeout of 30 v.s 5 shouldn't cause the bot to crash like that. I'll try to expose the bug in #956948: Locking framework ungrokably fails testing (looks like the bot is caught in an infinite loop there).

Log in or register to post comments

Comment #114

carlos8f commented 29 October 2010 at 18:32

header('Content-Type: text/plain; charset=utf-8');

function foo() {
  print "foo\n";
  return FALSE;
}

function bar() {
  print "bar\n";
  return TRUE;
}

print "testing B\n";
if (!$cache = foo() && !bar()) {
  print "B is wrong!\n";
}

print "testing C\n";
if (!($cache = foo()) && !bar()) {
  print "C is wrong!\n";
}

print "testing E\n";
if (!$cache = bar() && !bar()) {
  print "E is wrong!\n";
}

print "testing F\n";
if (!($cache = bar()) && !bar()) {
  print "F is wrong!\n";
}

testing B
foo
B is wrong!
testing C
foo
bar
testing E
bar
bar
E is wrong!
testing F
bar

This simple test shows that in #109 the page cache logic should be totally broken, and #107, which contains the paren's, should be correct. Yet the bot prefers #109. WTF is going on here? :P

Log in or register to post comments

Comment #115

carlos8f commented 29 October 2010 at 19:04

Status:

Needs review

» Needs work

In #956948: Locking framework ungrokably fails testing I've isolated the run-tests.sh failure to the operator precedence bug. We still need to figure out why page cache tests are failing in #107 though.

Log in or register to post comments

Comment #116

carlos8f commented 30 October 2010 at 20:32

Status:

Needs work

» Needs review

Status	File	Size
new	802446-116-lock-page-render-caches.patch	5.99 KB

To fix the page cache tests, all I needed to do was move header('X-Drupal-Cache: MISS') out of the lock_acquire() block and to the bottom of the function. The header should get sent regardless of whether locking succeeded.

Log in or register to post comments

Comment #117

moshe weitzman commented 30 October 2010 at 22:16

Status:

Needs review

» Reviewed & tested by the community

carlos8f gets the bulldog award for getting the tests to pass. great work.

Log in or register to post comments

Comment #118

Anonymous (not verified) commented 4 November 2010 at 22:25

nice work! i think this is RTBC as well.

Log in or register to post comments

Comment #119

carlos8f commented 8 November 2010 at 21:02

Issue tags:

-Performance, -Needs tests, -scalability, -Needs committer feedback

#116: 802446-116-lock-page-render-caches.patch queued for re-testing.

Log in or register to post comments

Comment #120

8 November 2010 at 21:40

Status:	Reviewed & tested by the community	» Needs work
Issue tags:		+Performance, +Needs tests, +scalability, +Needs committer feedback

The last submitted patch, 802446-116-lock-page-render-caches.patch, failed testing.

Log in or register to post comments

Comment #121

carlos8f commented 8 November 2010 at 21:42

Status:

Needs work

» Reviewed & tested by the community

Hmm, more fallout from #808560: Node comment statistics are only partially exposed in node_load() and are missing last_comment_uid. Fortunately that was just fixed!

Log in or register to post comments

Comment #122

carlos8f commented 8 November 2010 at 21:44

Issue tags:

-Performance, -Needs tests, -scalability, -Needs committer feedback

#116: 802446-116-lock-page-render-caches.patch queued for re-testing.

Log in or register to post comments

Comment #123

8 November 2010 at 22:20

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 802446-116-lock-page-render-caches.patch, failed testing.

Log in or register to post comments

Comment #124

carlos8f commented 8 November 2010 at 22:27

Status:	Needs work	» Needs review
Issue tags:		+Performance, +Needs tests, +scalability, +Needs committer feedback

#116: 802446-116-lock-page-render-caches.patch queued for re-testing.

Log in or register to post comments

Comment #125

carlos8f commented 8 November 2010 at 23:07

Status:

Needs review

» Reviewed & tested by the community

Still RTBC; test fails were not related.

Log in or register to post comments

Comment #126

carlos8f commented 14 November 2010 at 02:42

#116: 802446-116-lock-page-render-caches.patch queued for re-testing.

Log in or register to post comments

Comment #127

carlos8f commented 15 November 2010 at 01:50

I talked with @David Strauss today about this patch; he noted that when using varnish, page cache stampedes are already taken care of. That could cause some unnecessary locking, although Pressflow d7 may introduce an opt-out of core page caching ("External" cache mode), which would defer everything to varnish instead. Render cache stampede protection is still a nice improvement though.

There could be cases where fatal errors or uncaught exceptions leave the lock hanging for the duration (lock_release_all() doesn't run in that case). That's a tricky issue; in the case of fatals it's impossible to release the lock and it will just have to expire. In the case of exceptions, it might be possible to have a global try { } catch { release locks, re-throw exception }. That is another issue though. I think the current patch is OK to commit.

Log in or register to post comments

Comment #128

catch

he/him

English

commented 15 November 2010 at 05:12

lock_release_all() happens in a shutdown function. I haven't personally tested it but I think others did at the time and this was pretty resilient to some deliberately triggered fatal errors.

Unless I'm mistaken, while varnish has a 'grace' setting which allows for stampede protection against expired entries, it doesn't have any protection against stampedes in a cold start situation (although maybe it's possible to do that with a custom vcl). So this is a slightly different issue. fwiw I think you can do roughly the equivalent to pressflow D6 external page caching with vanilla D7 already, see #627496: Document using a null cache for page caching with reverse proxies.

Log in or register to post comments

Comment #129

webchick

she/they

English

Vancouver 🇨🇦

commented 20 November 2010 at 05:31

I tried to look at this tonight, but not only is it way over my head, it's also got Dries's fingerprints all over it. I do also get very nervous about sentences like "If we agree on this approach, I will submit similar patches for block and page caches (others?)." Tons of follow-up patches are not something we have time for at this point. But the trade-off might be worth it, dunno.

Leaving for The Spikey Haired One to make the call.

Log in or register to post comments

Comment #130

moshe weitzman commented 20 November 2010 at 12:53

@webchick - no worries, thats an ancient comment. The page cache protection is now in the patch and the block cache isn't getting stampede protection this release.

@dries - please consider that this has been mostly rtbc since may. thanks.

Log in or register to post comments

Comment #131

yonailo

He/him

commented 22 November 2010 at 13:01

subscribing

Log in or register to post comments

Comment #132

carlos8f commented 4 December 2010 at 20:32

Issue tags:

-Performance, -Needs tests, -scalability, -Needs committer feedback

#116: 802446-116-lock-page-render-caches.patch queued for re-testing.

Log in or register to post comments

Comment #133

carlos8f commented 11 December 2010 at 19:46

Issue tags:

+Performance, +Needs tests, +scalability, +Needs committer feedback

#116: 802446-116-lock-page-render-caches.patch queued for re-testing.

Log in or register to post comments

Comment #134

Anonymous (not verified) commented 6 January 2011 at 00:31

Status	File	Size
new	802446.stampede.patch	5.99 KB

updated #116 to get rid of fuzz, no code changes.

Log in or register to post comments

Comment #135

achton

Denmark

commented 13 January 2011 at 16:52

Subscribing.

Log in or register to post comments

Comment #136

pillarsdotnet commented 19 January 2011 at 20:02

Status	File	Size
new	802446.stampede.patch	5.5 KB

De-fuzzed again...

Log in or register to post comments

Comment #137

mgifford

he/him

English

commented 8 February 2011 at 22:36

Subscribe.

Log in or register to post comments

Comment #138

mgifford

he/him

English

commented 8 February 2011 at 22:36

Subscribe.

Log in or register to post comments

Comment #139

catch

he/him

English

commented 9 February 2011 at 03:42

Status	File	Size
new	802446.stampede_0.patch	4.71 KB

I keep thinking about this patch and have concerns about it now. These are similar to the concerns around the locking in variable_initialize(), and #802856: Make lock_wait() wait less is designed to deal with part of this, but I'm not sure if it's enough.

Say you have something that takes 300ms to build on a cache miss, and that is rendered on every page (for example a block in the footer).

Without locking, you'll get a cache stampede - every request does that 300ms of processing until one sets the cache (and all the ones that missed will try to write back to cache).

With locking, only one request tries to build the item, so you save the apache/database CPU time from that 300ms.

However between acquiring the lock and writing to cache you have a 300+ms window where every other process hitting the site has to wait.

If they start waiting right at the beginning of the cache miss, then the request will take 700ms longer than it would. If they start to wait right at the end, then it'll be 1300ms after the initial lock_acquire() before those requests are able to continue.

During this time CPU load on the server is going to be low (apart from lock_may_be_available() which is minimal), but it runs the risk of exceeding MaxClients - since you'll have all these sleeping processes waiting around. What we're effectively doing in those situations is adding a global read lock across the entire site (even with multiple webservers and database servers there'd still be a single read lock).

And the current patch would make this even more pronounced than the current variable patch, because the polling in lock_wait() doubles each time.

For now, I'm just removing the changes to lock_wait() from here, but I think we need to try to come up with a uniform approach to this if possible. For example we could lock cache writes but not reads, or we could lock_wait($name, 1); by default in these examples so there's less chance of stacking up apache processes, then go ahead and build the items anyway, while still not writing back to cache (or checking the cache before writing back to it).

But there are complex trade-offs here that need a bit more discussion.

Log in or register to post comments

Comment #140

catch

he/him

English

commented 9 February 2011 at 03:42

Status:

Reviewed & tested by the community

» Needs review

Back to CNR.

Log in or register to post comments

Comment #141

pillarsdotnet commented 15 March 2011 at 06:28

Status	File	Size
new	cache_stampede-802446-141.patch	5.29 KB

I'm getting multiple errors like the following:

PHP Fatal error: Call to undefined function db_insert() in /var/www/includes/lock.inc on line 125

This happens in a call to lock_acquire() added by the patch in #139.

The problem is that if $config['page_cache_without_database'] is set, then drupal_page_get_cache() gets called before the database system has been initialized. The #139 patch loads lock.inc but does not load database.inc. Calling lock_acquire() therefore results in the above error.

The attached patch loads both database.inc and lock.inc IF:

cache_get($cid, 'cache_page') returns FALSE, AND
variable_get('page_cache_without_database') returns TRUE

Should there should be a test that generates a page load of modified content with $conf['page_cache_without_database'] set? If so, should that test be in core?

Log in or register to post comments

Comment #142

pillarsdotnet commented 15 March 2011 at 16:59

Wow. Testbot is reporting failure to checkout the drupal repository. I wish someone with access to qa.drupal.org would update its status page when stuff like that happens.

I'd post a comment to the qa.drupal.org status page, but my account has apparently been banned from that server.

Log in or register to post comments

Comment #143

pillarsdotnet commented 15 March 2011 at 17:58

Status	File	Size
new	cache_stampede-802446-143.patch	6.54 KB

Fixed silly typo.

Log in or register to post comments

Comment #144

15 March 2011 at 18:21

Status:	Needs review	» Needs work
Issue tags:	-Performance, -Needs tests, -scalability, -Needs committer feedback

The last submitted patch, cache_stampede-802446-143.patch, failed testing.

Log in or register to post comments

Comment #145

pillarsdotnet commented 15 March 2011 at 21:38

Status:	Needs work	» Needs review
Issue tags:		+Performance, +Needs tests, +scalability, +Needs committer feedback

#143: cache_stampede-802446-143.patch queued for re-testing.

Log in or register to post comments

Comment #146

DanPir commented 28 March 2011 at 10:50

subscribe

Log in or register to post comments

Comment #147

catch

he/him

English

commented 13 May 2011 at 07:56

Version:

7.x-dev

» 8.x-dev

Just a note that we added stampede protection for empty caches to the 6.x-1.x memcache backend in #962422: Empty cache stampede protection, this (as well as the 'grace' mode for expired items) is configurable in settings.php, disabled by default (since we want to ensure people use memcache-lock.inc with it). Patch needs to be ported to 7.x branch still.

I am starting to think that cache backends are a better place for this to live, and we should reserve locking in core for where there are actual consistency issues (such as menu_rebuild(), arguably variable caching). A lock_acquire() with memcache is extremely cheap, but this is not going to be the case with the SQL version we ship with, and the cost of that happens whether there is a stampede or not.

Not changing status, interested to see what other people think about this, but since I was a strong supporter of this patch originally wanted to confirm that I've changed my mind on it now. If custom/contrib code is building an extremly expensive item and wants to make sure that only one process tries to do this, then it makes sense to do it there - but for the render cache some things may be quite cheap, or unique to each page (even user) and not that likely to cause a stampede.

Log in or register to post comments

Comment #148

Anonymous (not verified) commented 13 May 2011 at 10:27

I'm ok with #147.

My main concern with this was to stop the recursive locking pattern.

Log in or register to post comments

Comment #149

pillarsdotnet commented 30 May 2011 at 04:26

#143: cache_stampede-802446-143.patch queued for re-testing.

Log in or register to post comments

Comment #150

catch

he/him

English

commented 2 June 2011 at 12:35

Category:	bug	» task
Issue tags:		+Needs backport to D7

Adding backport tag, this is a task.

Log in or register to post comments

Comment #151

xjm

she/her

English

commented 29 July 2011 at 19:59

Issue tags:

+Needs issue summary update

Tagging issues not yet using summary template.

Log in or register to post comments

Comment #152

moshe weitzman commented 14 August 2011 at 04:12

In order to evaluate whether we should punt to contrib, it would be helpful to see how complex that pattern will be for contrib to implement. Any examples out there? Maybe we could add a property to the 'cache' definition which signals to core that we are doing an expensive operation and locking is wanted. Something like #lock => TRUE.

Log in or register to post comments

Comment #153

catch

he/him

English

commented 31 August 2011 at 15:56

Priority:

Major

» Normal

@Moshe - well #147/memcache has generic stampede protection for all cache entries using the lock framework, so if it's simple enough at that level, it should be OK elsewhere.

I'm going to put this down to normal priority.

I like the idea of #lock => TRUE, tried to build something like that into DrupalCacheArray.

Log in or register to post comments

Comment #154

moshe weitzman commented 31 August 2011 at 16:28

OK, I think the only todo left might be the #lock = TRUE bit. We could start a new issue for that.

Log in or register to post comments

Comment #155

thedavidmeister commented 17 June 2013 at 11:18

Issue tags:

-Performance, -Needs tests, -Needs issue summary update, -scalability, -Needs committer feedback, -Needs backport to D7

#143: cache_stampede-802446-143.patch queued for re-testing.

Log in or register to post comments

Comment #156

17 June 2013 at 11:19

Status:

Needs review

» Needs work

The last submitted patch, cache_stampede-802446-143.patch, failed testing.

Log in or register to post comments

Comment #157

17 June 2013 at 11:19

Issue tags:

+Performance, +Needs tests, +scalability, +

Restoring issue tags, see #2125755: System messages removed all issue tags during D7 upgrade.

Log in or register to post comments

Comment #158

yesct commented 24 January 2014 at 13:08

Issue summary:	View changes
Issue tags:		+Needs issue summary update

putting back lost tag for needs issue summary update (instructions )

Log in or register to post comments

Comment #159

fabianx commented 4 July 2015 at 14:49

Too bad this never got in ... Hopefully we can find a better (parallel) solution for D8 ...

E.g. placeholder data while waiting for the lock, then try later again ...

Log in or register to post comments

Comment #160

mgifford

he/him

English

commented 24 March 2016 at 03:57

Assigned:

moshe weitzman

» Unassigned

Unassigning stale issue. Hopefully someone else will pursue this.

Log in or register to post comments

Comment #161

24 March 2016 at 03:57

Version:

8.0.x-dev

» 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #162

24 March 2016 at 03:57

Version:

8.1.x-dev

» 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #163

xjm

she/her

English

commented 11 December 2016 at 20:34

Issue tags:

-Needs committer feedback

I don't think this is in a state where it needs committer feedback (six years later) since the proposed resolution is not applicable for D8. It would need to be updated with a new proposal.

(Also we don't use "Needs committer feedback" anymore for D8 core since there are three distinct committer roles.)

Log in or register to post comments

Comment #164

11 December 2016 at 20:34

Version:

8.2.x-dev

» 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #165

11 December 2016 at 20:34

Version:

8.3.x-dev

» 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #166

11 December 2016 at 20:34

Version:

8.4.x-dev

» 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #167

11 December 2016 at 20:34

Version:

8.5.x-dev

» 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #168

11 December 2016 at 20:34

Version:

8.6.x-dev

» 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Log in or register to post comments

Comment #169

11 December 2016 at 20:34

Version:

8.8.x-dev

» 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Log in or register to post comments

Comment #170

11 December 2016 at 20:34

Version:

8.9.x-dev

» 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #171

11 December 2016 at 20:34

Version:

9.2.x-dev

» 9.3.x-dev

Log in or register to post comments

Comment #172

11 December 2016 at 20:34

Version:

9.3.x-dev

» 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #173

11 December 2016 at 20:34

Version:

9.4.x-dev

» 9.5.x-dev

Drupal 9.4.9 was released on December 7, 2022 and is the final full bugfix release for the Drupal 9.4.x series. Drupal 9.4.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.5.x-dev branch from now on, and new development or disruptive changes should be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #174

11 December 2016 at 20:34

Version:

9.5.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Log in or register to post comments

Comment #175

smustgrave commented 2 September 2025 at 15:21

Status:	Needs work	» Postponed (maintainer needs more info)
Issue tags:		+stale-issue-cleanup

Thank you for creating this issue to improve Drupal.

We are working to decide if this task is still relevant to a currently supported version of Drupal. There hasn't been any discussion here for over 8 years which suggests that this has either been implemented or is no longer relevant. Your thoughts on this will allow a decision to be made.

Since we need more information to move forward with this issue, the status is now Postponed (maintainer needs more info). If we don't receive additional information to help with the issue, it may be closed after three months.

Thanks!

Log in or register to post comments

Comment #176

catch

he/him

English

commented 2 September 2025 at 16:57

Now that we use the render cache for a lot more things, I don't think locking at that level would be a good idea - the lock acquire overhead is non zero, most pages have several render cacheable items on them, and the cardinality of most cached items is quite high.

There are individual cases like header or footer menus that appear on every page, it might be possible to lock these, but this could be implemented at the level of individual block plugins (e.g. we could implement a lock in system menu block rendering).

#3483239: Use revolt to prewarm caches during lock waits might entail adding locking to more low level caches though - theme registry, library discovery etc. that can be pre-warmed.

Log in or register to post comments

Comment #177

2 September 2025 at 16:57

Version:

11.x-dev

» main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Cache stampede protection: drupal_render() and page cache

Comments