Attached simple patch uses lock system to ensure that only one request is rebuilding a given cache item at a time.

Note that the first line of the patch is just cleanup. $cid can't be empy or else we left this function a few lines earlier.

If we agree on this approach, I will submit similar patches for block and page caches (others?).

CommentFileSizeAuthor
#143 cache_stampede-802446-143.patch6.54 KBpillarsdotnet
#141 cache_stampede-802446-141.patch5.29 KBpillarsdotnet
#139 802446.stampede_0.patch4.71 KBcatch
#136 802446.stampede.patch5.5 KBpillarsdotnet
#134 802446.stampede.patch5.99 KBAnonymous (not verified)
#116 802446-116-lock-page-render-caches.patch5.99 KBcarlos8f
#109 802446-lock-render-proper.patch5.79 KBDamien Tournoud
#107 facepalm.patch6.02 KBcarlos8f
#103 lock.20.patch6.45 KBAnonymous (not verified)
#99 lock.802446.72.patch6.02 KBAnonymous (not verified)
#93 lock.patch6.01 KBAnonymous (not verified)
#79 lock.patch6.01 KBAnonymous (not verified)
#72 lock.patch6.02 KBAnonymous (not verified)
#65 render.lock__3.patch6.41 KBcatch
#60 render.lock_.patch6.4 KBAnonymous (not verified)
#58 render.lock_.patch12.81 KBAnonymous (not verified)
#57 render.lock_.patch6.42 KBAnonymous (not verified)
#55 render.lock_.patch6.42 KBAnonymous (not verified)
#50 render.lock_.patch4.46 KBAnonymous (not verified)
#47 render_cache.patch3.92 KBcatch
#44 render_cache.patch3.93 KBcatch
#41 rendercache.diff3.55 KBmoshe weitzman
#40 rendercache.diff3.52 KBmoshe weitzman
#39 rendercache.diff4.17 KBmoshe weitzman
#33 stampede.patch3.82 KBcatch
#32 stampede.patch3.8 KBcatch
#31 stampede.patch3.8 KBcatch
#18 hack.txt4.36 KBcatch
#10 render_cache.diff2.52 KBmoshe weitzman
#8 render_cache.diff2.99 KBmoshe weitzman
#5 render_cache.diff2.5 KBmoshe weitzman
#2 render_cache.diff1.37 KBmoshe weitzman
#1 render_cache.diff1.27 KBmoshe weitzman
render_cache.diff1.11 KBmoshe weitzman
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

moshe weitzman’s picture

FileSize
1.27 KB

I was not handling the 'wait' condition where subsequent requests don't get the lock.

moshe weitzman’s picture

Status: Active » Needs review
FileSize
1.37 KB

Now uses lock_wait() instead of sleep.

catch’s picture

This looks great to me. I have a D6 site which slows to a crawl whenever content is posted, locking framework is designed just for cases like this. My only concern is it'd be nice if lock_wait() checked for the lock release at granularity less than a second (say 100ms), but that's not for this patch to deal with.

catch’s picture

Status: Needs review » Reviewed & tested by the community

Looked at this again, there's nothing not to like here, RTBC.

moshe weitzman’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
2.5 KB

This patch adds a lock for page cache using same pattern.

I'm stopping short of block cache for now as that has a slightly different flow. Until we add lock for block, it is simple enough for hurting developers to convert expensive blocks to use drupal_render #cache instead using hook_block_view_alter().

moshe weitzman’s picture

Title: Cache stampede protection: drupal_render() » Cache stampede protection: drupal_render() and page cache

Status: Needs review » Needs work

The last submitted patch, render_cache.diff, failed testing.

moshe weitzman’s picture

Status: Needs work » Needs review
FileSize
2.99 KB

Reroll. Changed header() calls to drupal_add_http_header()

catch’s picture

Status: Needs review » Needs work
+++ includes/bootstrap.inc
@@ -2076,7 +2076,20 @@ function _drupal_bootstrap_page_cache() {
+        // Some other request is building cache. Wait, then re-run this function.
+         drupal_add_http_header('X-Drupal-Cache', 'WAIT');

Do we need the header call for WAIT? Indented one too many places.

Additionally the lock name isn't per page? Locking all page cache generation across the site seems not ideal to me, we have the requested path by this point already.

Powered by Dreditor.

moshe weitzman’s picture

Status: Needs work » Needs review
FileSize
2.52 KB

Fixed lock name for page cache. That was a silly late change.

I am keeping the WAIT header as that provides good information about why a cached page was slower than usual (i.e. it had to wait).

I reverted back to header() from drupal_add_http_header() since the latter could cause our hit/miss/wait header to get cached along with the page which is confusing.

catch’s picture

Issue tags: +Performance, +scalability

Looks great. Fine with keeping WAIT. Just waiting for testbot now.

Dries’s picture

Any idea how we can test the performance impact and correctness of this patch?

moshe weitzman’s picture

It is hard to test those. I tested correctness by adding a very long sleep after the page cache lock is acquired and then I launched a new anon request and verified that this new request went into a holding pattern while the first request "built" the page ... Not sure how to test performance regression. What I'm intending is a reliability improvement; site stays up during a busy cache flush.

I asked a couple lock experts to review this patch. Their support should ameliorate these concerns.

chx’s picture

This is scary, very very scary. Compared to where this lock framework was used (=menu rebuild) this is a very very frequent operation. I have no idea where are we with deadlocks and in general, how large is the lock overhead. Anyone has an insight?

moshe weitzman’s picture

Please, no FUD. Thats what I call the use of like "very, very scared" without presenting any basis.

Requests that are waiting for cache to be built will check once a second until the lock is released. Once released, they cache_get() the item they need. Checking once per second means one query to the semaphore table for the core lock.inc, even faster checks for APC, memcache, etc. See lock_wait().

We have well known stampede bugs. This patch addresses them with no API change.

catch’s picture

Lock waiting has very little overhead. The overhead I'd be more concerned about would be in http://api.drupal.org/api/function/lock_acquire/7

However - with database caching, cache_set() can be very expensive, let alone building the cached data, so there's trade off on stability. Since lock.inc is pluggable, it ought to be possible to reduce the lock_acquire() and lock_wait() overhead to pretty much nothing, at which point you have the protection against the site going down (which is a very real risk at the moment), without any performance impact.

I will see if I get a basic test case up with ab at some point to verify some of the above.

catch’s picture

OK here's a test case:

1. Enable page caching for anonymous users.
2. apply attached patch to cache.inc - just log that we wrote to cache.
3. truncate cache_page
4. ab -c5 -n50

without locking:

cache_page
cache_page
cache_page
cache_page
cache_page

with locking:

cache_page

So, the locking works.

Then, to test performance impact, I did the following:

1. ab -c1 -n50 http://d7.7 (acquired lock but we didn't need the stampede protection - this would be the worst case from a performance standpoint)

HEAD:

Concurrency Level:      1
Time taken for tests:   0.557 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    89.76 [#/sec] (mean)
Time per request:       11.141 [ms] (mean)
Time per request:       11.141 [ms] (mean, across all concurrent requests)
Transfer rate:          607.19 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     7   11  22.1      8     164
Waiting:        7   11  22.0      7     163
Total:          7   11  22.1      8     164

Patch:

Concurrency Level:      1
Time taken for tests:   0.402 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    124.27 [#/sec] (mean)
Time per request:       8.047 [ms] (mean)
Time per request:       8.047 [ms] (mean, across all concurrent requests)
Transfer rate:          840.63 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     5    8  20.2      5     148
Waiting:        4    8  20.1      5     147
Total:          5    8  20.2      5     148

In this case the patched version was faster, so there's no measurable impact from acquiring the lock then not using it. A better test case would hit a higher spread of paths but still with a high cache hit rate, but this is fine from a sanity check standpoint.

2. ab -c5 -n50 (stamped protection kicks in)

HEAD:

Concurrency Level:      5
Time taken for tests:   0.403 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346355 bytes
HTML transferred:       322200 bytes
Requests per second:    124.13 [#/sec] (mean)
Time per request:       40.282 [ms] (mean)
Time per request:       8.056 [ms] (mean, across all concurrent requests)
Transfer rate:          839.68 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   4.6      0      17
Processing:     5   39  94.3      7     386
Waiting:        5   38  93.6      7     385
Total:          5   40  98.7      7     403

Percentage of the requests served within a certain time (ms)
  50%      7
  66%     10
  75%     12
  80%     15
  90%    216
  95%    346
  98%    403
  99%    403
 100%    403 (longest request)

Patch:

Concurrency Level:      5
Time taken for tests:   1.050 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    47.62 [#/sec] (mean)
Time per request:       104.991 [ms] (mean)
Time per request:       20.998 [ms] (mean, across all concurrent requests)
Transfer rate:          322.15 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   6.3      0      24
Processing:     5   90 277.7      6    1030
Waiting:        5   90 277.3      6    1026
Total:          5   92 284.0      6    1050

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      7
  75%      7
  80%      7
  90%    152
  95%   1042
  98%   1050
  99%   1050
 100%   1050 (longest request)

As you can see, in the latter example, we lose requests per second, but that's not the interesting thing.

Without the patch, there are these long requests:

  90%    216
  95%    346
  98%    403
  99%    403

Each of those is a full page build and cache_set() - which takes about 200-400ms, of which all the time is spent generating the content and writing to cache. Note this was on HEAD with only one node on the front page.

With the patch, we instead get

 90%    152
  95%   1042
  98%   1050
  99%   1050
 100%   1050 

in this case, the 152ms request builds the content and write to cache, the other 4 requests sit around and do nothing for a second, and only 50ms is spent actually doing anything - sleep(1) uses 0 cpu for the second.

So, while raw performance is a little slower, the webserver has done 1/5th the work for four out of five cache misses. On a real site where a page cache miss might mean 1s to build and write the cache, we'd see no speed impact either.

It's also possible, especially with a non-SQL lock.inc, but even with it, to make lock_wait() more granular - see #802856: Make lock_wait() wait less - removing wait times for users who are put into the holding pattern on the cache miss. With both patches applied, I ran the same ab.

Concurrency Level:      5
Time taken for tests:   0.363 seconds
Complete requests:      50
Failed requests:        0
Write errors:           0
Total transferred:      346351 bytes
HTML transferred:       322200 bytes
Requests per second:    137.56 [#/sec] (mean)
Time per request:       36.347 [ms] (mean)
Time per request:       7.269 [ms] (mean, across all concurrent requests)
Transfer rate:          930.56 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   4.9      0      20
Processing:     5   34  63.8     11     240
Waiting:        5   33  62.7     10     231
Total:          5   36  67.9     11     247

Percentage of the requests served within a certain time (ms)
  50%     11
  66%     16
  75%     22
  80%     24
  90%    198
  95%    240
  98%    247
  99%    247
 100%    247 (longest request)

This time, the longest request is twice as fast as without the patch - there's no contention for either CPU or database writes on building the cache, and once the lock is released, requests which are made to wait still get from cache and return the page far within the time which would have otherwise been spent building and writing to cache themselves. We might want to leave that more aggressive lock_wait() to a pluggable include, but it shows that this approach can both improve overall stability and have zero raw performance cost in terms of user experience.

catch’s picture

FileSize
4.36 KB

Forgot to attach the hack to cache.inc - this includes both patches as well.

chx’s picture

Status: Needs review » Needs work

I asked others to do research into lock.inc which was answered that my request is FUD. lock_acquire does a database write. That's not much of an overhead, but it's there. However, it's pluggable, so be it.

There can be no deadlock as far as I can see.

What IS possible, however, that you recurse so many times that PHP dies out. We need a recurse protection in there.

Edit: that's only necessary if catch decimates the wait time. If he raises $delay = 30 to $delay = 300 changing the behaviour to the same just more fine grained, then such a protection is not needed.

chx’s picture

Status: Needs work » Reviewed & tested by the community

OK, so let's go with it in the hopes the other patch gets in too.

dhthwy’s picture

To go along with catch's ab benchmark.

I made it so Drupal clears the cache for every other page request, so I get a HIT/MISS scenario. I couldn't get it to do a WAIT no matter how much I bombarded it. I'm sure it works though, just couldn't get the right conditions for it to fire.

Locking stuff average (MISS): 0.0143499279022158
Other stuff inside (WHEN CACHE MISS) _drupal_bootstrap_page_cache() average: 0.1099051904678

Minimum lock stuff time: 0.00162601470947
Minimum other stuff time: 0.0664360523224

Maximum lock stuff time: 0.150331020355
Maximum other stuff time: 0.228527784348

"Locking stuff average (MISS)" means elapsed time from comment "// Cache miss. Avoid a stampede." to the point it sets the header to "X-Drupal-Cache: MISS".

"Other stuff inside (WHEN CACHE MISS) _drupal_bootstrap_page_cache()" means elapsed time from beginning of function to the point it sets the header to "X-Drupal-Cache: MISS".

Conclusion is that it looks like the locking stuff takes very little time compared to the rest of function. No opcode cachers being used btw.

Arggg I messed up somewhere writing this down.. I'll have to correct these times.

Ok those times should be right now. I've got a perl app that makes 100 page requests to Drupal, half of them (when they miss) insert those elapsed times into the header. I've run it several times and they're pretty consistent.

lostchord’s picture

Just out of curiosity... you have fully researched distributed lock management here? Is the "lock system" being referred to a "distributed lock manager"?

If you have a cache shared across a number of distinct processors - a web server farm - then you will need this level of synchronisation I think. There are all sorts of interesting race conditions and failure modes that need to be considered.

This is not easy. I had discussions with some of the VMS developers who did the first ever DLM and it took this amazing development team quite a few iterations to get it right.

cheers

catch’s picture

PHP is single threaded - one process can only ever request or secure a lock on one thing at a time, so the only situation where you could have a deadlock is where a lock is acquired then the process dies with a fatal error before releasing the lock - in that case the lock will expire and be released eventually. If there are other potential race conditions in the locking framework, then those deserve separate issues against lock.inc. The original issue is here #251792: Implement a locking framework for long operations.

@dhthwy - to get a WAIT, you'll need to run your perl script in two processes (or just use ab which handles that for you).

lostchord’s picture

@catch

This applies to PHP running on seperate servers where the database is shared between the servers?

dhthwy’s picture

catch, I ran a bunch of perl procs. Still couldn't get it.. hehe. Maybe ab+perl. I did it in perl so I could get the header stuff I injected from Drupal.
But not worth me trying for anyway since you've already tested that and the locking doesn't appear to have a noticable impact on performance.

Dries’s picture

Instead of locking and waiting, I'd prefer to come up with a write-through cache. In such scenario, other requests continue to use the old data until it was replaced in the cache. That requires zero waiting from other requests. Have we considered that?

Damien Tournoud’s picture

Status: Reviewed & tested by the community » Needs work

Re: the polling delay question: the standard, database-based, lock.inc implementation is nothing more then a cheap baseline. There are several ways to do that better, without polling, even with a database-based implementation, but that requires some additional privileges (locking, etc.) that most shared hostings cannot provide.

The best way to avoid replacing a cache stampede by a lock stampede would probably be to implement decay polling in lock.inc by default.

This said, let's study how much it takes to implement a semi-write-through strategy. (complete write-though is not possible, because there are a lot of cases where the data itself has completely disappeared, and we have to lock in that case anyway)

catch’s picture

Status: Needs work » Reviewed & tested by the community

This is for absolute cache misses - whenever cache_get() returns FALSE, not for when there is existing, stale, data in the cache which needs to be updated. They're two different issues.

At the moment core's cache.inc returns false for the update case too - so this patch would stop a stampede in that case, but with a write-through, you'd have cache_get() return TRUE - so this is entirely compatible with that if we added it in a separate patch.

For reference:

http://tag1consulting.com/patches/pagecache_lock
http://groups.drupal.org/node/59318
http://drupal.org/node/295738

These deal with stale cached data, but not empty data, if we could get that into core too it'd be great, but it's doesn't solve all cases.

catch’s picture

Status: Reviewed & tested by the community » Needs review

Cross posted, I still think this belongs where it is, and write-through belongs in cache.inc, so leaving at CNR rather than CNW.

Damien Tournoud’s picture

Status: Needs review » Needs work

In here, let's at the minimum implement decay polling in lock.inc.

catch’s picture

Status: Needs work » Needs review
FileSize
3.8 KB

There's a D6 patch for write through on #762978: Prevent cache stamped for page cache table, I just bumped that issue to D7.

Here's the same patch with decay polling, I didn't change the minimum wait from 1s, that should stay in a separate issue.

catch’s picture

FileSize
3.8 KB

Brain no worky. Less broken patch.

catch’s picture

FileSize
3.82 KB

Minus one more stupid error, and with some improvements to the lock.inc change via Damien in irc.

jriedel’s picture

@lostchord

Being an old VMS internals guy...

We are not talking about DLM in this patch. The locking is being done by inserting a row in the database, not by the webservers on many different nodes. If your database is clustered, then it is up to the database to worry about the locking required to make transactions work correctly.

In VMS I didn't care about DLM as long as $ENQ/$DEQ calls worked the same on single node or in a cluster.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

I'm happy with latest patch. I think we have a lot of experts in this issue now.

lostchord’s picture

@jriedel

Ditto the old VMS internals bit... I'm much happier in VAX assembler and a relative newbie to PHP :-) I think the critical part in your reply is this:

If your database is clustered, then it is up to the database to worry about the locking required to make transactions work correctly.

I suspect this will probably not be the case in most 'simple' replication scenarios.

I'm assuming that as far as locking is concerned we are dealing with this part of the API and the intent is to provide a mutex capability rather than a generalised resource synchronisation capability. In this context I've got a couple of questions:

  • Would it be true to say that if I'm sharing tables between sites using common prefixes then the semaphore table would have to be shared as well? (Actually I've just noticed that this form of sharing seems to have been discouraged as of March 4th...that was an attractive feature).
  • The integrity of the system relies on independent module developers choosing non-conflicting lock names. Wouldn't it be better to force modules to register lock names and detect conflicts before they potentially cause trouble?

I have some other questions regarding how the lock_* stuff behaves but I'll dump those in a seperate post when I've dug a bit more. Finding API documentation and code is relatively easy, finding out what the code was actually intended to do (as opposed to what you can see that its doing), the requirements it's based on, is proving a lot harder.

cheers

Dries’s picture

+++ includes/bootstrap.inc	20 May 2010 12:21:53 -0000
@@ -2076,7 +2076,20 @@ function _drupal_bootstrap_page_cache() 
+        // Some other request is building cache. Wait, then re-run this function.

Is 'building cache' proper English?

+++ includes/bootstrap.inc	20 May 2010 12:21:53 -0000
@@ -2076,7 +2076,20 @@ function _drupal_bootstrap_page_cache() 
+        return _drupal_bootstrap_page_cache();

Why do we have to call this function recursively? That seems odd given that we might execute SQL queries and whatnot. For example, the IP blocking code would be executed twice.

catch’s picture

Status: Reviewed & tested by the community » Needs work

It looks like the lock protection should go in drupal_page_get_cache() rather than here. Moshe, any reason not to do that?

moshe weitzman’s picture

Status: Needs work » Needs review
FileSize
4.17 KB

This patch avoids duplicate drupal_block_denied() calls.

I looked into avoiding recursion altogether but the way I did it had us assuming that the cache took less than x seconds to build and I can't think of any sensible hard-coded value there. Its simpler IMO to add a static as I have in this patch.

moshe weitzman’s picture

FileSize
3.52 KB

Ditched the new static var and instead moved the lock_acquire code into drupal_page_get_cache() as suggested by catch. This is a touch simpler. We still use recursion as we don't know how long a page will take to build. Its plausible that it will take more than one 30 second wait period. There is minimal performance cost to the recursion.

moshe weitzman’s picture

FileSize
3.55 KB

Fixed the code comment language per Dries' request in 37. No functionality change.

catch’s picture

I'm wondering if instead of recursion we could just do a while (lock_wait()); will try to roll a patch for that tomorrow. If for some reason doesn't work out this is RTBC.

catch’s picture

Status: Needs review » Reviewed & tested by the community

I had a look at the while() loop but it was more verbose than recursion (iirc, can't believe I left this hanging for a week, sorry :( ). Since the recursion is otherwise completely harmless, and only happens on a cache miss that's failed to acquire a lock anyway, marking this back to RTBC.

catch’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
3.93 KB

lock_initialize() moved to _drupal_bootstrap_variables(), removed the call from the patch, no other changes.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

fyi, catch's last change removed a tiny bit of code that was no longer needed due to variable stampede patch ... wait for green before commit.

moshe weitzman’s picture

#44: render_cache.patch queued for re-testing.

catch’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
3.92 KB

beejeebus noticed a very nasty typo in irc. No function is called _drupal_page_get_cache().

Anonymous’s picture

discussed an alternative locking implementation with catch. here's an example implementation that uses do { } while() instead of recursion.

in general, i'd favour retrying n times, then bailing (whether its a 50x header or whatever the code around the lock decides) over using recursion.

they'll be the same 99.99% of the time, until the locking goes sideways for whatever reason, then the recursion will just cause the processes to loop until death.

function drupal_page_get_cache($check_only = FALSE) {
  global $base_root;
  static $cache_hit = FALSE;
            
  if ($check_only) { 
    return $cache_hit;
  }       
        
  if (drupal_page_is_cacheable()) {
    $name = $GLOBALS['base_root'] . request_uri();
    $tries = 0;
    require_once DRUPAL_ROOT . '/' . variable_get('lock_inc', 'includes/lock.inc');
    lock_initialize();
    do {    
      if (!$cache = cache_get($base_root . request_uri(), 'cache_page')) {
        if (!$lock_acquired = lock_acquire($name, 1)) {
          lock_wait($name);
        }
        else {
          // Proceed with page build;  cache its result, then release the lock.
          header('X-Drupal-Cache: MISS');
        }
      }
    }
    while ($cache === FALSE && $lock_acquired === FALSE && ++$tries < 10);
    
    if ($cache !== FALSE) {
      $cache_hit = TRUE;
    }
    return $cache;
  }
}

in this case, IMO we can just proceed rather than exit, but we could do something different where $tries == 10 if we wanted to.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

I'm fine with the alternative proposed. If you write it, please post here. In the meanwhile, this is back to RTBC and is much better than no stampede protection.

Anonymous’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
4.46 KB

ok, here's a patch based on #48. i'm ok if people decide they just want to get #49 in - i'll reroll in a follow up issue.

this patch will attempt to get the locks multiple times, then continue to render the page or elements even if lock acquisition fails. both caching and locking are pluggable, so the site may still be able to serve the request when this happens.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

Did a code and logic review and this is a slight improvement so back to RTBC.

moshe weitzman’s picture

Did a code and logic review and this is a slight improvement so back to RTBC.

catch’s picture

Why does this add the lock initialization back in? That's done in _drupal_bootstrap_variables() now so shouldn't be necessary to repeat.

catch’s picture

Well except in the case where you set $conf['page_cache_without_database'] = TRUE, then variable_initialize() gets skipped...

Anonymous’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
6.42 KB

ok, after discussion with catch in #drupal, here's another go.

- the lock_acquire() call looked wrong to me after i read lock_acquire() ;-) we should definitely hold the lock for longer than 1 second, otherwise this stampede protection wont work for non-cached pages that take longer than that to generate

- lock initialisation checks $conf['page_cache_without_database'] as per #53 and #54

- don't let the _drupal_bootstrap_page_cache() invoke bootstrap and exit hooks if $conf['page_cache_without_database'] is TRUE, because its not safe to do so

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

Nice catch. Back to RTBC.

Anonymous’s picture

FileSize
6.42 KB

catch noticed that i'd used 'initialise' instead of 'initialize'. new patch uses Uhmerican speling, no other changes.

Anonymous’s picture

FileSize
12.81 KB

ahem.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, render.lock_.patch, failed testing.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
6.4 KB

one more time.

catch’s picture

Status: Needs review » Reviewed & tested by the community

Back to RTBC again.

Anonymous’s picture

Issue tags: +Needs committer feedback
sun’s picture

+++ includes/bootstrap.inc	13 Jul 2010 07:13:37 -0000
@@ -870,11 +870,35 @@ function drupal_page_get_cache($check_on
+    // the database, drupal_page_get_cache() could get called before the lock ¶

+++ includes/common.inc	13 Jul 2010 07:13:40 -0000
@@ -5235,7 +5236,17 @@ function drupal_render_cache_get($elemen
+  ¶

Trailing white-space.

+++ includes/bootstrap.inc	13 Jul 2010 07:13:37 -0000
@@ -2127,23 +2155,18 @@ function _drupal_bootstrap_page_cache() 
+      // If the skipping of bootstrap hooks is not enforced, call hook_boot.
...
+      // If the skipping of bootstrap hooks is not enforced, call hook_exit.

Function names should have trailing ().

Powered by Dreditor.

sun’s picture

#60: render.lock_.patch queued for re-testing.

catch’s picture

FileSize
6.41 KB

Comment issues were already in HEAD, but cleaned up anyway.

Anonymous’s picture

thanks sun and catch for keeping this alive, now how do we get DriesChick to way in?

moshe weitzman’s picture

Priority: Normal » Major

*Still* RTBC.

Damien Tournoud’s picture

+ // Try to get either the cached page or a lock five times. If that fails,
+ // continue with page execution, because the site may still be able to
+ // serve the request.

I do not understand the use of this loop:

  • It's the job of lock_wait() to loop (and it already does)
  • The default $delay of lock_wait() is 30s, so this patch would make Drupal wait for up to 150s before proceeding to non-cached page rendering (!!!!)

Something as simple as:

$cache = cache_get($base_root . request_uri(), 'cache_page');
if (!$cache && !$lock_acquired = lock_acquire($name)) {
  lock_wait($name, 2);
  $cache = cache_get($base_root . request_uri(), 'cache_page');
  if (!$cache) {
    $lock_acquired = lock_acquire($name);
  }
}

Feels more then enough.

webchick’s picture

Status: Reviewed & tested by the community » Needs review

Sounds like this still needs discussion based on Damien's last comment.

Anonymous’s picture

Status: Needs review » Needs work

i'm ok with Damien's idea. the most important thing is to avoid the broken "hey, lets just recurse until the heat death of the universe, or the php process death, whichever comes first".

whether we go with #68 or not, i agree we should shorten the wait period. unless someone says otherwise, i'll post a patch based on #68 later today.

catch’s picture

I'd want more than 2 seconds for a wait, Drupal.org itself used to take around 30 seconds to load /tracker. However no loop seems fine here.

Anonymous’s picture

FileSize
6.02 KB

ok, attached patch addresses #68 and #71.

Anonymous’s picture

Status: Needs work » Needs review

woops, needs review.

catch’s picture

Looks fine to me. lock_wait() bails out as soon as possible anyway, so I don't see a reason to set anything but the default there.

moshe weitzman’s picture

It looks like there are some small differences in the lock code between page cache and render cache. Is that deliberate?

Anonymous’s picture

yes :-)

in the page cache, we want to send a cache miss header if we acquire the lock.

in the render cache, we want to wait for less time, because we're only dealing with bits of a page.

moshe weitzman’s picture

IMO it is helpful to send cache miss header for render cache too. obviously the CID has to get sent too. If others think that’s overkill, then OK.

catch’s picture

lock_wait() exits as soon as it can anyway, I don't see any reason to specify 5 seconds. If I render cache something that takes 15 seconds to build (a huge page callback, someone changes a 2 second view to a 20 second one by adding an extra sort on it, a block on every page that has to make a load of network requests to a slow third party service, plenty of possibilities), then that's precisely what this patch should add a little bit of protection against.

Anonymous’s picture

FileSize
6.01 KB

ok, new patch addresses #78. i've left the render-cache headers out, i think that's overkill, but i'll add it back if either catch or DamZ vote it should go in.

Status: Needs review » Needs work
Issue tags: -Performance, -scalability, -Needs committer feedback

The last submitted patch, lock.patch, failed testing.

Anonymous’s picture

Status: Needs work » Needs review

#79: lock.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Performance, +scalability, +Needs committer feedback

The last submitted patch, lock.patch, failed testing.

moshe weitzman’s picture

Hmm. bot is not happy. Anyone?

moshe weitzman’s picture

Status: Needs work » Needs review

#79: lock.patch queued for re-testing.

moshe weitzman’s picture

I am running run-tests.sh successfully with this patch. Neither myself nor justinrendell can reproduce this bot result. Anyone know how to resolve "Failed to run tests: failed during invocation of run-tests.sh."

Status: Needs review » Needs work

The last submitted patch, lock.patch, failed testing.

carlos8f’s picture

Subscribing. Test failure is odd.

Damien Tournoud’s picture

It seems that #79 makes "Forum functionality (ForumTestCase)" timeout. The failure seems legit.

carlos8f’s picture

Yeah, a timeout is likely. The PIFR bot can't distinguish infinite loops from other errors yet. Usually it's reported as a fatal error, but here we're getting run-tests.sh error. In any case, the forum tests may have some clues.

Anonymous’s picture

thanks for the investigationing, i'll pick this up and see if i can figure out the failure in the forum test.

Anonymous’s picture

the test fails are probably coming from http://drupal.org/node/886152#comment-3624434 - once that lands i'll reroll and retest.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
6.01 KB

alllllrighty, updated patch for HEAD, no other changes, lets see what the bot thinks now.

Status: Needs review » Needs work

The last submitted patch, lock.patch, failed testing.

Anonymous’s picture

oh blurgh. i still can't reproduce that error. i can successfully run the tests on my machine via run-tests.sh.

i'll chase rfay/boombatower/damz to see if can get access to a bot to figure this out.

carlos8f’s picture

@justinrandell, when you ran run-tests.sh, what concurrency did you set? I have a feeling the tests only fail when a higher concurrency is set (the production test bots have concurrency 4+, I think). Concurrency would probably trigger lock_wait(), etc. so that might be the key to reproducing the failure.

I've only skimmed through the code in the patch, but I am interested in what changed between #72 and #79 to cause the failure (interdiff please?). Either HEAD broke the patch, or #79 added something that broke it. Hmm.

carlos8f’s picture

Issue tags: +Needs tests

Also, I'm interested in a unit test being written for this (read #12 - #13, yes it's difficult, but I think we should try). For the reasons that this touches critical bootstrap code in a major way, and secondly to debug the run-tests.sh failure, which if it is due to concurrency, would point to a legitimate bug.

carlos8f’s picture

Well, actually run-tests.sh concurrency in theory shouldn't trigger lock_wait(), since each test method gets its own database, and its own set of locks.

And although lock.inc has some rudimentary tests in core, apparently there is no way to actually test concurrent requests with our framework. Maybe we'll need some thinking out-of-the-box on this one.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
6.02 KB

ok, lets see what the bot thinks of the patch at #72, rerolled only to remove fuzz. if this passes, then we have a smaller target.

carlos8f’s picture

Wow! Now we can isolate the change that causes failure.

interdiff lock.802446.72.patch lock_13.patch 
diff -u includes/common.inc includes/common.inc
--- includes/common.inc	29 Oct 2010 03:16:31 -0000
+++ includes/common.inc	28 Oct 2010 23:23:55 -0000
@@ -5574,10 +5574,10 @@
   }
   $bin = isset($elements['#cache']['bin']) ? $elements['#cache']['bin'] : 'cache';
 
-  if (!$cache = cache_get($cid, $bin) && !lock_acquire($cid, 5)) {
-    lock_wait($cid, 5);
+  if (!$cache = cache_get($cid, $bin) && !lock_acquire($cid)) {
+    lock_wait($cid);
     if (!$cache = cache_get($cid, $bin)) {
-      lock_acquire($cid, 5);
+      lock_acquire($cid);
     }
   }

Looks like the only change is using '5' as the timeout/delay. Interesting.

carlos8f’s picture

I think this is the problem:

The lock is supposed to be released in http://api.drupal.org/api/function/drupal_page_footer/7 (via drupal_page_set_cache(), etc), but it's possible that in some cases, that code is not running and the lock is not properly released. In that case, with the timeout being 30 (default), PHP times out waiting for a lock that was never released properly. It might be that we need to call lock_release() in a shutdown function to ensure that it is called whenever exit() is called, to catch cases like drupal_goto().

carlos8f’s picture

OK, taking http://api.drupal.org/api/function/lock_release_all/7 into consideration, all locks should be released on shutdown. Nevertheless, it appears that our lock is not being released. There's some bug here that we don't grok yet.

Anonymous’s picture

FileSize
6.45 KB

ok, here's a wild stab in the dark to clear the locks on initialization. lets see what the bot thinks.

Anonymous’s picture

oh boy, that's a bad idea. please disregard #103.

Status: Needs review » Needs work

The last submitted patch, lock.20.patch, failed testing.

carlos8f’s picture

Status: Needs work » Needs review

Another perplexing part is that PHP is not supposed to time out due to sleeping! (on Linux anyway). It still seems like there is a direct correlation between the lock_wait() timeout being 30 seconds in the failed patch, and PHP having a global (default) timeout of 30. The passing patch uses 5 as lock_wait() timeout, and I think it's safe to say that it's using the full 5 seconds and continuing silently.

Created #956948: Locking framework ungrokably fails testing to investigate test failures.

carlos8f’s picture

FileSize
6.02 KB

Facepalm.


header('Content-Type: text/plain; charset=utf-8');

function foo() {
  return FALSE;
}

function bar() {
  return TRUE;
}

if ($cache = foo() && bar()) {
  print "A is wrong!\n";
}

if (!$cache = foo() && !bar()) {
  print "B is wrong!\n";
}

if (!($cache = foo()) && !bar()) {
  print "C is wrong!\n";
}

Output:

B is wrong!

@chx, can we get a phpwtf on this?

Attached patch is #93 with proper parentheses.

interdiff lock_13.patch facepalm.patch 
diff -u includes/bootstrap.inc includes/bootstrap.inc
--- includes/bootstrap.inc	28 Oct 2010 23:23:51 -0000
+++ includes/bootstrap.inc	28 Oct 2010 23:23:51 -0000
@@ -869,7 +869,7 @@
     }
 
     $cid = $base_root . request_uri();
-    if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {
+    if (!($cache = cache_get($cid, 'cache_page')) && !lock_acquire($cid)) {
       lock_wait($cid);
       if (!$cache = cache_get($cid, 'cache_page')) {
         if (lock_acquire($cid)) {
diff -u includes/common.inc includes/common.inc
--- includes/common.inc	28 Oct 2010 23:23:55 -0000
+++ includes/common.inc	28 Oct 2010 23:23:55 -0000
@@ -5574,7 +5574,7 @@
   }
   $bin = isset($elements['#cache']['bin']) ? $elements['#cache']['bin'] : 'cache';
 
-  if (!$cache = cache_get($cid, $bin) && !lock_acquire($cid)) {
+  if (!($cache = cache_get($cid, $bin)) && !lock_acquire($cid)) {
     lock_wait($cid);
     if (!$cache = cache_get($cid, $bin)) {
       lock_acquire($cid);
carlos8f’s picture

ignore this one

Damien Tournoud’s picture

Basic boolean logic.

sun’s picture

+++ includes/bootstrap.inc
@@ -860,11 +860,29 @@ function drupal_page_get_cache($check_only = FALSE) {
+    if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {
...
+      if (!$cache = cache_get($cid, 'cache_page')) {

+++ includes/common.inc
@@ -5573,7 +5574,14 @@ function drupal_render_cache_get($elements) {
+  if (!($cache = cache_get($cid, $bin)) && !lock_acquire($cid)) {
...
+    if (!($cache = cache_get($cid, $bin))) {

common.inc looks correct to me, bootstrap.inc not.

carlos8f’s picture

@Damien, did you see the patch in #107? #109 is incomplete, it doesn't fix

if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {

I believe the patch in #107 is more complete.

And also,

if (!$cache = cache_get($cid, $bin)) {

is fine by itself, but #109 changes it. It's only when we add the && that problems happen with operator precedence, I actually prefer to leave out the paren's if it's a simple expression like above.

Damien Tournoud’s picture

#109 was a crosspost. We just came to the same conclusion.

carlos8f’s picture

Oh oh. #107 vs. #109:

interdiff facepalm.patch 802446-lock-render-proper.patch
diff -u includes/bootstrap.inc includes/bootstrap.inc
--- includes/bootstrap.inc	28 Oct 2010 23:23:51 -0000
+++ includes/bootstrap.inc
@@ -869,7 +869,7 @@
     }
 
     $cid = $base_root . request_uri();
-    if (!($cache = cache_get($cid, 'cache_page')) && !lock_acquire($cid)) {
+    if (!$cache = cache_get($cid, 'cache_page') && !lock_acquire($cid)) {
       lock_wait($cid);
       if (!$cache = cache_get($cid, 'cache_page')) {
         if (lock_acquire($cid)) {
diff -u includes/common.inc includes/common.inc
--- includes/common.inc	28 Oct 2010 23:23:55 -0000
+++ includes/common.inc
@@ -5576,7 +5576,7 @@
 
   if (!($cache = cache_get($cid, $bin)) && !lock_acquire($cid)) {
     lock_wait($cid);
-    if (!$cache = cache_get($cid, $bin)) {
+    if (!($cache = cache_get($cid, $bin))) {
       lock_acquire($cid);
     }
   }

The only notable difference is that #107 fixes the page cache operator precedence. And then we get page cache test fails. This is exactly the opposite of what to expect: #107 should pass, #109 should fail!

Regardless, we have moved past the run-tests.sh error by fixing this operator bug. That is at least encouraging, but still, using timeout of 30 v.s 5 shouldn't cause the bot to crash like that. I'll try to expose the bug in #956948: Locking framework ungrokably fails testing (looks like the bot is caught in an infinite loop there).

carlos8f’s picture

header('Content-Type: text/plain; charset=utf-8');

function foo() {
  print "foo\n";
  return FALSE;
}

function bar() {
  print "bar\n";
  return TRUE;
}

print "testing B\n";
if (!$cache = foo() && !bar()) {
  print "B is wrong!\n";
}

print "testing C\n";
if (!($cache = foo()) && !bar()) {
  print "C is wrong!\n";
}

print "testing E\n";
if (!$cache = bar() && !bar()) {
  print "E is wrong!\n";
}

print "testing F\n";
if (!($cache = bar()) && !bar()) {
  print "F is wrong!\n";
}
testing B
foo
B is wrong!
testing C
foo
bar
testing E
bar
bar
E is wrong!
testing F
bar

This simple test shows that in #109 the page cache logic should be totally broken, and #107, which contains the paren's, should be correct. Yet the bot prefers #109. WTF is going on here? :P

carlos8f’s picture

Status: Needs review » Needs work

In #956948: Locking framework ungrokably fails testing I've isolated the run-tests.sh failure to the operator precedence bug. We still need to figure out why page cache tests are failing in #107 though.

carlos8f’s picture

Status: Needs work » Needs review
FileSize
5.99 KB

To fix the page cache tests, all I needed to do was move header('X-Drupal-Cache: MISS') out of the lock_acquire() block and to the bottom of the function. The header should get sent regardless of whether locking succeeded.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

carlos8f gets the bulldog award for getting the tests to pass. great work.

Anonymous’s picture

nice work! i think this is RTBC as well.

carlos8f’s picture

Issue tags: -Performance, -Needs tests, -scalability, -Needs committer feedback

Status: Reviewed & tested by the community » Needs work
Issue tags: +Performance, +Needs tests, +scalability, +Needs committer feedback

The last submitted patch, 802446-116-lock-page-render-caches.patch, failed testing.

carlos8f’s picture

Status: Needs work » Reviewed & tested by the community
carlos8f’s picture

Issue tags: -Performance, -Needs tests, -scalability, -Needs committer feedback

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 802446-116-lock-page-render-caches.patch, failed testing.

carlos8f’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +Needs tests, +scalability, +Needs committer feedback
carlos8f’s picture

Status: Needs review » Reviewed & tested by the community

Still RTBC; test fails were not related.

carlos8f’s picture

carlos8f’s picture

I talked with @David Strauss today about this patch; he noted that when using varnish, page cache stampedes are already taken care of. That could cause some unnecessary locking, although Pressflow d7 may introduce an opt-out of core page caching ("External" cache mode), which would defer everything to varnish instead. Render cache stampede protection is still a nice improvement though.

There could be cases where fatal errors or uncaught exceptions leave the lock hanging for the duration (lock_release_all() doesn't run in that case). That's a tricky issue; in the case of fatals it's impossible to release the lock and it will just have to expire. In the case of exceptions, it might be possible to have a global try { } catch { release locks, re-throw exception }. That is another issue though. I think the current patch is OK to commit.

catch’s picture

lock_release_all() happens in a shutdown function. I haven't personally tested it but I think others did at the time and this was pretty resilient to some deliberately triggered fatal errors.

Unless I'm mistaken, while varnish has a 'grace' setting which allows for stampede protection against expired entries, it doesn't have any protection against stampedes in a cold start situation (although maybe it's possible to do that with a custom vcl). So this is a slightly different issue. fwiw I think you can do roughly the equivalent to pressflow D6 external page caching with vanilla D7 already, see #627496: Document using a null cache for page caching with reverse proxies.

webchick’s picture

I tried to look at this tonight, but not only is it way over my head, it's also got Dries's fingerprints all over it. I do also get very nervous about sentences like "If we agree on this approach, I will submit similar patches for block and page caches (others?)." Tons of follow-up patches are not something we have time for at this point. But the trade-off might be worth it, dunno.

Leaving for The Spikey Haired One to make the call.

moshe weitzman’s picture

@webchick - no worries, thats an ancient comment. The page cache protection is now in the patch and the block cache isn't getting stampede protection this release.

@dries - please consider that this has been mostly rtbc since may. thanks.

yonailo’s picture

subscribing

carlos8f’s picture

Issue tags: -Performance, -Needs tests, -scalability, -Needs committer feedback
carlos8f’s picture

Issue tags: +Performance, +Needs tests, +scalability, +Needs committer feedback
Anonymous’s picture

FileSize
5.99 KB

updated #116 to get rid of fuzz, no code changes.

achton’s picture

Subscribing.

pillarsdotnet’s picture

FileSize
5.5 KB

De-fuzzed again...

mgifford’s picture

Subscribe.

mgifford’s picture

Subscribe.

catch’s picture

FileSize
4.71 KB

I keep thinking about this patch and have concerns about it now. These are similar to the concerns around the locking in variable_initialize(), and #802856: Make lock_wait() wait less is designed to deal with part of this, but I'm not sure if it's enough.

Say you have something that takes 300ms to build on a cache miss, and that is rendered on every page (for example a block in the footer).

Without locking, you'll get a cache stampede - every request does that 300ms of processing until one sets the cache (and all the ones that missed will try to write back to cache).

With locking, only one request tries to build the item, so you save the apache/database CPU time from that 300ms.

However between acquiring the lock and writing to cache you have a 300+ms window where every other process hitting the site has to wait.

If they start waiting right at the beginning of the cache miss, then the request will take 700ms longer than it would. If they start to wait right at the end, then it'll be 1300ms after the initial lock_acquire() before those requests are able to continue.

During this time CPU load on the server is going to be low (apart from lock_may_be_available() which is minimal), but it runs the risk of exceeding MaxClients - since you'll have all these sleeping processes waiting around. What we're effectively doing in those situations is adding a global read lock across the entire site (even with multiple webservers and database servers there'd still be a single read lock).

And the current patch would make this even more pronounced than the current variable patch, because the polling in lock_wait() doubles each time.

For now, I'm just removing the changes to lock_wait() from here, but I think we need to try to come up with a uniform approach to this if possible. For example we could lock cache writes but not reads, or we could lock_wait($name, 1); by default in these examples so there's less chance of stacking up apache processes, then go ahead and build the items anyway, while still not writing back to cache (or checking the cache before writing back to it).

But there are complex trade-offs here that need a bit more discussion.

catch’s picture

Status: Reviewed & tested by the community » Needs review

Back to CNR.

pillarsdotnet’s picture

I'm getting multiple errors like the following:

PHP Fatal error: Call to undefined function db_insert() in /var/www/includes/lock.inc on line 125

This happens in a call to lock_acquire() added by the patch in #139.

The problem is that if $config['page_cache_without_database'] is set, then drupal_page_get_cache() gets called before the database system has been initialized. The #139 patch loads lock.inc but does not load database.inc. Calling lock_acquire() therefore results in the above error.

The attached patch loads both database.inc and lock.inc IF:

  • cache_get($cid, 'cache_page') returns FALSE,   AND
  • variable_get('page_cache_without_database') returns TRUE

Should there should be a test that generates a page load of modified content with $conf['page_cache_without_database'] set? If so, should that test be in core?

pillarsdotnet’s picture

Wow. Testbot is reporting failure to checkout the drupal repository. I wish someone with access to qa.drupal.org would update its status page when stuff like that happens.

I'd post a comment to the qa.drupal.org status page, but my account has apparently been banned from that server.

pillarsdotnet’s picture

Fixed silly typo.

Status: Needs review » Needs work
Issue tags: -Performance, -Needs tests, -scalability, -Needs committer feedback

The last submitted patch, cache_stampede-802446-143.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +Needs tests, +scalability, +Needs committer feedback

#143: cache_stampede-802446-143.patch queued for re-testing.

DanPir’s picture

subscribe

catch’s picture

Version: 7.x-dev » 8.x-dev

Just a note that we added stampede protection for empty caches to the 6.x-1.x memcache backend in #962422: Empty cache stampede protection, this (as well as the 'grace' mode for expired items) is configurable in settings.php, disabled by default (since we want to ensure people use memcache-lock.inc with it). Patch needs to be ported to 7.x branch still.

I am starting to think that cache backends are a better place for this to live, and we should reserve locking in core for where there are actual consistency issues (such as menu_rebuild(), arguably variable caching). A lock_acquire() with memcache is extremely cheap, but this is not going to be the case with the SQL version we ship with, and the cost of that happens whether there is a stampede or not.

Not changing status, interested to see what other people think about this, but since I was a strong supporter of this patch originally wanted to confirm that I've changed my mind on it now. If custom/contrib code is building an extremly expensive item and wants to make sure that only one process tries to do this, then it makes sense to do it there - but for the render cache some things may be quite cheap, or unique to each page (even user) and not that likely to cause a stampede.

Anonymous’s picture

I'm ok with #147.

My main concern with this was to stop the recursive locking pattern.

pillarsdotnet’s picture

#143: cache_stampede-802446-143.patch queued for re-testing.

catch’s picture

Category: bug » task
Issue tags: +Needs backport to D7

Adding backport tag, this is a task.

xjm’s picture

Tagging issues not yet using summary template.

moshe weitzman’s picture

In order to evaluate whether we should punt to contrib, it would be helpful to see how complex that pattern will be for contrib to implement. Any examples out there? Maybe we could add a property to the 'cache' definition which signals to core that we are doing an expensive operation and locking is wanted. Something like #lock => TRUE.

catch’s picture

Priority: Major » Normal

@Moshe - well #147/memcache has generic stampede protection for all cache entries using the lock framework, so if it's simple enough at that level, it should be OK elsewhere.

I'm going to put this down to normal priority.

I like the idea of #lock => TRUE, tried to build something like that into DrupalCacheArray.

moshe weitzman’s picture

OK, I think the only todo left might be the #lock = TRUE bit. We could start a new issue for that.

thedavidmeister’s picture

Issue tags: -Performance, -Needs tests, -Needs issue summary update, -scalability, -Needs committer feedback, -Needs backport to D7

#143: cache_stampede-802446-143.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, cache_stampede-802446-143.patch, failed testing.

YesCT’s picture

Issue summary: View changes
Issue tags: +Needs issue summary update

putting back lost tag for needs issue summary update (instructions)

Fabianx’s picture

Too bad this never got in ... Hopefully we can find a better (parallel) solution for D8 ...

E.g. placeholder data while waiting for the lock, then try later again ...

mgifford’s picture

Assigned: moshe weitzman » Unassigned

Unassigning stale issue. Hopefully someone else will pursue this.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

xjm’s picture

Issue tags: -Needs committer feedback

I don't think this is in a state where it needs committer feedback (six years later) since the proposed resolution is not applicable for D8. It would need to be updated with a new proposal.

(Also we don't use "Needs committer feedback" anymore for D8 core since there are three distinct committer roles.)

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.9 was released on December 7, 2022 and is the final full bugfix release for the Drupal 9.4.x series. Drupal 9.4.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.5.x-dev branch from now on, and new development or disruptive changes should be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.