Problem/Motivation

There's a lock in D7's variable_initialize() which is often unnecessary and can cause performance problems under high load / concurrency, especially if the variable system is being abused / frequently updated.

Steps to reproduce

https://git.drupalcode.org/project/drupal/-/blob/7.74/includes/bootstrap...

Proposed resolution

#3 has a patch from @Damien Tournoud which avoids waiting on a lock in the event of a cache miss, and simply loads variables from the db. This is an improvement.

We're also including an opt-out via a variable, e.g.:

$conf['variable_initialize_wait_for_lock'] = TRUE;

...which will revert to the existing behaviour, in case any sites need to do so.

Remaining tasks

Commit the patch.

User interface changes

None.

API changes

Addition of the variable in default.settings.php

Data model changes

None.

Release notes snippet

Can be taken from notes above.

See also #125 by @David Rothstein, #177 from @Steven Jones and #187 from @Fabianx


Original Issue Summary

Original title was "variable_set() should rebuild the variable cache, not just clear it"

Much of the discussion before e.g. #125 was around alternative but related problems / solutions.

Problem

Variable caching has a number of inter-related issues.

Memory and cache object size
All variables are loaded from the {variables} table regardless of whether they're in use or not, and some older Drupal sites can have over 2,000 variables in the table since many dynamic variables (and etc.) are not properly tracked between upgrades and module uninstalls. This leads to a large cache item, and since it is assigned to a global, also high memory usage in the critical path. The cache item size ranges from around 100k on a brand new site to 1.4mb or more on a site with lots of contrib modules that has been around for a while.

Locking
When variable_set() is called (which some modules and even core do a lot, the variable cache is cleared entirely. It then has to be rebuilt in full by the next request. In Drupal 7 we added a lock so that only one request does this at a time, which prevents a stampede on the variable cache itself, but can also mean processes hanging around waiting until the single process that acquired the lock is done - since it is a global read lock on every single process that hits the Drupal install.

If you have an operation that involves a large number of variable_set() - for example rebuilding css and js aggregation with a large number of bundles - #886488: [D7] Add stampede protection for css and js aggregation, or a failing drupal_http_request() due to a third party service going down, #959620: drupal_http_request() can lead to variable_set() stampede, then this happens:

* Process 1 calls variable_set(), clears the cache.
* Process 2 acquires a lock, starts to rebuild the cache.
* Processes 3-200 all get a cache miss and fail to acquire the lock - since this is variable_initialize() that's for every single request to the site - cached pages, imagecache generation, you name it.
* Process 2 frees the lock after 100ms or so.
* Process 4 acquires the lock since it wants to do a variable_set() again.
* process 5-200 are still waiting.
* Despite load being low, apache runs out of spare threads, connections get backed up etc.

Proposed solution
Since the variable cache is a single array that includes lots of stuff that is rarely used, it is a good candidate for DrupalCacheArray. This has the following benefits:

- variables that are stale in the database or only requested very rarely don't get put into the cache item.
- there is no longer a need for a global read lock on the site (however we add a write lock for writes to reduce race condition windows - something that does not exist in core now).
- The variable cache is converted to write-through - so there is no need to do a full rebuild each time there is a cache clear.
- Because DrupalCacheArray builds up the cache object over time, the first few requests can issue a dozen or three small database queries instead of waiting for a massive db_query() and unserialization of hundreds or thousands of items before they're able to complete, which should reduce performance spikes on cold starts (with the trade off of some extra writes to the cache as it gets built).

CommentFileSizeAuthor
#190 973436-189.patch2.01 KBmcdruid
#172 interdiff_171_to_172.txt1.4 KBjoseph.olstad
#172 D7-variable_set_variable_get-973436-172.patch2.97 KBjoseph.olstad
#172 D7-tests_only-973436-172.patch1.82 KBjoseph.olstad
#171 D7-tests_only-973436-171.patch1.81 KBjoseph.olstad
#171 interdiff_169_to_171.txt1.4 KBjoseph.olstad
#171 D7-variable_set_variable_get-973436-171.patch2.96 KBjoseph.olstad
#169 D7-variable_set_variable_get-973436-168.patch2.96 KBjoseph.olstad
#166 tests_only-973436-166.patch1.81 KBjoseph.olstad
#165 tests_only-973436.patch1.51 KBjoseph.olstad
#156 973436-156.patch1.97 KBmcdruid
#156 interdiff-973436-136-156.txt1.78 KBmcdruid
#136 973436-non-bothering-variable-cache-136.patch1.15 KBjoseph.olstad
#113 drupal-n973436-113.patch3.84 KBDamienMcKenna
#111 drupal-n973436-111.patch3.8 KBDamienMcKenna
#110 drupal-n2258313-theme.txt1.18 KBDamienMcKenna
#101 drupal8.variable-cache.101.patch7.98 KBsun
#97 drupal8.variable-cache.96.patch7.48 KBsun
#94 variable_cache.patch6.21 KBcatch
#92 variable_cache.patch17.68 KBcatch
#90 variable_cache.patch5.04 KBcatch
#82 973436-82-variable-cache.patch5.95 KBAnonymous (not verified)
#75 variable_cache_0.patch5.98 KBcatch
#72 variable_set-973436-72.patch3.87 KBkarschsp
#71 variable_set-973436-71.patch3.87 KBkarschsp
#66 variable_set-973436-67.patch3.87 KBkarschsp
#65 variable_set-973436-65.patch3.87 KBkarschsp
#63 variable_set-973436-63-d7.patch4.39 KBpillarsdotnet
#62 variable_set-973436-62-d7.patch24.76 KBpillarsdotnet
#62 variable_set-973436-62.patch4.41 KBpillarsdotnet
#58 variable_cache_locking_hell-D6.patch5.81 KBdsobon
#57 variable_cache_locking_hell-D6.patch3.06 KBdsobon
#43 variable-cache-rebuild-976436-43-D6.patch3.04 KBkentr
#41 973436.variable.patch4.05 KBAnonymous (not verified)
#40 variable_cache_rebuild_976436.patch4.06 KBcatch
#38 variable-cache-rebuild.patch3.91 KBAnonymous (not verified)
#31 973436-31-variable-cache-rebuild.patch5.01 KBcarlos8f
#24 973436-24-variable-cache-rebuild.patch4.99 KBcarlos8f
#20 variables.patch4.99 KBAnonymous (not verified)
#14 variables.patch4.24 KBAnonymous (not verified)
#12 973436-variable_cache.patch3.64 KBcatch
#6 973436-variable-set.patch2.74 KBcatch
#4 variables.patch2.15 KBcatch
#3 973436-non-bothering-variable-cache.patch1.34 KBDamien Tournoud
variable_set_writethrough.patch1 KBcatch
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Status: Needs review » Needs work

The last submitted patch, variable_set_writethrough.patch, failed testing.

Damien Tournoud’s picture

As an alternative strategy, we could also just load the whole variable table while a cache is being rebuilt.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
1.34 KB

Implementation of #2.

catch’s picture

FileSize
2.15 KB

Hmm this is a tricky one, let's sum up the options:

- D6 - complete stampede, every thread that gets a cache miss races to set the cache item first, with db caching this can cause duplicate key errors due to the merge query.
- HEAD - no stampede, but variable_set() or a cache clear causes a global read lock for every single process hitting the server.
- My patch, adds a global /write/ lock to variable_set(), but leaves the read lock unchanged for cold start situations.
- Damien's patch, removes the global lock except for actually setting the cache, so there'll still be a stampede on cache misses, but only the first process actually attempts to set the cache entry, leaves writes unchanged.

In general the issue here is less when there's a single cache miss, but more when there's several variable_set() happening in sequence (css/js rebuilding, saving a really long admin form, badly behaving contrib module - of which there are dozens that mis-use variable_set() - here's a recent example #954642: Simplemenu resets the variable-cache for every request.

Weighing all of these up, I'd be very tempted to combine both mine and Damien's patches. The result of this would be:

No duplicate key error when there's a cache miss stampede, no stampede at all on variable_set(), the actual process that triggers the variable_set() will take a bit longer.

Attached a patch that does this. Didn't review the install failure yet.

Status: Needs review » Needs work

The last submitted patch, variables.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
Issue tags: +Performance, +scalability
FileSize
2.74 KB

Fixed up the install failures.

Also removed the recursion and went for a similar approach to http://drupal.org/node/802446 - if we can't acquire the lock, lock_wait() once then continue to go ahead once that's done, only doing the write through if we're able to acquire the lock, otherwise clearing the cache as usual.

I'm not 100% convinced by this yet myself but uploading for review.

catch’s picture

Two comments from chx via irc, leaving at CNR for the overall approach:

1. The lock_initialize() in variable_set() looks weird, I pointed out it's only needed in the installer - chx says: so we should move it there.
2. the db query, cache_set(), lock_release could potentially be moved to a helper function shared betweeen variable_initialize() and variable_set()

These are both fair enough but I don't have time to debug the installer this evening.

Also chx doesn't want any more locking added to core, but I pointed out in this case, the lock only exists to prevent an (albeit extremely short) race condition between db_query() and cache_set(), and having the write through here should mean there's much less chance of the lock code in variable_initialize() ever running, so there should be less locking overall :p

nnewton’s picture

RE point 1: The require_once and lock_init in variable_set.
I agree that the lock_init looks weird, but more importantly introducing a require_once into variable_set may kill performance on servers that don't have the apc require_once override enabled. This will introduce _alot_ of realpath calls to a very hot path to find the true path of the lock file, no?

Anonymous’s picture

Status: Needs review » Needs work

seems we have the same race in variable_del().

a process hitting a cold cache could retrieve the variables from permanent storage just before the delete, then lose the cache delete/set race, and set the cache with out of date data.

here's an attempt at a wrapper function around the variable cache that both variable_del and variable_set use.

can't make a patch because the wifi on this plane doesn't do cvs.

/**
 * Load the persistent variable table.
 *
 * The variable table is composed of values that have been saved in the table
 * with variable_set() as well as those explicitly specified in the configuration
 * file.
 */
function variable_initialize($conf = array()) {
  // NOTE: caching the variables improves performance by 20% when serving
  // cached pages.
  if ($cached = cache_get('variables', 'cache_bootstrap')) {
    $variables = $cached->data;
  }
  else {
    // Signal to variable_reset_cache() that this is a read-only operation.
    // In this case, if no lock can be acquired, the variable values are
    // returned without writing to the persistent cache to avoid a race
    // between reading and writing to the cache.
    $variables = variable_reset_cache(FALSE);
  }

  foreach ($conf as $name => $value) {
    $variables[$name] = $value;
  }

  return $variables;
}

/**
 * Sets a persistent variable.
 *
 * Case-sensitivity of the variable_* functions depends on the database
 * collation used. To avoid problems, always use lower case for persistent
 * variable names.
 *
 * @param $name
 *   The name of the variable to set.
 * @param $value
 *   The value to set. This can be any PHP data type; these functions take care
 *   of serialization as necessary.
 *
 * @see variable_del()
 * @see variable_get()
 */
function variable_set($name, $value) {
  global $conf;
  $conf[$name] = $value;

  db_merge('variable')->key(array('name' => $name))->fields(array('value' => serialize($value)))->execute();
  variable_reset_cache();
}

/**
 * Reset the variable cache safely.
 *
 * @param bool $variable_write
 */
function variable_reset_cache($variable_write = TRUE) {
  $lock_name = __FUNCTION__;
  if (!$lock_acquired = lock_acquire($lock_name, 1) && $variable_write) {
    // When we've made a change via variable_set() or variable_del(), we try
    // harder to get the lock.
    lock_wait($lock_name);
    $lock_acquired = lock_acquire($lock_name, 1);
  }
  $variables = array_map('unserialize', db_query('SELECT name, value FROM {variable}')->fetchAllKeyed());
  if ($lock_acquired) {
    cache_set('variables', $variables, 'cache_bootstrap');
    lock_release($lock_name);
  }
  else if ($variable_write) {
    // When we've made a change via variable_set() or variable_del() and failed
    // to rebuild the cache, we need to clear it to ensure that subsequent
    // requests pick up the changes.
    cache_clear_all('variables', 'cache_bootstrap');
  }
  return $variables;
}

/**
 * Unsets a persistent variable.
 *
 * Case-sensitivity of the variable_* functions depends on the database
 * collation used. To avoid problems, always use lower case for persistent
 * variable names.
 *
 * @param $name
 *   The name of the variable to undefine.
 *
 * @see variable_get()
 * @see variable_set()
 */
function variable_del($name) {
  global $conf;

  db_delete('variable')
    ->condition('name', $name)
    ->execute();

  unset($conf[$name]);

  variable_reset_cache();
}
Anonymous’s picture

as a D8 aside, now that we're trying to fix the race here (which exists in D6 as well), i wonder at what point we want to ditch the "hey, lets read it all in one go pattern". as long as we stick to that, we make the race-window much, much bigger than it ought to be.

we can't lock at the scope of the change, instead we have to lock at the scope of the whole variable table. it kinda reminds me of those poor applications that get stuck with MyIsam on mysql. oh, wait ;-)

i wonder if in D8 we couldn't move to something more like the old-registry caching system, where we let the persistent variable cache build up over time.

carlos8f’s picture

subscribe

catch’s picture

Status: Needs work » Needs review
FileSize
3.64 KB

Here's Justin's change rolled into a patch, we should definitely do the same thing for variable_del(). Untested so let's see what the bot thinks.

Opened #987768: [PP-1] Optimize variable caching for lazy cache building, that's a great idea.

Status: Needs review » Needs work

The last submitted patch, 973436-variable_cache.patch, failed testing.

Anonymous’s picture

FileSize
4.24 KB

thanks for rolling the patch catch. attached patch adds variable_get() back in ;-)

Anonymous’s picture

Status: Needs work » Needs review

status fail.

Status: Needs review » Needs work
Issue tags: -Performance, -scalability

The last submitted patch, variables.patch, failed testing.

catch’s picture

Status: Needs work » Needs review

whoopsie. Not sure why your patch failed too, sent for retesting.

Status: Needs review » Needs work
Issue tags: +Performance, +scalability

The last submitted patch, variables.patch, failed testing.

yonailo’s picture

subscribing

Anonymous’s picture

FileSize
4.99 KB

added a first stab at initializing the lock system in the installer as suggested by chx via catch.

Anonymous’s picture

Status: Needs work » Needs review

blurgh.

catch’s picture

Issue tags: +Needs backport to D6

Installer change is good, rest of the patch looks good too. I'm too close to this to RTBC though.

Jeremy’s picture

Subscribing.

carlos8f’s picture

I found a logic error that would make threads wait even if the lock was acquired:

+++ includes/bootstrap.inc	3 Dec 2010 16:26:05 -0000
@@ -803,12 +794,37 @@ function variable_get($name, $default = 
+  if (!$lock_acquired = lock_acquire($lock_name, 1) && $variable_write) {
+    // When we've made a change via variable_set() or variable_del(), we try
+    // harder to get the lock.

This is an operator precedence bug reminiscent of #802446: Cache stampede protection: drupal_render() and page cache, which evaluates as $lock_acquired = (lock_acquire() && $variable_write). Obviously not what is intended :)

Rerolled to fix that and a style issue: "else if" should be "elseif".

Otherwise the method and implementation look sound. This patch may have real performance consequences though, since the overhead for variable_set() is much more. But the trade-off is better for concurrent reads after a set, as the OP makes clear.

Also, from the OP:

other requests to the site while this is going on don't know anything about it - they'll always get a cache hit regardless.

The patch keeps this contract pretty well. The main improvement is cache_set() replacing a previous cache_set() with newer data, instead of removing it entirely. cache_clear_all() is left as a last resort during concurrent writes.

David_Rothstein’s picture

 function install_drupal($settings = array()) {
   global $install_state;
+
+  // Initialize the locking system.
+  require_once DRUPAL_ROOT . '/includes/lock.inc';
+  lock_initialize();

Why is this code in install_drupal()?

I think it should be in install_begin_request(), somewhere closer to where the rest of the files are loaded and the rest of the initialization is performed.

carlos8f’s picture

@David wouldn't putting lock_init in install_begin_request() be less friendly to systems that use install_drupal() as an API, such as drush?

Anonymous’s picture

holy crap, that precedence things always bites me, thanks carlos8f. i'm not sure i can RTBC this, but it feels close.

David_Rothstein’s picture

@carlos8f: install_drupal() calls install_begin_request(), so everyone will get it. It's an issue of code organization, really - all the other systems are loaded/initialized from there, so why should this one be different?

carlos8f’s picture

Oh, it would've helped to look at the code :P

Yes, variable_initialize() moving to install_begin_request() makes a lot of sense. lock.inc as a pluggable system should be loaded just after DRUPAL_BOOTSTRAP_CONFIGURATION, ideally.

Status: Needs review » Needs work

The last submitted patch, 973436-24-variable-cache-rebuild.patch, failed testing.

carlos8f’s picture

Status: Needs work » Needs review
FileSize
5.01 KB

Rerolled to move lock initialization to install_begin_request().

Anonymous’s picture

looks fine to me. i think this is ready, but i don't think i should RTBC it.

catch’s picture

Status: Needs review » Reviewed & tested by the community

Looks ready to me as well. Looking at the patch now it's pretty far from my original one, and I haven't made any actual changes to the logic here since #6, so I'm going to go ahead and mark it RTBC.

chx pointed out (in irc, don't think it made it into a comment) that this is the only place in core where a cache clear triggers a rebuild after the write, instead of just clearing on the next request, in case Dries or webchick also questions this, here's some (more) justification for that:

- variable_initialize() happens very, very early in the bootstrap, and it's needed for both cached and uncached page requests (unless you manually tweak settings.php variables or use varnish).

- having no locking leads to stampedes and PHP errors with database caching, and could lead to performance issues on high traffic sites since it's quite expensive to build, and a very large cache array to set.

- having the locking just by itself, puts us in the situation of having a global read lock and hanging apache processes for every single request (to both cached and uncached pages) across a site - this is very easy to reproduce with Jeremy's debug patches above.

- since we load the entire variable table into cache at one go and can't afford stale cache entries, there is no option apart from either a regression to the stampede, or a global lock somewhere

- therefore the patch avoids both stampedes and hanging processes, by moving the locking and cache rebuilding to variable_set() and variable_del(). In the case of a cold start situation, we prevent the original stampede by not writing to the variable cache with more than one thread at a time on reads too, others read from the table but don't try to write back to cache. This situation will be much rarer since the majority of variable cache clears are from del/set anyway.

- we still don't have any way to test the locking framework with simpletest, however there is less locking, and further away from the critical path with this patch, than there currently is in HEAD.

carlos8f’s picture

this is very easy to reproduce with Jeremy's debug patches above.

I think you mean Jeremy's debug patches in #886488: [D7] Add stampede protection for css and js aggregation? :)

We have these three performance/stampede-protection patches left in the queue:

#802446: Cache stampede protection: drupal_render() and page cache
#886488: [D7] Add stampede protection for css and js aggregation
#973436: Overzealous locking in variable_initialize()

Although they look good from a logic and code review standpoint, I am a little nervous at the prospect of these getting committed just before release, and having very little real-life testing to indicate that they actually improve concurrency and minimize load. I emphasize real-life because it's hard to test the organic nature of traffic and concurrency in a controlled test. If we can conduct a test either on a real traffic-heavy site, or in a controlled test that can test for locking/multiple threads across multiple URLs, I would be more confident. Any ideas on how we could go about that testing?

carlos8f’s picture

sun’s picture

Status: Reviewed & tested by the community » Needs work
+++ includes/bootstrap.inc	5 Dec 2010 00:27:17 -0000
@@ -725,43 +725,6 @@ function drupal_get_filename($type, $nam
@@ -786,6 +749,34 @@ function variable_get($name, $default = 

@@ -786,6 +749,34 @@ function variable_get($name, $default = 
+function variable_initialize($conf = array()) {

I agree that this function should be moved closer to the corresponding functions, but why is _initialize() moved below _get()?

+++ includes/bootstrap.inc	5 Dec 2010 00:27:17 -0000
@@ -803,12 +794,37 @@ function variable_get($name, $default = 
+ * @param bool $variable_write

Missing phpDoc description.

+++ includes/bootstrap.inc	5 Dec 2010 00:27:17 -0000
@@ -803,12 +794,37 @@ function variable_get($name, $default = 
+function variable_reset_cache($variable_write = TRUE) {

Why is this added below _set()? Should be directly below _initialize().

Powered by Dreditor.

sun’s picture

Sorry, please scratch the last of the review issues (removed it), I'm tired.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
3.91 KB

updated as per #36.

sun’s picture

+++ includes/bootstrap.inc	15 Dec 2010 16:16:44 -0000
@@ -762,6 +753,34 @@ function variable_initialize($conf = arr
+ * @param bool $variable_write
+ *   Flag to indicate if we are resetting the cache after a write operation.

Without reading the actual code, it wasn't entirely clear to me what the effect of this parameter is.

Also, the default value should be documented, and I especially wondered why it's TRUE and not FALSE by default, since many other *_reset() functions in core solely flush a cache - they do not write the cache.

Powered by Dreditor.

catch’s picture

Updated the phpdoc for this - any better?

Anonymous’s picture

FileSize
4.05 KB

updated #40 to fix fuzz only.

@catch - i think that phpdoc is better, thanks.

fgm’s picture

Further analysis at #249185: Concurrency problem with variable caching leading to cache inconsistency, of which this issue appears to be a duplicate.

kentr’s picture

Here's a first run at a D6 port.

I was uncertain how best to apply the changes in the install_begin_request() section of #41, so they are not yet included.

Tested with basic drush vset / vget / vdel, and simple drush cron runs.

----
Edited to correct comment number reference. Changed "#42" to "#41".

pillarsdotnet’s picture

The patch in #41 changes install_begin_request() to initialize the locking system before the database system is loaded. But the locking system depends on the database system, because lock_acquire() calls db_insert(). Can we guarantee that this will not result in PHP Fatal error: Call to undefined function db_insert()?

DanPir’s picture

subscribe

basicmagic.net’s picture

subscribe

Lars Toomre’s picture

Issue tags: +Needs backport to D7

This will need to be patched in D8 first and then back-ported to both D7 and D6 after it is integrated into D8. Hence, moving adjusting the version and issue tags.

As I read through this issue, I was wondering whether for performance issues it might make sense to allow variable_set and variable_del functions to be called with a second format.

The second format would be a keyed associative array where the key was the variable to be set and the value would be what it should be set to. In both functions then we could use an is_array() check to silently handle strings as they are now or to set an array of values in one go. This would save the relatively expensive cache rebuild until most, if not all, of the variables are set for a module/entity/plug-in etc.

This seems like relatively low hanging fruit from performance and DX perspective. Does this idea make sense?

Lars Toomre’s picture

Version: 7.x-dev » 8.x-dev

Meant to change the version number too!

catch’s picture

Cross posting with #987768: [PP-1] Optimize variable caching. These are two different approaches to the same issue, the other issue is more comprehensive but also a bigger change - so I think it makes sense for these to co-exist.

kentr’s picture

This will need to be patched in D8 first and then back-ported to both D7 and D6 after it is integrated into D8.

I don't understand the reasoning for this. Is this Drupal policy? In my view, fixing it in the current version is a higher priority than fixing it in a future version.

It's a pretty serious bug for anyone who wants to run cron frequently with a lot of processing and can bring their server to a halt until they figure out the cause and build a custom cron.

pillarsdotnet’s picture

@kentr -- Yes, this is Drupal policy. If you want to discuss the policy, please read all of the following thread first: #1050616: Figure out backport workflow from Drupal 8 to Drupal 7

kentr’s picture

@pillarsdotnet: Thanks. I will read that thread.

Independent of the backport policy, I believe this bug to be critical. Inaccurate variable values can wreak all sorts of havoc.

RobLoach’s picture

What if instead of a full variable cache reset each time a variable was set, we set a &drupal_static() flag and have the cache reset once the page is done rendering....

variable_set('mystuff45', 19);
variable_set('mystuf545f', 12);
variable_set('mystuff54', 50);
variable_set('mystuff534', 72);
variable_set('mystuff2', 24);
variable_del('mystuff2');
variable_set('mystuff3', 66);

Do we really need to reset the cache seven times here?

catch’s picture

If we do that we make potential for race conditions considerably longer - another process could set mystuff43 to something else (very likely with css/js aggregate hashes since the same array is built differently on different requests), and overwrite.

The cache rebuilding is much cheaper on #987768: [PP-1] Optimize variable caching since it avoids having to query the entire {variable} table each time.

crea’s picture

Subscribing

xjm’s picture

Tagging issues not yet using summary template.

dsobon’s picture

A programmer with 20 years experience recently told me:

locking is like a toilet - only one person at a time can do a sh!t.

Patch included for pressflow-6.22.104 (may need one small change for D6).

Patch implements:
* read-modify-write concept.
* write {b,}locking for variable_{set,del}().
* variable_init() is called just before modifying $conf[].
* checks if db_insert() is defined.
* modify lock_wait() to sleep _after_ checking lock, not before, and to sleep 250 mS not 1000 mS.
* does _not_ implement read locking (that needs to be implemented at the module level for time-heavy processing, and use lock_wait() & lock_acquire() - the problem is _not_ with variable_get()).

dsobon’s picture

And yes. I made a mistake in the patch! But while I was at it, I found / resolved 2 more "problems".

Updated for:

* remove function_exists() - that was not causing the warning. db_result() being called from lock_acquire() on a disappeared lock was causing the problem. "drush @site cc all" is one case to trigger the following:

'all' cache was cleared                                                                        [success]
WD php: Warning: mysqli_num_rows() expects parameter 1 to be mysqli_result, boolean given in   [error]
db_result() (line 200 of
/var/aegir/platforms/test-pressflow-6.22.104/includes/database.mysqli.inc).

* fix lock_acquire() to check db_query() with db_affected_rows(), instead of db_result().
* rewrote cache_get() since it was if-nesting hell. Unreadable core code is never good.
* cache_get() now supports checking "expire" on an individual basis, providing cache_lifetime is set (which appears to be designed as a way to globally enable never-expire caching). cache->expire was only ever used to check if CACHE_PERMANENT or not. Otherwise cache_lifetime was used as the "expire" time as part of deleting expired cache entries.

carlos8f’s picture

@dsobon, Pressflow patches should be submitted to https://github.com/pressflow/6. This is a D8 issue now :)

carlos8f’s picture

Status: Needs review » Needs work
Issue tags: +Performance, +Needs backport to D6, +Needs issue summary update, +scalability, +Needs backport to D7

The last submitted patch, 973436.variable.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
4.41 KB
24.76 KB

Re-roll of #41 for D7 and D8.

The pressflow patch does not translate well to vanilla D6, so I'll leave that bit to someone else.

pillarsdotnet’s picture

Let's try that D7 patch again...

xjm’s picture

Status: Needs review » Needs work
Issue tags: +Novice

Note that the Drupal 8.x patch will need to be rerolled, because the core directory structure for Drupal 8 has now changed. (For more information, see #22336: Move all core Drupal files under a /core folder to improve usability and upgrades). When the patch has been rerolled, please set the issue back to "Needs Review."

Tagging as novice for the task of rerolling the Drupal 8.x patch.

If you need help rerolling this patch, you can come to core office hours or ask in #drupal-gitsupport on IRC.

karschsp’s picture

Status: Needs work » Needs review
FileSize
3.87 KB

Here's a re-roll of the 8.x patch in #62 for the /core move.

karschsp’s picture

Fixing whitespace issues in #65

Status: Needs review » Needs work
Issue tags: -Performance, -Needs backport to D6, -Needs issue summary update, -Novice, -scalability, -Needs backport to D7

The last submitted patch, variable_set-973436-67.patch, failed testing.

karschsp’s picture

Status: Needs work » Needs review

#66: variable_set-973436-67.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Performance, +Needs backport to D6, +Needs issue summary update, +Novice, +scalability, +Needs backport to D7

The last submitted patch, variable_set-973436-67.patch, failed testing.

xjm’s picture

+++ b/core/includes/install.core.incundefined
@@ -240,6 +240,9 @@ function install_begin_request(&$install_state) {
+  // Initialize the locking system.
+  require_once DRUPAL_ROOT . '/' . variable_get('lock_inc', 'includes/lock.inc');

Need to correct this for core/.

karschsp’s picture

Status: Needs work » Needs review
FileSize
3.87 KB

Re-roll which fixes the issue identified by @xjm in #70.

karschsp’s picture

Had my "/"s in the wrong place.

catch’s picture

Just a reminder there's a patch at #987768: [PP-1] Optimize variable caching that does write though caching but also tries to optimize some other things (like not loading variables that aren't in use etc.). If that ones goes in, we would not need this patch, but if this goes in, we could still go for that patch, so not marking duplicate but it is very close (and I realise I opened both issues within a couple of months of each other, doh).

Peter Bowey’s picture

Refer #58 http://drupal.org/node/973436#comment-5125648

I appreciate the work / code offered by dsobon @ #58

I have applied dsobon's code to Pressflow 6.22 - 104,
and it works for a solution to this issue. Pressflow lags somewhat on the Support
+ Google scale, so I am *grateful* to find the solution here; as will others.

As is mentioned in both these related articles;
See: http://drupal.org/node/987768 and http://drupal.org/node/973436 (this page)
this is a problem for D6/D7 and D8. (that logically means Pressflow too)

catch’s picture

FileSize
5.98 KB

I'd hoped this would get fixed quickly and then the more thorough overhaul happen afterwards in #987768: [PP-1] Optimize variable caching, but here we are a few months later and they're both open, so I've marked the other issue as duplicate and am uploading the latest patch from there to here.

This needs a new issue summary, will try to do that soon as well.

catch’s picture

Added issue summary.

donquixote’s picture

What exactly will this mean on the persistence level?
What will be stored in the database?
In D6, there is the variables table with one row per serialized variable, and then the cache table, with one row for the serialized contents of the variables table altogether. The cache thing can only be loaded all-at-once, while from the variables table we can load the variables one-at-a-time, or at least with a big SELECT and then fetched row-by-row. The same for write operations.

How are we going to change this?
Will there be a cache that caches only a part of the variables table?

(I hope the summary will tell)

catch’s picture

The summary has this information already, also see http://api.drupal.org/api/drupal/includes--bootstrap.inc/class/DrupalCac...

There is no schema change here, only the cache strategy is different.

Status: Needs review » Needs work

The last submitted patch, variable_cache_0.patch, failed testing.

donquixote’s picture

@catch (#78),
so this means our only "cache" for variables in the db is still the big thing that stores all variables in one big serialized nested array?
How does this help us then? We still load this big thing on every request, in variable_initialize() ?

Anonymous’s picture

re #80 - i haven't looked at the patch for a while, but i think it builds the 'big list of variables we load on each requeast' over time.

contrast this to now, where we just load the entire variables table, cache it. then load the whole lot from cache.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
5.95 KB

here's a version of the patch at #75 that should apply, no code changes.

pounard’s picture

In variable_get():

  $cache = &drupal_static('variables');
  if (isset($cache) && isset($cache[$name])) {
    $conf[$name] = $cache[$name];
    return $conf[$name];
  }

Is that really necessary? In variable_initialize() all content of the cache object is copied into $conf, and its internals have been emptied. It should never pass the isset($cache[$name]) here; Or am I missing something?

In variable_set() (and also variable_set()):

  $cache = &drupal_static('variables');
  if (is_object($cache)) {
    // Write through to the variable cache.
    $replace = array();
    if ($cached = cache('bootstrap')->get('variables')) {
      $cached->data[$name] = $value;
      cache('bootstrap')->set('variables', $cached->data);
      $replace = $cached->data;
    }
    $cache->replaceStorage($replace);
  }
  else {
    cache('bootstrap')->delete('variables');
  }

When does the $cache variable could not be initialized? If anything is using variable_set() in pre-boostrap, it's a real software bug, it should not happen. So my guess is that this is_object() test should be removed.

To ensure a correct behavior, we could use a new function such as _variable_get_cache() instead of direct call to drupal_static() that would create on the fly the new instance instead, maybe? In all cases, the object is created at bootstrap anyway.

My guess is that a empty() method could be done over the cache object instead of using cache('something')->delete('something'): this has a huge cons: the cid and bin appears only once in the full code, where you instanciate the object itself.

catch’s picture

In variable_initialize() all content of the cache object is copied into $conf, and its internals have been emptied. It should never pass the isset($cache[$name]) here; Or am I missing something?

A bit. We copy the variables that are known about by DrupalCacheArray into $conf (since there is still code around in contrib that directly mucks about with $GLOBALS['conf'] to tweak things during requests, so we can't actually replace that with the object safely). However DrupalCacheArray won't know about variables it hasn't seen yet, hence the fallback which will only happen on a cache miss.

When does the $cache variable could not be initialized? If anything is using variable_set() in pre-boostrap, it's a real software bug, it should not happen. So my guess is that this is_object() test should be removed.

The installer uses it even before the cache tables are available iirc - one of the first actions of the installer is variable_set(). We might be able to hack the installer up though instead of baking this in, it's a long time since I wrote this so I can't remember 100% if that condition happens anywhere else.

pounard’s picture

We copy the variables that are known about by DrupalCacheArray into $conf (since there is still code around in contrib that directly mucks about with $GLOBALS['conf'] to tweak things during requests, so we can't actually replace that with the object safely).

I understood that, it's OK it keeps backward compat.

However DrupalCacheArray won't know about variables it hasn't seen yet, hence the fallback which will only happen on a cache miss.

I see this use cases here, tell me if I'm missing one:
I call variable_get() and I get a variable into $conf...

  • ... and this variable comes from an override, so it's OK that the cache object doesn't know about it!
  • ... and this variable comes from the cache object, so it has been overrided at bootstrap time thanks to variable_initialize() and it's already into $conf
  • ... and this variable does not exists in either, but no need to check the $cache since it already gave all it's got.
  • ... and the variable has been modified or set just before using variable_set(), but variable_set() already updated both $conf and the object.
  • ... and the variable was into $conf but someone modified the global by hand: case in which we don't want to call the cache object.

So the isset() call will never go to TRUE, except if someone modified the object internals without going through variable_set(), which is a real bug out there, not in the variable system.

EDIT: Ho I understood, someone may have set a variable in a parallel thread: considering the fact that the PHP has not a long run, does it really worth the shot of reading it? Or am I still missing something in the DrupalCacheArray resolve/persist algorithm.

The DrupalCacheArray seems to suffer many problems, but I might be wrong here, see this:
It seems to be the extending class or the component who use it to have the responsability of calling the persist() method, which happens with the VariableCache::resolveCacheMiss() and VariableCache::replaceStorage() implementations: those two calls can be used in parallel threads, thus one saving its set of persisting keys as one big cache entry, the second one overrides and does the same, but they may have used totally different keys (and may have not loaded the cache from the other). This will bump to the next hits that may do exactly the same because not all variables has been saved (the second save will have overriden the first), and those after, etc... until we have luck and one and only one thread happens to run with ALL the useful variables being used.

catch’s picture

... and this variable does not exists in either, but no need to check the $cache since it already gave all it's got.

This is the step that's wrong.

DrupalCacheArray at any one point (and in turn $conf once this patch lands), only contains the variables that have been requested via variable_get() since the last time the cache was cleared. This is different to now where $conf contains the entire contents of the variables table.

So if a variable is not in $conf or $cache, it might be in the database still (a D6 site I looked at last week had 3,500 entries in the variable table, this is the most I've ever seen, but more than 1,500 is quite common). So in that case, the $cache object will need to look it up, return it, and cache it at the end of the request so it's there for next time.

That way, instead of loading a cache entry with on average between 500-2000 variables into $conf on every request, it should be closer to 100-600 or similar (obviously will depend on the modules, the site, how much activity on admin pages, how long since the last cache clear etc.).

pounard’s picture

Yes, you are right! That was were I was wrong, indeed. That's OK I guess then. Still my notes about the _set() and _del() functions remains.

catch’s picture

I missed this bit first time around:

those two calls can be used in parallel threads, thus one saving its set of persisting keys as one big cache entry, the second one overrides and does the same, but they may have used totally different keys (and may have not loaded the cache from the other). This will bump to the next hits that may do exactly the same because not all variables has been saved (the second save will have overriden the first), and those after, etc... until we have luck and one and only one thread happens to run with ALL the useful variables being used.

That's what this is for:


    if (!$lock || lock_acquire($lock_name)) {
      if ($cached = cache($bin)->get($cid)) {
        $data = $cached->data + $data;
      }
      cache($bin)->set($cid, $data);
      if ($lock) {
        lock_release($lock_name);
      }
    }
  }

So the default is to write lock the variables before writing to cache, during which the very freshest cache data is read from the database, added to, then written back again. That way there's no chance of cache ping pong with different sets of data from different processes. Some might throw their hard work away, but it's always strictly an additive process.

pounard’s picture

OK, didn't see that bit, nice.

catch’s picture

FileSize
5.04 KB

Re-rolled this for PSR-0, no other changes.

Status: Needs review » Needs work

The last submitted patch, variable_cache.patch, failed testing.

catch’s picture

Status: Needs work » Needs review
FileSize
17.68 KB

grrr, and with the new file this time.

Status: Needs review » Needs work

The last submitted patch, variable_cache.patch, failed testing.

catch’s picture

FileSize
6.21 KB

Not looked at the failures yet, but this moves some logic out of variable_del() and into VariableCache::offsetUnset()

tim.plunkett’s picture

Category: task » bug

The patch in #92 fixes the bug from #1536262: Entering a site name when installing Drupal 8 has no effect until caches are cleared, so I closed that, and bumped this. We should add a test case here for the site title as well.

Also I think the patch in #94 was missing that new file again.

xjm’s picture

Issue tags: +Needs tests
sun’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests
FileSize
7.48 KB

Re-rolled against HEAD. Improved some documentation.

Also adjusted the replaceStorage() method, as it didn't persist the new/replaced storage keys. When replacing the existing storage (which was persisted) I think the new/replaced storage should be fully persisted, too. Doesn't really matter for the implementation code in this patch, as it replaces with an empty array.

I'm entirely not happy about the

drupal_static_reset('variables');

that seems to be required for the batch test.

sun’s picture

+++ b/core/includes/bootstrap.inc
@@ -874,9 +867,22 @@ function variable_set($name, $value) {
+    // Write through to the variable cache.
+    $replace = array();
+    if ($cached = cache('bootstrap')->get('variables')) {
+      $cached->data[$name] = $value;
+      cache('bootstrap')->set('variables', $cached->data);
+      $replace = $cached->data;
+    }
+    $cache->replaceStorage($replace);

Why don't we call into VariableCache::__destruct() (or a public VariableCache::set()) instead of this manual fiddling?

xjm’s picture

Issue tags: +Needs tests

Status: Needs review » Needs work

The last submitted patch, drupal8.variable-cache.96.patch, failed testing.

sun’s picture

Status: Needs work » Needs review
FileSize
7.98 KB

I'm actually not sure whether I fully understand the envisioned concept behind this patch, so here's one more version incorporating the suggestion in #98 and reverting the replaceStorage() change from #97. I don't have the impression I'm contributing something useful here, so going to step out of this issue for now.

The replaceStorage() idea in variable_initialize() really needs much more elaborate documentation on why that is done. The idea looks fairly complex and special to me and I wonder whether it is not going to result in unexpected issues down the line.

EDIT: It also looks like the lock validation in variable_initialize() on read is lost with this patch.

Status: Needs review » Needs work

The last submitted patch, drupal8.variable-cache.101.patch, failed testing.

David_Rothstein’s picture

Category: bug » task
Fabianx’s picture

Fabianx’s picture

Assigned: Unassigned » Fabianx

Maybe I get to work on that :)

YesCT’s picture

@Fabianx still interested?

fgm’s picture

Does it still make any sense in the wake of CMI ?

amateescu’s picture

Version: 8.x-dev » 7.x-dev
Issue tags: -Needs backport to D7

I think the D8 equivalent task for this is #1880766: Preload configuration objects.

amateescu’s picture

Issue summary: View changes

Updating summary.

rwohleb’s picture

Assigned: Fabianx » Unassigned

With everyone focused on the D8 beta, issues like this have fallen to the side of the road. I'm excited by the D8 work being done in #2228261: Add a local, PhpStorage-based cache backend and #2248767: Use fast, local cache back-end (APCu, if available) for low-write caches (bootstrap, discovery, and config), but before I can even dream of those being back-ported to D7, it seems we need this issue committed first. I'm wondering if anyone has made any advancements on this front but has yet to post about it.

I'm reassigning to none since there has been no activity for two years.

DamienMcKenna’s picture

Status: Needs work » Needs review
FileSize
1.18 KB

A reroll of #61, which was the last D7 version of the patch.

DamienMcKenna’s picture

FileSize
3.8 KB

No, sorry, this is the patch file.

Status: Needs review » Needs work

The last submitted patch, 111: drupal-n973436-111.patch, failed testing.

DamienMcKenna’s picture

FileSize
3.84 KB

Whoops, there was a syntax error in the patch. Fixed.

DamienMcKenna’s picture

Status: Needs work » Needs review
Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

RTBC - Looks great to me.

I am not sure what we wanted to test here, maybe Stefan or David will know.

fgm’s picture

Not sure yet: does it cover the case outlined in https://www.drupal.org/node/249185#comment-3312162 under heavy load ? It seems it tries to, but not sure it succeeds.

Fabianx’s picture

Yes, that case is covered.

It tries to acquire the lock, but unless writing variables it never waits.

DamienMcKenna’s picture

Seconding Fabianx's request in #115 for feedback on what tests might be useful to confirm this doesn't cause regressions. Thanks.

stefan.r’s picture

Issue tags: +Drupal 7.60 target
Steven Jones’s picture

The patch in #113 misses the partial querying of the variable table that was added early in D8 in #987768: [PP-1] Optimize variable caching, so if you do multiple variable_sets in one page request you are going to do multiple rebuilds. This is as noted in #53, and catch addressed that concern in #54 by essentially saying that the work in #987768 was done, which it isn't in D7, right?
(I appreciate that the variable system has gone away in D8)

So, is that a real issue, or is it okay that page requests might rebuild the variable cache over and over, albeit in a non-blocking way for other requests.

Fabianx’s picture

#120: While what catch has written is true, there is one 'catch' to it (pun intended):

- Writes are way slower than reads usually and read-before write is generally cheap

So the only thing we save is the 'read', but we increase conflicts that way (e.g. ping-pong play of two or more threads).

Therefore I stick to this being RTBC - even if we don't use the VariableCache().

Though we might still consider to use the CacheArray (cache collector) approach for variables - for the reason that it means that only active variables have to be loaded from the cache and for several sites, variable table size is so BIG, but e.g. only 30% is actively used on most page requests.

I still think #113 is a pretty safe compromise to improve it more without radically changing it.

catch’s picture

I think it's worth committing #113, then if people want to re-open #987768: [PP-1] Optimize variable caching for CacheArray they should. The variable cache conversion to CacheArray was harder than any of the other ones in 7.x due to the differences between variable_get() returns in early and late bootstrap, it might also conflict with drush (although that needs testing).

pounard’s picture

Please don't CacheArray the variables, even if only 30% are in use on each page, it's just too critical during runtime to setup anything that would be more complex than it currently is. Current variable handling is fine, and we don't have cache size problems with it (as long as you don't use 200 modules).

EDIT: +1 for catch

Fabianx’s picture

Assigned: Unassigned » David_Rothstein
Issue tags: -Needs tests

This is base system and 7.60 target and hence an issue for David to look at.

I am not seeing the need for further tests though.

David_Rothstein’s picture

Assigned: David_Rothstein » Unassigned
Status: Reviewed & tested by the community » Needs review

It took me a while to get to this, but I did some testing now and as #120 said this really does slow down variable_set() a lot.

The difference is somewhat noticeable when submitting a form like the one on admin/config/people/accounts that does 30+ variable_set() calls. And that's with a default core install (with only around 60-70 variables in the table); I haven't tested but I wonder if the slowdown would be a lot worse on a site that has many variables?

So if the tradeoff is between slowing down an admin operation that lots of sites do vs. solving an end-user performance bottleneck that happens only on extremely high traffic sites, it is not clear to me how to weigh that tradeoff.

I got to thinking, though, why does this patch even need to make changes to variable_set() and variable_del() at all? The performance bottleneck is due to the lock_wait() during variable initialization, so why not just fix that? Then I noticed that that's exactly what Damien Tournoud's original patch in #3 did :) #4 says there would still be a "stampede" with that patch but I'm not sure what that refers to since it is designed to avoid a stampede on the cache writes... is the stampede just that many processes will try to call lock_acquire() and read directly from the {variable} table simultaneously? Isn't that pretty minor?

The patch in #113 has benefits and drawbacks, but the patch in #3 seems like it only has benefits compared to the current code in core - it gets rid of the pointless lock_wait() and that's it. So why not go with #3?

torgosPizza’s picture

donquixote’s picture

It seems the "Proposed solution" in the issue summary talks only about Drupal 8.
Could someone who followed this issue update the issue summary and add a section "Proposed solution(s) for Drupal 7" ?

joelpittet’s picture

I'm seeing deadlocks on the variable_reset_cache key in semaphore in with #113 applied. Going to try #3 for a while.

fgm’s picture

Lots of deadlocks on semaphore on D8.4 too, even on single web head, with as little as 3 simultaneous clients (looping Drush commands).

nareshp’s picture

Hi All,

Our environment is on 7.59 with memcache_storage module (7.x-1.4), we have been frequently getting Lock wait time out errors.

Patch https://www.drupal.org/files/issues/drupal-n973436-113.patch in #113 seems already there in 7.59 and this issue is tagged to 7.60 release.

Could some one help me what else is planned on 7.60 release to fix the issue.

Below is full stack trace of error.

Error message
PDOException: Uncaught exception 'PDOException' with message 'SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction' in /var/www/html/includes/database/database.inc:2227

Stack trace
in PDOStatement::execute called at /var/www/html/includes/database/database.inc (2227)
in DatabaseStatementBase::execute called at /var/www/html/includes/database/database.inc (697)
in DatabaseConnection::query called at /var/www/html/includes/database/query.inc (1177)
in UpdateQuery::execute called at /var/www/html/includes/database/query.inc (1643)
in MergeQuery::execute called at /var/www/html/sites/all/modules/contrib/memcache_storage/memcache_storage.inc (321)
in MemcacheStorage::variableSet called at /var/www/html/sites/all/modules/contrib/memcache_storage/memcache_storage.inc (148)
in MemcacheStorage::clear called at /var/www/html/includes/cache.inc (173)
in cache_clear_all called at /var/www/html/includes/theme.inc (354)
in drupal_theme_rebuild called at /var/www/html/includes/common.inc (7664)
in drupal_flush_all_caches called at /var/www/html/modules/system/system.admin.inc (1781)
in system_clear_cache_submit called at /var/www/html/includes/form.inc (1520)
in form_execute_handlers called at /var/www/html/includes/form.inc (904)
in drupal_process_form called at /var/www/html/includes/form.inc (386)
in drupal_build_form called at /var/www/html/includes/form.inc (131)
in drupal_get_form called at ? (?)
in call_user_func_array called at /var/www/html/includes/menu.inc (527)
in menu_execute_active_handler called at /var/www/html/index.php (21)
joseph.olstad’s picture

Issue tags: -Drupal 7.60 target +Drupal 7.61 target

Bumping to 7.61. This didn't make it into 7.60.

joseph.olstad’s picture

joseph.olstad’s picture

joseph.olstad’s picture

Issue tags: -Drupal 7.64 target

Actually, I'm going to disagree with this patch, I don't think variable_set should rebuild the variable cache, that would mean the variable cache would be rebuild multiple times if you had to set several variables in a row.
Maybe someone else can explain how this would improve performance? Seems like it would possibly degrade performance.

joseph.olstad’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: +Drupal 7.64 target

After re-reading through again this entire thread, points brought up by DavidRothstein at #125 and reports by JoelPittet #128 and fgm #129 , patch #3 provides the best solution and has no reported regressions. Comment #130 is erroneous, patch 113 is NOT in 7.59.

Patch #3 works without causing deadlocks on the semaphore table.

Go with patch #3

joseph.olstad’s picture

patch #3 needed a reroll.

creds to the developer who wrote patch #3 , as this is code verbatim

***EDITTED*** see next comment

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 136: 973436-non-bothering-variable-cache-136.patch, failed testing. View results

joseph.olstad’s picture

deleted my own comment for now.

MustangGB’s picture

Issue tags: -Drupal 7.64 target +Drupal 7.69 target
MustangGB’s picture

Issue tags: -Drupal 7.69 target +Drupal 7.70 target
joseph.olstad’s picture

Priority: Major » Critical
Status: Needs work » Reviewed & tested by the community
Issue tags: -Drupal 7.70 target +Drupal 7.73 target

Lets get this in dev asap, this issue just cost me another weekend (thanks to the patch I have some of my sunday left), I stumbled on this again for a client that had an unpatched environment, this could have been resolved 10 years ago thanks to Damien Tournauds patch, to avoid repeating myself please see see comment #135 and also #136 fixes stale variable_get cache that seems to occur when variable_get is hit within 1 second of another request pulling the same variable_get. Patch resolves stale cache even for simultaneous request.

gdaw’s picture

RTBC +1

dsutter’s picture

RTBC +1

joel_osc’s picture

Would be good to get this committed - thank-you! +1 RBTC

joseph.olstad’s picture

Issue tags: -Drupal 7.73 target +Drupal 7.74 target
joseph.olstad’s picture

@FabianX seeing as you already approved this patch 4 years ago can you please slam dunk this one for us? I have mentioned this issue to
@McDruid however he is currently very busy with the db api changes revolving around MySQL 8.

#973436-115: Overzealous locking in variable_initialize()

mcdruid’s picture

Can this change have an opt-in/opt-out variable that could be set in $conf in settings.php?

joseph.olstad’s picture

@mcdruid, thanks for having a look, I checked and D8 /D9 does not have an opt-in/opt-out variable for settings get.

The fix for this bug ensures that the variable_get returns the correct value instead of the stale cached value. The cache for the value is rebuilt when the value is updated.

I realize that variable_get doesn't exist as it was in D7 however settings get exists in D9. The fix was hopefully brought to settings get also. With that said, D8/D9 does not have an opt-in / opt-out for stale cached settings values.

joseph.olstad’s picture

@mcdruid, (@catch, @FabianX) and others have already agreed that this is a good fix (change) to make.

@catch recommends that once we put this in we can re-open this one:
#987768: [PP-1] Optimize variable caching

as for an opt-in, I have a half dozen client sites that need this fix I would not want to have to 'opt-in' to get the fix.

The performance penalty here is very small as a tradeoff for accuracy.

Again, I've spent a lot of time debugging multiple different sites in different scenarios where I discovered this patch was needed. 2 years after I found another site that needs this fix. There's already 148 comments on this issue and very little pushback and no reported regressions.

joseph.olstad’s picture

@mcdruid, if an opt-in option is what it takes to finally get this in, then yes I can do that. Is this what you recommend? if so, let me know I'll roll a patch for this.

mcdruid’s picture

I haven't had a chance to discuss this with @Fabianx yet, but I am starting to think that we should consider an opt-out variable for any changes that could possibly have adverse effects on some sites.

The change in #3 / #136 does look mostly harmless to me (and @David_Rothstein's approval in #125 is reassuring).

However participation in discussion of D7 issues on d.o (or lack thereof) is such that I don't want to assume a few people saying "this works on tens of sites for me" means something is safe for the (we think) hundreds of thousands of sites that are a "silent majority". Giving those sites an escape hatch if they are negatively impacted by a change in core seems a good idea to me. Others may have a different point of view, which I'm happy to hear.

So yes, I'd like to see this change have a simple opt-out option if it doesn't make the code horribly complicated.

joseph.olstad’s picture

@mcdruid, I understand your hesitation on this as a lot of weight on your shoulders but I had another look at this patch just now (patch 136) and it seems that we're actually simplifying the code quite a bit. I just cannot see how this change (improvement imho) could cause a regression. The code we're to replace is problematic and I just cannot see a need for it.

variable_get in this case just needs to get a value, and the code change simplifies the logic for this ensuring it's not getting a stale value.

I'm hoping that @FabianX can re-affirm what he stated in 121 is also true for patch 136.

So for now I'll hold off refactoring for an 'opt-in' with the hopes that @FabianX can review.

After 10 years no one here has reported a single regression. The empirical evidence is very strong and there is extensive test coverage exists for variable_get already included in core.

Thanks,

joseph.olstad’s picture

As a quick followup, I just did a grep in the 7.x branch , there's several pages of results of variable_get being used in our automated tests. We've got good coverage , the fix here is for concurrency which not even D9 tests are currently covering so I wouldn't expect us to have to write concurrency tests for this (although it would be nice).

grep variable_get * -R| grep test

My confidence level in patch 136 is extremely high, 99.9999%

with that said, I'm hoping that @FabianX can weigh in soon.

pounard’s picture

@joseph.olstad confidence burn into flame as soon as you experience weird database concurrency problems :) I know what I'm talking about, we actually did write 2 different versions of variable_get, variable_set and variable_initialize patches for different projects, depending on how the database behave, and how the site was being used. I think there's not global win-win solution for this problem. None of those patches did work for the project they weren't written for, but all did solve real life database concurrency or performance problems.

joseph.olstad’s picture

@pounard, the last time you reviewed this issue was 9 years ago, have you reviewed the newest patch (#136) ? It is vastly simplified (improved) logic.

mcdruid’s picture

I am not a fan of locking with "infinite patience"; looking at this has given me flashbacks to #2406863-146: stampede protection should be flexible ("Cache rebuild lock hit" watchdog message). There are definitely cases where waiting for a lock can be worse for performance than just getting on with it (which often means rebuilding a cache).

I really like the approach in #3 (re-rolled in #136) and it looks to me like in the vast majority of cases it would be an improvement.

However, I can think of some edge cases where it may cause problems. For example (this may seem a bit convoluted / silly but hopefully illustrates the point) imagine a site is (ab)using a variable as a counter. Bad practice, but we've all seen similar / worse things I'm sure.

Without the locking to rebuild the cache I think it's possible that a site will sometimes have a thread which is in the process of incrementing that counter (but has not yet committed the change to the db), while another thread gets a cache miss, loads variables straight from the db and begins the process of incrementing the counter itself, but starting from a lower value than should be the case. Throw the "cache consistent" problem into the mix (whereby a site might be using e.g. memcache which is unaware of database transactions) and I think it's definitely plausible that this change could exacerbate - if not cause - race conditions on some sites.

So I'm inclined to say we should only make this change if we provide an opt-out for any sites on which it causes problems.

Here's a patch which I think does that in a fairly lightweight way.

I'm not convinced about the comment / explanation in default.settings.php yet, but I'm more concerned about reviewing the logic initially.

In a very unscientific test using ab while running a variable_set / cache clear in a loop to try and roughly simulate some churn in the variables / caches, the patch was very marginally faster without the opt-out set. In other words, re-introducing the locking made Drupal slower.

Would appreciate any reviews of the logic especially.

joseph.olstad’s picture

@mcdruid, funny you mention (ab)using a variable_get variable as a counter, this is precisely what I've seen clients do (frequently) (and even lots of experienced long time site builders that are still active did) and 136 is the fix for that. In fact, this is the very reason why my client needs this fix, it was easier to patch core variable_get than it was to rewrite their custom code with an autoincrementer field from the db.

And I was looking at refactoring their code but it was far too much work. I am guessing that lots of shops have done, I've patched core and put a monitoring service on to see if there were any concurrency incidents since then, my monitoring software (a cron job checks every 2 hours for bad data , if found sends me and my client an email warning), has not found a single glitch since patching.

I have a cron job monitoring the server that had the issue and since putting in patch 136 the issue has not resurfaced. Yet they still handle high loads. Prior to the patch 136 they would have concurrency issues once or twice per month where a stale value was returned, since then it has been never a stale value always the correct value. Not one single incident as long as 136 is applied (patch 3 reroll).

joseph.olstad’s picture

the concurrency test for this is have to be within 1000 milliseconds, two requests for the same variable, one request sets and gets the variable, the next sets it and gets it also, without patching the second request incorrectly gets the stale cached value of the first request.

With patch 136 (reroll of 3), the second request gets the correct value (not the stale value), an incremented value as expected.

you can run more concurrency tests by scripting additional concurrent requests that increment and get the variable to prove that patch 136 works.

so instead of 2 concurrent requests, make 10 or 20. Patch 136 will do a lot better than without.

I think if I have time I will try to set up a test case scenario for you to PROVE without a doubt that patch 136 is FAR AND AWAY superior to variable_get that is currently in core.

pounard’s picture

@joseph.olstad I have to agree that #136 seems much simpler, and avoid the terrible wait recursive loop, which is far better in my opinion, it'll work fine (much better than actual code anyway) on all sites that don't use variables for dynamic data.

joseph.olstad’s picture

ok here is the code that exposes the stale cached values when hit twice within 1000 milliseconds from two distinct requests.

function mycompany_next_case() {
  $case_no = variable_get('mycompany_next_case', 10001);
  variable_set('mycompany_next_case', $case_no + 1);

  return $case_no;
}

without patch, stale cached value is returned on second request within 1000 ms of the first request.

WITH patch 136 the correct value is returned

This is a very straight forward and simple use case, very easy to prove that core behavior is undocumented (except for here) and unexpected.

The expected and correct behavior is reliable only by applying patch 136

Without a patch you may get the correct value, other times you may not.

With patch 136 we always get the expected value.

Does this help clarify why patch 136 is needed? This is also a code simplification and an elegant one.

mcdruid’s picture

Unless I'm missing something, #156 does the same thing as #136 unless this is in settings.php

$conf['variable_initialize_wait_for_lock'] = TRUE;

There's the additional cost (compared to #136) of one function call in the event of a cache miss + a failure to get the lock.

The advantage is that any sites which encounter a problems because of the behaviour change have an easy way to opt out.

Have I got the logic wrong, or is the argument against the added complexity of the opt out?

joseph.olstad’s picture

I haven't tested the opt-out patch, I just cannot see a reason for opting out of this fix.
I'm on the phone with the client that was affected by this and they are wondering why this would not get fixed?

***EDIT***
This caused my client huge amounts of grief and several people were notified of the issue it led to a serious security and privacy issue as two people got the same id and the information cross contaminated. Once the issue became known, we found out this spanned several months if not years unnoticed.
***EDIT***

joseph.olstad’s picture

Its quite demoralizing to have to fix the issue several times on several servers over a span of several years and realize that you forgot to apply the patch.

joseph.olstad’s picture

@mcdruid , ya the opt-out seems like added complexity to me. I'll maybe take a stab at writing an automated test to prove the 136 fix using the code I provided in #160.

I am wondering if a simpletest just calling that function 3 or 4 times in a row will prove the failure or if it really has to be a seperate request doing so... maybe this weekend I'll run a few more tests to see if I can write a fails test and prove without any doubt the need for 136.

joseph.olstad’s picture

Tests only patch.

joseph.olstad’s picture

another one, hoping both of these fail.

The last submitted patch, 165: tests_only-973436.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 166: tests_only-973436-166.patch, failed testing. View results

joseph.olstad’s picture

Status: Needs work » Needs review
FileSize
2.96 KB

add patch

tests only is also the interdiff

Status: Needs review » Needs work

The last submitted patch, 169: D7-variable_set_variable_get-973436-168.patch, failed testing. View results

joseph.olstad’s picture

Status: Needs work » Needs review

Sticking with patch 136 , ignore 172

pounard’s picture

It's a 10 years old bug, it needs to be fixed, if @joseph.olstad patch works and proves having no side effects, I'm in favor of commiting it. There is no drupal dev in the world maintaining complex sites on the long run that never experienced problems with variables. 10 years reapplying patches every upgrade is really a maintenance burden we all want to avoid.

joseph.olstad’s picture

Just to clarify, 136 is basically a straight up reroll of Damien Tournouds patch 3 from 10 years ago, it's brilliant. I changed nothing except for rerolling it. It has stood up the test of time.

joseph.olstad’s picture

As for opt-in, opt-out, one of my clients has over 180 Drupal 7 sites and I have several other clients with lots of sites. Opting-in is a lot of tedious work in this case. Sometimes I go back to maintain sites that have been worked on by others and not correctly patched. Would prefer the one fix for all.

Steven Jones’s picture

@mcdruid / @joseph.olstad The variable system isn't a Software Transactional Memory system. There's a giant race condition baked in that I think comments #156 to #173 that dance around. None of these patches solve that race condition, at all (and that's fine):

  1. Thread A and B load variables, either from the cache or from the variables table. They both have variable: 'foo' == 123.
  2. Both do some work, then thread A increments the variable and calls variable set: 'foo' = 124 in the DB
  3. Thread B then completes its work and then increments the variable and sets it, but because it loaded the original value of 'foo' as 123, it also sets 'foo' = 124 in the DB.

If we're trying to count something, then ideally we'd have ended up with 'foo' == 125, but it's 124 in the DB.

db_next_id() exists if you need a source of strictly increasing integers. If you want to count something, then 'UPDATE table SET col = col + 1' is going to be thread safe on an ACID compliant DB.

@joseph.olstad I assume the sites you describe are essentially setting a variable on a pretty high % of page requests, and are 'high traffic', i.e. more than one current request. In that case, running stock Drupal core, there would likely be a decent number of page requests waiting for the lock to be available, and then as soon as it is, one of them discovers that the cache needs rebuilding and acquires the lock and rebuilds the cache. But then from what you describe it's often the case that that request might invalidate the cache. Then one of the threads waiting for the lock will eventually get it, and rebuild the cache.
As acquiring the lock isn't a FIFO queue, it's simply random, it's entirely possible that a number of the threads will keep waiting on the lock essentially forever.

The patch in #136 gets rid of your threads waiting around, which is indeed a very good thing, but unless your code is doing some magic on the variables table there's still the race condition there, no?

@mcdruid I assume that in #156 what you're doing is essentially de-risking the release in the case that there's some site using variables in a weird way, you can drop something in the release notes and tell people to read them and that'll sort it. In the same way that we hope this fix is going to fix @joseph.olstad's problem, it might cause someone else problems. The variable system is such a fundamental thing that changing the loading behavior could make for weird edge cases where it breaks. I applaud the effort to provide an easy 'opt out' for people who for whatever reason, do not want this change.
Maybe someone is running some system where selecting and unseriazling all those variables has a significantly higher cost than waiting and getting them out of cache and so maybe they'd opt-out of the 'just carry on approach' as that would actually be worse for them. These things are hard to predict!

So, as you asked in #156: the logic seems sounds to me, it is essentially the patch in #136 by default, and the existing behavior with the opt out enabled. The additional function call is basically negligible I'd say, since it really is a single call to another function that then accesses some global memory. If adding that sort of thing is a significant performance penalty, then your Drupal site would have a much bigger problems to solve!

@joseph.olstad I assume you'd be happy with either of #136 or #156 being commited, maybe now that I've given a possible reason someone might want to opt out?

mcdruid’s picture

@Steven Jones thanks for a great summary.

Your description of my approach aiming to "de-risk the release" is spot on; my hope was that #156 is the same as #136 by default but with an opt-out for any sites that want to keep the existing behaviour (for whatever reason). Thanks for your review of the logic.

I was once asked to provide examples of "something inconceivable that might happen" and had to point out that by definition it's pretty hard to conceive of something inconceivable ;) As you mention, edge cases where this change may make things worse might be hard to imagine, but I am uncomfortable forcing it on all D7 sites without providing an opt-out.

Are we RTBC on #156?

Also, the patch(es) we're now considering don't really reflect the issue title here or the current IS, which I'll try to address if/when this is committed.

joseph.olstad’s picture

Ok on 156

I think 156 is overly cautious but it is an improvement.

In all of my 10 years on Drupal extensively with over 250 different apps on drupal including one for an international public govt celebrity starts with "J" running high volume spikes on a clustered mysql backend I have very rarely maybe once or twice in 10 years seen a lock issue and it was caused by scripting or very poor implementation unrelated to the variable table that we were able to resolve.

variable get should rebuild cache when set, it's logical to me in terms of caching. similar to expiring a page cache when a node is updated.

With that said I appreciate the 156 opt in which config is better than patching.

I'm ok with either 136 or 156.

Thanks for your time on this.

Steven Jones’s picture

Status: Needs review » Reviewed & tested by the community

So I think we are RTBC on #156 with the caveat that you might want to improve the comment on commit?

mcdruid’s picture

Issue tags: +Pending Drupal 7 commit

Great, thank you.

Yes, any suggested tweak to the comments for default.settings.php welcome - I'll have a think.

joseph.olstad’s picture

@mcdruid, I just re-read 156, this is perfect thanks!

so if I understand the patch correctly you're providing an option to use the previous logic but the default will be to use the new logic.

great! looking forward to this.

izmeez’s picture

While I think I understand the purpose of the patch I am finding the comment in the default.settings.php a bit unclear.

+/**
+ * Opt out of variable_initialize() locking optimization.
+ *
+ * After lengthy discussion in https://www.drupal.org/node/973436 a change was
+ * made in variable_initialize() in order to avoid excessive waiting under
+ * certain conditions. Set this variable to TRUE in order to opt out of this
+ * optimization and revert to the original behaviour.
+ */

Isn't the purpose of the patch to use a lock to prevent arriving at the wrong result and the concern is that under certain conditions it may result in excessive waits in which case one may want to opt out of this optimization?

If this is the case the wording may need to be changed to something like:

+ * After lengthy discussion in https://www.drupal.org/node/973436 a change was
+ * made in variable_initialize() that includes locking to ensure the correct
+ * result is returned. This may result in excessive waiting in certain
+ * conditions. To avoid excessive waiting set this variable to TRUE to opt out
+ * of this optimization and revert to the original behaviour.
+ */
joseph.olstad’s picture

I just spent 30 minutes looking at this patch again closely and figuring out why it works so well (136).

I think its important to mention the crux of this fix and why we don't need an opt-out (but I'm not against it I just think it's not necessary).

The crux is this, with the fix, the cache_set function (a write to the db) is ONLY called if a lock is successfully made, notice the 1 second lock, this is consistent with what I observed on unpatched systems getting stale values if two requests within 1000 milliseconds, so with the new logic there won't be a stampede ever and always correct uncached value on a cache miss due to the fact that only the successful 1 second lock does the write to cache. If someone else (another request) has a lock then there will not be a cache_set the uncached result is instead retrieved directly from the db (a read).

with the patch 136/156 this cache miss logic is only invoked when the cache is missed (patch 156 slightly tougher to read and understand but looks ok). If the cache is missed without a lock there's a read straight from the db, something that Drupal does all the time anyway. If a lock is successfully made, a write to the cache is performed.

Voilà les raisons que je n'ai aucun souci avec ce patch!

as for the wording of the settings.default.php the original wording from 156 is sufficient. We don't know if anyone will actually need this but if we (in the unlikely event) get support requests about this we can quickly point to the option that allows this to be opted-out.

Steven Jones’s picture

@izmeez no :)

At this very moment Drupal core does use a lock, but this lock is only to prevent multiple threads doing the exact same work: to take the variables from the variable table and put them into the cache system.
Having them as one giant lump in the cache system 'improves performance by 20% when serving cached pages' according to a comment in variable_initialize.

The patch in #156 changes that so that if the variables are not in the cache system, a single thread still does the work and pops them all into the cache system, but while that thread is doing that, all of the other threads are not prevented from continuing, they simply read the variables from the table (possibly incurring a 20% performance penalty.)

However, what patch #156 also allows is configuration so that someone can opt-out of that behavior, if for example as mentioned in #177:

someone is running some system where selecting and unseriazling all those variables has a significantly higher cost than waiting and getting them out of cache

Then maybe they'd actually be looking at a huge performance issue. For example, maybe their cache system is a very, very fast but their backend storage database is very, very slow and can't handle lots and lots of connections. In that case, they would almost certainly want to preserve the current behavior and block other threads rather than overloading their backend database.

What they are opt-ing out of is the 'carry on with selecting the variables from the database table directly in the event of a cache miss and another thread is going set the cache in a moment' behavior.
Drupal will still be using locks to ensure that only one thread bothers to do the cache set/write.

@joseph.olstad I think that the variable system is probably best thought of as an 'eventually consistent' system. It's riddled with race conditions, and you may or may not get some depending on the state of transactions during the page request and configured cache backends etc. I think we're on the same page, but I'll just say again that the variables system not a Software Transactional Memory system nor does it use locks to ensure that things like 'incrementing' a variable are guaranteed to work as you may expect. If you need those things then #177 mentions some alternatives. No amount of fiddling with the patches in #136 or #156 is going to change that.

@mcdruid How about:

After lengthy discussion in https://www.drupal.org/node/973436 a change was made in variable_initialize() in order to avoid excessive waiting under certain conditions. More specifically: if a variable is changed or the variable cache otherwise cleared, multiple concurrent request threads would have coordinated using a lock so that a single thread would rebuild the variable cache and then all the other threads block/wait until that cache entry was available. This was changed so that by default the other threads would read the variables directly from the variables table too while the cache was being rebuilt.

On most sites this change should have very little negative impact, as variables are changed relatively infrequently and the additional cost of multiple threads briefly doing the same work should be small. However, depending on the exact site configuration and database and cache components used there's a small possibility that this additional cost would be high enough to outweigh the benefits of allowing multiple threads to continue in parallel.

Set this variable to TRUE in order to opt out of this optimization removing the coordinated waiting and revert to the original behaviour.

Maybe it would be clearer to actually make it an opt-in to the previous 'blocking' behaviour?

joseph.olstad’s picture

The opt out is to read the stale value from the locked thread rather than read the correct current value from the variables table directly. Both are reads, the value in the variable table is correct it is the cached value that sometimes ends up stale.

Both 136 and unpatched core correctly prevent stampeding cache table writes only the thread which acquires the 1 second lock is able to write to cache.

What we are fixing is to bypass cache and read the correct value with db query. There is no risk of a stampede in either case because neither approach results in more writes.

I am not sure why someone would want to opt out of a table read in this case but that is their prerogative.

The use case I outlined is using variable set increment value and get to get the new count. variable_set is working it is variable_get that we are patching.

The issue summary is somewhat misleading.

Fabianx’s picture

Category: Task » Bug report

This is a critical bug for some and not circumventable easily.

So let's classify as a bug.

The bug is endless rebuilding of the variables cache.

I am okay with opt-in (as it fixes a bug) and this is approved after issue summary update.

There are three different sites:

a) Using DB purely -> not an issue as the write incurred by the "lock()" is worse performance than the "select()" anyway prompting some to remove the lock all-together (and for those site that indeed would be best)

b) Using Memcache / redis and high performance -> Hopefully have done their homework to have as less variable_set calls as possible, it could stampede for them in theory, but there are other things loaded from the DB (like roles) that hopefully they should cope well. Worst case they know what they are doing and can opt-out again.

c) Abusing the variable system with high traffic -> Lot's of variable_set for setting state(). For those the issue is solved now.

So all that don't care as much win and those that specifically tune can measure and opt-out if needed.

joseph.olstad’s picture

mcdruid’s picture

Title: variable_set() should rebuild the variable cache, not just clear it » Overzealous locking in variable_initialize()
Issue summary: View changes

Updating IS.

mcdruid’s picture

FileSize
2.01 KB

#156 need a reroll - no interdiff, as the problem was just other variables have been committed to default.settings.php

I'm actually fairly happy with the minimal comment in default.settings.php as it is; @Steven Jones provides much more detail in #185 but I'm inclined to leave that for the CR / release notes and keep the comments in (default.)settings.php simple.

joseph.olstad’s picture

Looks good!

  • mcdruid committed 9211220 on 7.x
    Issue #973436 by catch, joseph.olstad, beejeebus, karschsp,...
mcdruid’s picture

Status: Reviewed & tested by the community » Fixed
Issue tags: -Pending Drupal 7 commit

Thanks everyone!

joseph.olstad’s picture

Excellent! my patch anxiety levels have dropped significantly!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.