Problem/Motivation

With a non-DB backed cache, updating nodes can result in stale data being in the cache due to a race condition.

This is because when an entity is saved we open a new DB transaction, during which we do all the updates to the data tables, and we clear the relevant items in the cache tables, at the end of the saving process the transaction is committed and all of those operations happen as one atomic piece.

If however, your cache backend is not the DB, then the cache flush will happen immediately, allowing a window for another operation to come in and cause the flushed cache items to be rebuilt from the non-updated DB source data, because the transaction hasn't been committed yet.

Proposed resolution

Many options. I think that we're still at the stage of getting to consensus on the issue.

Remaining tasks

Obtain consensus on a sensible way to resolve this, and then code that up.

I'm not sure that this is testable? Can we test race conditions?

User interface changes

None.

API changes

None at the moment.

Original report by @slashrsm

There can be a race condition when using some other cache back-end than DB (Memcached in my case). Data will be saved in DB at the end of function since node_save() uses db_transaction(). Field cache is, on the other hand, cleared at field_attach_update().

If there comes a request to the node being saved just after the field cache was cleared and before the transaction was commited we end up with a outdated data in field cache.

This will happen only if some other cache back-end than DB is used. Field cache will be cleared when transaction is commited, so there should be no problem in this case.

Related issues:
#1658672: It seems as sometimes empty and old nodes are served by memcache
#1677830: Node not saved when hook_node_insert() or hook_node_update() are invoked

And some more background about this issue:
http://janezurevc.name/strange-behaviour-race-condition-during-nodesave

Comments

marcingy’s picture

Status: Active » Closed (won't fix)

This is surely something for memcache to fix rather than core - memcache could register a shutdown function or it could add items to queue to be cleared etc. Basically as this is a contrib issue the contrib module needs to (and can) provide a solution.

I have actually been bitten by this issue and fixed it via a queue mechanism which works prefectly or so it seems. We didn't track down the cause so have not contributed a patch back because we thought it was something unique to us. I will try and create a couple of alternative patches for memcache this weekend that can get round the issue.

slashrsm’s picture

I can agree with this, but I'd still like to have this documented in core. Issue: #1677830: Node not saved when hook_node_insert() or hook_node_update() are invoked

msonnabaum’s picture

Priority: Normal » Major
Status: Closed (won't fix) » Active

This is absolutely not contrib's responsiblity, this is a core bug. A cache backend shouldn't have any knowledge of possible race conditions in core.

Part of the problem here is when we decided to stop using the revision id to key the field cache in http://drupal.org/node/456488.

Now we have the same cache id for all revisions, which does not work when MySQL is returning a stale vid because the update transaction hasn't been committed yet.

Not quite sure what to do to fix this yet, but it needs discussion.

bjaspan’s picture

I haven't worked on Fields or core in a long time, but I agree this does sound like a core bug.

Offhand idea for a fix: The problem is that a revision-specific cache entry is being created but the read is revision-agnostic, assuming it is going to be for the latest version. We went to revision-agnostic cache entries to avoid caching data for old revisions that will rarely be used. Since there is no way to create an atomic transaction with an external database system, perhaps the caching strategy should be changed: Go back to using the entity and revision id as key, but when clearing the field cache before an update, always clear it for all revisions of the entity. e.g.

field_attach_update: cache_clear_all('field:node:[nid]:*', 'cache_field') // clear cache for all revisions
field_attach_load: cache_get(field:node:[nid]:[vid]), cache_set(field:node:[nid]:[vid], $data) // get and set cache for the current revision

Of course, this assumes we can do a prefix cache_clear_all(). I don't know/remember if the cache api allows that.

FYI, I am unlikely to see updates to this issue unless someone pings me directly.

bjaspan’s picture

More thoughts:

If there is no prefix cache clear, perhaps field_attach_update() could clear the cache entry for the *previous* revision before saving the new one. Of course, this requires that saves be strictly serialized, again requiring a transaction involving the external cache backend. I guess if there is a transaction that updates the entity's table in SQL first, then we know they are serialized.

Another option is a background process that clears all field cache entries for non-current revisions. That's going to be painful at large scale, though, just to figure out what the entries to clear are.

msonnabaum’s picture

Status: Active » Needs review
StatusFileSize
new1.14 KB

Totally untested, but maybe something like this could work.

Anonymous’s picture

the race is made worse by an external cache, but is not at all confined to it.

even with db-backed cache, any process that misses cache and reads from the db while an update is happening in another process can write stale info back to the cache just after the update process commits a cache clear.

isn't the safest way to do this from a race-condition-p.o.v. to not write to the cache from a read operation? instead, write through to cache on insert / update? that makes the 'ZOMG my cache went away' case scarier, so i can see that not being popular...

Status: Needs review » Needs work

The last submitted patch, fields-cache-race-condition-1679344-6.patch, failed testing.

catch’s picture

Version: 7.x-dev » 8.x-dev
Issue tags: +Needs backport to D7

isn't the safest way to do this from a race-condition-p.o.v. to not write to the cache from a read operation? instead, write through to cache on insert / update? that makes the 'ZOMG my cache went away' case scarier, so i can see that not being popular...

It would be nice to write through to the cache instead of blowing it away, but I don't think we can reliably do that at the moment since we cache hook output (i.e. hook_field_load()), so we'd still have to do the full entity load before writing back anyway. Also we'd still need a fall back for reads since there's evictions, general cache clears etc. so it'd still leave a race condition window open.

What about using a lock for this? Lock at the start of entity_save() and if any process gets a cache miss and can't acquire the lock it doesn't try to write back.

damien tournoud’s picture

Because we don't have multi-stage commit capabilities, the only way to make this atomic would be to:

  • Add a version column to the base entity table, that gets incremented every time the entity is saved (right now, we do have entities without revision support and we do allow an entity to be saved without a new revision created, which are both mistakes, but we are not going to fix that in this issue);
  • Save the cache keyed by version

That way:

(1) While the transaction has not been committed, other requests see the old version from both the database and the cache
(2) If the transaction is rolled back, everyone still see the old version from both the database and the cache
(3) When the transaction gets committed, everyone immediately see the new version from both the database and the cache

damien tournoud’s picture

(And the nice thing about mandating entities to get a version field is that we can easily implement optimistic locking on top of it.)

gielfeldt’s picture

This is issue actually applies to all cache operations performed inside transactions, not just for entities or fields. The default REPEATABLE-READ isolation level in MySQL also only adds to the problem.

even with db-backed cache, any process that misses cache and reads from the db while an update is happening in another process can write stale info back to the cache just after the update process commits a cache clear.

This will only be a problem if the update is not atomic (including the cache update), i.e. if the update is not using transactions.

This is surely something for memcache to fix rather than core - memcache could register a shutdown function or it could add items to queue to be cleared etc. Basically as this is a contrib issue the contrib module needs to (and can) provide a solution.

A little cache wrapper I made, trying to address the general issue/root cause in "user-space":
http://drupal.org/sandbox/gielfeldt/1946668

damien tournoud’s picture

@gielfeldt: REPEATABLE-READ is "stronger" then READ-COMMITTED, so you must be mistaken somewhere.

gielfeldt’s picture

It's precisely because it's stronger that it's a problem. Even if we manage to postpone cache operations until after a transaction has been committed, there's no way to absolutely ensure that synchronization between memcache and the DB because of the consistent snapshot mechanism. I'll try to draw it up tomorrow when I get the time.

gielfeldt’s picture

+------------------------------------+--------------------------------------------------------------------------+
|             Request A              |                                Request B                                 |
+------------------------------------+--------------------------------------------------------------------------+
| Start transaction                  | Start transaction                                                        |
| update some data for some entity A | select something from db (consistent snapshot begins!)                   |
| Commit transaction                 | time goes by                                                             |
| clear cache for that entity A      | time goes by                                                             |
|                                    | check cache for entity A                                                 |
|                                    | cache not found!                                                         |
|                                    | load entity A (loads old data! consistent snapshot began before commit!) |
|                                    | populate cache (with old data!)                                          |
+------------------------------------+--------------------------------------------------------------------------+

EDIT: Each row should be viewed as a discrete point in time.

dcam’s picture

http://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.

The summary may need to be updated with information from comments.

pwaterz’s picture

We are having this issue happen all the time in d7 with memcache and node save. The only work around that I have come up with this point is to add this to the settings.php

$conf['cache_class_cache_field'] = 'DrupalDatabaseCache';

Firemyst’s picture

We had a similar problem, so we tried to replace memcache with Redis, but the issue was not resolved. This is not a memcache specific issue.

gielfeldt’s picture

True, it is not memcache specific. This is issue applies to any setup that is not using the database as the cache backend.

pwaterz’s picture

Yes Gielfeldt. Firemyst, my proposed work around will work for redis as well.

gielfeldt’s picture

my proposed work around will work for redis as well

You could say that. Your work around is to not use redis :-)

Also, just doing it for fields is not enough. The problem is much more fundamental and can show up whenever a transaction is used. Custom module using node-hooks and performing their own cache operations inside these. Or the EntityCache module for example, is also prone to this error.

My consistent cache wrapper was never completed, nor have I had the time to do so :-(. Instead, I've include the transactional safe cache wrapper from AutoSlave, which is working and solves the underlying problem. There are 2 requirements though. A core patch and configuring the database to use isolation level READ-COMMITTED. Perhaps some inspiration from this module can be used to fix Drupal in regards to this issue.

xjm’s picture

Issue summary: View changes
Issue tags: +affects drupal.org
xjm’s picture

Component: node.module » node system

(Merging "node system" and "node.module" components for 8.x; disregard.)

catch’s picture

I still think we can use a lock for this, and Damien's answer in #10 didn't sufficiently explain why we can't.

Trying to version key entity cache entries runs into a chicken and egg problem. When you request the entity, you don't know the version, so you'd need to query the database to get the version, before you can request the item from cache. That reduces the performance benefit of having the caching in the first place - doubles the queries for a start.

If we added a lock, it'd look like this:

Entity system acquires a lock based on the entity type + ID, would mean waiting if the lock is already acquired per the usual pattern, release the lock at the end of the save operation. Same with deletes as well.

Cache misses can try to acquire the lock before writing, and just not bother trying to write to the cache if the acquire fails.

Even if it's not perfect it'd be a lot more robust than what we have now, and there's only extra database queries on writes.

steven jones’s picture

Assigned: Unassigned » steven jones
Issue tags: +Needs issue summary update

Taking a look at this.

steven jones’s picture

Assigned: steven jones » Unassigned
Issue summary: View changes
Issue tags: -Needs issue summary update

Sorry, really don't understand D8 enough at this point to make a more meaningful contribution.

drumm’s picture

This issue is the root cause of the brief Drupal.org outage earlier today.

Because of this, Drupal.org must run with a DB-backed field cache. We occasionally get deadlocks when waiting for MySQL to finish clearing the field cache. MySQL isn't exactly designed to be a cache backend. Switching to another cache backend would avoid the problem for Drupal.org.

mikeytown2’s picture

@drumm
#2222635-26: Waiting for table metadata lock on cache_field table
#2193149-3: Deadlocks occur in cache_field table.
We are using both of these patches in production; it's a 99% logged in site.

drumm’s picture

Those look like okay mitigations. We'll take a look at them for Drupal.org. It would still be best to resolve this issue, so we can use MemCache as a field cache backend.

mikeytown2’s picture

Totally agree, but sometimes you need some duct tape. With this being D8 & "needs work" without a viable patch it might be a little while before this hits D7.

jackbravo’s picture

Do we have steps to reproduce this issue in a normal install?

pwaterz’s picture

I agree with catch in comment #24.

xjm’s picture

Since Drupal.org is running Drupal 7, it's okay to also submit a Drupal 7 patch for this, and then we can forward port it to Drupal 8 here in the issue as well. If needed, we can presumably use the D7 patch for Drupal.org until it's fixed in HEAD.

gielfeldt’s picture

A way to fix this could be to make an event when a transaction finishes, so that cache backends may react on this; i.e, they can buffer cache writes if a transaction is in progress and then actually perform them on transaction finish. (Note, the event must occur after the actual commit).

It would be a minor fix to core, but would require some extra logic in the non-db cache backends.

gielfeldt’s picture

Version: 8.x-dev » 7.x-dev
StatusFileSize
new4.87 KB

Here's a proposal for D7.

The patch is the same as bundled with: https://www.drupal.org/sandbox/gielfeldt/1946668

The patch merely provides a new cache interface for contrib modules to implement.

gielfeldt’s picture

Status: Needs work » Needs review
fabianx’s picture

I like this approach a _lot_. This puts the burden on contrib/ but contrib can then do something about the transactions and maybe even get locks themselves.

RTBC + 1 from me for this approach, but no real RTBC as we could still just acquire a lock for the issue at hand here.

drumm’s picture

Looks good!

An implementation for memcache would be great both for Drupal.org and as an example for reviewing this patch.

gielfeldt’s picture

StatusFileSize
new4.76 KB

While the patch provides the possibility to address this issue, it does require a bit of logic in the cache backend to handle it properly. The memcache project could look at the cache consistent for inspiration, but perhaps it could be implemented in simpler way than I did there?

If this paradigm is followed, then each backend would have to implement the transaction handling, and we might end up with a lot of duplicate code across projects. What to do? Could core provide some helper functions perhaps?

I've attached an updated version of the patch, just in case.

catch’s picture

I'd rather do the lock here since that doesn't require a 7.x API change, then open a separate issue to discuss the transaction support for cache backends.

fabianx’s picture

It seems like d.org could _try_ using consistent cache module with memcache. I had a quick look through and the ConsistentCacheLookup class looked pretty good to me using the intelligent buffer.

drumm’s picture

I filed #2316057: Move field cache to memcache for deploying a solution for Drupal.org. We'll have to hack core and go with what works. In the meantime, it would be great to have a solution in core, so we don't have too many core hacks pile up.

David_Rothstein’s picture

An API change (i.e. breaking the cache API) would not be good for 7.x here, but the latest patch looks like only an API addition... which I think is fine if it turns out to be the best way to fix this technically.

fabianx’s picture

#10:

I think something like that is possible and there is two ways to do this:

a) Add a timestamp or hash of timestamp to the cache ID and add some garbage collector for old cache entries.


function field_get_cid($id, $entity_type, $entity, $entity_info) {
  $cid = "field:$entity_type:$id";
  if (!empty($entity_info['entity modified property']))) {
    $modified_property = $entity_info['entity modified property'];
    $modified = $entity->${modifieed_property};
    $cid .= ':' . $modified;
  }
  
  return $cid;
}

field_update_cache:


// ...

  if ($entity_info['field cache']) {
    $cid = "field:$entity_type:$id";
    if (!empty($entity_info['entity modified property']))) {
      $modified_property = $entity_info['entity modified property'];
      $modified = $entity->original->${modified_property};
      $cid .= ':' . $modified;
   }
   // Cache clear all is only needed for old data
   cache_clear_all($cid, 'cache_field');
  }

That effectively implements serialized approach, but uses timestamps instead and leads at least to the new entity being loaded always populate the correct timestamp, in the race condition case the old data is just left over.

b) Use cache re-validation instead of invalidation

// Remove the cache clear all

Use the modified property and store the cache like:

$cache->data == array(
  'field_data' => $data,
  'revalidation' => array(
    'modified' => $modified,
  )
);

After a cache_get ensure the cache is still matching the right timestamp. This is better, because less clearing, therefore less DB operations, and just a race on cache_set with the same data.

In core only users don't have modified timestamps (@see entity_modified contrib module) so this should work well for the primary use case of nodes and being purely internal is probably not an API change.

After reviewing a lot of the current invalidation strategies and trying out the hash based strategy in render_cache, I found invalidation to be the best strategy and here we have a simple key to know if the entity was invalidated.

Edit: And YES! vid should be added as well at least to the revalidation as proposed already by msonnabaum in #3 as that will fix the issue for revisionable entities with revisions enabled obviously.

mikeytown2’s picture

I'm working on a MySQL specific cache backend and I want to make sure I have this logic correct. The cache db that I'm working on uses READ-UNCOMMITTED which means that any write (Insert, Update, Delete) inside of a transaction will be seen by it.

Heart of the issue from my understanding:
node_save() calls field_attach_update() which then calls cache_clear_all("field:$entity_type:$id", 'cache_field');. Another request hits node_load() at the perfect time so that when DrupalDefaultEntityController::load() hits field_attach_load() the cache is empty but the node/field transaction has not been committed so the old data gets saved to the cache.

Current check

    // Special case for https://www.drupal.org/node/1679344
    // Revisit this once a better workaround has been found.
    // By using READ-UNCOMMITTED this might not be needed.
    if (   $this->bin === 'cache_field'
        && is_string($cid)
        && strpos($cid, 'field:') !== FALSE
        && substr_count($cid. ':') >= 2
        && Database::getConnection()->inTransaction()
        ) {
      // Use default core logic for this cache clear.
      return parent::clear($cid, $wildcard);
    }

I think I can do the cache_clear in this situation outside of a transaction due to this connection using READ-UNCOMMITTED. Is my understanding of the situation correct or did I miss something?

As for a solution, having a trigger when the transaction finishes seems like a good general approach; buffer all writes to the cache until the transaction is done.

gielfeldt’s picture

Heart of the issue from my understanding:
node_save() calls field_attach_update() which then calls cache_clear_all("field:$entity_type:$id", 'cache_field');. Another request hits node_load() at the perfect time so that when DrupalDefaultEntityController::load() hits field_attach_load() the cache is empty but the node/field transaction has not been committed so the old data gets saved to the cache.

This is correct. However, the issue is not isolated to node_save() as such, but is present everywhere a cache-write operation happens inside a transaction.

Using a database cache backend with READ-UNCOMMITTED is equivalent of using e.g. memcache wrt the problem stated in this thread.

I'm not sure what the code you posted is supposed to do though?

As for a solution, having a trigger when the transaction finishes seems like a good general approach; buffer all writes to the cache until the transaction is done.

This is exactly what Cache Consistent does.

deviantintegral’s picture

StatusFileSize
new856 bytes

I've been dealing with a related issue, though it's not specific to the field cache. We have a site with many scheduled revisions to content, and often the revision change is just a change in $node->status, so we end up serving nodes marked as unpublished from the cache.

Is there a reason we're preserving $transaction all the way through to the end of node_save()? If not, I think we can unset $transaction before we call resetCache(), ensuring the that there's no race between parallel requests and rebuilding the entity cache.

Here's a patch - but I don't think this supersedes other patches here, but perhaps might be useful in addition.

damienmckenna’s picture

FYI the d.o maintenance team worked around this by using the Cache Consistency module.

drumm’s picture

Yep, it is working well. We are still using "Method #1 (alternate)" from https://www.drupal.org/project/cache_consistent.

damienmckenna’s picture

@drumm: Thanks for mentioning that, I figured it'd be useful to note here so if someone else found the issue they'd be aware of the workaround.

fabianx’s picture

#47: Surprisingly simple, but a different issue, because the field cache still is affected in the same way. Can you open up another issue?

We might have _that_ problem in D8, too still.

However we could chain through from entity to clear the field_cache, too. Not ideal, but possible.

tim.plunkett’s picture

Version: 7.x-dev » 8.0.x-dev
Status: Needs review » Postponed (maintainer needs more info)

Is this still relevant in D8? If not, move it back to D7 and remove the backport tag.

fabianx’s picture

Status: Postponed (maintainer needs more info) » Active

Yes, it is.

joelpittet’s picture

There was a couple pretty nasty bugs in cache_consistent that have been fixed it seems in the -dev release. I'm testing them out and seem to be resolved. Just wanted to mention since a few of you likely use it.

One where child menu items were disappearing sporatically, which seems to be resolved. Another where taxonomy terms returned all terms in a multiple get as the same entity by accident and threw a fatal error in my case when saving multiple tags on an article.

#2469925: Child menu items disappear
#2533274: Loading multiple taxonomy terms ConsistentCacheBase::matchKey keeps returning the same values for each entity

Status: Active » Needs review

Status: Needs review » Needs work

The last submitted patch, 47: 1679344.47-unset-transaction.patch, failed testing.

ehegedus’s picture

Issue tags: +Needs reroll

The patch from #39 does not apply to Drupal core 7.40, needs reroll.

ehegedus’s picture

Issue tags: -Needs reroll
StatusFileSize
new4.77 KB

Re-rolled against Drupal core 7.40:

  • Resolved conflict in the _cache_get_object() function's documentation.
ehegedus’s picture

Re-rolled against Drupal core 7.40:

  • Same conflict as above.

Note: Didn't rename because the cache_consistent module ships with these two patches.

torgospizza’s picture

Status: Needs work » Needs review

Setting to Needs Review as there is a new patch.

The last submitted patch, 58: drupal-7.28-transactional-cache-1679344-58.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 59: drupal-7.36-transactional-cache-1679344-59.patch, failed testing.

joelpittet’s picture

Version: 8.0.x-dev » 7.x-dev
Status: Needs work » Needs review

The last submitted patch, 58: drupal-7.28-transactional-cache-1679344-58.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 59: drupal-7.36-transactional-cache-1679344-59.patch, failed testing.

gielfeldt’s picture

StatusFileSize
new4.74 KB

Re-rolled against dev.

gielfeldt’s picture

Status: Needs work » Needs review
joelpittet’s picture

Version: 7.x-dev » 8.0.x-dev

Back to 8.0.x as I changed it for testing and forgot.

jgrubb’s picture

Status: Needs review » Reviewed & tested by the community

This is incredibly confusing since there are patches for 7.x flying around in this issue that's marked as being for 8.x now. I can confirm that the patch in #66 works in production against 7.41 though.

catch’s picture

Status: Reviewed & tested by the community » Needs work

This needs a patch that's commitable to 8.x to be RTBC for 8.x.

gielfeldt’s picture

Status: Needs work » Needs review
StatusFileSize
new6.81 KB

Here's a patch for D8.

It fires events when transactions are started/committed/rolled-back. No integration from the DB-layer to the Cache Backend at all, so a bit cleaner than the D7 version.

Not sure though, if this is the "proper" way to do it.

EDIT: probably also needs some testing ...

Status: Needs review » Needs work

The last submitted patch, 71: drupal-transaction-events-1679344-71-8.0.x.patch, failed testing.

gielfeldt’s picture

Any test-savvy folks, who can tell why the tests are failing?

drumm’s picture

https://dispatcher.drupalci.org/job/default/34960/consoleFull has the full output from the test. Something about how event_dispatcher seems like it isn't working properly, or needs more scaffolding in the tests.

gielfeldt’s picture

Thx.

I can't really make out, if it's because the tests need to setup an event_dispatcher, or if it's better/necessary to inject a dispatcher service into the database-connection instead of using \Drupal::service('event_dispatcher').

... or both.

Any thoughts/ideas/solutions anyone?

jgrubb’s picture

Is there a 7.x issue for this? How can I help get the patch in #66 rolled into core?

catch’s picture

Version: 8.0.x-dev » 8.1.x-dev

I'd probably do this like the following:

- don't add anything to the database layer (neither cache backend nor events)
- add a new class that handles creation and rolling back transactions as a wrapper
- cache backends are already tagged services, so we can loop over them, check a new interface and call a method - no need for the extra indirection of an event.
- implement that interface for the backends that need it in core.

Since this is a new API, it needs to be committed to 8.1.x rather than 8.0.x, but could then be directly backported to 7.x once it's in.

gielfeldt’s picture

Ok. But are we certain that we'll catch all transactions correctly via the current transaction class design? Or are you thinking we should refactor this? The commit and rollback actions, for example, is currently spread out over multiple places.

catch’s picture

Issue tags: +Needs tests

But are we certain that we'll catch all transactions correctly via the current transaction class design?

No. I think this behaviour will have to be opt-in, but it's mostly an issue for entities, and we can opt-in those pretty much all at once. We could also deprecate using the current transaction handling in general when it's working for the main cases, to encourage people to convert custom code that might have the same issue.

Or are you thinking we should refactor this?

Possibly we'd have to get the new class to handle transaction commits - them being implicit in __destruct() currently will make it hard to do what I suggested in #78.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

gielfeldt’s picture

Are we sure we don't want an event system on the database layer? If a cache backend is only triggered directly by transaction events, then what happens if a cache backend is instantiated _after_ a transaction has been started? Wouldn't it be better to decouple it completely, like in the patch?

damienmckenna’s picture

FYI the patch in #66 still commits correctly against the latest 7.x-dev.

damienmckenna’s picture

Status: Needs work » Needs review
StatusFileSize
new6.6 KB

This is a reroll of #71 against 8.2.x-dev.

Status: Needs review » Needs work

The last submitted patch, 83: drupal-n1679344-83-8.2.x.patch, failed testing.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

gielfeldt’s picture

StatusFileSize
new6.27 KB

This is a reroll of #83 against 8.2.x-dev.

Also changed the transaction event to pass the database connection object instead of just the transaction depth.

gielfeldt’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 86: race_condition_in-1679344-86.patch, failed testing.

gielfeldt’s picture

Status: Needs work » Needs review
StatusFileSize
new12.08 KB

Fixed existing test cases by injecting event dispatcher into database object, and added tests for the database events.

gielfeldt’s picture

I just realized that the event dispatcher isn't injected into the database, because it's not available at that time :-(

gielfeldt’s picture

StatusFileSize
new12.67 KB

There. Not sure if this is kosher, or if the container and event_dispatcher service needs to be initialized earlier somehow.

Crell’s picture

Status: Needs review » Needs work

The database system should not have any dependency on the container, even indirectly. That's begging for circular dependency nightmares.

I also really don't like making the database depend on the event system. If anything, I want to eliminate the one hook that it has in it to make it more decoupled, not less. Adding another dependency to what is *almost* Component-eligible is a no-go in my book.

Even if we absolutely must give it an event dispatcher (which I do not yet believe is the case), then any awareness whatsoever that it's a container-specific dispatcher is off the table. It should be purely EventDispatcherInterface from Symfony, and absolutely not ever touch \Drupal, ever. This patch is filled with technical debt. And that's before we consider the performance implications of adding multiple event calls to every transaction.

Really, trying to solve this at the database connection level is the wrong place to begin with.

gielfeldt’s picture

The database system should not have any dependency on the container, even indirectly. That's begging for circular dependency nightmares.

You're probably right about this.

I also really don't like making the database depend on the event system. If anything, I want to eliminate the one hook that it has in it to make it more decoupled, not less. Adding another dependency to what is *almost* Component-eligible is a no-go in my book.

The way I see it, in order to solve this problem, the cache layer must be aware of the transactions. Having the database layer depend on the cache layer, as previously suggested in this thread, only makes the coupling tighter imo. Plus it also adds the caveat of only working for already instantiated cache backends when the transaction occurs.

Even if we absolutely must give it an event dispatcher (which I do not yet believe is the case), then any awareness whatsoever that it's a container-specific dispatcher is off the table. It should be purely EventDispatcherInterface from Symfony, and absolutely not ever touch \Drupal, ever. This patch is filled with technical debt. And that's before we consider the performance implications of adding multiple event calls to every transaction.

Again, relying on the container is probably not the best way.
To which technical debt are you referring, besides the container dependency?
Regarding performance, will this really be a problem, considering it only happens on transactions? How often do transactions really occur? Sites are usually read-heavy. There have been discussions about using locks instead, but I don't believe this will be better performance-wise, nor implementation-wise for consumers of cache. The crux of the matter is that the cache layer needs to know when a transaction occurs.

Really, trying to solve this at the database connection level is the wrong place to begin with.

How do you suggest we go about this problem then? This issue is over 4 years old, still unresolved. The other solutions in this thread either targets only node/field save using locks or unique-id-galore, the latter which might address the transaction-problem, but indirectly and would probably require lots of refactoring, since this would essentially apply to every table. I still believe using locks for this would cause more harm than good (lockups, timeouts).

I'm not hell-bent on this particular solution, but I haven't found/seen anything else that solves it.

Crell’s picture

First, sorry if I came off harsher than intended. I've been fighting to keep extra "convenience" dependencies out of the DB layer for years, only sometimes successfully. :-( (Drupal has a tendency to add dependencies over time until they create a knot of mess we have to detangle, which again takes years. See also: Drupal 8.)

The problem here isn't that the cache system needs to be aware of a database transaction. Rather, it's that we should assume that we *cannot rely on the particular backend being used* by either the cache system or the entity system. Vis, we should be doing something at a higher level that doesn't intrinsically rely on an SQL transaction. Recall that it's possible to run Drupal 8 with its entity system backed by MongoDB. That would break this solution, too.

Based on the description in the summary (I didn't reread the whole thread), it seems like the issue is that the cache is getting cleared/written within the transaction in the first place. That is, the order of actions described in the summary is:

1) Start Transaction
2) Clear cache
3) Write new node data
4) Close Transaction

The problem being that another thread could come in and write a new cache between steps 2 and 4, causing stale data.

To which I ask, why are we even clearing the cache at that point? The cache clear should happen if and only if the entire node save process was successful. That is, if the transaction didn't get rolled back at any point. And then it can get refilled afterward by another process as normal. That is:

1) Star Transaction
2) Write new nod data
3) Close Transaction
4) Clear cache

(And replace points 1-3 with something that has no transactions if you're on a non-SQL entity backend.) In that case, there is still a stale data window, but that stale data window is much smaller; it's also a stale read window, not write window. Viz, there's only a few milliseconds in which the previously stale data would be read before it's cleared and then gets rebuilt with new data. That's not perfect, but it's certainly better than a few millisecond window in which stale data can get written and then cached for the next hour. :-)

Am I misunderstanding the problem space? Because the above seems like it would work. Basically, cache clear should not be part of the entity save process, but a post-effect of a successful entity save process.

swentel’s picture

Am I misunderstanding the problem space? Because the above seems like it would work. Basically, cache clear should not be part of the entity save process, but a post-effect of a successful entity save process.

Makes very much sense to me. Remove the cache_clear_all call in field_attach_update and clear it in field_entity_update(). That should get rid of the racing condition (at least in D7).

Also in #53, Fanbian mentions this is still a problem for D8, but at least cache_field doesn't exist at all anymore, so maybe (at least for the cache_field) probably, this is D7 material only ?

gielfeldt’s picture

@Crell

It's okay. I didn't make this patch for the sake of convenience, but because it actually poses a real problem.

You did not misunderstand the problem space, and I agree that an entire refactoring of the cache usage concept might be the best approach, in order to avoid coupling to the database all together. However this may be easier than it sounds (there a alot of caveats, see below), and is most likely a candidate for Drupal 9+. But what in the meantime?

Regarding post-process for clearing cache instead of inside entity_save (which will cause a problem, described below), how do we determine when an entity save process has been successful?

Also, another problem is nested transactions. Consider the following:

node_save:
start transaction
end transaction
clear cache (or post process like you mentioned)

So far, so good. But:

node_save:
start transaction
invoke field hooks
field clear cache (these are the ones we need to eliminate then)
crazy custom code that clears cache (the design pattern when coding for Drupal needs to be very clear, that you may not tamper with the cache here).
end transaction
clear cache

And another problem:
custom-bulk-node-save:
start transaction
foreach a bunch of nodes
node_save:
start transaction
invoke field hooks
field clear cache (these are the ones we need to eliminate then)
crazy custom code that clears cache (the design pattern when coding for Drupal needs to be very clear, that you may not tamper with the cache here).
end transaction
clear cache (problem here, since node_save does not know about the outer transaction, and is it at all possible to address this using coding standards/guidelines/patterns?)
end transaction

Even if no entity_save's are involved, how do contrib-modules handle the "quirk" that cache may not be cleared while inside a transaction?

An isolated node_save() may be relatively easy to fix regarding this, but not system-wide I think.

Crell’s picture

As implementation, from a quick glance at the code I'm suggesting adding a cache clear call to EntityStorageBase::save():

  public function save(EntityInterface $entity) {
    // Track if this entity is new.
    $is_new = $entity->isNew();

    // Execute presave logic and invoke the related hooks.
    $id = $this->doPreSave($entity);

    // Perform the save and reset the static cache for the changed entity.
    $return = $this->doSave($id, $entity);

    // Execute post save logic and invoke the related hooks.
    $this->doPostSave($entity, !$is_new);

    return $return;
  }

The cache clear call belongs right before the return statement, nowhere else. If something clears the cache prior to that, oh well, we burn some CPU but that's fine. We're already clearing the static cache in memory in doPostSave(), actually, so maybe the persistent cache clear should go there as well.

If you're persisting an entity and not going through that method, then you're violating the API and we don't care about you. I don't know how "crazy code goes here" affects that problem.

That leaves the multi-save case. The issue here is if you're saving 10 nodes and node 6 breaks for some reason, it gets rolled back... Do you roll back nodes 1-5 as well? A wrapping transaction would imply yes. OK, so what we need to do there is check if the whole set failed, and if it failed, clear the cache for all of the impacted nodes. They would all have been individually cleared while updating, so in this case what we're doing is clearing out any caches created by another process accessing node 3 (and therefore recaching it) with the new data; that one request would get the new cache, but the cache would be cleared again after the transaction is rolled back and then go back to the previous node when next accessed. Not ideal, but works.

Or, perhaps better, there's enough layers of indirection in the save process already that we could probably do something like this (pseudocode):

function save($entity) {
  saveForReal($entity);
  clear_cache($entity);
}

function saveMultiple(array $entities) {
  foreach ($entities as $entity) {
    save($entity);
  }

  array_map('clear_cache', $entities);
}

Of course, we'd need to figure out where in saveMultiple() to do the transaction, given that it's backend-agnostic. But... thinking on that, do we even support that now? If someone's doing completely custom code and exposing a transaction at that high a level, then they're coupling their code to an SQL database outside of the entity API anyway. Which means... not our problem.

Edit: how do we determine when an entity save process has been successful?

Thinking on it... do we need to? If an entity save is successful, we want to clear the cache. If an entity save is not successful, we don't need to clear the cache... but does it hurt anything if it is anyway? Remember, we're not doing a complete cache wipe anymore, just clearing that one node. If we can do better than the greedy approach we certainly should (if entity API gives us the necessary feedback, which it may not, in which case that's a bug to fix), but if we can't... there's no data loss, just slightly more cache misses. Vis, any time you save an entity, successful or not, it clears the cache for that entity, and only that entity.

fabianx’s picture

The discussion is a good point.

The problem however is an invalidation one, not an ordering or transaction one. (which is also why my proposed revalidate approaches would work as they fix _invalidation_).

There are only two hard problems in ... anyway ;).

The problem is _not_ that the field cache gets cleared, but that the field cache does not get cleared _again_ after the node has been saved.

This is where our system makes the implicit assumption that the DB cache backend runs under the same transaction.

So the solution for D8 is to add the node's cache tag to the field cache entry (or wherever we store that).

And in D7 we and everyone else doing crazy things need to clear the field cache after the node has been saved and the transaction been committed, which always works.

Of course the best solution is to track cache clears during the transactions, but if core used a separate and possibly async DB connection for the cache backend, we would encounter these cases in core as well.

--

And if the code iterates over that and that many nodes it also needs to invalidate all those node caches again and yes that is a little bit bad.

--

And then we end up with gielfeldt's solution again, as we really need to invalidate when pushing the transaction not before.

Basically any cache item touched is invalid (if we don't record all cache things like gielfeldt's solution does).

--

HOWEVER

I think for D8 this is not a problem anymore, because:

a) field cache is gone in favor of whole entity render caching
b) We do most invalidation via cache tags now with the only exception being cleanup of expired items and cc all (clear whole bin)

And cache tag or even fast chained invalidation is a DB operation, which always runs in the same transaction.

So it should not be a problem - though no one really so far looked at fast_chained / cache tags + transactions, yet.

For cache_tags we increment a value, which should be fine according to http://stackoverflow.com/questions/3821468/sql-atomic-increment-and-lock... - though it needs more investigation regarding different isolation levels.

For chained_fast we use a timestamp and the update order is undefined, if there is a long transaction with an early update and a shorter one we theoretically could overwrite the timestamp that was set before.

Also our locks that are set during a transaction are uhm ... non-existant as they don't do anything ...

Ugh ...

That is indeed quite some work for Drupal 9.

But overall it seems we need to record events during transactions regarding caches and cache related operations and play them back, including safety checking and locking ...

e.g. if we recorded cache_tags invalidations during the transaction, then incrementing outside all transactions, then we are good in any case as the sum matches the invalidations and at it is always higher.

In the same way if we did write the chained fast timestamp at the end of all transactions, then we could check that we never write a lower value than before.

So with D8 we have different problems, but its not solved :/.

Also shows that the timestamps of chained_fast are worse than e.g. an atomic counter (if its truly atomic and transaction-independent).

lcache (by David Strauss) might also be an idea to record cache operations in a table, then play them back. That should be unproblematic related to transactions, too.

--

TL;DR: Transactions and application caches combined are a problem ...

--

Edit:

- Under the default isolation level this is all less of a problem as all those rows are locked by the transaction itself, but it means we have less control over what happens and I really need to test this some more:

* lock() => semaphore table, accesses the row, sees active transaction => sleeps
* chained_fast() => cache table, timestamp, accesses the row, sees active transaction => sleeps
* cache tags => cachetags table, accesses the row, sees active transaction => sleeps

So as long as transactions are short, it works fine. If not it can stall your site potentially a lot.

e.g. one long running transaction in theory could make the site unavailable for further writes if chained_fast cannot write the timestamp data again.

But still it means for D8 it is much less of a problem as we have cache invalidation in the same transaction again.

However I think the approach of being able to run in other isolation levels and make services aware of transactions is still worthwhile.

However for D8 I think this should be a generic event and not isolated to cache backends. (if that is already the case, ignore).

For D7 it could just be a hook_invocation() instead.

Then contrib is happy and core as well.

gielfeldt’s picture

@Crell

What about custom code like this? How do we enforce it, if we are to ban it?

function mymodule_entity_update($entity) {
  // Do some stuff.
  // ...
  // Clear custom cache
  \Drupal::cache()->clear('mycacheentry:' . $entity->id());
}

In the above, mycacheentry:%id% would be cleared prematurely in a non-transactionally-synced cache backend.

It may also be harder to take advantage of existing code, like so:

$nodes = load_a_bunch_of_nodes();
foreach ($nodes as $node) {
  // The following function may perform a taxonomy_save() on terms related to 
  // the node.
  some_custom_module_that_processes_node($node);
}

In the above code we, as a consumer of some_custom_module_that_processes_node(), do not necessarily know what entites are saved. Not sure how we fix this, short of refactoring the entire in/re-validation system as mentioned.

If we are to make this an event system that's not container-dependent, I'm not really sure how to go about. Hope someone can pitch in.

--------

@Fabianx

The problem however is an invalidation one, not an ordering or transaction one. (which is also why my proposed revalidate approaches would work as they fix _invalidation_).

True, however the current implementation of cache invalidation in D7/D8 "conflicts" with the use of transactions in certain situations.

And in D7 we and everyone else doing crazy things need to clear the field cache after the node has been saved and the transaction been committed, which always works.

_IF_ you know when the outermost transaction ends :-)

lcache (by David Strauss) might also be an idea to record cache operations in a table, then play them back. That should be unproblematic related to transactions, too.

This is somewhat what Cache Consistent does, except it doesn't store the cache operations in a table, but just in-memory, since a transaction never spans requests.
Also, I cannot find anything on lcache by David Strauss? Have I become this old and bad ad Googling?

TL;DR: Transactions and application caches combined are a problem ...

And code that starts a transaction and calls e.g. core code which may save entities unbeknownst to the consumer.

Also our locks that are set during a transaction are uhm ... non-existant as they don't do anything ...

Not true, they still work. However at this point, the database may timeout if another tries to acquire the lock while the transaction is still in progress. In any case here, it may be a very good idea to use another connection for the lock table. I think there's another issue for this somewhere? (also, use read-committed to avoid gap-locks, etc.)

-------

In short: The patch in this issue only addresses the problem with notifying of transactions. Cache Consistent actually does the recording and playback. I have on purpose not submitted Cache Consistent as a core patch, since it's quite massive, and some people would most likely have very strong feelings regarding that implementation :-)

Also, the D7 patch actually notifies the cache layer directly about transactions (like @catch, I think, suggested). I can't exactly remember why I did this, but I think there were some issues using the hook_system. (early page cache, and so on).

Crell’s picture

And code that starts a transaction and calls e.g. core code which may save entities unbeknownst to the consumer.

See, here I think is the crux of the issue. Transactions are specifically an SQL thing (at least the transactions we're talking about). We have no backend-agnostic transaction system except the lock API, which isn't really the same thing.

As of Drupal 8, if you're coding to a specific backend (ie, SQL), then odds are you're doing it wrong. The database should only be accessed inside a particular subsystem (cache, entity, lock, keyvalue, etc.). If you're starting a transaction from, say, a Controller, then you're already off the supported path and we don't need to support that anyway.

If you do need to do something SQL-specific in a swappable service, then you can be reasonably confident that you are the top-level transaction, I'd think. And you can do your own rollback logic there if necessary, including not touching caches until you're sure it's done. Basically, cache invalidation belongs in the finally {} block.

We also, admittedly, lack a "Unit of Work" concept in Entity API for cases where you need to update multiple entities together, or not. The way to fix that is to add a UnitOfWork / Batch update (not batch API, but multi-update) concept to Entity API, which can then have an appropriate transactional behavior per-backend. On SQL it would be a DB transaction, on something else it would be... I don't know, ask those maintainers. :-)

In general, though, this is one of the downsides of abstracting away all storage. If we want people to be able to move cache, keyvalue, entities, etc. to three different backends, then there's basically no opportunity to build atomic cross-datapoll operations. At least not without a great deal of extra engineering that would be inherently imperfect anyway.

It's a trade-off either way. D8's made the trade-off of flexibility at the expense of guaranteed atomicity. That's not good or bad without context. The best we can do at this point is to handle the common cases (post-operation cache clearing, etc.) and document the caveats of the rest.

jedihe’s picture

I'm deciding which fix/remediation to apply for nodes being hit by this. I see cache_consistent is being used successfully by d.o but would also like to know if anybody has used #47 (committing transaction before ->resetCache() is called, by @deviantintegral) and how well it works for mitigating this problem for nodes. Thx!

mkalkbrenner’s picture

In some production environments we recognized strange 503 errors without any reason.
After a lot of debugging I found the cause and wrote the patch #2833100: dblog fails to log INNODB_LOG_FILE_SIZE errors. With that patch applied the cause is visible in the logs:

Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[42000]: Syntax error or access violation: 1118 The size of BLOB/TEXT data inserted in one transaction is greater than 10% of redo log size. Increase the redo log size using innodb_log_file_size.: INSERT INTO {cache_render} (cid, expire, created, tags, checksum, data, serialized) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6) ON DUPLICATE KEY UPDATE cid = VALUES(cid), expire = VALUES(expire), created = VALUES(created), tags = VALUES(tags), checksum = VALUES(checksum), data = VALUES(data), serialized = VALUES(serialized);

As you can see, the cache write causes the transaction to not be committed at all!

tstoeckler’s picture

So you are saying that you think it is problematic to include the cache write in the transaction in the first place?

berdir’s picture

That's a render cache write. Why would that even happen when saving an entity? Only thing I can imagine is some custom code rendering something in a presave or similar hook? Are you sure that happens as part of saving?

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

berdir’s picture

We discussed this issue a while ago in an entity major issue triage meeting and I didn't do my homework of commenting on the issue.

Basically, we are not sure if this is still a problem in D8, especially after #2753675: loadUnchanged should use the persistent cache to load the unchanged entity if possible instead of invalidating it because we're doing a lot less cache reads and writes when saving an entity, there there's less risk for race conditions.

It would be great if people who have seen this in a the past could check if they can still reproduce this in 8.3. If we don't get any such reports, we might decide to change this to a normal bug report.

I assume it's still major for Drupal 7, and given that a fix would look comletely different there anyway, it might make sense to open a separate 7.x issue that could still be major as the situation there hasn't changed.

nwom’s picture

#66 still applies cleanly to the 7.x. Thank you!

nithinkolekar’s picture

One of my d7 project I faced the race condition for node save using services module and solved that by adding random delay for each node_save(hackish way).
is anybody tested if patch also works for service module's node creation?

penyaskito’s picture

I still see this in the Lingotek module when translations are tried to be saved concurrently. Still figuring out if the problem is the module itself or not.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

acbramley’s picture

I've just encountered what seems like an issue that may be related to this. We just deployed Drupal 8.4 to a testing environment that has memcache (using the memcache_storage module) and hit this error on a node save:

Drupal\Core\Entity\EntityStorageException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'node:5876' for key 'PRIMARY': INSERT INTO {cachetags} (invalidations, tag) VALUES...

This did not show up on dev environments without memcache installed.

The cache bins that we use memcache for are:
menu, toolbar, render, and dynamic_page_cache.

I've not seen this issue before upgrading to 8.4 though

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

badrange’s picture

I wonder:
- Will this problem ever be fixed for Drupal 7?

I just stumbled into this recently. The cache_consistent module with the patch for 7.41 seems to work on my local vagrant box, but I really don't feel like messing with a legacy production system with thousands of users.

- Will this problem ever be fixed for Drupal 8?

Hopefully before I ever run into it so that I don't get even more grey hair from age old Drupal bugs.

dtv_rb’s picture

Why is there still no official fix for this?

Drupal 8 websites with high traffic (like ours) are forced to use the development version of a several year old module (cache_consistent) with an equally old core patch to work around this bug.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

badrange’s picture

I am no longer working on any Drupal projects, so I'm unfollowing this. So long and thanks for all the grey hair, Drupal.

m4olivei’s picture

Working with a client on a D8 site (8.6.15) with memcache, seeing syptoms akin to the issue in this description, eg. editor saves a node, form returns and all the field values in the form are set to old data.

Wondering if others are seeing this behavior still on more recent D8? There was a suggestion that it's no longer an issue in #106. However, following along, and comparing with recent D8 code, a node save will arrive at \Drupal\Core\Entity\Sql\SqlContentEntityStorage::save, which begins a DB transaction. Before that transaction falls out of scope, there is a cache reset that happens (\Drupal\Core\Entity\ContentEntityStorageBase::resetCache) that hits memcache. There is also a cache tag invalidation that happens. Cache tag invalidation is done on the DB, so that seems like it wouldn't be a concern.

Trying to prove out if it's still an issue in more recent D8. Seems like it could be given the above.

berdir’s picture

See #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock and the related redis issue #3018203: Support delayed cache tag invalidation, that fixed similar problems for us reliably. Memcache will need a corresponding patch just like redis (note that the patch there doesn't reflect the most recent patches in the core issue, which will allow to further simplify the cache tag implementation and avoid duplication).

m4olivei’s picture

Thanks @Berdir! I was pretty lost at that time, and you gave me new inspiration.

This issue and the ones you linked to are really hard to follow when you're starting with minimal knowledge of DB transactions, MySQL isolation levels, memcache, and cache tag invalidation. Over the past couple of days I've been working on reliably reproducing this race condition that leads to a cache consistency problem. Here is what I came up with. Hope this helps others. Just seeing this mystical issue is huge for me, and gives me hope to knock this out using one of the suggested solutions.

OK, here's the steps to reproduce I came up with:

My setup

  • Drupal 8.6.15
  • memcached PHP extension
  • memcache Drupal module setup and configured
  • xdebug

Steps to reproduce

  1. Create a new node, make it super simple, just a title will suffice for the test. Save it.
  2. Edit the node. We’ll call this authenticated session, Session A, because we're going to open another next.
  3. Open a separate authenticated sessoin (different browser, or Incognito Chrome). We’ll call this Session B.
  4. Open the node edit form for the same node in Session B.
  5. Fire up xdebug for Session A, and set a breakpoint just before the transaction for the node save commits to the database, here: https://git.drupalcode.org/project/drupal/blob/8.6.15/core/lib/Drupal/Co...
  6. In Session A, change the node title, and Save the node. Xdebug should stop at the breakpoint set above.
  7. NOTE: At this point, you can query the database for the node in question, and you’ll see the old node title still (in the node_field_data table). This is because the DB transaction hasn’t been commited yet. Note also that at this point, memcache has been invalidated for this node, this is key.
  8. In Session B, reload the edit form for the node. You should see the old node title. This is expected, since the transaction from the node save triggered by Session A hasn’t been committed yet.
  9. Allow Xdebug to play through the rest of Session A’s request.
  10. At this point you can query the DB and you’ll see the new node data in the data tables (eg. node_field_data).
  11. Go back to Session A in the browser, and edit the node again. Note that the form has the old data in the form!
  12. Reload the form in Sessoin B and note that the old data is still showing there as well.
  13. At this point, if you rebuild the Drupal cache, and then reload the node edit form, the new data will display.

The xdebug breakpoint makes the race condition really obvious. In practice it could be a long running process on node save, eg. reaching out to an external service as has been suggested here. For us we have Apple News module enabled, which can take a long time on node save. Regardless, there is a window, illustrated by the xdebug breakpoint here, where stale data can be written to the cache backend by a parallel process when you use an external cache backend like redis or memcache.

Now I have some small hope of resolving this using one of the suggested solutions :). Will report back as I make progress.

m4olivei’s picture

Adding some finer points to what I last wrote:

  • The breakpoint for Session A can actually be set sooner in the request, right after this line of code, the vulerable window begins: https://git.drupalcode.org/project/drupal/blob/8.6.15/core/lib/Drupal/Co...
  • The relevant cache item is the values:node:1154455 item in the entity cache.
  • Session B does not have to be authenticated, or a request for the node form. It can be any request for that node that doesn’t come back from one of the cache layers, eg. Drupal builds the request and builds that entity cache item. IOW any request that sets the entity cache item for the node in question.

I've also been looking at #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock as a potential solution. Not sure I totally understand why yet, but that patch doesn't solve the issue here. I'm finding that the entity cache item is still served even when the cache tag invalidations are done after the transaction is committed. You still end up with the behavior of an editor editing a node, only for it to reload with stale data.

I suspect this is because memcache needs a patch like @Berdir mentoned exists for redis, eg. #3018203: Support delayed cache tag invalidation. Will look into that next.

m4olivei’s picture

I suspect this is because memcache needs a patch like @Berdir mentoned exists for redis, eg. #3018203: Support delayed cache tag invalidation. Will look into that next.

This turned out to be not true. The issue here is that the entity cache item (values:node:1154455) that is causing the stale data issue, doesn't have any specific cache tag associated with it that is invalidated after a node save. The cache tags it gets saved with is: entity_field_info, memcache:entity and node_values. None of which are cleared on node save, which leaves us with the problem at hand. It stands to reason therefore that adding node:1154455 to the list of cache tags for the entity cache item would fix the race condition. That way, with the patch in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock, when the node:1154455 cache tag is invalidated after the transaction commits, it includes clearing out cache like values:node:1154455.

m4olivei’s picture

Assigned: Unassigned » m4olivei

Will work on a patch to add the entity cache tags to the entity values cache item.

m4olivei’s picture

Assigned: m4olivei » Unassigned
Status: Needs work » Needs review
StatusFileSize
new751 bytes

Here is a patch. This assumes that you are also using the patch in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock. Specifically I am testing against https://www.drupal.org/project/drupal/issues/2966607#comment-13253493.

It's a totally different approach from prior patches. So far, testing the two patches together is working to resolve the stated steps to reproduce in #120. Not sure if this approach has unwanted sideeffects. It seems like if would be pretty innocent, I'm essentially adding a cache tag for the individual entity to an item that would otherwise be manually cleared by key during entity save. This ensures that when your using a alternative cache backend (eg. memcache, redis) the logic in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock kicks in to fix the race condition stated here.

berdir’s picture

> This turned out to be not true.

But it is. The core issue is about cache tags, but it adds the mechanism to do whatever you want at the end of a transaction and the redis cache patch uses it to also delay direct cache item invalidation. We had exactly the problem that you describe on several large sites and the core + redis patch reliably solved it. And yes, memcache will need a corresponding implementation.

We don't want to add the cache tag like this, because it is a 1:1 relationship between the cache tag and the cache key, we already know how to clear those cache items directly and do so. This adds extra overhead on every cache get because we also always need to check that the cache tags are not invalidated.

Status: Needs review » Needs work

The last submitted patch, 124: race_condition_cache_tags-1679344-124.patch, failed testing. View results

m4olivei’s picture

The core issue is about cache tags, but it adds the mechanism to do whatever you want at the end of a transaction and the redis cache patch uses it to also delay direct cache item invalidation.

Ahh, I see that now. I realize now that when I was looking at that redis patch I was focusing on the cache tag pieces and not seeing that direct deletion piece. That makes sense. In which, case, ignore my patch! :). I'll look into doing the same in memcache. It looks like this ticket was filed to work on the implementation for #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock:

#2996615: Transaction support for cache (tags) invalidation

Suppose that's a good place to work in.

We don't want to add the cache tag like this, because it is a 1:1 relationship between the cache tag and the cache key, we already know how to clear those cache items directly and do so. This adds extra overhead on every cache get because we also always need to check that the cache tags are not invalidated.

Thanks for pointing that out. I was obviously missing this in my understanding.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

m4olivei’s picture

Reporting back to close the loop on my previous comments.

We deployed the patch in #2966607-127: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock to 8.7.7 in production last week together with the corresponding patch to memcache in #2996615: Transaction support for cache (tags) invalidation and we've seen our issue dissapear!

In my estimation the combination of #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock and the implementation of it for redis or memcache resolves this issue for those two backends.

Thanks for all your work on this and help to me personally @Berdir!

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

matroskeen’s picture

Issue tags: +Bug Smash Initiative

Accidentally, I stumbled upon this issue, and based on the last comments it feels like it's not relevant anymore?... (at least for Drupal 9)
Can anyone check and confirm? Otherwise, we'll probably need to update the issue summary.

ruiadr’s picture

Hello,

I had same problem with my configuration, I use redis for cache, but some tables in database was still in use (cachetags and cache_container).
May be I followed the wrong way reading the documentation (I m novice with redis), but I found this article who explains how configure correctly your app to use redis for all your cache: https://community.platform.sh/t/how-to-use-redis-for-caching-a-drupal-8-...

In my opinion, if all your cache is not in redis (a large part in redis, a small part in database for example), during the node save, services are not executed in the right order because they don't have the same response time (redis is faster than db).

In my case, it seems to solve my problems, i had PDOException (There is already an active transaction) and at the same time Drupal\Core\Http\Exception\CacheableAccessDeniedHttpException (The 'xxxx' permission is required.).