Problem/Motivation
With a non-DB backed cache, updating nodes can result in stale data being in the cache due to a race condition.
This is because when an entity is saved we open a new DB transaction, during which we do all the updates to the data tables, and we clear the relevant items in the cache tables, at the end of the saving process the transaction is committed and all of those operations happen as one atomic piece.
If however, your cache backend is not the DB, then the cache flush will happen immediately, allowing a window for another operation to come in and cause the flushed cache items to be rebuilt from the non-updated DB source data, because the transaction hasn't been committed yet.
Proposed resolution
Many options. I think that we're still at the stage of getting to consensus on the issue.
Remaining tasks
Obtain consensus on a sensible way to resolve this, and then code that up.
I'm not sure that this is testable? Can we test race conditions?
User interface changes
None.
API changes
None at the moment.
Original report by @slashrsm
There can be a race condition when using some other cache back-end than DB (Memcached in my case). Data will be saved in DB at the end of function since node_save() uses db_transaction(). Field cache is, on the other hand, cleared at field_attach_update().
If there comes a request to the node being saved just after the field cache was cleared and before the transaction was commited we end up with a outdated data in field cache.
This will happen only if some other cache back-end than DB is used. Field cache will be cleared when transaction is commited, so there should be no problem in this case.
Related issues:
#1658672: It seems as sometimes empty and old nodes are served by memcache
#1677830: Node not saved when hook_node_insert() or hook_node_update() are invoked
And some more background about this issue:
http://janezurevc.name/strange-behaviour-race-condition-during-nodesave
| Comment | File | Size | Author |
|---|---|---|---|
| #124 | race_condition_cache_tags-1679344-124.patch | 751 bytes | m4olivei |
| #91 | race_condition_in-1679344-91.patch | 12.67 KB | gielfeldt |
| #89 | race_condition_in-1679344-89.patch | 12.08 KB | gielfeldt |
| #86 | race_condition_in-1679344-86.patch | 6.27 KB | gielfeldt |
| #83 | drupal-n1679344-83-8.2.x.patch | 6.6 KB | damienmckenna |
Comments
Comment #1
marcingy commentedThis is surely something for memcache to fix rather than core - memcache could register a shutdown function or it could add items to queue to be cleared etc. Basically as this is a contrib issue the contrib module needs to (and can) provide a solution.
I have actually been bitten by this issue and fixed it via a queue mechanism which works prefectly or so it seems. We didn't track down the cause so have not contributed a patch back because we thought it was something unique to us. I will try and create a couple of alternative patches for memcache this weekend that can get round the issue.
Comment #2
slashrsm commentedI can agree with this, but I'd still like to have this documented in core. Issue: #1677830: Node not saved when hook_node_insert() or hook_node_update() are invoked
Comment #3
msonnabaum commentedThis is absolutely not contrib's responsiblity, this is a core bug. A cache backend shouldn't have any knowledge of possible race conditions in core.
Part of the problem here is when we decided to stop using the revision id to key the field cache in http://drupal.org/node/456488.
Now we have the same cache id for all revisions, which does not work when MySQL is returning a stale vid because the update transaction hasn't been committed yet.
Not quite sure what to do to fix this yet, but it needs discussion.
Comment #4
bjaspan commentedI haven't worked on Fields or core in a long time, but I agree this does sound like a core bug.
Offhand idea for a fix: The problem is that a revision-specific cache entry is being created but the read is revision-agnostic, assuming it is going to be for the latest version. We went to revision-agnostic cache entries to avoid caching data for old revisions that will rarely be used. Since there is no way to create an atomic transaction with an external database system, perhaps the caching strategy should be changed: Go back to using the entity and revision id as key, but when clearing the field cache before an update, always clear it for all revisions of the entity. e.g.
Of course, this assumes we can do a prefix cache_clear_all(). I don't know/remember if the cache api allows that.
FYI, I am unlikely to see updates to this issue unless someone pings me directly.
Comment #5
bjaspan commentedMore thoughts:
If there is no prefix cache clear, perhaps field_attach_update() could clear the cache entry for the *previous* revision before saving the new one. Of course, this requires that saves be strictly serialized, again requiring a transaction involving the external cache backend. I guess if there is a transaction that updates the entity's table in SQL first, then we know they are serialized.
Another option is a background process that clears all field cache entries for non-current revisions. That's going to be painful at large scale, though, just to figure out what the entries to clear are.
Comment #6
msonnabaum commentedTotally untested, but maybe something like this could work.
Comment #7
Anonymous (not verified) commentedthe race is made worse by an external cache, but is not at all confined to it.
even with db-backed cache, any process that misses cache and reads from the db while an update is happening in another process can write stale info back to the cache just after the update process commits a cache clear.
isn't the safest way to do this from a race-condition-p.o.v. to not write to the cache from a read operation? instead, write through to cache on insert / update? that makes the 'ZOMG my cache went away' case scarier, so i can see that not being popular...
Comment #9
catchIt would be nice to write through to the cache instead of blowing it away, but I don't think we can reliably do that at the moment since we cache hook output (i.e. hook_field_load()), so we'd still have to do the full entity load before writing back anyway. Also we'd still need a fall back for reads since there's evictions, general cache clears etc. so it'd still leave a race condition window open.
What about using a lock for this? Lock at the start of entity_save() and if any process gets a cache miss and can't acquire the lock it doesn't try to write back.
Comment #10
damien tournoud commentedBecause we don't have multi-stage commit capabilities, the only way to make this atomic would be to:
That way:
(1) While the transaction has not been committed, other requests see the old version from both the database and the cache
(2) If the transaction is rolled back, everyone still see the old version from both the database and the cache
(3) When the transaction gets committed, everyone immediately see the new version from both the database and the cache
Comment #11
damien tournoud commented(And the nice thing about mandating entities to get a version field is that we can easily implement optimistic locking on top of it.)
Comment #12
gielfeldt commentedThis is issue actually applies to all cache operations performed inside transactions, not just for entities or fields. The default REPEATABLE-READ isolation level in MySQL also only adds to the problem.
This will only be a problem if the update is not atomic (including the cache update), i.e. if the update is not using transactions.
A little cache wrapper I made, trying to address the general issue/root cause in "user-space":
http://drupal.org/sandbox/gielfeldt/1946668
Comment #13
damien tournoud commented@gielfeldt: REPEATABLE-READ is "stronger" then READ-COMMITTED, so you must be mistaken somewhere.
Comment #14
gielfeldt commentedIt's precisely because it's stronger that it's a problem. Even if we manage to postpone cache operations until after a transaction has been committed, there's no way to absolutely ensure that synchronization between memcache and the DB because of the consistent snapshot mechanism. I'll try to draw it up tomorrow when I get the time.
Comment #15
gielfeldt commentedEDIT: Each row should be viewed as a discrete point in time.
Comment #16
dcam commentedhttp://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.
The summary may need to be updated with information from comments.
Comment #17
pwaterz commentedWe are having this issue happen all the time in d7 with memcache and node save. The only work around that I have come up with this point is to add this to the settings.php
$conf['cache_class_cache_field'] = 'DrupalDatabaseCache';
Comment #18
Firemyst commentedWe had a similar problem, so we tried to replace memcache with Redis, but the issue was not resolved. This is not a memcache specific issue.
Comment #19
gielfeldt commentedTrue, it is not memcache specific. This is issue applies to any setup that is not using the database as the cache backend.
Comment #20
pwaterz commentedYes Gielfeldt. Firemyst, my proposed work around will work for redis as well.
Comment #21
gielfeldt commentedYou could say that. Your work around is to not use redis :-)
Also, just doing it for fields is not enough. The problem is much more fundamental and can show up whenever a transaction is used. Custom module using node-hooks and performing their own cache operations inside these. Or the EntityCache module for example, is also prone to this error.
My consistent cache wrapper was never completed, nor have I had the time to do so :-(. Instead, I've include the transactional safe cache wrapper from AutoSlave, which is working and solves the underlying problem. There are 2 requirements though. A core patch and configuring the database to use isolation level READ-COMMITTED. Perhaps some inspiration from this module can be used to fix Drupal in regards to this issue.
Comment #22
xjmComment #23
xjm(Merging "node system" and "node.module" components for 8.x; disregard.)
Comment #24
catchI still think we can use a lock for this, and Damien's answer in #10 didn't sufficiently explain why we can't.
Trying to version key entity cache entries runs into a chicken and egg problem. When you request the entity, you don't know the version, so you'd need to query the database to get the version, before you can request the item from cache. That reduces the performance benefit of having the caching in the first place - doubles the queries for a start.
If we added a lock, it'd look like this:
Entity system acquires a lock based on the entity type + ID, would mean waiting if the lock is already acquired per the usual pattern, release the lock at the end of the save operation. Same with deletes as well.
Cache misses can try to acquire the lock before writing, and just not bother trying to write to the cache if the acquire fails.
Even if it's not perfect it'd be a lot more robust than what we have now, and there's only extra database queries on writes.
Comment #25
steven jones commentedTaking a look at this.
Comment #26
steven jones commentedSorry, really don't understand D8 enough at this point to make a more meaningful contribution.
Comment #27
drummThis issue is the root cause of the brief Drupal.org outage earlier today.
Because of this, Drupal.org must run with a DB-backed field cache. We occasionally get deadlocks when waiting for MySQL to finish clearing the field cache. MySQL isn't exactly designed to be a cache backend. Switching to another cache backend would avoid the problem for Drupal.org.
Comment #28
mikeytown2 commented@drumm
#2222635-26: Waiting for table metadata lock on cache_field table
#2193149-3: Deadlocks occur in cache_field table.
We are using both of these patches in production; it's a 99% logged in site.
Comment #29
drummThose look like okay mitigations. We'll take a look at them for Drupal.org. It would still be best to resolve this issue, so we can use MemCache as a field cache backend.
Comment #30
mikeytown2 commentedTotally agree, but sometimes you need some duct tape. With this being D8 & "needs work" without a viable patch it might be a little while before this hits D7.
Comment #31
jackbravo commentedDo we have steps to reproduce this issue in a normal install?
Comment #32
pwaterz commentedI agree with catch in comment #24.
Comment #33
xjmSince Drupal.org is running Drupal 7, it's okay to also submit a Drupal 7 patch for this, and then we can forward port it to Drupal 8 here in the issue as well. If needed, we can presumably use the D7 patch for Drupal.org until it's fixed in HEAD.
Comment #34
gielfeldt commentedA way to fix this could be to make an event when a transaction finishes, so that cache backends may react on this; i.e, they can buffer cache writes if a transaction is in progress and then actually perform them on transaction finish. (Note, the event must occur after the actual commit).
It would be a minor fix to core, but would require some extra logic in the non-db cache backends.
Comment #35
gielfeldt commentedHere's a proposal for D7.
The patch is the same as bundled with: https://www.drupal.org/sandbox/gielfeldt/1946668
The patch merely provides a new cache interface for contrib modules to implement.
Comment #36
gielfeldt commentedComment #37
fabianx commentedI like this approach a _lot_. This puts the burden on contrib/ but contrib can then do something about the transactions and maybe even get locks themselves.
RTBC + 1 from me for this approach, but no real RTBC as we could still just acquire a lock for the issue at hand here.
Comment #38
drummLooks good!
An implementation for memcache would be great both for Drupal.org and as an example for reviewing this patch.
Comment #39
gielfeldt commentedWhile the patch provides the possibility to address this issue, it does require a bit of logic in the cache backend to handle it properly. The memcache project could look at the cache consistent for inspiration, but perhaps it could be implemented in simpler way than I did there?
If this paradigm is followed, then each backend would have to implement the transaction handling, and we might end up with a lot of duplicate code across projects. What to do? Could core provide some helper functions perhaps?
I've attached an updated version of the patch, just in case.
Comment #40
catchI'd rather do the lock here since that doesn't require a 7.x API change, then open a separate issue to discuss the transaction support for cache backends.
Comment #41
fabianx commentedIt seems like d.org could _try_ using consistent cache module with memcache. I had a quick look through and the ConsistentCacheLookup class looked pretty good to me using the intelligent buffer.
Comment #42
drummI filed #2316057: Move field cache to memcache for deploying a solution for Drupal.org. We'll have to hack core and go with what works. In the meantime, it would be great to have a solution in core, so we don't have too many core hacks pile up.
Comment #43
David_Rothstein commentedAn API change (i.e. breaking the cache API) would not be good for 7.x here, but the latest patch looks like only an API addition... which I think is fine if it turns out to be the best way to fix this technically.
Comment #44
fabianx commented#10:
I think something like that is possible and there is two ways to do this:
a) Add a timestamp or hash of timestamp to the cache ID and add some garbage collector for old cache entries.
field_update_cache:
That effectively implements serialized approach, but uses timestamps instead and leads at least to the new entity being loaded always populate the correct timestamp, in the race condition case the old data is just left over.
b) Use cache re-validation instead of invalidation
// Remove the cache clear all
Use the modified property and store the cache like:
After a cache_get ensure the cache is still matching the right timestamp. This is better, because less clearing, therefore less DB operations, and just a race on cache_set with the same data.
In core only users don't have modified timestamps (@see entity_modified contrib module) so this should work well for the primary use case of nodes and being purely internal is probably not an API change.
After reviewing a lot of the current invalidation strategies and trying out the hash based strategy in render_cache, I found invalidation to be the best strategy and here we have a simple key to know if the entity was invalidated.
Edit: And YES! vid should be added as well at least to the revalidation as proposed already by msonnabaum in #3 as that will fix the issue for revisionable entities with revisions enabled obviously.
Comment #45
mikeytown2 commentedI'm working on a MySQL specific cache backend and I want to make sure I have this logic correct. The cache db that I'm working on uses READ-UNCOMMITTED which means that any write (Insert, Update, Delete) inside of a transaction will be seen by it.
Heart of the issue from my understanding:
node_save() calls field_attach_update() which then calls
cache_clear_all("field:$entity_type:$id", 'cache_field');. Another request hits node_load() at the perfect time so that when DrupalDefaultEntityController::load() hits field_attach_load() the cache is empty but the node/field transaction has not been committed so the old data gets saved to the cache.Current check
I think I can do the cache_clear in this situation outside of a transaction due to this connection using READ-UNCOMMITTED. Is my understanding of the situation correct or did I miss something?
As for a solution, having a trigger when the transaction finishes seems like a good general approach; buffer all writes to the cache until the transaction is done.
Comment #46
gielfeldt commentedThis is correct. However, the issue is not isolated to node_save() as such, but is present everywhere a cache-write operation happens inside a transaction.
Using a database cache backend with READ-UNCOMMITTED is equivalent of using e.g. memcache wrt the problem stated in this thread.
I'm not sure what the code you posted is supposed to do though?
This is exactly what Cache Consistent does.
Comment #47
deviantintegral commentedI've been dealing with a related issue, though it's not specific to the field cache. We have a site with many scheduled revisions to content, and often the revision change is just a change in $node->status, so we end up serving nodes marked as unpublished from the cache.
Is there a reason we're preserving $transaction all the way through to the end of node_save()? If not, I think we can unset $transaction before we call resetCache(), ensuring the that there's no race between parallel requests and rebuilding the entity cache.
Here's a patch - but I don't think this supersedes other patches here, but perhaps might be useful in addition.
Comment #48
damienmckennaFYI the d.o maintenance team worked around this by using the Cache Consistency module.
Comment #49
drummYep, it is working well. We are still using "Method #1 (alternate)" from https://www.drupal.org/project/cache_consistent.
Comment #50
damienmckenna@drumm: Thanks for mentioning that, I figured it'd be useful to note here so if someone else found the issue they'd be aware of the workaround.
Comment #51
fabianx commented#47: Surprisingly simple, but a different issue, because the field cache still is affected in the same way. Can you open up another issue?
We might have _that_ problem in D8, too still.
However we could chain through from entity to clear the field_cache, too. Not ideal, but possible.
Comment #52
tim.plunkettIs this still relevant in D8? If not, move it back to D7 and remove the backport tag.
Comment #53
fabianx commentedYes, it is.
Comment #54
joelpittetThere was a couple pretty nasty bugs in cache_consistent that have been fixed it seems in the -dev release. I'm testing them out and seem to be resolved. Just wanted to mention since a few of you likely use it.
One where child menu items were disappearing sporatically, which seems to be resolved. Another where taxonomy terms returned all terms in a multiple get as the same entity by accident and threw a fatal error in my case when saving multiple tags on an article.
#2469925: Child menu items disappear
#2533274: Loading multiple taxonomy terms ConsistentCacheBase::matchKey keeps returning the same values for each entity
Comment #57
ehegedus commentedThe patch from #39 does not apply to Drupal core 7.40, needs reroll.
Comment #58
ehegedus commentedRe-rolled against Drupal core 7.40:
Comment #59
ehegedus commentedRe-rolled against Drupal core 7.40:
Note: Didn't rename because the cache_consistent module ships with these two patches.
Comment #60
torgospizzaSetting to Needs Review as there is a new patch.
Comment #63
joelpittetComment #66
gielfeldt commentedRe-rolled against dev.
Comment #67
gielfeldt commentedComment #68
joelpittetBack to 8.0.x as I changed it for testing and forgot.
Comment #69
jgrubb commentedThis is incredibly confusing since there are patches for 7.x flying around in this issue that's marked as being for 8.x now. I can confirm that the patch in #66 works in production against 7.41 though.
Comment #70
catchThis needs a patch that's commitable to 8.x to be RTBC for 8.x.
Comment #71
gielfeldt commentedHere's a patch for D8.
It fires events when transactions are started/committed/rolled-back. No integration from the DB-layer to the Cache Backend at all, so a bit cleaner than the D7 version.
Not sure though, if this is the "proper" way to do it.
EDIT: probably also needs some testing ...
Comment #73
gielfeldt commentedAny test-savvy folks, who can tell why the tests are failing?
Comment #74
drummhttps://dispatcher.drupalci.org/job/default/34960/consoleFull has the full output from the test. Something about how
event_dispatcherseems like it isn't working properly, or needs more scaffolding in the tests.Comment #75
gielfeldt commentedThx.
I can't really make out, if it's because the tests need to setup an event_dispatcher, or if it's better/necessary to inject a dispatcher service into the database-connection instead of using \Drupal::service('event_dispatcher').
... or both.
Any thoughts/ideas/solutions anyone?
Comment #76
jgrubb commentedIs there a 7.x issue for this? How can I help get the patch in #66 rolled into core?
Comment #77
catchI'd probably do this like the following:
- don't add anything to the database layer (neither cache backend nor events)
- add a new class that handles creation and rolling back transactions as a wrapper
- cache backends are already tagged services, so we can loop over them, check a new interface and call a method - no need for the extra indirection of an event.
- implement that interface for the backends that need it in core.
Since this is a new API, it needs to be committed to 8.1.x rather than 8.0.x, but could then be directly backported to 7.x once it's in.
Comment #78
gielfeldt commentedOk. But are we certain that we'll catch all transactions correctly via the current transaction class design? Or are you thinking we should refactor this? The commit and rollback actions, for example, is currently spread out over multiple places.
Comment #79
catchNo. I think this behaviour will have to be opt-in, but it's mostly an issue for entities, and we can opt-in those pretty much all at once. We could also deprecate using the current transaction handling in general when it's working for the main cases, to encourage people to convert custom code that might have the same issue.
Possibly we'd have to get the new class to handle transaction commits - them being implicit in __destruct() currently will make it hard to do what I suggested in #78.
Comment #81
gielfeldt commentedAre we sure we don't want an event system on the database layer? If a cache backend is only triggered directly by transaction events, then what happens if a cache backend is instantiated _after_ a transaction has been started? Wouldn't it be better to decouple it completely, like in the patch?
Comment #82
damienmckennaFYI the patch in #66 still commits correctly against the latest 7.x-dev.
Comment #83
damienmckennaThis is a reroll of #71 against 8.2.x-dev.
Comment #86
gielfeldt commentedThis is a reroll of #83 against 8.2.x-dev.
Also changed the transaction event to pass the database connection object instead of just the transaction depth.
Comment #87
gielfeldt commentedComment #89
gielfeldt commentedFixed existing test cases by injecting event dispatcher into database object, and added tests for the database events.
Comment #90
gielfeldt commentedI just realized that the event dispatcher isn't injected into the database, because it's not available at that time :-(
Comment #91
gielfeldt commentedThere. Not sure if this is kosher, or if the container and event_dispatcher service needs to be initialized earlier somehow.
Comment #92
Crell commentedThe database system should not have any dependency on the container, even indirectly. That's begging for circular dependency nightmares.
I also really don't like making the database depend on the event system. If anything, I want to eliminate the one hook that it has in it to make it more decoupled, not less. Adding another dependency to what is *almost* Component-eligible is a no-go in my book.
Even if we absolutely must give it an event dispatcher (which I do not yet believe is the case), then any awareness whatsoever that it's a container-specific dispatcher is off the table. It should be purely EventDispatcherInterface from Symfony, and absolutely not ever touch \Drupal, ever. This patch is filled with technical debt. And that's before we consider the performance implications of adding multiple event calls to every transaction.
Really, trying to solve this at the database connection level is the wrong place to begin with.
Comment #93
gielfeldt commentedYou're probably right about this.
The way I see it, in order to solve this problem, the cache layer must be aware of the transactions. Having the database layer depend on the cache layer, as previously suggested in this thread, only makes the coupling tighter imo. Plus it also adds the caveat of only working for already instantiated cache backends when the transaction occurs.
Again, relying on the container is probably not the best way.
To which technical debt are you referring, besides the container dependency?
Regarding performance, will this really be a problem, considering it only happens on transactions? How often do transactions really occur? Sites are usually read-heavy. There have been discussions about using locks instead, but I don't believe this will be better performance-wise, nor implementation-wise for consumers of cache. The crux of the matter is that the cache layer needs to know when a transaction occurs.
How do you suggest we go about this problem then? This issue is over 4 years old, still unresolved. The other solutions in this thread either targets only node/field save using locks or unique-id-galore, the latter which might address the transaction-problem, but indirectly and would probably require lots of refactoring, since this would essentially apply to every table. I still believe using locks for this would cause more harm than good (lockups, timeouts).
I'm not hell-bent on this particular solution, but I haven't found/seen anything else that solves it.
Comment #94
Crell commentedFirst, sorry if I came off harsher than intended. I've been fighting to keep extra "convenience" dependencies out of the DB layer for years, only sometimes successfully. :-( (Drupal has a tendency to add dependencies over time until they create a knot of mess we have to detangle, which again takes years. See also: Drupal 8.)
The problem here isn't that the cache system needs to be aware of a database transaction. Rather, it's that we should assume that we *cannot rely on the particular backend being used* by either the cache system or the entity system. Vis, we should be doing something at a higher level that doesn't intrinsically rely on an SQL transaction. Recall that it's possible to run Drupal 8 with its entity system backed by MongoDB. That would break this solution, too.
Based on the description in the summary (I didn't reread the whole thread), it seems like the issue is that the cache is getting cleared/written within the transaction in the first place. That is, the order of actions described in the summary is:
1) Start Transaction
2) Clear cache
3) Write new node data
4) Close Transaction
The problem being that another thread could come in and write a new cache between steps 2 and 4, causing stale data.
To which I ask, why are we even clearing the cache at that point? The cache clear should happen if and only if the entire node save process was successful. That is, if the transaction didn't get rolled back at any point. And then it can get refilled afterward by another process as normal. That is:
1) Star Transaction
2) Write new nod data
3) Close Transaction
4) Clear cache
(And replace points 1-3 with something that has no transactions if you're on a non-SQL entity backend.) In that case, there is still a stale data window, but that stale data window is much smaller; it's also a stale read window, not write window. Viz, there's only a few milliseconds in which the previously stale data would be read before it's cleared and then gets rebuilt with new data. That's not perfect, but it's certainly better than a few millisecond window in which stale data can get written and then cached for the next hour. :-)
Am I misunderstanding the problem space? Because the above seems like it would work. Basically, cache clear should not be part of the entity save process, but a post-effect of a successful entity save process.
Comment #95
swentel commentedMakes very much sense to me. Remove the cache_clear_all call in field_attach_update and clear it in field_entity_update(). That should get rid of the racing condition (at least in D7).
Also in #53, Fanbian mentions this is still a problem for D8, but at least cache_field doesn't exist at all anymore, so maybe (at least for the cache_field) probably, this is D7 material only ?
Comment #96
gielfeldt commented@Crell
It's okay. I didn't make this patch for the sake of convenience, but because it actually poses a real problem.
You did not misunderstand the problem space, and I agree that an entire refactoring of the cache usage concept might be the best approach, in order to avoid coupling to the database all together. However this may be easier than it sounds (there a alot of caveats, see below), and is most likely a candidate for Drupal 9+. But what in the meantime?
Regarding post-process for clearing cache instead of inside entity_save (which will cause a problem, described below), how do we determine when an entity save process has been successful?
Also, another problem is nested transactions. Consider the following:
node_save:
start transaction
end transaction
clear cache (or post process like you mentioned)
So far, so good. But:
node_save:
start transaction
invoke field hooks
field clear cache (these are the ones we need to eliminate then)
crazy custom code that clears cache (the design pattern when coding for Drupal needs to be very clear, that you may not tamper with the cache here).
end transaction
clear cache
And another problem:
custom-bulk-node-save:
start transaction
foreach a bunch of nodes
node_save:
start transaction
invoke field hooks
field clear cache (these are the ones we need to eliminate then)
crazy custom code that clears cache (the design pattern when coding for Drupal needs to be very clear, that you may not tamper with the cache here).
end transaction
clear cache (problem here, since node_save does not know about the outer transaction, and is it at all possible to address this using coding standards/guidelines/patterns?)
end transaction
Even if no entity_save's are involved, how do contrib-modules handle the "quirk" that cache may not be cleared while inside a transaction?
An isolated node_save() may be relatively easy to fix regarding this, but not system-wide I think.
Comment #97
Crell commentedAs implementation, from a quick glance at the code I'm suggesting adding a cache clear call to EntityStorageBase::save():
The cache clear call belongs right before the return statement, nowhere else. If something clears the cache prior to that, oh well, we burn some CPU but that's fine. We're already clearing the static cache in memory in doPostSave(), actually, so maybe the persistent cache clear should go there as well.
If you're persisting an entity and not going through that method, then you're violating the API and we don't care about you. I don't know how "crazy code goes here" affects that problem.
That leaves the multi-save case. The issue here is if you're saving 10 nodes and node 6 breaks for some reason, it gets rolled back... Do you roll back nodes 1-5 as well? A wrapping transaction would imply yes. OK, so what we need to do there is check if the whole set failed, and if it failed, clear the cache for all of the impacted nodes. They would all have been individually cleared while updating, so in this case what we're doing is clearing out any caches created by another process accessing node 3 (and therefore recaching it) with the new data; that one request would get the new cache, but the cache would be cleared again after the transaction is rolled back and then go back to the previous node when next accessed. Not ideal, but works.
Or, perhaps better, there's enough layers of indirection in the save process already that we could probably do something like this (pseudocode):
Of course, we'd need to figure out where in saveMultiple() to do the transaction, given that it's backend-agnostic. But... thinking on that, do we even support that now? If someone's doing completely custom code and exposing a transaction at that high a level, then they're coupling their code to an SQL database outside of the entity API anyway. Which means... not our problem.
Edit:
Thinking on it... do we need to? If an entity save is successful, we want to clear the cache. If an entity save is not successful, we don't need to clear the cache... but does it hurt anything if it is anyway? Remember, we're not doing a complete cache wipe anymore, just clearing that one node. If we can do better than the greedy approach we certainly should (if entity API gives us the necessary feedback, which it may not, in which case that's a bug to fix), but if we can't... there's no data loss, just slightly more cache misses. Vis, any time you save an entity, successful or not, it clears the cache for that entity, and only that entity.
Comment #98
fabianx commentedThe discussion is a good point.
The problem however is an invalidation one, not an ordering or transaction one. (which is also why my proposed revalidate approaches would work as they fix _invalidation_).
There are only two hard problems in ... anyway ;).
The problem is _not_ that the field cache gets cleared, but that the field cache does not get cleared _again_ after the node has been saved.
This is where our system makes the implicit assumption that the DB cache backend runs under the same transaction.
So the solution for D8 is to add the node's cache tag to the field cache entry (or wherever we store that).
And in D7 we and everyone else doing crazy things need to clear the field cache after the node has been saved and the transaction been committed, which always works.
Of course the best solution is to track cache clears during the transactions, but if core used a separate and possibly async DB connection for the cache backend, we would encounter these cases in core as well.
--
And if the code iterates over that and that many nodes it also needs to invalidate all those node caches again and yes that is a little bit bad.
--
And then we end up with gielfeldt's solution again, as we really need to invalidate when pushing the transaction not before.
Basically any cache item touched is invalid (if we don't record all cache things like gielfeldt's solution does).
--
HOWEVER
I think for D8 this is not a problem anymore, because:
a) field cache is gone in favor of whole entity render caching
b) We do most invalidation via cache tags now with the only exception being cleanup of expired items and cc all (clear whole bin)
And cache tag or even fast chained invalidation is a DB operation, which always runs in the same transaction.
So it should not be a problem - though no one really so far looked at fast_chained / cache tags + transactions, yet.
For cache_tags we increment a value, which should be fine according to http://stackoverflow.com/questions/3821468/sql-atomic-increment-and-lock... - though it needs more investigation regarding different isolation levels.
For chained_fast we use a timestamp and the update order is undefined, if there is a long transaction with an early update and a shorter one we theoretically could overwrite the timestamp that was set before.
Also our locks that are set during a transaction are uhm ... non-existant as they don't do anything ...
Ugh ...
That is indeed quite some work for Drupal 9.
But overall it seems we need to record events during transactions regarding caches and cache related operations and play them back, including safety checking and locking ...
e.g. if we recorded cache_tags invalidations during the transaction, then incrementing outside all transactions, then we are good in any case as the sum matches the invalidations and at it is always higher.
In the same way if we did write the chained fast timestamp at the end of all transactions, then we could check that we never write a lower value than before.
So with D8 we have different problems, but its not solved :/.
Also shows that the timestamps of chained_fast are worse than e.g. an atomic counter (if its truly atomic and transaction-independent).
lcache (by David Strauss) might also be an idea to record cache operations in a table, then play them back. That should be unproblematic related to transactions, too.
--
TL;DR: Transactions and application caches combined are a problem ...
--
Edit:
- Under the default isolation level this is all less of a problem as all those rows are locked by the transaction itself, but it means we have less control over what happens and I really need to test this some more:
* lock() => semaphore table, accesses the row, sees active transaction => sleeps
* chained_fast() => cache table, timestamp, accesses the row, sees active transaction => sleeps
* cache tags => cachetags table, accesses the row, sees active transaction => sleeps
So as long as transactions are short, it works fine. If not it can stall your site potentially a lot.
e.g. one long running transaction in theory could make the site unavailable for further writes if chained_fast cannot write the timestamp data again.
But still it means for D8 it is much less of a problem as we have cache invalidation in the same transaction again.
However I think the approach of being able to run in other isolation levels and make services aware of transactions is still worthwhile.
However for D8 I think this should be a generic event and not isolated to cache backends. (if that is already the case, ignore).
For D7 it could just be a hook_invocation() instead.
Then contrib is happy and core as well.
Comment #99
gielfeldt commented@Crell
What about custom code like this? How do we enforce it, if we are to ban it?
In the above, mycacheentry:%id% would be cleared prematurely in a non-transactionally-synced cache backend.
It may also be harder to take advantage of existing code, like so:
In the above code we, as a consumer of some_custom_module_that_processes_node(), do not necessarily know what entites are saved. Not sure how we fix this, short of refactoring the entire in/re-validation system as mentioned.
If we are to make this an event system that's not container-dependent, I'm not really sure how to go about. Hope someone can pitch in.
--------
@Fabianx
True, however the current implementation of cache invalidation in D7/D8 "conflicts" with the use of transactions in certain situations.
_IF_ you know when the outermost transaction ends :-)
This is somewhat what Cache Consistent does, except it doesn't store the cache operations in a table, but just in-memory, since a transaction never spans requests.
Also, I cannot find anything on lcache by David Strauss? Have I become this old and bad ad Googling?
And code that starts a transaction and calls e.g. core code which may save entities unbeknownst to the consumer.
Not true, they still work. However at this point, the database may timeout if another tries to acquire the lock while the transaction is still in progress. In any case here, it may be a very good idea to use another connection for the lock table. I think there's another issue for this somewhere? (also, use read-committed to avoid gap-locks, etc.)
-------
In short: The patch in this issue only addresses the problem with notifying of transactions. Cache Consistent actually does the recording and playback. I have on purpose not submitted Cache Consistent as a core patch, since it's quite massive, and some people would most likely have very strong feelings regarding that implementation :-)
Also, the D7 patch actually notifies the cache layer directly about transactions (like @catch, I think, suggested). I can't exactly remember why I did this, but I think there were some issues using the hook_system. (early page cache, and so on).
Comment #100
Crell commentedSee, here I think is the crux of the issue. Transactions are specifically an SQL thing (at least the transactions we're talking about). We have no backend-agnostic transaction system except the lock API, which isn't really the same thing.
As of Drupal 8, if you're coding to a specific backend (ie, SQL), then odds are you're doing it wrong. The database should only be accessed inside a particular subsystem (cache, entity, lock, keyvalue, etc.). If you're starting a transaction from, say, a Controller, then you're already off the supported path and we don't need to support that anyway.
If you do need to do something SQL-specific in a swappable service, then you can be reasonably confident that you are the top-level transaction, I'd think. And you can do your own rollback logic there if necessary, including not touching caches until you're sure it's done. Basically, cache invalidation belongs in the finally {} block.
We also, admittedly, lack a "Unit of Work" concept in Entity API for cases where you need to update multiple entities together, or not. The way to fix that is to add a UnitOfWork / Batch update (not batch API, but multi-update) concept to Entity API, which can then have an appropriate transactional behavior per-backend. On SQL it would be a DB transaction, on something else it would be... I don't know, ask those maintainers. :-)
In general, though, this is one of the downsides of abstracting away all storage. If we want people to be able to move cache, keyvalue, entities, etc. to three different backends, then there's basically no opportunity to build atomic cross-datapoll operations. At least not without a great deal of extra engineering that would be inherently imperfect anyway.
It's a trade-off either way. D8's made the trade-off of flexibility at the expense of guaranteed atomicity. That's not good or bad without context. The best we can do at this point is to handle the common cases (post-operation cache clearing, etc.) and document the caveats of the rest.
Comment #101
jedihe commentedI'm deciding which fix/remediation to apply for nodes being hit by this. I see cache_consistent is being used successfully by d.o but would also like to know if anybody has used #47 (committing transaction before ->resetCache() is called, by @deviantintegral) and how well it works for mitigating this problem for nodes. Thx!
Comment #102
mkalkbrennerIn some production environments we recognized strange 503 errors without any reason.
After a lot of debugging I found the cause and wrote the patch #2833100: dblog fails to log INNODB_LOG_FILE_SIZE errors. With that patch applied the cause is visible in the logs:
As you can see, the cache write causes the transaction to not be committed at all!
Comment #103
tstoecklerSo you are saying that you think it is problematic to include the cache write in the transaction in the first place?
Comment #104
berdirThat's a render cache write. Why would that even happen when saving an entity? Only thing I can imagine is some custom code rendering something in a presave or similar hook? Are you sure that happens as part of saving?
Comment #106
berdirWe discussed this issue a while ago in an entity major issue triage meeting and I didn't do my homework of commenting on the issue.
Basically, we are not sure if this is still a problem in D8, especially after #2753675: loadUnchanged should use the persistent cache to load the unchanged entity if possible instead of invalidating it because we're doing a lot less cache reads and writes when saving an entity, there there's less risk for race conditions.
It would be great if people who have seen this in a the past could check if they can still reproduce this in 8.3. If we don't get any such reports, we might decide to change this to a normal bug report.
I assume it's still major for Drupal 7, and given that a fix would look comletely different there anyway, it might make sense to open a separate 7.x issue that could still be major as the situation there hasn't changed.
Comment #107
nwom commented#66 still applies cleanly to the 7.x. Thank you!
Comment #108
nithinkolekar commentedOne of my d7 project I faced the race condition for node save using services module and solved that by adding random delay for each node_save(hackish way).
is anybody tested if patch also works for service module's node creation?
Comment #109
penyaskitoI still see this in the Lingotek module when translations are tried to be saved concurrently. Still figuring out if the problem is the module itself or not.
Comment #111
acbramley commentedI've just encountered what seems like an issue that may be related to this. We just deployed Drupal 8.4 to a testing environment that has memcache (using the memcache_storage module) and hit this error on a node save:
This did not show up on dev environments without memcache installed.
The cache bins that we use memcache for are:
menu, toolbar, render, and dynamic_page_cache.
I've not seen this issue before upgrading to 8.4 though
Comment #113
badrange commentedI wonder:
- Will this problem ever be fixed for Drupal 7?
I just stumbled into this recently. The cache_consistent module with the patch for 7.41 seems to work on my local vagrant box, but I really don't feel like messing with a legacy production system with thousands of users.
- Will this problem ever be fixed for Drupal 8?
Hopefully before I ever run into it so that I don't get even more grey hair from age old Drupal bugs.
Comment #114
dtv_rb commentedWhy is there still no official fix for this?
Drupal 8 websites with high traffic (like ours) are forced to use the development version of a several year old module (cache_consistent) with an equally old core patch to work around this bug.
Comment #117
badrange commentedI am no longer working on any Drupal projects, so I'm unfollowing this. So long and thanks for all the grey hair, Drupal.
Comment #118
m4oliveiWorking with a client on a D8 site (8.6.15) with memcache, seeing syptoms akin to the issue in this description, eg. editor saves a node, form returns and all the field values in the form are set to old data.
Wondering if others are seeing this behavior still on more recent D8? There was a suggestion that it's no longer an issue in #106. However, following along, and comparing with recent D8 code, a node save will arrive at
\Drupal\Core\Entity\Sql\SqlContentEntityStorage::save, which begins a DB transaction. Before that transaction falls out of scope, there is a cache reset that happens (\Drupal\Core\Entity\ContentEntityStorageBase::resetCache) that hits memcache. There is also a cache tag invalidation that happens. Cache tag invalidation is done on the DB, so that seems like it wouldn't be a concern.Trying to prove out if it's still an issue in more recent D8. Seems like it could be given the above.
Comment #119
berdirSee #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock and the related redis issue #3018203: Support delayed cache tag invalidation, that fixed similar problems for us reliably. Memcache will need a corresponding patch just like redis (note that the patch there doesn't reflect the most recent patches in the core issue, which will allow to further simplify the cache tag implementation and avoid duplication).
Comment #120
m4oliveiThanks @Berdir! I was pretty lost at that time, and you gave me new inspiration.
This issue and the ones you linked to are really hard to follow when you're starting with minimal knowledge of DB transactions, MySQL isolation levels, memcache, and cache tag invalidation. Over the past couple of days I've been working on reliably reproducing this race condition that leads to a cache consistency problem. Here is what I came up with. Hope this helps others. Just seeing this mystical issue is huge for me, and gives me hope to knock this out using one of the suggested solutions.
OK, here's the steps to reproduce I came up with:
My setup
Steps to reproduce
node_field_datatable). This is because the DB transaction hasn’t been commited yet. Note also that at this point, memcache has been invalidated for this node, this is key.node_field_data).The xdebug breakpoint makes the race condition really obvious. In practice it could be a long running process on node save, eg. reaching out to an external service as has been suggested here. For us we have Apple News module enabled, which can take a long time on node save. Regardless, there is a window, illustrated by the xdebug breakpoint here, where stale data can be written to the cache backend by a parallel process when you use an external cache backend like redis or memcache.
Now I have some small hope of resolving this using one of the suggested solutions :). Will report back as I make progress.
Comment #121
m4oliveiAdding some finer points to what I last wrote:
values:node:1154455item in the entity cache.I've also been looking at #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock as a potential solution. Not sure I totally understand why yet, but that patch doesn't solve the issue here. I'm finding that the entity cache item is still served even when the cache tag invalidations are done after the transaction is committed. You still end up with the behavior of an editor editing a node, only for it to reload with stale data.
I suspect this is because memcache needs a patch like @Berdir mentoned exists for redis, eg. #3018203: Support delayed cache tag invalidation. Will look into that next.
Comment #122
m4oliveiThis turned out to be not true. The issue here is that the entity cache item (
values:node:1154455) that is causing the stale data issue, doesn't have any specific cache tag associated with it that is invalidated after a node save. The cache tags it gets saved with is:entity_field_info,memcache:entityandnode_values. None of which are cleared on node save, which leaves us with the problem at hand. It stands to reason therefore that addingnode:1154455to the list of cache tags for the entity cache item would fix the race condition. That way, with the patch in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock, when thenode:1154455cache tag is invalidated after the transaction commits, it includes clearing out cache likevalues:node:1154455.Comment #123
m4oliveiWill work on a patch to add the entity cache tags to the entity values cache item.
Comment #124
m4oliveiHere is a patch. This assumes that you are also using the patch in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock. Specifically I am testing against https://www.drupal.org/project/drupal/issues/2966607#comment-13253493.
It's a totally different approach from prior patches. So far, testing the two patches together is working to resolve the stated steps to reproduce in #120. Not sure if this approach has unwanted sideeffects. It seems like if would be pretty innocent, I'm essentially adding a cache tag for the individual entity to an item that would otherwise be manually cleared by key during entity save. This ensures that when your using a alternative cache backend (eg. memcache, redis) the logic in #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock kicks in to fix the race condition stated here.
Comment #125
berdir> This turned out to be not true.
But it is. The core issue is about cache tags, but it adds the mechanism to do whatever you want at the end of a transaction and the redis cache patch uses it to also delay direct cache item invalidation. We had exactly the problem that you describe on several large sites and the core + redis patch reliably solved it. And yes, memcache will need a corresponding implementation.
We don't want to add the cache tag like this, because it is a 1:1 relationship between the cache tag and the cache key, we already know how to clear those cache items directly and do so. This adds extra overhead on every cache get because we also always need to check that the cache tags are not invalidated.
Comment #127
m4oliveiAhh, I see that now. I realize now that when I was looking at that redis patch I was focusing on the cache tag pieces and not seeing that direct deletion piece. That makes sense. In which, case, ignore my patch! :). I'll look into doing the same in memcache. It looks like this ticket was filed to work on the implementation for #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock:
#2996615: Transaction support for cache (tags) invalidation
Suppose that's a good place to work in.
Thanks for pointing that out. I was obviously missing this in my understanding.
Comment #129
m4oliveiReporting back to close the loop on my previous comments.
We deployed the patch in #2966607-127: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock to 8.7.7 in production last week together with the corresponding patch to memcache in #2996615: Transaction support for cache (tags) invalidation and we've seen our issue dissapear!
In my estimation the combination of #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock and the implementation of it for redis or memcache resolves this issue for those two backends.
Thanks for all your work on this and help to me personally @Berdir!
Comment #134
matroskeenAccidentally, I stumbled upon this issue, and based on the last comments it feels like it's not relevant anymore?... (at least for Drupal 9)
Can anyone check and confirm? Otherwise, we'll probably need to update the issue summary.
Comment #135
matroskeenI asked in #bugsmash Slack channel where @catch confirmed it can be closed as a duplicate of #2996615: Transaction support for cache (tags) invalidation and #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock.
Thank you everyone! Another D9 Major issue is done!
Comment #136
ruiadr commentedHello,
I had same problem with my configuration, I use redis for cache, but some tables in database was still in use (cachetags and cache_container).
May be I followed the wrong way reading the documentation (I m novice with redis), but I found this article who explains how configure correctly your app to use redis for all your cache: https://community.platform.sh/t/how-to-use-redis-for-caching-a-drupal-8-...
In my opinion, if all your cache is not in redis (a large part in redis, a small part in database for example), during the node save, services are not executed in the right order because they don't have the same response time (redis is faster than db).
In my case, it seems to solve my problems, i had PDOException (There is already an active transaction) and at the same time Drupal\Core\Http\Exception\CacheableAccessDeniedHttpException (The 'xxxx' permission is required.).