Problem/Motivation
Since upgrading to 10.4.1 I have seen a user warning about overwriting cache redirects:
User warning: Trying to overwrite a cache redirect with one that has nothing in common, old one at address "languages:language_interface, theme, user.permissions, route" was pointing to "user", new one points to "user.node_grants:view". in Drupal\Core\Cache\VariationCache->set() (line 138 of /app/web/core/lib/Drupal/Core/Cache/VariationCache.php).
Debugging on 11.x I was able to reproduce it. It is coming from the local tasks for nodes under certain conditions after the entity status is changed.
It appears that when the user has the view own unpublished content permission and the node is not published then the node access handler returns access with the user.permissions context added by the checkViewAccess method.
When the node is published checkViewAccess returns NULL and so the parent call continues down to where $access_result = $this->grantStorage->access($node, $operation, $account); is called. At this point if there is an implementation of hook_node_grants then NodeGrantDatabaseStorage adds the user.node_grants:view context.
This difference seems to cause errors for variation cache works and when it renders the tabs after the status change produces the error.
> User warning: Trying to overwrite a cache redirect for "entity_view:block:stark_local_tasks:[languages:language_interface]=en:[route]=entity.node.canonical02d270e527ee2db38f14edbd305a4fea048b224ab212dd8a734fefc49e4a04ab:[theme]=stark:[user.permissions]=863125a601a1ef531f97c3f043764864a62066d84f76547712f32db5e29fa5eb" with one that has nothing in common, old one at address "languages:language_interface, theme, user.permissions, route" was pointing to "user.roles:authenticated, user", new one points to "user.node_grants:view". in Drupal\Core\Cache\VariationCache->set() (line 138 of core/lib/Drupal/Core/Cache/VariationCache.php).
Steps to reproduce
Refer to the automated test in the MR for minimal steps to reproduce. Summary of the steps below.
Setup:
- Standard install of 11.x
- Enable node_access_test_empty test module (or any module that implements node grants)
- Create page content type - set default to not published
- Ensure local tasks block is placed on the page
- Create role with the following permissions: 'create page content', 'edit any page content', 'view own unpublished content'
- Create an user with the above role
Flow:
- As a user with the above role create a new page node (they must be the author)
- View the node and ensure the local tabs are rendered
- As admin or other user with appropriate permissions publish the node
- As the original node author, view the published node again and you get the cache redirect error message
Proposed resolution
Remaining tasks
User interface changes
Introduced terminology
API changes
Data model changes
Release notes snippet
| Comment | File | Size | Author |
|---|---|---|---|
| #53 | 3504114-53.patch | 5.81 KB | leo liao |
Issue fork drupal-3504114
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #4
cilefen commentedComment #5
ericgsmith commentedThank you for moving the issue. Unfortunately I don't have the knowledge to say this is a node issue and not a cache issue?
I have had trouble really digging into the variation cache as I think I can't find documentation on some of the basic terms like cache redirect - and I'm confused why the redirects which appear to relate to this warning don't contain cache tags. There is no doubt a good reason, but I don't know enough about that to say that this is a node issue.
It seems like the cache has a dependency on a node - and that node changes so the cache is correctly invalidated, but whatever these redirects are seem to hang around due to the absence of tags. Now the node has changed that brings about a scenario where the context is then different - and that triggers an error.
The test WIP I added was just to highlight the difference, not to say that the difference is unexpected.
Also - as far as I understood the cache context docs, user.node_grants cache context should specify that the item is uncachable - so I'm futher confused how we end up getting cache redirects errors for something that should not be cached.
Comment #7
ericgsmith commentedAlright, gave this another crack and for whatever reason I need to repeat the unpublish dance to generate the error, and now have the error generating in CI - https://git.drupalcode.org/issue/drupal-3504114/-/jobs/4579941#L1064
Comment #8
ericgsmith commentedComment #10
catchThis was previously looked at in #3278759: Access cacheability is not correct when "view own unpublished content" is in use but the test clearly fails so we must have missed something there. Renamed the test class.
Comment #11
kristiaanvandeneyndeAt a quick glance, the NodeAccessControlhandler definitely adds the node as a cacheable dependency. So if the status changes, anything that was set in there should get invalidated and the error you report should not occur. I'll see if I can reproduce this locally. Might only be tomorrow or next week, though.
LocalTasksBlock::build() seems to take care with its cacheable metadata, though...
Comment #12
catchI pushed a commit that makes the test go green - it is not the correct fix, but I ran out of time and that's where I got to debugging.
Comment #13
kristiaanvandeneyndeManaged to reproduce following steps in IS.
On my latest branch where blocks are placeholdered, it shows the following two errors:
With both having the same stacktrace of:
So that definitely narrows it down to the olivero_secondary_local_tasks and olivero_primary_local_tasks blocks having the wrong cacheable metadata. Both use the local_tasks_block block plugin.
Comment #14
kristiaanvandeneyndeWhen the node is unpublished,
$links['tabs']['entity.node.canonical']['#access']contains:When it's published:
Which confirms the IS. And that probably does indeed stem from the node access checks.
Comment #15
kristiaanvandeneyndeWhen we get to NACH::checkAccess()'s last return statement, the access result only has 'user.node_grants:view' cache context on it. This is because the node is published and checkViewAccess() does nothing. But it does also have the node:1 cache tag. So if the node gets unpublished, the entire cache entry for said node should vanish.
I'm thinking this is a VariationCache bug where it still finds the old redirects.
This comment is actually not true:
In this scenario, we have the following chain:
The first redirect never changes no matter how many times we save the node. And because we don't tag redirects it means it will always lead to the second redirect where (if the first visit was unpublished) we have the following contexts:
Now, the 'user.roles:authenticated' and 'user' cache context should only ever be added if the node is unpublished. But because we never flushed the redirects, the cache GET when the node gets published finds the still accurate first redirect, which leads to the stale second redirect and that's where shit hits the fan.
Oh dear.
Comment #16
kristiaanvandeneyndeKeep in mind that the bug only exists because we have no entity status cache context.
Basically the node access check outcome doesn't depend on the entity status, it varies by it. So if in NACH::checkViewAccess() we could do the following:
Then we would be in the clear because the second redirect would be moved to the third and in its place would be a new redirect which only redirects to said node's status.
Now the question is whether such a cache context would make sense, because it could require us to load an entity that's not on the route. Here it is, but for argument's sake... And given how cache contexts need to be fast I'm not sure that's such a great idea.
So the second best thing would be to add cache tags to redirects, but then they might get invalidated all the time, killing the VC performance gain. Because, by the time we get to VC, we have the cache tags of the entire thing and we can't know at what point they became relevant. So we might be tagging the first redirect with node:1 even though the status of said node will never affect it.
The whole point of not flushing redirects is because everything that could lead to a redirect must be representable by a cache context. In the case of NACH it's not. Oof.
Comment #17
kristiaanvandeneyndeOh and before anyone panics: There is no immediate risk here.
The hardening in VC will complain about the redirect, but it will then store the new redirect just fine. The risk is that the goal was to turn the warning into an exception and then we would be in trouble, so we can't do that until we figure this one out.
Comment #18
catchI'm wondering if we can have a node status cache context that always returns TRUE (or whatever), but it adds the node ID as the cache tag (possibly without loading it as an optimisation) so that the redirect gets tagged when it's present.
Comment #19
kristiaanvandeneyndeFor this particular case, what we have is this:
['entity_view', 'block', $entity->id()]So this leads to a top-level redirect with just the required cache contexts, pointing to 'route' and whatever else the local tasks of the very first page after a cache rebuild specifies. As we visit more pages, the cacheable metadata of the local tasks will vary but always contain route and as such we get to a second redirect through self-healing seen in #15.
So now we have for cache keys
['entity_view', 'block', $entity->id()]:['languages:language_interface', 'theme', 'user.permissions']- This is the case for ALL local task blocks['route']- This comes from a massive amount of different routes, not per se node relatedNow imagine we hit the node page first and tag both the first and second tier redirect with 'node:1'. We then visit a few more pages, build more redirects and warm some caches and node 1 gets saved. Boom, the local tasks of all those pages cannot find their redirect path to their cache entry any more and have to calculate their contents anew.
Trying to be smart about this during self-healing is also impossible because you cannot know at what point the tag became relevant. Set it too early and you clear too many caches, set it too late and you might have a security risk on your hands.
Now imagine we did have a cache context to represent an entity status. Then we would have the following:
['languages:language_interface', 'theme', 'user.permissions']['route']- ANY REDIRECT AFTER THIS POINT IS FOR THE SPECIFIC NODE ON THE ROUTE ONLYThis is fine, because if we subsequently visit the node 2 page, we would only find the first two redirects and then resolving the route cache context would lead to a miss, at which point we can store a redirect for 'entity_status:node|2'.
Finally, there's still a very real chance the final page would be tagged with the node cache tag, but if that one gets invalidated, the redirect chain remains intact as it is not affected by any change on the node whatsoever. Status 0 will always point to one CacheRedirect and status 1 to the other. That will never, ever change.
So I'm starting to really like the prospect of an EntityStatusCacheContext for this scenario. tagging redirects is a no-go and flushing them all whenever the end of the chain is invalidated is just enormously wasteful.
Comment #20
kristiaanvandeneyndeYou mean a cache context that only serves a purpose to "hint" to VC that it should add some cache tags to the redirect where it's present? Now that would be an awesome alternative. Would have to be its own kind of cache context, though. We should not mix that behavior with regular cache contexts.
Example with cache context "entity_status:node|1"
There's still the question of what happens if it's present on a node page when all caches are warm. Because at that point we'd try to write a redirect with the following added:
If this gets optimized on a second visit to, say, a taxonomy page then self-healing would break it up and store the route redirect. That would erase the node stuff and next time we visit said node we'd store the redirect without 'route' and tag that with the node tag.
HOLY SHIT THAT WOULD WORK. That's awesome.
We could even remove the "special" cache contexts from the redirect if we want, but it might be better to keep them for debugging purposes. We could then go as far as to allowing these cache contexts to not have a value at all, only return a set of cache tags.
The only downside to this approach is that we always flush the final bits of the redirect chain (and therefore the local tasks for said node) whenever the node is saved. If we were to use a "true" entity_status cache context, we could keep both versions of said local tasks indefinitely.
But I'm not sure the extra cache lookup introduced by that is worth it. Keep in mind every CacheRedirect is an extra cache GET. Then again, having to recalculate local tasks and run route access checks every time a node is saved isn't great either.
Comment #21
catchI don't think it needs to be special as such, we already have a getCacheableMetadata() method on CacheContextInterface, the hard thing will be naming. But yeah we could potentially add a new interface that indicates that the ::getContext() method is inert/null op to make them special. Like 'MetadataOnlyCacheContextInterface' or something.
The other thing I'm wondering also though is does it really hurt to add the
'user.roles:authenticated'cache context when it's not strictly needed - it is extremely low cardinality, just anon and auth. However apart from my initial quick fix MR approach above, I couldn't find the right place to actually add that and still have the tests pass yet.Comment #22
mxr576I am still catching up with this thread, but...
IIRC @kristiaanvandeneynde you have suggested against doing something similar in #2628870-69: Access result caching per user(.permissions) does not check for correct user -- and fun fact, the code that I copied the comment is available in a public repo since then: https://git.drupalcode.org/project/view_usernames_node_author/-/blob/1.x...
Comment #24
catchThe cacheable-metadata-only redirect might be useful, but also I think we can fix this in the node access handler.
https://git.drupalcode.org/project/drupal/-/merge_requests/11395/diffs#n... is viable I think.
https://git.drupalcode.org/project/drupal/-/merge_requests/11402/diffs was an early start at an alternative approach that is quite broken.
Comment #25
kristiaanvandeneyndeI'm not sure this would work. There's a bug in the MR and the intended solution leads to several cache inefficiencies.
NACH::checkViewAccess() currently only checks if you can view your own unpublished nodes. It also properly adds any cacheable metadata to the CacheableMetadata object that's passed into it and said metadata is merged in ::checkAccess() wherever we return.
So not returning early, still running the grants check and then returning the grants check means we killed off access for viewing your own unpublished node. As we do not return said access check anymore. It also means that the following lines do nothing.
Now, this can be fixed. So let's move on to the changes in ::viewAccess() instead and see what's changed.
Now 1 soft collides with 2 for someone who can't view own unpublished and grant implementations exist. If we visit a published first and then visit the same node when unpublished, we will overwrite the CacheRedirect we had for 'node_grants:view' with a direct cache entry because it varies by nothing extra. Vice versa if we visit unpublished first and then publish the node, we write a cache redirect and then the cache entry.
So in principle VC will work fine, it's just slightly inefficient to kill off the redirect path when the status changes or to follow a redirect path that we know will lead to an invalidated item. No big deal, I suppose.
1 and 3 will still lead to a VC error for anonymous people with the permission while grant implementations exist because, depending on the order, we will either try to overwrite a redirect for 'node_grants:view' with 'user.roles:authenticated' or vice versa. Anonymous people shouldn't get the permission, but still...
1 and 4 will no longer lead to a VC error for authenticated users with the permission when grants are implemented, because we will always have a redirect for 'node_grants:view' regardless of node status. However, because the node status is still not reflected in the redirect chain, we will see the same inefficiencies we saw between 1 and 2.
Same story when grants are not implemented, but here we won't see 'node_grants:view' at all and it's merely the difference between nothing extra and some contexts added by checkViewAccess() so it's also the same as the behavior between 1 and 2.
TL;DR: We still have errors in some ill advised configurations and we definitely still have inefficiencies across the board. The essence remains that NACH is varying by node status but we have no context in core to reflect that.
If we don't want to support such cache contexts, then the problem becomes a VC problem because now we need to be able to add cache tags to cache redirects and we get to the possible suggestion @catch made.
After sleeping on it, it does not seem like a VC issue after all because ::checkViewAccess() does add a cache context whenever it varies by something, only it fails to do so for the status check. That is the root of the issue.
Moving back to node subsystem for that reason. If the solution is to add a status cache context, we can do that in a different issue which is for the cache subsystem.
Comment #26
kristiaanvandeneyndeYou could erase the VC error coming from 1 vs 3 by moving the node grants cache context higher up in checkViewAccess(). That could be a stopgap solution, but we really need to look into an entity status cache context instead. Visiting node pages after changes to said node will now face several extra queries that we ideally should not even run.
Comment #27
kristiaanvandeneyndeComments #25 and #26 were for MR 11402. Checking 11395 in a few.
Comment #29
kristiaanvandeneyndeOkay everything I wrote holds true, just the bug I reported in #25 is not present in the other MR. Everything else still fits as the grants cache context is added too late for the stopgap to work.
Comment #30
kristiaanvandeneyndeEdit: This comment was written when the idea is that pseudo-contexts would never actually get written. Check later comments for a working solution where they do get written and with good DX.
Given this some more thought and I don't see "hinting" for what to tag a redirect with working.
Say we have a scenario where we have three critical decision points:
Ideally, we want the node cache tag to be added to the redirect for B or X, but how do we know where to add it?
We could come to a cold cache with AB(N:1), AXY(N:1) or AXZ(N:1) and that would be one big fat redirect pointed to by the initial cache contexts (). So the chain would, for example, be:
<INIT> --> AB --> <CACHED DATA>Now invalidating A whenever the node changes would be silly, but for the sake of arguing let's say we write the tag at the last redirect. If we now change the status, the whole chain is flushed, but no real damage is done. However, if we visit a second node page that triggers XY, the redirect path would be rewritten to:
<INIT> --> A --> XY --> <CACHED DATA>At this point, tagging the last redirect would give us what we want. Only XY and the cached data get flushed when the node status changes and we keep redirect A. But if we now visit a third node that triggers XZ, we get:
<INIT> --> A --> X --> Z --> <CACHED DATA>Now we're in a pickle again. Because we need to tag X with the node cache tag, but how do we know that? When we get to the cache we only see AXZ(N:1), we do not know which context was added in which order.
So hinting what to tag a redirect with would only work if we had a chronological order of when cache contexts were added and I still think it would be really poor DX to expect people to provide pseudo-contexts whenever the vary by something that is not part of the global context.
But that would require a complete rewrite of our cacheable metadata add and merge methods.
So moving on to another option: Source detection.
If we could tell NACH where the node that's being passed in is coming from, we could add the cache context for whatever global source we used to get the node. In this case it could be route.args:node or something. An example of this is the route.group cache context in Group, but a more generic version of it.
But that would require some sort of service where we can pass in an object and get a source in return. And if the node was actually loaded at random and passed into NACH, it would still break because we'd have no cache context to represent the status check.
So that left us with one final solution: A cache context to represent status checks. I am really not a fan of this either because it's just plain dirty. The node in question may be from a global source, but we don't know that. Falling back to loading the node without context is just asking for performance issues and poor cache hit ratio.
This was also touched on in the issue mentioned in #22.: To add insult to injury, we had a notion of "current user" vs "any user" being discussed there and that could lead to poor cacheability all around.
Now, I may have got a solution but I'll write that down in a separate comment. This whole comment was to outline why we're stuck.
Comment #31
kristiaanvandeneyndeOkay so moving on from my previous comment, I just thought of something that could make pseudo cache contexts work.
Imagine a value object PseudoCacheContext that implements CacheableDependencyTrait but:
Now in VariationCache, when we detect such a context it will tag the the next chain item after the context with said metadata.
Now we could take the example of my previous comment and tell VariationCache exactly which redirect needs to be tagged. So no matter what redirect the pseudo-context gets added to, that's where we flag the next redirect for receiving the node cache tag. So from the example above, no matter if the chain looks like:
<INIT> --> AXZpseudo --> <CACHED DATA><INIT> --> ABpseudo --> <CACHED DATA><INIT> --> A --> pseudo --> XZ --> <CACHED DATA><INIT> --> A --> pseudo --> X --> Y --> <CACHED DATA><INIT> --> A --> pseudo --> X --> Z --> <CACHED DATA><INIT> --> A --> pseudo --> B --> <CACHED DATA>We would always tag the one after the pseudo context with the node status.
The offending code in NACH would then become:
This feels like it could work with minimal effort on changing VariationCache. The only drawback is that the pseudo-context will in some cases introduce an extra redirect, but that would be the case anyway if we had a proper "node status cache context".
Comment #32
kristiaanvandeneyndeThe upside of the above is that the DX isn't nearly as bad.
It's easy to document PseudoCacheContext as:
"When you need to vary based on a cacheable dependency's property but you have no cache context to represent that, perhaps because you don't know where the dependency came from. In that case, add the dependency to a PseudoCacheContext and pass in the PseudoCacheContext as a cacheable dependency of your return value."
But with clearer language than what I could come up with just now.
Comment #33
catchCould this be called
CacheableDependencyCacheContext- a cache context that varies by its cache dependencies?If so I have one more question:
Is there a forseeable situation where someone might want to make a similar cache context that also returns something? If so we could add CacheableDependencyCacheContextInterface and look for that in VariationCache, but otherwise just add the new object as described in #31.
Comment #34
kristiaanvandeneyndeSure, we can name it whatever makes most sense.
But it's important to state it does not vary by the dependency, it's just that we have a dependency that causes variations which we cannot represent with a regular cache context so we need to be able to invalidate a part of the redirect chain to reflect the dependency.
I don't see a use case where the cache context would return any value, really. It will essentially serve as a substitute for a redirect chain link that we cannot otherwise declare. By virtue of that it does not make sense to have it return anything other than a fixed value (like 1).
Edit: But as I start working on this I can always see what the cost of a more elaborate check is, so we're future-proof in case the situation does change. I just can't imagine a scenario like that right now.
Comment #35
catchHmm this is true, it doesn't vary by what the dependency is, it just varies to the extent that the cache context is present (or not), and then adds invalidation on top. Maybe naming is still fine with that caveat though.
Comment #36
kristiaanvandeneyndeJust added the cache context and value object for review. Still need to implement in VC and the node access control handler as proof-of-concept.
Comment #37
catchLooking at this again and it feels almost impossible to explain depending on a cacheable dependency vs. varying by the property of a dependency.
Just to check I extracted one of the previous quick fixes here, which compensates for the short circuit when the node is unpublished, to see if it passes the new test coverage. I'm not 100% sure we should even be bypassing the node grants system when access is granted via view own unpublished.
Comment #39
kristiaanvandeneyndeI'd say the following:
More details and edge cases to be added, but that would be the gist of it.
So take the status field for example:
If the node was retrieved from e.g. the route and we know that, we can vary by route. If we don't know where it came from, we use a pseudo cache context.
Comment #40
kristiaanvandeneyndeThe MR you opened still has the following going for it:
It also means we check the grants cache context where it doesn't apply. But that's only for one out of many configurations of status vs owner vs permissions. So I'd be able to live with that as collateral damage.
However, this is the first time we encountered this bug and we cannot know whether we might encounter it more. We may not always have the luxury of adding an unrelated cache context just to fix cache redirect collisions. So a proper solution is still needed to be safe and that is what I was trying to come up with.
Comment #41
catchWe could maybe split my 'quick fix' MR out to a side issue, try to get the committed (with the test coverage) and then continue here?
Comment #42
kristiaanvandeneyndeSure why not
Comment #43
catchSpun out the quick fix to #3516477: Avoid cache redirect error when using 'view own unpublished content' permission alongside node grants.
Comment #45
bingol@ciandt.com commentedComment #46
bingol@ciandt.com commentedComment #47
bingol@ciandt.com commentedComment #48
acbramley commentedHi @bingol@ciandt.com, contribution to this issue should be done in the MR, not patches. It would be good to understand your changes as well in the form of a comment.
Comment #49
kristiaanvandeneynde#3516477: Avoid cache redirect error when using 'view own unpublished content' permission alongside node grants landed, so the MR here can now be rewritten to remove the workaround committed there.
Comment #50
acbramley commentedMerged in 11.x, the test was reverted to what was committed in the other issue and the only difference in NodeAccessControlHandler was an additional comment which I've kept.
Comment #51
acbramley commentedactually this is probably still NW as we'd need changes to the test to cover the redirect error.
Comment #52
kristiaanvandeneyndeYes, the approach to be tried here now is the one that introduces the notion of a pseudo cache context.
Comment #53
leo liao commentedThis is a patch for MR, supported in 11.2