Problem/Motivation
If people get a render array or cacheable code flow where they can add to a parent and in their child specify cache contexts, those will accumulate and VariationCache will work by virtue of cache redirects.
However, I've recently seen core code where parent cache contexts were left out and/or manually optimized and by happy accident this hasn't caused any issues yet, but if a longstanding bugfix (#2973356: Cacheability information from route access checker access results are ignored by dynamic_page_cache) gets in we might all of a sudden see bug reports and perhaps even security issues.
We should make VariationCache more defensive by checking for all previously set cache contexts every step of a redirect chain. If one is missing, we should immediately throw an exception and bail out. This should catch a lot of insecurely written code that doesn't properly set the right cache contexts by letting people know there is something wrong with their code.
Steps to reproduce
Try the following render array:
// This is the starting point.
$build['parent'] = [
'#markup' => $user->hasPermission('foo') ? 'You can foo!' : 'You cannot foo',
'#cache' => ['contexts' => ['user.permissions']],
];
// This is in a place where the original build can be altered.
if ($user->hasPermission('foo')) {
$build['parent']['child']['#cache']['contexts'] = ['user'];
$build['parent']['child']['#markup'] = $user->id() === 13 ? 'You are 13!' : 'You are not 13';
}
This is a horrible example, but it works fine. Now imagine if in that alter step they tried to be clever and said: "Oh hey, user and user.permissions fold into user, so let's do that here already".
if ($user->hasPermission('foo')) {
// Trying to be smart here by optimizing parent.
$build['parent']['#cache']['contexts'] = ['user'];
$build['parent']['child']['#markup'] = $user->id() === 13 ? 'You are 13!' : 'You are not 13';
}
The above would cause a redirect to be written for 'user' a user with permission 'foo' came first and for 'user.permissions' if anyone else came first. Whereas the actual redirects should be: user.permissions for everyone + an extra redirect beyond the original one to user for users who have the 'foo' permission.
VariationCache would choke here because it tries to heal ['user'] vs ['user.permissions'] but they have nothing in common. ['user'] and ['user', 'user.permissions'] does and would properly apply self-healing and create the second redirect.
A more practical example can be found in UserAccessControlHandler::checkAccess() for the view operation. I have yet to file a bug report for that, but it applies the flawed logic above where it doesn't always add the user.permissions cache context even though it should.
if ($account->hasPermission('access user profiles') && $entity->isActive()) {
return AccessResult::allowed()->cachePerPermissions()->addCacheableDependency($entity);
}
// AT THIS POINT EVERYTHING SHOULD VARY BY USER.PERMISSIONS, YET DOESN'T.
// Users can view own profiles at all times.
elseif ($account->id() == $entity->id()) {
return AccessResult::allowed()->cachePerUser();
}
// AT THIS POINT EVERYTHING SHOULD ALSO VARY BY USER, YET DOESN'T.
else {
return AccessResultNeutral::neutral("The 'access user profiles' permission is required and the user must be active.")->cachePerPermissions()->addCacheableDependency($entity);
}
As explained in the issue where I found this, this could lead to a person with the same permissions as the profile owner to lock the owner out of their profile if said non-owner came first and got an access denied which then gets cached per permissions. If this access check were to trigger for the owner first and then for a non-owner, we'd get the broken redirects described above.
Proposed resolution
Add a sanity check to VariationCache to enforce that any previous cache contexts up until the point where we want to store something are present. This point can be right after the initial get, but equally after a few redirects. That entire chain must be represented by any subsequent CacheRedirect that is being stored.
Remaining tasks
Fix VariationCache and write tests
User interface changes
/
API changes
Broken code will now get an exception, which is good.
Data model changes
/
Release notes snippet
/
| Comment | File | Size | Author |
|---|
Issue fork drupal-3452181
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 3452181-variationcache-needs-to
changes, plain diff MR !8271
Comments
Comment #2
kristiaanvandeneyndeMarked as major because we have a few sleeper bugs in core that this might uncover.
Comment #4
kristiaanvandeneyndeMR needs work, first item in the chain is always the stored item or a redirect, so we can't have this "we already checked initially" stuff in there. Working on it.
Comment #5
kristiaanvandeneyndeOkay tests have thorough coverage now. Let's see if this breaks core and why.
Comment #6
kristiaanvandeneyndeOof, I was afraid we'd see this many test fails. Let's start digging, I suppose.
Comment #7
kristiaanvandeneyndeCommentNonNodeTest message was:
Trying to set a redirect at an address that already had a redirect with nothing in common, old redirect at address "languages:language_interface, theme, user.permissions" was pointing to "url.path.parent, url.path.is_front", new redirect at same address points to "route".Sucks how browser tests obscure these, so far I always put the following in these tests so I'd happily learn about a better way of debugging this:
Comment #8
kristiaanvandeneyndeOkay so figured out where these are coming from: CommentBreadcrumbBuilder::build() and PathBasedBreadcrumbBuilder::build(). It seems like either one or the other fires, but not both, leading to the exception we introduced here as the same cache ID now triggers a completely different redirect.
Comment #9
kristiaanvandeneyndeYup, from CommentBreadcrumbBuilder:
This conditional applies return value is never represented by the 'route' cache context when the result is FALSE.
I'm sure we'll find plenty similar core bugs while combing over the test fails here.
Comment #10
kristiaanvandeneyndeFound an issue that reports this, moving on.
if I change BreadcrumbManager to always add the route cache context, the test goes green.
Comment #11
kristiaanvandeneyndeSeems like the other bug is with jsonapi:
The data it was trying to set:
Will try to debug that one further when I have time, but that might be Thursday at the earliest.
Comment #12
kristiaanvandeneyndeThese are coming from JsonApiRequestValidator::validateQueryParams() and EntityResource:buildWrappedResponse() in that order. So probably also some check not making it into the cache contexts.
Comment #13
mxr576Could this be also supported/enforced by PHPStan (Drupal) too? (cc @mglaman)
Of course there are things that can be only detected in runtime, but some of these could be also identified via static code analyses...
Comment #14
joachim commentedI think I've figured out (some of) the jsonapi module failures -- will work more on them tomorrow.
Comment #15
joachim commentedI got as far as Drupal\jsonapi\Access\EntityAccessChecker::checkEntityAccess() where we have this:
with no user.roles context.
Comment #16
kristiaanvandeneyndeCurrent findings:
Then, the first time we try to store the return value of JsonApiRequestValidator::validateQueryParams(), leading to the creation of a second CacheRedirect, pointing to 'url.query_args'
But the second time, we try to store the return value of EntityResource::buildWrappedResponse(), which adds 'url.query_args:fields' and 'url.query_args:include' and somewhere along the line 'user.permissions' also makes it in there. Probably an access check, but have to figure that one out.
So the redirect at address
response:[request_format]=api_json:[route]=jsonapi.entity_view_display--entity_view_display.individual_SOME_HASH_1:[url.site]=http://webused to point at 'url.query_args', but now points to those new cache contexts leading to the final cache ID of:response:[request_format]=api_json:[route]=jsonapi.entity_view_display--entity_view_display.individual_SOME_HASH_1:[url.query_args:fields]=:[url.query_args:include]=:[url.site]=http://web:[user.permissions]=_SOME_HASH_2This means that whatever code comes after JsonApiDocumentTopLevelNormalizer::normalize() is currently inconsistent in setting cache contexts.
Comment #17
kristiaanvandeneyndeFixed it. JsonApiRequestValidator::validateQueryParams() only sets url.query_args when validation fails, but obviously it also needs to set that when validation succeeds. Because if the query args change, so can the validity.
Comment #18
kristiaanvandeneyndeOkay so this shows that there's only two detectable bugs in core right now:
CommentBreadcrumbBuilder::applies(), I "fixed" this the wrong way here to make tests go green. This caused StandardPerformanceTest to fail as there are now 2 more cache gets, which makes sense because we added a cache context. It also made phpunit tests report an unknown failure.
JsonApiRequestValidator::validateQueryParams(), I fixed this properly here but it needs a dedicated issue. Will create that one next.
There might be more core bugs, such as the one we just fixed in #3452426: Insufficient cacheability information bubbled up by UserAccessControlHandler, but we don't always have tests that are extensive enough for this VC fix to throw exceptions there. I.e.: They are flying under the radar and this exception will over time throw an exception on sites that have warmed their caches enough to run into the buggy code.
Comment #19
mxr576@kristiaanvandeneynde some tests are still failing as I can see, needs work? :-/
https://git.drupalcode.org/issue/drupal-3452181/-/pipelines/197002/test_...
Comment #20
kristiaanvandeneyndeThose fails stem from my "fix" for breadcrumbs. The actual fails caused by the VC changes are all green now. If we split off the breadcrumb and jsonapi fix everything will be green.
Comment #21
needs-review-queue-bot commentedThe Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".
This does not mean that the patch necessarily needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #22
kristiaanvandeneyndeI know it doesn't apply, that's the whole point :)
Comment #23
kristiaanvandeneyndeComment #24
kristiaanvandeneyndeBlocking issues have been fixed, simplifying MR and seeing what testbot thinks. Might add a little bit more hardening after tests go green.
Comment #25
kristiaanvandeneyndeAll green, now to add a bit more hardening. May change my mind but let's see what this catches.
Comment #26
kristiaanvandeneyndeAll green!
Ideally I add another test case for that last bit I just added. Don't have time right now, but hopefully over the next week or so. Shouldn't take too long but this week is really busy.
Comment #27
mxr576\o/
\o/
Comment #29
smustgrave commentedLoosely been trying to keep up with one.
Ran the test-only feature https://git.drupalcode.org/issue/drupal-3452181/-/jobs/2411247 and appears to have coverage.
Reading the threads I believe all feedback/questions have been addressed
Applied some super nitpicky void returns directly.
Believe this could be ready so going to mark it.
Comment #30
catchOne question on the MR - not sure about the 'redirect address' language in the exception.
Comment #31
kristiaanvandeneyndeI am working on the final comment and reverted the extra hardening locally. Just making sure we have ample test coverage.
Comment #32
kristiaanvandeneyndeAt this point I'm content with the new tests. We could always add more and more to prove things work, but right now we have extra test coverage for when things don't work. Which is what this issue was about.
Now final concern is how do we proceed? I think it was Alex Pott who raised the question that we cannot be throwing exceptions for things we previously allowed people to do wrong. The whole reason this MR started with exceptions was to catch the places in core that were broken and we've since fixed those in the spin-off issues.
Yet @mxr576 has an issue lined up which depends on this one so we can easily tell if his DPC changes open up even more cans of worms. So he would definitely benefit from this MR introducing the hardening as exceptions.
We could set a flag somewhere and log an error when its one value and throw an exception when its the other, but then the VariationCache service would need to get a logger injected only for us to remove it again in a next major version. So it would instantly be deprecated.
Any thoughts?
Comment #33
bbralaI'd be all for logging first then exceptions in 12 (or 13 if the totality of work is done to late).
Logging errors will make contrib and sites fix it eventually. I might even prefer it to stay logged on upgrades and be exceptions on new installs in a future maybe actually. Although not sure how much we've done that for core.
Comment #34
kristiaanvandeneyndeThe thing is we need tests to fail if they uncover incorrect (or dangerous) cache context use. I'm not sure we can achieve that for all types of tests if we switch to error logging.
Existing vs new sites shouldn't be a cut-off point as, if there is a bug in core, it would mean people couldn't install new Drupal websites whereas existing ones would continue working.
Comment #35
catchE_USER_WARNINGwill cause any test to fail afaik. PHPUnit 10 makes it harder to assert warnings compared to exceptions, which is a bit annoying, but otherwise it should work fine. In general I think it's a good idea when introducing a new exception to use E_USER_WARNING first, unless the exception is replacing a fatal error or similar.Comment #36
mxr576We have three environments where we want to handle this differently:
In my opinion:
In that case, it might work, but I’m not sure it fully addresses the issue: The VariationCache still needs a logger dependency, whether it's a NullLogger or a real logger, to support the "lenient" mode based on configuration. We could use setter injection to keep the constructor cleaner, though I generally prefer optional dependencies to be included in the constructor and represented with a null object when not required. This approach makes dependencies more explicit and easier to manage.
How about introducing a new service parameter, like
variation_cache.cache_manipulation.reporting_mode, with three different options?Based on this value, a different logger could be injected into the VariationCache service. Exception throwing could also be managed by a logger, although that isn't strictly necessary. This approach would allow for flexible configuration while keeping the handling of cache manipulation issues consistent and clear.
Comment #37
mxr576We cannot foresee how long we would need to support different favor of lenient modes, but following my idea, in D12 we could change the default mode to "exception" in development environments in
development.services.ymlin a hope that we can start throwing exceptions only in D13.Comment #38
kristiaanvandeneyndeWhile sometimes I would agree with a multi-tiered approach, here I'm not a fan. I'd like to keep the VariationCache as lightweight as possible and introducing all these options seems to complicate matters even more for a piece of core that few people understand. If we can trigger_error() with E_USER_WARNING and make tests fail on those, we're already in the sweet spot IMO: Tests will fail, but sites will not.
Then, in a next major we should turn the trigger_error into exceptions because we really want to prevent people from setting incorrect cache contexts. It would stop a lot of potential security issues dead in its tracks before they can even get a chance to make it into core/contrib.
Comment #39
mxr576@kristiaanvandeneynde I see the value in your suggestion and basically what we would lose in contrast with my suggestion that a service level switch between "appear in logs or not", but in other layers, E_USER_WARNING-s can be silenced. (Unless we hit: #1267246: Drupal overriding error_reporting setting in php.ini ;S)
So you have got my vote on
E_USER_WARNING.Comment #40
kristiaanvandeneyndeLooks like we can no longer test for E_USER_ERROR. It's still there in PhpUnit 10 (even though it claims otherwise), but will be removed.
So now what? :/
Comment #41
kristiaanvandeneyndeOkay, did some digging and following #3417650: [meta] Replace calls to ::expectError*() and ::expectWarning*() it seems we actually changed errors into exceptions in other places in core. Which means that, there too, existing code could start breaking.
With that in mind, would it be fine to just commit the MR as is (with the exception) and deal with the fallout as people stumble upon broken/unsafe cache context assignment? Keep in mind that a site which violates the thing we're detecting here will have broken caching and potentially even a security issue, so maybe we have to break loudly to detect these flaws ASAP?
Comment #42
mxr576:-(
To be honest, the results are mixed. While I understand and even sympathize with the idea of failing with an exception—especially when security is involved— even if this approach could lead to more work for us during the next minor release. My main problem that these exceptions will only surface at runtime.
Changed to logging:
Changed to throwing a exception:
So in production, while there may be an underlying issue that has existed for months, if it hasn’t been discovered yet, it shouldn’t be exposed to end users.
Unless the issue is easy to detect, you have extensive automated test coverage, or exceptionally thorough manual testers, there’s a risk that upgrading to a new Drupal minor/major version with this change could result in issues only being caught in Production—despite significant testing efforts beforehand.
Comment #43
mxr576Who could make the explicit call on exception or not? @alexpott? @catch?
Comment #44
mxr576This could work as an alternative workaround for the removed support of "PHPUnit 10 no longer converts E_* to exceptions, therefore E_* can no longer be expected.". When error handler is not restored before tearDown(), PHPUnit rightfully complains about that.
Comment #45
catchPHPUnit has dropped support for expecting errors, it hasn't stopped failing on E_USER_WARNING, so the dropped support only makes it a pain to test the warning itself, not to get a test failure on an unexpected warning. It's still possible to test for warnings by swapping out the error handler mid-test too iirc. edit: crossposed with #44 but yes exactly that.
Comment #46
kristiaanvandeneyndeOkay, let's do #44. Seems like a really nice temporary solution until we can switch to exceptions.
Comment #47
andypostbtw
E_USER_ERRORis deprecated, see related #3465827: Stop passing E_USER_ERROR to trigger_error() on PHP 8.4Comment #48
kristiaanvandeneyndeThe E_USER_ERROR is only temporary, though. Will be removed in Drupal 12.0.0 and then we'll throw a proper exception. Your issue seems to also have a MR targeted at D12, so I suppose we're fine? I adjusted the MR, will push in a second after creating a follow-up to point to.
Comment #49
kristiaanvandeneyndeUpdated the MR, follow-up here: #3468921: Convert trigger_error in VariationCache into a LogicException
Comment #50
andypostUsing
E_USER_WARNINGlooks ok here but beware usingE_USER_ERRORit should be deprecates in 10.4 or at least silenced to work on PHP 8.4Comment #51
kristiaanvandeneyndeYeah I switched to E_USER_WARNING to be sure.
Comment #52
mxr576Wait... it just came to my mind, how the current version with trigger_error() will help with catching any unwanted side-effects in #2973356: Cacheability information from route access checker access results are ignored by dynamic_page_cache, since as we identified, trigger_error() only makes PHPUnit tests fail with a custom error handler...
Comment #53
mxr576... I tried to summarize what the latest version of MR means for Drupal core and contrib and this was the time when I realized that neither core tests (?) nor contrib tests are going to fail automatically when this prevention mechanism gets triggered inside VariationCache.
Am I missing something?
Comment #54
kristiaanvandeneyndeFrom #45:
Comment #55
mxr576So we double checked the actual behavior with @kristiaanvandeneynde and here a quick summary of our Slack conversation.
FTR: For the results below I have removed
expectException...and the custom error handler from the test and replaced it with a simpleself::assertTrue(1 === 1);.PHPUnit CLI fails with a non-zero exit code that should make both Drupal Core tests on Gitlab CI and contrib modules fail.
PHPStorm's PHPUnit reports the same result, the only thing that concerns me that the test execution is flagged as successful, which could lead to DX issues... but that could be a PHPStorm issue/misconfiguration. (See attached screenshot)
Comment #56
mxr576Based on the previous comment, moving this to RTBC
Comment #57
kristiaanvandeneyndeAlso this no longer depends on other parts of core being fixed, so review bot can return.
Comment #58
mxr576...also this issue should not be related because
E_USER_WARNINGis in use. See #50 #51Comment #59
catchOne question on the MR - just moving an additional explanatory comment from the test to the place we actually trigger the error. Other than this, it looks good to me though.
Comment #60
kristiaanvandeneyndeMoved from VariationCacheTest to VariationCache and added the following to it to make it even more clear:
Comment #63
catchComments look much better!
Committed/pushed to 11.x and cherry-picked to 10.4.x, thanks!
Comment #65
kristiaanvandeneyndeAwesome, thanks!
Comment #66
quietone commentedThere are failures on 10.4.x in the following tests and those test pass locally when this change is reverted.
Comment #67
kristiaanvandeneyndeI have already addressed this on Drupal Slack. I even fixed something in there: #3472014: BookBreadcrumbBuilder needs to always set the route.book_navigation cache context
The point is that the tests now fail if your cache contexts are broken or insecure. This was the whole point of this issue. While we shied away from straight up breaking existing sites by throwing exceptions, the compromise was that we would be breaking tests instead to show people something was wrong.
Tests broke in Book, I had a look and fixed it. There may be more to fix, but that is really a bug in Book and not core.
Comment #68
quietone commented@kristiaanvandeneynde, thanks for fixing the Book contrib module. Is there an issue to make a similar fix for 10.4.x?
Changing status because the daily tests are failing on 10.4.x.
Comment #69
smustgrave commentedOpened #3472592: Fix Book breadcrumb cacheability will have an MR up in a few seconds.
Comment #70
wim leersVery good hardening, I'm very glad to have you watching over us as a guardian, @kristiaanvandeneynde! 🙏👏 This feels like an oversight of me, @FabianX and others who landed the "cache redirect" concept originally in Drupal core a ~decade ago.
Thanks for improving the DX with such clear messages 👍👏 Impressive that we can now automatically detect errors in the logic!
Comment #71
kristiaanvandeneyndeThanks for the kind words @wim leers, much appreciated.
Keep in mind the detection only works in tests as far as the caches are properly warmed. It will over time catch everything on a live site, especially when things are about to go wrong. The goal is to catch as many of these logic errors now so we can fix them and then throw exceptions in the next major.