Problem/Motivation
Various test modules use the state API to track things between the tested site and the test, after #2575105: Use cache collector for state this creates race conditions, in combination with #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration - because the state cache item is written as a post response task.
The race condition is as follows:
1. drupalGet(), state cache miss.
2. Test class, calls State:set(), which calls invalidate, but there's nothing to invalidate yet
3. request from #1 is still running end of request tasks, we get to updateCache() in CacheCollector, there was no cache at the beginning of the request, and there's still no cache now, so it just writes to the cache.
4. Test class calls drupalGet(), and this now gets a cache without whatever happened in #2.
State has an explicit race condition prevention mechanism when a request begins with a cache item, writes to it at the end of the request, but it's been upated in-between. However, it doesn't handle the case when there is a cache miss at the beginning of the request and the cache is invalidated again before it tries to write.
This case would be extremely unlikely on a real site - you usually either have an empty cache after an invalidation, in which case it's safe to write, or you have a non-empty cache before an invalidation, in which case we already handle it, but the 'clean' environment of the test means we're reproducing it in a fraction of test runs.
We can prevent this by writing a cache item from State::set(), and in CacheCollector, not writing to the cache, if it was empty at the beginning of the request and has been written before the end of the request. Most CacheCollector implementations don't need this protection because they are caching discovery and similar - which can be invalidated, but not individually written to.
Steps to reproduce
Proposed resolution
For test randomness, either using TestWaitTerminate in these tests, or switching from state to key/value would prevent the random test failures.
However, it's also possible to add race condition protection to CacheCollector + state.MR 7336 attempts to do this and is the MR to review.
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
Issue fork drupal-3438424
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #6
catchI ran WorkflowUiTest 100 times and got 3-4 fails.
https://git.drupalcode.org/project/drupal/-/jobs/1244616
Then with the fix, I ran it 500 times and got 0 fails:
https://git.drupalcode.org/project/drupal/-/jobs/1244933
https://git.drupalcode.org/project/drupal/-/jobs/1248600
https://git.drupalcode.org/project/drupal/-/jobs/1248679
https://git.drupalcode.org/project/drupal/-/jobs/1248737
https://git.drupalcode.org/project/drupal/-/jobs/1248755
Comment #7
catchComment #8
catchI need to back out the test changes here, they're just for the repeated test run job so that we only run one test method 100 times instead of the entire test, but would be great to get reviews - and leaving it in for now in case we want to re-run that job even more times.
Comment #9
longwaveLooks good to me - the fixes make sense and the comments definitely help explain what's going on. As was discussed in Slack this was originally a concern with the cache collector but never was an issue in practice until tests started hitting cold caches occasionally in this way. For me the evidence in #6 is enough to commit this, so NW to remove the test changes.
Comment #10
catchYeah we covered the warm (but invalid) cache situation with the cache collector, which is the main case, but cold + invalid just never came up or seemed realistic. Additionally you would never hit this with a discovery cache (which is most cache collector use cases), because it's unlikely to get invalidated twice in the same request or anything, needs to be extremely high write. I'm not sure a site would even run into this with state because fully cold cache then immediate write situation seems unlikely there too, but we're clearly hitting it in tests.
Rebased to drop the test modifications.
Comment #11
longwaveMR!7336 looks great.
Comment #12
catchStarted off as a task but definitely a bugfix now.
Comment #13
catchHad one more look over this and realised the comment in State::set() was slightly out of date and could be clearer. Updated that but leaving RTBC since it's comment-only.
Comment #16
alexpottThe latest changes have the MR failing tests... see https://git.drupalcode.org/project/drupal/-/merge_requests/7336/pipelines
Comment #17
catchManaged to commit to the wrong branch... dropped from the MR now.
Comment #18
alexpottStill failing tests...
core/tests/Drupal/Tests/Core/Cache/CacheCollectorTest.php
core/profiles/standard/tests/src/FunctionalJavascript/StandardPerformanceTest.php
I think having the MR link in the issue messes up d.o so it puts the failing test link in the wrong place... I've removed that from the issue summary so hopefully this is easier to see.
Comment #19
catchOK now I can see the test failures, that's helpful.
The unit test failure is very handy, I wasn't sure how to test the race condition here, but turns out we already had a unit test for a cache item being set in the middle of the request. I've split that one method into two, so it now tests both the warm cache situation (already covered but the test wasn't explicitly testing this) and cold cache situation (what we're fixing here and what the test coverage was actually testing but with different expectations).
Getting about 90% chromedriver crashes when running functional tests locally, so having to hit and hope for the updates there :/
Comment #20
catchPerformance tests are green now - it's one extra state cache set + the related key/value queries collected during that request, not surprising since we're being more conservative about writing to the cache on completely cold starts now.
Comment #21
alexpottGiven catch's changes are to a test to fix and improved comments going to rtbc and commit. Hopefully this will reduce some of the random fails due to the state cache collector from occurring.
Comment #22
alexpottCommitted 3a3e618 and pushed to 11.x. Thanks!
Committed af46c48 and pushed to 10.3.x. Thanks!