Problem/Motivation

Various test modules use the state API to track things between the tested site and the test, after #2575105: Use cache collector for state this creates race conditions, in combination with #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration - because the state cache item is written as a post response task.

The race condition is as follows:

1. drupalGet(), state cache miss.
2. Test class, calls State:set(), which calls invalidate, but there's nothing to invalidate yet
3. request from #1 is still running end of request tasks, we get to updateCache() in CacheCollector, there was no cache at the beginning of the request, and there's still no cache now, so it just writes to the cache.
4. Test class calls drupalGet(), and this now gets a cache without whatever happened in #2.

State has an explicit race condition prevention mechanism when a request begins with a cache item, writes to it at the end of the request, but it's been upated in-between. However, it doesn't handle the case when there is a cache miss at the beginning of the request and the cache is invalidated again before it tries to write.

This case would be extremely unlikely on a real site - you usually either have an empty cache after an invalidation, in which case it's safe to write, or you have a non-empty cache before an invalidation, in which case we already handle it, but the 'clean' environment of the test means we're reproducing it in a fraction of test runs.

We can prevent this by writing a cache item from State::set(), and in CacheCollector, not writing to the cache, if it was empty at the beginning of the request and has been written before the end of the request. Most CacheCollector implementations don't need this protection because they are caching discovery and similar - which can be invalidated, but not individually written to.

Steps to reproduce

Proposed resolution

For test randomness, either using TestWaitTerminate in these tests, or switching from state to key/value would prevent the random test failures.

However, it's also possible to add race condition protection to CacheCollector + state.MR 7336 attempts to do this and is the MR to review.

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3438424

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

catch created an issue. See original summary.

Berdir made their first commit to this issue’s fork.

catch’s picture

Title: [random test failures] Use key/value directly instead of state in test modules » [random test failures] Race condition in state when individual keys are set with an empty cache
catch’s picture

Issue summary: View changes
Status: Active » Needs review

I need to back out the test changes here, they're just for the repeated test run job so that we only run one test method 100 times instead of the entire test, but would be great to get reviews - and leaving it in for now in case we want to re-run that job even more times.

longwave’s picture

Status: Needs review » Needs work

Looks good to me - the fixes make sense and the comments definitely help explain what's going on. As was discussed in Slack this was originally a concern with the cache collector but never was an issue in practice until tests started hitting cold caches occasionally in this way. For me the evidence in #6 is enough to commit this, so NW to remove the test changes.

catch’s picture

Status: Needs work » Needs review

Yeah we covered the warm (but invalid) cache situation with the cache collector, which is the main case, but cold + invalid just never came up or seemed realistic. Additionally you would never hit this with a discovery cache (which is most cache collector use cases), because it's unlikely to get invalidated twice in the same request or anything, needs to be extremely high write. I'm not sure a site would even run into this with state because fully cold cache then immediate write situation seems unlikely there too, but we're clearly hitting it in tests.

Rebased to drop the test modifications.

longwave’s picture

Status: Needs review » Reviewed & tested by the community

MR!7336 looks great.

catch’s picture

Category: Task » Bug report

Started off as a task but definitely a bugfix now.

catch’s picture

Had one more look over this and realised the comment in State::set() was slightly out of date and could be clearer. Updated that but leaving RTBC since it's comment-only.

alexpott changed the visibility of the branch 3438424-random-test-failures to hidden.

alexpott changed the visibility of the branch 3438424-usleep to hidden.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

The latest changes have the MR failing tests... see https://git.drupalcode.org/project/drupal/-/merge_requests/7336/pipelines

catch’s picture

Status: Needs work » Reviewed & tested by the community

Managed to commit to the wrong branch... dropped from the MR now.

alexpott’s picture

Issue summary: View changes
Status: Reviewed & tested by the community » Needs work

Still failing tests...

core/tests/Drupal/Tests/Core/Cache/CacheCollectorTest.php
core/profiles/standard/tests/src/FunctionalJavascript/StandardPerformanceTest.php

I think having the MR link in the issue messes up d.o so it puts the failing test link in the wrong place... I've removed that from the issue summary so hopefully this is easier to see.

catch’s picture

OK now I can see the test failures, that's helpful.

The unit test failure is very handy, I wasn't sure how to test the race condition here, but turns out we already had a unit test for a cache item being set in the middle of the request. I've split that one method into two, so it now tests both the warm cache situation (already covered but the test wasn't explicitly testing this) and cold cache situation (what we're fixing here and what the test coverage was actually testing but with different expectations).

Getting about 90% chromedriver crashes when running functional tests locally, so having to hit and hope for the updates there :/

catch’s picture

Status: Needs work » Needs review

Performance tests are green now - it's one extra state cache set + the related key/value queries collected during that request, not surprising since we're being more conservative about writing to the cache on completely cold starts now.

alexpott’s picture

Status: Needs review » Reviewed & tested by the community

Given catch's changes are to a test to fix and improved comments going to rtbc and commit. Hopefully this will reduce some of the random fails due to the state cache collector from occurring.

alexpott’s picture

Version: 11.x-dev » 10.3.x-dev
Status: Reviewed & tested by the community » Fixed

Committed 3a3e618 and pushed to 11.x. Thanks!
Committed af46c48 and pushed to 10.3.x. Thanks!

  • alexpott committed af46c48e on 10.3.x
    Issue #3438424 by catch, Berdir, alexpott, longwave: [random test...

  • alexpott committed 3a3e6186 on 11.x
    Issue #3438424 by catch, Berdir, alexpott, longwave: [random test...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.