Problem/Motivation
Random fail in PathWorkspacesTest:
PHPUnit 9.6.8 by Sebastian Bergmann and contributors.
Testing Drupal\Tests\workspaces\Functional\PathWorkspacesTest
F.. 3 / 3 (100%)
Time: 00:42.107, Memory: 4.00 MB
There was 1 failure:
1) Drupal\Tests\workspaces\Functional\PathWorkspacesTest::testPathAliases
Failed asserting that a boolean is not empty.
/var/www/html/vendor/phpunit/phpunit/src/Framework/Constraint/Constraint.php:122
/var/www/html/vendor/phpunit/phpunit/src/Framework/Constraint/Constraint.php:55
/var/www/html/core/modules/workspaces/tests/src/Functional/PathWorkspacesTest.php:109
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:728
FAILURES!
Tests: 3, Assertions: 126, Failures: 1.
Only affects 11.x but across PHP versions and database drivers.
https://www.drupal.org/pift-ci-job/2719595
https://www.drupal.org/pift-ci-job/2718910
https://www.drupal.org/pift-ci-job/2719593
https://www.drupal.org/pift-ci-job/2719016
https://www.drupal.org/pift-ci-job/2719804
Discussed briefly in Slack, #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration is possibly a suspect issue.
Steps to reproduce
?
Proposed resolution
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
| Comment | File | Size | Author |
|---|---|---|---|
| #19 | 50x_functional_path_tests_18+MR-4425.patch | 3.27 KB | znerol |
| #18 | 50x_functional_path_tests_17+MR-4425.patch | 3.27 KB | znerol |
| #12 | 50x_functional_path_tests-3295790-reverted.patch | 22.04 KB | spokje |
| #9 | 50x_functional_path_tests.patch | 1.94 KB | spokje |
Issue fork drupal-3375584
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
longwaveRunning the failing test 100x to see if this triggers it.
Comment #3
longwaveComment #4
znerol commentedComment #5
longwaveTrying again with 1500x.
Comment #6
longwave1500 passes suggests it is somehow an interaction with another test. Trying all functional tests 50x at @Spokje's suggestion.
Comment #7
spokjeI was trying to run all tests from
@group path, butcore/scripts/run-tests.shdoesn't seem to be handle a "simple" thing like--group?I see support for module, class, directory but not the standard
phpunit --group foo?Comment #8
longwaveThe default argument is a group or comma-separated list of groups, so you just do
core/scripts/run-tests.sh path. That is why the DrupalCI argument istestgroupsas well, it's a hack that you can use it to pass other switches.Comment #9
spokjeAh, crap.
I knew that once, many moons ago.
Thanks @longwave
I think here's you're (rather big) fail-canary-in-a-coal-mine to start with.
Comment #10
znerol commentedIs there any chance that the test generates a node with
id() != 1under some circumstances?I do not understand any of the workspaces stuff. But from skimming the test it looks like the assertions aren't really consistent. In the first part there is a
drupalGetwhich constructs the path using$node->id()while in the assertions following it looks like the path is hard-coded topreload-paths:/node/1.Comment #11
longwaveIt shouldn't be able to, as each test is supposed to be independent, and should run the same each time, unless we explicitly say that something is random.
Comment #12
spokjen=50, but it seems #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration is a very likely suspect.
This patch was brought to you by the Coal Mine Canary Liberation Front
Comment #13
znerol commentedThat hunk maybe? In that case we'd need a better way to wait for termination to have terminated in the child site.
Comment #14
spokjeYes please!
For me personally
sleep(1);is a very undeterministic way of determining something is finished.Comment #15
longwaveI think you might be right. In PathAliasTest for example we had to add:
In PathWorkspacesTest the failure is also looking for this cache key:
So the easy fix is to add
sleep(1);to PathWorkspacesTest, but is this the symptom of something worse? Has #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration introduced the possibility of cache coherency bugs or race conditions, if we need to sleep in tests?Comment #16
znerol commentedSymfony docs about kernel termination state that
and further (probably needs an update now):
Side note: I guess the symfony docs need an update after upstream upstream PR 46931.
After reading the docs, I feel that the
terminateevent should be used for stuff which also could be performed in a background task (a queue or cron). Not sure whether the usage of the terminate event in Drupal core and contrib falls into this category.On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.
Comment #18
znerol commentedChecking MR together with patch from #9
Comment #19
znerol commentedMoving around sleeps.
Comment #20
andypostThe test fails mostly every time in #3374223: Fix deprecated overloaded function usage in PHP 8.3
Comment #21
znerol commentedFiled a follow-up #3375959: Add a way to delay executions in test runner until terminate event completed in the child site
Comment #22
znerol commentedThe sqlite test in #19 failed with General error: 5 database is locked. Is that a result of running tests concurrently?
Comment #23
longwaveSQLite is prone to database locks, it is not designed for the sorts of heavy loads and high concurrency that we have in test runs. I think that is OK to ignore here, I triggered a retest to see if it goes away.
Comment #24
catchYeah I think the way we use it is fine and the problems it's causing are test-specific. #3375959: Add a way to delay executions in test runner until terminate event completed in the child site is a good idea.
Comment #25
spokjeNow I hate to be _that_ guy, but hey, I _am_ that guy:
The normal routine to prove a random failure is fixed is to run the failing patch and the patch with the fix at the same time, whilst the latter has to have ~8000 - 10.000 failure free runs to prove it's credibility.
Seeing that the current 50x run takes around 10 minutes and that concurrent tests are, at least in my experience, getting wonky/throwing unpredictable unrelated randoms around the 1 hour mark, I think we need a 300x run-patch, which we then trigger 8000/300 ~ 27 times.
The best way for that is using the "Custom parameters" option when creating a test-run.
After that first time, it's "just" a matter of browser-back for 27x and pressing the submit with the same chosen options (Those tend to stick after the 2nd time browser-back in my, far too big, experience).
We could also try to bring the number of different testable classes in the "canary"/failing patch down, so we can have more runs of that.
But that's very likely to take (much) more time then blindly monkey-bashing the browser-back button.
Also: Big yay! for #3375959: Add a way to delay executions in test runner until terminate event completed in the child site
Comment #27
longwaveDiscussed this with @Spokje in Slack and @znerol in person at Drupal Dev Days. This is affecting contributions at DDD as otherwise green patches are hitting it, so I am making the pragmatic call to commit this from NR. The fix appears to be the correct one given that we needed similar sleeps elsewhere, and this should unblock developers here; it won't make anything worse if it is the wrong fix, and we have the followup being actively worked on to handle this all in a better way.
Committed and pushed 2dad249ac0 to 11.x. Thanks!
Comment #28
catch+1 on the commit from NR, I guess this can be a posthumous RTBC.
@Spokje that has turned out to be very important on issues where we're unskipping skipped random failures, but I think for 'quick fix' issues like this where the issue is actively failing, it's a bit less of a concern since the commit is a lot less likely to re-introduce a random failure even if it doesn't 100% fix one.