[random test failure] Random failure in PathWorkspacesTest [#3375584]

Problem/Motivation

Random fail in PathWorkspacesTest:

PHPUnit 9.6.8 by Sebastian Bergmann and contributors.

Testing Drupal\Tests\workspaces\Functional\PathWorkspacesTest
F..                                                                 3 / 3 (100%)

Time: 00:42.107, Memory: 4.00 MB

There was 1 failure:

1) Drupal\Tests\workspaces\Functional\PathWorkspacesTest::testPathAliases
Failed asserting that a boolean is not empty.

/var/www/html/vendor/phpunit/phpunit/src/Framework/Constraint/Constraint.php:122
/var/www/html/vendor/phpunit/phpunit/src/Framework/Constraint/Constraint.php:55
/var/www/html/core/modules/workspaces/tests/src/Functional/PathWorkspacesTest.php:109
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:728

FAILURES!
Tests: 3, Assertions: 126, Failures: 1.

Only affects 11.x but across PHP versions and database drivers.

https://www.drupal.org/pift-ci-job/2719595
https://www.drupal.org/pift-ci-job/2718910
https://www.drupal.org/pift-ci-job/2719593
https://www.drupal.org/pift-ci-job/2719016
https://www.drupal.org/pift-ci-job/2719804

Discussed briefly in Slack, #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration is possibly a suspect issue.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Comment	File	Size	Author
#19	50x_functional_path_tests_18+MR-4425.patch	3.27 KB	znerol
#18	50x_functional_path_tests_17+MR-4425.patch	3.27 KB	znerol
#12	50x_functional_path_tests-3295790-reverted.patch	22.04 KB	spokje
#9	50x_functional_path_tests.patch	1.94 KB	spokje
#6	PHPUnit-Functional-50x.patch	1.67 KB	longwave
#5	PathWorkspacesTest-1500x.patch	2.04 KB	longwave
#2	PathWorkspacesTest-100x.patch	2.04 KB	longwave

Issue fork drupal-3375584

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3375584-random-test-failure changes, plain diff MR !4425
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

19 July 2023 at 13:33

longwave created an issue. See original summary.

Comment #2

longwave

he/him

English

commented 19 July 2023 at 14:18

Status:

Active

» Needs review

Status	File	Size
new	PathWorkspacesTest-100x.patch	2.04 KB

Running the failing test 100x to see if this triggers it.

Comment #3

longwave

he/him

English

commented 19 July 2023 at 14:18

Issue tags:

+ddd2023

Comment #4

znerol commented 19 July 2023 at 21:24

Issue summary:

View changes

Status	File	Size
new	PathWorkspacesTest-1500x.patch	2.04 KB

Comment #6

longwave

he/him

English

commented 20 July 2023 at 12:26

Status	File	Size
new	PHPUnit-Functional-50x.patch	1.67 KB

1500 passes suggests it is somehow an interaction with another test. Trying all functional tests 50x at @Spokje's suggestion.

Comment #7

spokje

Dutch

commented 20 July 2023 at 12:35

I was trying to run all tests from @group path, but core/scripts/run-tests.sh doesn't seem to be handle a "simple" thing like --group?

I see support for module, class, directory but not the standard phpunit --group foo?

Comment #8

longwave

he/him

English

commented 20 July 2023 at 12:41

The default argument is a group or comma-separated list of groups, so you just do core/scripts/run-tests.sh path. That is why the DrupalCI argument is testgroups as well, it's a hack that you can use it to pass other switches.

Comment #9

spokje

Dutch

commented 20 July 2023 at 12:56

Status:

Needs review

» Active

Status	File	Size
new	50x_functional_path_tests.patch	1.94 KB

3 files were hidden/shown/deleted

Status	File	Size
hidden	PathWorkspacesTest-100x.patch	2.04 KB
hidden	PathWorkspacesTest-1500x.patch	2.04 KB
hidden	PHPUnit-Functional-50x.patch	1.67 KB

Ah, crap.

I knew that once, many moons ago.
Thanks @longwave

I think here's you're (rather big) fail-canary-in-a-coal-mine to start with.

Comment #10

znerol commented 20 July 2023 at 13:23

Is there any chance that the test generates a node with id() != 1 under some circumstances?

I do not understand any of the workspaces stuff. But from skimming the test it looks like the assertions aren't really consistent. In the first part there is a drupalGet which constructs the path using $node->id() while in the assertions following it looks like the path is hard-coded to preload-paths:/node/1.

  /**
   * Tests path aliases with workspaces.
   */
  public function testPathAliases() {
    // Create a published node in Live, without an alias.
    $node = $this->drupalCreateNode([
      'type' => 'article',
      'status' => TRUE,
    ]);

[...]

    $edit = [
      'path[0][alias]' => '/' . $this->randomMachineName(),
    ];
    $this->drupalGet('node/' . $node->id() . '/edit');
    $this->submitForm($edit, 'Save');

[...]

    // Publish the workspace and check that the alias can be accessed in Live.
    $stage->publish();
    $this->assertAccessiblePaths([$path]);
    $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));

Comment #11

longwave

he/him

English

commented 20 July 2023 at 13:31

It shouldn't be able to, as each test is supposed to be independent, and should run the same each time, unless we explicitly say that something is random.

Comment #12

spokje

Dutch

commented 20 July 2023 at 13:52

Status	File	Size
new	50x_functional_path_tests-3295790-reverted.patch	22.04 KB

n=50, but it seems #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration is a very likely suspect.

This patch was brought to you by the Coal Mine Canary Liberation Front

Comment #13

znerol commented 20 July 2023 at 14:03

That hunk maybe? In that case we'd need a better way to wait for termination to have terminated in the child site.

    // Cacheable normalizations are written after the response is flushed to
    // the client; give the server a chance to complete this work.
    sleep(1);

Comment #14

spokje

Dutch

commented 20 July 2023 at 14:07

Yes please!

For me personally sleep(1); is a very undeterministic way of determining something is finished.

Comment #15

longwave

he/him

English

commented 20 July 2023 at 14:11

I think you might be right. In PathAliasTest for example we had to add:

+    // The \Drupal\path_alias\AliasWhitelist service performs cache clears after
+    // Drupal has flushed the response to the client; wait for this to finish.
+    sleep(1);
     $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:' . $edit['path[0][value]']), 'Cache entry was created.');

In PathWorkspacesTest the failure is also looking for this cache key:

    $this->assertNotEmpty(\Drupal::cache('data')->get('preload-paths:/node/1'));

So the easy fix is to add sleep(1); to PathWorkspacesTest, but is this the symptom of something worse? Has #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration introduced the possibility of cache coherency bugs or race conditions, if we need to sleep in tests?

Comment #16

znerol commented 20 July 2023 at 15:09

Symfony docs about kernel termination state that

you will trigger the kernel.terminate event where you can perform certain actions that you may have delayed in order to return the response as quickly as possible to the client (e.g. sending emails).

and further (probably needs an update now):

at the moment, only the PHP FPM server API is able to send a response to the client while the server's PHP process still performs some tasks. With all other server APIs, listeners to kernel.terminate are still executed, but the response is not sent to the client until they are all completed.

Side note: I guess the symfony docs need an update after upstream upstream PR 46931.

After reading the docs, I feel that the terminate event should be used for stuff which also could be performed in a background task (a queue or cron). Not sure whether the usage of the terminate event in Drupal core and contrib falls into this category.

On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.

Comment #17

20 July 2023 at 15:13

znerol opened merge request !4425

Comment #18

znerol commented 20 July 2023 at 15:16

Status	File	Size
new	50x_functional_path_tests_17+MR-4425.patch	3.27 KB

Checking MR together with patch from #9

Comment #19

znerol commented 20 July 2023 at 19:43

Status	File	Size
new	50x_functional_path_tests_18+MR-4425.patch	3.27 KB

Moving around sleeps.

Comment #20

andypost

he/him

Russian

commented 20 July 2023 at 19:47

The test fails mostly every time in #3374223: Fix deprecated overloaded function usage in PHP 8.3

Comment #21

znerol commented 20 July 2023 at 20:58

Filed a follow-up #3375959: Add a way to delay executions in test runner until terminate event completed in the child site

Comment #22

znerol commented 21 July 2023 at 09:01

Status:

Active

» Needs review

The sqlite test in #19 failed with General error: 5 database is locked. Is that a result of running tests concurrently?

Comment #23

longwave

he/him

English

commented 21 July 2023 at 09:51

SQLite is prone to database locks, it is not designed for the sorts of heavy loads and high concurrency that we have in test runs. I think that is OK to ignore here, I triggered a retest to see if it goes away.

Comment #24

catch

he/him

English

commented 21 July 2023 at 09:55

On the other hand, storing a cache doesn't actually affect the current request. And Drupal needs to cope with concurrent requests / concurrent cache rebuilds anyway. No matter whether concurrent requests come from one user agent or from many.

Yeah I think the way we use it is fine and the problems it's causing are test-specific. #3375959: Add a way to delay executions in test runner until terminate event completed in the child site is a good idea.

Comment #25

spokje

Dutch

commented 21 July 2023 at 11:02

Now I hate to be _that_ guy, but hey, I _am_ that guy:

The normal routine to prove a random failure is fixed is to run the failing patch and the patch with the fix at the same time, whilst the latter has to have ~8000 - 10.000 failure free runs to prove it's credibility.

Seeing that the current 50x run takes around 10 minutes and that concurrent tests are, at least in my experience, getting wonky/throwing unpredictable unrelated randoms around the 1 hour mark, I think we need a 300x run-patch, which we then trigger 8000/300 ~ 27 times.

The best way for that is using the "Custom parameters" option when creating a test-run.
After that first time, it's "just" a matter of browser-back for 27x and pressing the submit with the same chosen options (Those tend to stick after the 2nd time browser-back in my, far too big, experience).

We could also try to bring the number of different testable classes in the "canary"/failing patch down, so we can have more runs of that.
But that's very likely to take (much) more time then blindly monkey-bashing the browser-back button.

Also: Big yay! for #3375959: Add a way to delay executions in test runner until terminate event completed in the child site

Comment #26

21 July 2023 at 12:36

longwave committed 2dad249a on 11.x

Issue #3375584 by znerol, longwave, Spokje: [random test failure] Random...

Comment #27

longwave

he/him

English

commented 21 July 2023 at 12:36

Status:

Needs review

» Fixed

Discussed this with @Spokje in Slack and @znerol in person at Drupal Dev Days. This is affecting contributions at DDD as otherwise green patches are hitting it, so I am making the pragmatic call to commit this from NR. The fix appears to be the correct one given that we needed similar sleeps elsewhere, and this should unblock developers here; it won't make anything worse if it is the wrong fix, and we have the followup being actively worked on to handle this all in a better way.

Committed and pushed 2dad249ac0 to 11.x. Thanks!

Comment #28

catch

he/him

English

commented 21 July 2023 at 13:07

+1 on the commit from NR, I guess this can be a posthumous RTBC.

@Spokje that has turned out to be very important on issues where we're unskipping skipped random failures, but I think for 'quick fix' issues like this where the issue is actively failing, it's a bit less of a concern since the commit is a lot less likely to re-introduce a random failure even if it doesn't 100% fix one.

Comment #29

4 August 2023 at 13:09

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

[random test failure] Random failure in PathWorkspacesTest

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3375584

Comments

Parent issue

Related issues

Referenced by