Add a way to delay executions in test runner until terminate event completed in the child site [#3375959]

Problem/Motivation

In #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration and #3375584: [random test failure] Random failure in PathWorkspacesTest several calls to sleep() were introduced in order to mitigate race conditions. The affected browser tests have the following pattern in common:

An alias path is created during some request in the child site.
The response is generated and evaluated by the test runner.
After that the following events can occur in any order:
- An assertion against the alias path cache is run in the test runner
- The alias cache is rebuilt during execution of the terminate event in the child site

Steps to reproduce

This is hard, look at the patches in #3375584: [random test failure] Random failure in PathWorkspacesTest for some ideas.

Proposed resolution

Introduce a middleware which is running at a low enough priority to act on the final response
From within the middleware, acquire a lock which is automatically released using a shutdown callback. Add a header to the response in order to indicate to the test runner that the lock exists.
Add a mechanism to wait for that lock in the existing test http client middleware (if the response header is present)
Add a state flag which we can use to enable the middleware from within the test runner
Introduce a test trait in order to simplify all that
Remove the sleep() from the test cases and instead use the new test trait

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Comment	File	Size	Author
#25	50x_functional_path_tests+MR-4461-25.patch	15.34 KB	znerol
#21	50x_functional_path_tests+MR-4461-21.patch	14.5 KB	znerol
#20	50x_functional_path_tests+MR-4461-20.patch	14.37 KB	znerol
#18	50x_functional_path_tests+MR-4461-17.patch	14.47 KB	znerol
#13	50x_functional_path_tests+MR-4434-13.patch	16.66 KB	znerol
#12	50x_functional_path_tests+MR-4434-12.patch	15.8 KB	znerol
#11	50x_functional_path_tests+MR-4434-11.patch	15.59 KB	znerol
#10	50x_functional_path_tests+MR-4434-10.patch	15.56 KB	znerol
#8	50x_functional_path_tests+MR-4434.patch	6.29 KB	znerol

Issue fork drupal-3375959

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

2 hidden branches

3375959-terminate-wait-lock

changes, plain diff MR !4461

Check out this branch for the first time

Check out existing branch, if you already have it locally

3375959-add-a-way

changes, plain diff MR !4434

Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

20 July 2023 at 20:57

znerol created an issue. See original summary.

Comment #2

znerol commented 20 July 2023 at 20:59

Issue summary:

View changes

Comment #3

znerol commented 20 July 2023 at 21:00

Issue tags:

+ddd2023

Comment #4

andypost

he/him

Russian

commented 20 July 2023 at 21:02

Comment #5

znerol commented 21 July 2023 at 09:16

Assigned:

Unassigned

» znerol

Comment #6

21 July 2023 at 10:13

znerol opened merge request !4434

Comment #7

mondrake

🇮🇹

commented 21 July 2023 at 10:14

Priority:

Normal

» Major

I've started to see test failures in my experimental database driver DruDbal, due to database locks, since when the parent issue was committed.

Bumping to major, even though I think critical would be appropriate if the testing framework stability decreased.

Comment #8

znerol commented 21 July 2023 at 10:18

Status	File	Size
new	50x_functional_path_tests+MR-4434.patch	6.29 KB

Comment #9

znerol commented 21 July 2023 at 10:54

Assigned:	znerol	» Unassigned
Status:	Active	» Needs review

Comment #10

znerol commented 21 July 2023 at 11:29

Status	File	Size
new	50x_functional_path_tests+MR-4434-10.patch	15.56 KB

Remove calls to sleep() added in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration.

Comment #11

znerol commented 21 July 2023 at 11:50

Status	File	Size
new	50x_functional_path_tests+MR-4434-11.patch	15.59 KB

More test groups (path,jsonapi,language,locale).

Comment #12

znerol commented 21 July 2023 at 12:24

Status	File	Size
new	50x_functional_path_tests+MR-4434-12.patch	15.8 KB

Looks like setWaitForTerminate() needs to run earlier in ConfigurableLanguageManagerTest and LocaleLocaleLookupTest.

Comment #13

znerol commented 22 July 2023 at 07:39

Status	File	Size
new	50x_functional_path_tests+MR-4434-13.patch	16.66 KB

Remove calls to sleep() added in #3375584: [random test failure] Random failure in PathWorkspacesTest.

Comment #14

bradjones1

English

commented 23 July 2023 at 01:44

Primary author of the changes in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration, here.

From a quick read of this issue it appears that the goal is to disable post-response processing, by way of essentially undoing the fixes from the other issue, namely setting the content-length header. This might work, however it seems a bit hacky to me and, most importantly, not entirely obvious. For someone who comes along later and wants to maintain or debug this code, it's not clear why the content-length header is related. Even I had to go back and do some issue archaeology (that issue was open for almost a year, so my memory was a bit fuzzy and this stuff is pretty esoteric.)

For my part I would rather see a more explicit short-circuit of this functionality, or (better yet) improve the underlying race conditions and flakiness while allowing the underlying functionality work as expected, so we are actually testing the system as it's expected to operate. There was a lot of discussion in the parent issue about this kind of thing, and it was generally considered to be an improvement in test coverage because we're no longer testing/depending on broken functionality.

I don't love the additions of the sleep() calls, but we had little choice because as you rightly point out it's hard to signal to and from the test runner and the site that's being called. I think this is where the improvement should happen - I'm sure there is a more elegant solution to this aspect of this puzzle.

Comment #15

catch

he/him

English

commented 23 July 2023 at 17:36

From a quick read of this issue it appears that the goal is to disable post-response processing, by way of essentially undoing the fixes from the other issue, namely setting the content-length header. This might work, however it seems a bit hacky to me and, most importantly, not entirely obvious.

This is true, but I think it applies equally to sleep(1).

For my part I would rather see a more explicit short-circuit of this functionality, or (better yet) improve the underlying race conditions and flakiness while allowing the underlying functionality work as expected, so we are actually testing the system as it's expected to operate.

One possible approach that crossed my mind was a test-only post response task, that writes something to state or key/value when it's finished and tries to run last of the post response tasks. We could then poll for that in the parent process, and delete it again as soon as its found ready for the next time, like a ::waitForResponseTasksToFinish() method. This would allow the post response tasks to run actually post-response while having something hopefully more reliable than sleep() to ensure they have.

Comment #16

znerol commented 24 July 2023 at 07:43

We could try to use flock() for that purpose. If the state flag is set, the child site creates a temporary file and flocks() it (using LOCK_EX). The path to the file is communicated to the parent site in a response header (X-Drupal-Test-Wait-Terminate: /tmp/xxxx.lock).

The test runners HTTP client (most probably stuff in UIHelperTrait) examines the response headers and extracts the path from X-Drupal-Test-Wait-Terminate. If it exists, it tries to set an exclusive lock on that file (which blocks until the lock in the child site is released).

Some notes:

In the best case, the child site doesn't need to clean up the lock (since it should be released also by fclose(), or when stream is garbage collected.
The test runner assumes that the child already terminated if it fails to open the file or fails to lock it.

Comment #17

24 July 2023 at 10:14

znerol opened merge request !4461

Comment #18

znerol commented 24 July 2023 at 10:26

Status	File	Size
new	50x_functional_path_tests+MR-4461-17.patch	14.47 KB

Opened an alternative MR 4461 implementing the ideas in #16 using the symfony lock component. Attached is a patch which executes the affected test groups 50 times in a row.

Comment #19

24 July 2023 at 11:12

Status:

Needs review

» Needs work

The last submitted patch, 18: 50x_functional_path_tests+MR-4461-17.patch, failed testing. View results

Comment #20

znerol commented 24 July 2023 at 11:59

Status:

Needs work

» Needs review

Status	File	Size
new	50x_functional_path_tests+MR-4461-20.patch	14.37 KB

4 files were hidden/shown/deleted

Status	File	Size
hidden	50x_functional_path_tests+MR-4434-11.patch	15.59 KB
hidden	50x_functional_path_tests+MR-4434-12.patch	15.8 KB
hidden	50x_functional_path_tests+MR-4434-13.patch	16.66 KB
hidden	50x_functional_path_tests+MR-4461-17.patch	14.47 KB

Remove X-Drupal-Wait-Terminate response header and only attempt to wait for termination if container has state service.

Comment #21

znerol commented 24 July 2023 at 12:27

Status	File	Size
new	50x_functional_path_tests+MR-4461-21.patch	14.5 KB

Retain a reference to the lock, otherwise it will be released prematurely.

Comment #22

catch

he/him

English

commented 24 July 2023 at 12:46

+++ b/core/lib/Drupal/Core/CoreServiceProvider.php
@@ -142,6 +143,12 @@ protected function registerTest(ContainerBuilder $container) {
       ->addTag('http_client_middleware');
+    // Removes Content-Length header added in FinishResponseSubscriber if
+    // required by the test runner.
+    $container

This is outdated now. New approach looks very encouraging to me.

Comment #23

znerol commented 24 July 2023 at 13:01

This is outdated now.

Yes. And since we are not accessing the Request/Response object anymore it doesn't need to go into a stack middleware. Not sure if there is a better location for the code though.

Comment #24

24 July 2023 at 13:12

Status:

Needs review

» Needs work

The last submitted patch, 21: 50x_functional_path_tests+MR-4461-21.patch, failed testing. View results

Comment #25

znerol commented 24 July 2023 at 20:23

Status:

Needs work

» Needs review

Status	File	Size
new	50x_functional_path_tests+MR-4461-25.patch	15.34 KB

Switching to \Drupal:lock(). We know that this is working in other cases, so let's just use that.

Comment #26

znerol commented 26 July 2023 at 06:47

Issue summary:

View changes

4 files were hidden/shown/deleted

Status	File	Size
hidden	50x_functional_path_tests+MR-4434.patch	6.29 KB
hidden	50x_functional_path_tests+MR-4434-10.patch	15.56 KB
hidden	50x_functional_path_tests+MR-4461-20.patch	14.37 KB
hidden	50x_functional_path_tests+MR-4461-21.patch	14.5 KB

Comment #27

catch

he/him

English

commented 26 July 2023 at 16:04

Just using lock seems both simpler and more reliable. Haven't done a line by line review yet but also nothing stuck out either, so +1.

Comment #28

bradjones1

English

commented 26 July 2023 at 16:14

I did a once-over of the revised patch and I'm a bit confused as to where and how it's actually waiting for termination of the post-response work. From my read of the earlier comments I figured that there would be, say, a service added that would run after all other destructable services that would release the lock, but it appears this is being done within the test HTTP client? Isn't the whole problem here that the client side doesn't know how long the post-response work takes, and has to be polling some state that is asynchronous from the request->response cycle? I could definitely be reading this wrong as I am not an expert on locks (I always get turned around when implementing them) but I'm suspicious that this might not be signaling the end of work so much as adding a few ticks, and as a result the timing works out?

Comment #29

catch

he/him

English

commented 26 July 2023 at 16:39

@bradjones I think we need a comment as a reminder, but it's fine:

DatabaseLockBackend for example does this:

  /**
   * Constructs a new DatabaseLockBackend.
   *
   * @param \Drupal\Core\Database\Connection $database
   *   The database connection.
   */
  public function __construct(Connection $database) {
    // __destruct() is causing problems with garbage collections, register a
    // shutdown function instead.
    drupal_register_shutdown_function([$this, 'releaseAll']);
    $this->database = $database;
  }

So acquiring a lock at the beginning of the request, this will then run at the end of the request after all post response tasks have run, clearing out every lock that was acquired.

It's mostly a safety catch so that a request can't hold a lock even after it's finished (say if there's a fatal error before the lock can be released) but happens to do exactly what we need here.

The patch is using the default values for ::acquire() and ::wait() which means it's acquired for 30 seconds. Lock:wait() intelligently handles polling while it's waiting (i.e. it starts with millseconds and ends up at hundreds of milliseconds), so that works for making sure we wait the minimum time necessary.

Comment #30

26 July 2023 at 18:31

znerol closed merge request !4434

Comment #31

bradjones1

English

commented 26 July 2023 at 19:03

Ahhh, yeah, OK. That makes sense and appreciate the clarification. I think a comment that states how it works is helpful here for maintainability. Thanks.

Comment #32

spokje

Dutch

commented 6 August 2023 at 08:04

Status:

Needs review

» Reviewed & tested by the community

This seems ready for core comitters review, and is IMHO a much more deterministic way then whacking sleep(1) into tests.

Comment #33

6 August 2023 at 09:25

catch committed 1debb398 on 11.x

Issue #3375959 by znerol, catch, bradjones1: Add a way to delay...

Comment #34

catch

he/him

English

commented 6 August 2023 at 09:45

Status:

Reviewed & tested by the community

» Fixed

Committed 1debb39 and pushed to 11.x. Thanks!

Comment #35

20 August 2023 at 09:49

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Comment #36

27 September 2023 at 05:18

znerol closed merge request !4461

Add a way to delay executions in test runner until terminate event completed in the child site

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3375959

Comments

Related issues

Referenced by