Problem/Motivation

In #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration and #3375584: [random test failure] Random failure in PathWorkspacesTest several calls to sleep() were introduced in order to mitigate race conditions. The affected browser tests have the following pattern in common:

  1. An alias path is created during some request in the child site.
  2. The response is generated and evaluated by the test runner.
  3. After that the following events can occur in any order:
    • An assertion against the alias path cache is run in the test runner
    • The alias cache is rebuilt during execution of the terminate event in the child site

Steps to reproduce

This is hard, look at the patches in #3375584: [random test failure] Random failure in PathWorkspacesTest for some ideas.

Proposed resolution

  • Introduce a middleware which is running at a low enough priority to act on the final response
  • From within the middleware, acquire a lock which is automatically released using a shutdown callback. Add a header to the response in order to indicate to the test runner that the lock exists.
  • Add a mechanism to wait for that lock in the existing test http client middleware (if the response header is present)
  • Add a state flag which we can use to enable the middleware from within the test runner
  • Introduce a test trait in order to simplify all that
  • Remove the sleep() from the test cases and instead use the new test trait

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3375959

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

znerol created an issue. See original summary.

znerol’s picture

Issue summary: View changes
znerol’s picture

Issue tags: +ddd2023
andypost’s picture

znerol’s picture

Assigned: Unassigned » znerol

mondrake’s picture

Priority: Normal » Major

I've started to see test failures in my experimental database driver DruDbal, due to database locks, since when the parent issue was committed.

Bumping to major, even though I think critical would be appropriate if the testing framework stability decreased.

znerol’s picture

StatusFileSize
new6.29 KB
znerol’s picture

Assigned: znerol » Unassigned
Status: Active » Needs review
znerol’s picture

StatusFileSize
new15.59 KB

More test groups (path,jsonapi,language,locale).

znerol’s picture

StatusFileSize
new15.8 KB

Looks like setWaitForTerminate() needs to run earlier in ConfigurableLanguageManagerTest and LocaleLocaleLookupTest.

znerol’s picture

bradjones1’s picture

Primary author of the changes in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration, here.

From a quick read of this issue it appears that the goal is to disable post-response processing, by way of essentially undoing the fixes from the other issue, namely setting the content-length header. This might work, however it seems a bit hacky to me and, most importantly, not entirely obvious. For someone who comes along later and wants to maintain or debug this code, it's not clear why the content-length header is related. Even I had to go back and do some issue archaeology (that issue was open for almost a year, so my memory was a bit fuzzy and this stuff is pretty esoteric.)

For my part I would rather see a more explicit short-circuit of this functionality, or (better yet) improve the underlying race conditions and flakiness while allowing the underlying functionality work as expected, so we are actually testing the system as it's expected to operate. There was a lot of discussion in the parent issue about this kind of thing, and it was generally considered to be an improvement in test coverage because we're no longer testing/depending on broken functionality.

I don't love the additions of the sleep() calls, but we had little choice because as you rightly point out it's hard to signal to and from the test runner and the site that's being called. I think this is where the improvement should happen - I'm sure there is a more elegant solution to this aspect of this puzzle.

catch’s picture

From a quick read of this issue it appears that the goal is to disable post-response processing, by way of essentially undoing the fixes from the other issue, namely setting the content-length header. This might work, however it seems a bit hacky to me and, most importantly, not entirely obvious.

This is true, but I think it applies equally to sleep(1).

For my part I would rather see a more explicit short-circuit of this functionality, or (better yet) improve the underlying race conditions and flakiness while allowing the underlying functionality work as expected, so we are actually testing the system as it's expected to operate.

One possible approach that crossed my mind was a test-only post response task, that writes something to state or key/value when it's finished and tries to run last of the post response tasks. We could then poll for that in the parent process, and delete it again as soon as its found ready for the next time, like a ::waitForResponseTasksToFinish() method. This would allow the post response tasks to run actually post-response while having something hopefully more reliable than sleep() to ensure they have.

znerol’s picture

We could try to use flock() for that purpose. If the state flag is set, the child site creates a temporary file and flocks() it (using LOCK_EX). The path to the file is communicated to the parent site in a response header (X-Drupal-Test-Wait-Terminate: /tmp/xxxx.lock).

The test runners HTTP client (most probably stuff in UIHelperTrait) examines the response headers and extracts the path from X-Drupal-Test-Wait-Terminate. If it exists, it tries to set an exclusive lock on that file (which blocks until the lock in the child site is released).

Some notes:

  1. In the best case, the child site doesn't need to clean up the lock (since it should be released also by fclose(), or when stream is garbage collected.
  2. The test runner assumes that the child already terminated if it fails to open the file or fails to lock it.

znerol’s picture

StatusFileSize
new14.47 KB

Opened an alternative MR 4461 implementing the ideas in #16 using the symfony lock component. Attached is a patch which executes the affected test groups 50 times in a row.

Status: Needs review » Needs work

The last submitted patch, 18: 50x_functional_path_tests+MR-4461-17.patch, failed testing. View results

znerol’s picture

Status: Needs work » Needs review
StatusFileSize
new14.37 KB

Remove X-Drupal-Wait-Terminate response header and only attempt to wait for termination if container has state service.

znerol’s picture

StatusFileSize
new14.5 KB

Retain a reference to the lock, otherwise it will be released prematurely.

catch’s picture

+++ b/core/lib/Drupal/Core/CoreServiceProvider.php
@@ -142,6 +143,12 @@ protected function registerTest(ContainerBuilder $container) {
       ->addTag('http_client_middleware');
+    // Removes Content-Length header added in FinishResponseSubscriber if
+    // required by the test runner.
+    $container

This is outdated now. New approach looks very encouraging to me.

znerol’s picture

This is outdated now.

Yes. And since we are not accessing the Request/Response object anymore it doesn't need to go into a stack middleware. Not sure if there is a better location for the code though.

Status: Needs review » Needs work

The last submitted patch, 21: 50x_functional_path_tests+MR-4461-21.patch, failed testing. View results

znerol’s picture

Status: Needs work » Needs review
StatusFileSize
new15.34 KB

Switching to \Drupal:lock(). We know that this is working in other cases, so let's just use that.

znerol’s picture

Issue summary: View changes
catch’s picture

Just using lock seems both simpler and more reliable. Haven't done a line by line review yet but also nothing stuck out either, so +1.

bradjones1’s picture

I did a once-over of the revised patch and I'm a bit confused as to where and how it's actually waiting for termination of the post-response work. From my read of the earlier comments I figured that there would be, say, a service added that would run after all other destructable services that would release the lock, but it appears this is being done within the test HTTP client? Isn't the whole problem here that the client side doesn't know how long the post-response work takes, and has to be polling some state that is asynchronous from the request->response cycle? I could definitely be reading this wrong as I am not an expert on locks (I always get turned around when implementing them) but I'm suspicious that this might not be signaling the end of work so much as adding a few ticks, and as a result the timing works out?

catch’s picture

@bradjones I think we need a comment as a reminder, but it's fine:

DatabaseLockBackend for example does this:

  /**
   * Constructs a new DatabaseLockBackend.
   *
   * @param \Drupal\Core\Database\Connection $database
   *   The database connection.
   */
  public function __construct(Connection $database) {
    // __destruct() is causing problems with garbage collections, register a
    // shutdown function instead.
    drupal_register_shutdown_function([$this, 'releaseAll']);
    $this->database = $database;
  }

So acquiring a lock at the beginning of the request, this will then run at the end of the request after all post response tasks have run, clearing out every lock that was acquired.

It's mostly a safety catch so that a request can't hold a lock even after it's finished (say if there's a fatal error before the lock can be released) but happens to do exactly what we need here.

The patch is using the default values for ::acquire() and ::wait() which means it's acquired for 30 seconds. Lock:wait() intelligently handles polling while it's waiting (i.e. it starts with millseconds and ends up at hundreds of milliseconds), so that works for making sure we wait the minimum time necessary.

bradjones1’s picture

Ahhh, yeah, OK. That makes sense and appreciate the clarification. I think a comment that states how it works is helpful here for maintainability. Thanks.

spokje’s picture

Status: Needs review » Reviewed & tested by the community

This seems ready for core comitters review, and is IMHO a much more deterministic way then whacking sleep(1) into tests.

  • catch committed 1debb398 on 11.x
    Issue #3375959 by znerol, catch, bradjones1: Add a way to delay...
catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed 1debb39 and pushed to 11.x. Thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.