Problem/Motivation
In #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration and #3375584: [random test failure] Random failure in PathWorkspacesTest several calls to sleep() were introduced in order to mitigate race conditions. The affected browser tests have the following pattern in common:
- An alias path is created during some request in the child site.
- The response is generated and evaluated by the test runner.
- After that the following events can occur in any order:
- An assertion against the alias path cache is run in the test runner
- The alias cache is rebuilt during execution of the terminate event in the child site
Steps to reproduce
This is hard, look at the patches in #3375584: [random test failure] Random failure in PathWorkspacesTest for some ideas.
Proposed resolution
- Introduce a middleware which is running at a low enough priority to act on the final response
- From within the middleware, acquire a lock which is automatically released using a shutdown callback. Add a header to the response in order to indicate to the test runner that the lock exists.
- Add a mechanism to wait for that lock in the existing test http client middleware (if the response header is present)
- Add a state flag which we can use to enable the middleware from within the test runner
- Introduce a test trait in order to simplify all that
- Remove the
sleep()from the test cases and instead use the new test trait
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
| Comment | File | Size | Author |
|---|---|---|---|
| #25 | 50x_functional_path_tests+MR-4461-25.patch | 15.34 KB | znerol |
Issue fork drupal-3375959
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 3375959-terminate-wait-lock
changes, plain diff MR !4461
- 3375959-add-a-way
changes, plain diff MR !4434
Comments
Comment #2
znerol commentedComment #3
znerol commentedComment #4
andypostComment #5
znerol commentedComment #7
mondrakeI've started to see test failures in my experimental database driver DruDbal, due to database locks, since when the parent issue was committed.
Bumping to major, even though I think critical would be appropriate if the testing framework stability decreased.
Comment #8
znerol commentedComment #9
znerol commentedComment #10
znerol commentedRemove calls to sleep() added in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration.
Comment #11
znerol commentedMore test groups (
path,jsonapi,language,locale).Comment #12
znerol commentedLooks like
setWaitForTerminate()needs to run earlier inConfigurableLanguageManagerTestandLocaleLocaleLookupTest.Comment #13
znerol commentedRemove calls to sleep() added in #3375584: [random test failure] Random failure in PathWorkspacesTest.
Comment #14
bradjones1Primary author of the changes in #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration, here.
From a quick read of this issue it appears that the goal is to disable post-response processing, by way of essentially undoing the fixes from the other issue, namely setting the content-length header. This might work, however it seems a bit hacky to me and, most importantly, not entirely obvious. For someone who comes along later and wants to maintain or debug this code, it's not clear why the content-length header is related. Even I had to go back and do some issue archaeology (that issue was open for almost a year, so my memory was a bit fuzzy and this stuff is pretty esoteric.)
For my part I would rather see a more explicit short-circuit of this functionality, or (better yet) improve the underlying race conditions and flakiness while allowing the underlying functionality work as expected, so we are actually testing the system as it's expected to operate. There was a lot of discussion in the parent issue about this kind of thing, and it was generally considered to be an improvement in test coverage because we're no longer testing/depending on broken functionality.
I don't love the additions of the sleep() calls, but we had little choice because as you rightly point out it's hard to signal to and from the test runner and the site that's being called. I think this is where the improvement should happen - I'm sure there is a more elegant solution to this aspect of this puzzle.
Comment #15
catchThis is true, but I think it applies equally to sleep(1).
One possible approach that crossed my mind was a test-only post response task, that writes something to state or key/value when it's finished and tries to run last of the post response tasks. We could then poll for that in the parent process, and delete it again as soon as its found ready for the next time, like a ::waitForResponseTasksToFinish() method. This would allow the post response tasks to run actually post-response while having something hopefully more reliable than sleep() to ensure they have.
Comment #16
znerol commentedWe could try to use flock() for that purpose. If the state flag is set, the child site creates a temporary file and
flocks()it (usingLOCK_EX). The path to the file is communicated to the parent site in a response header (X-Drupal-Test-Wait-Terminate: /tmp/xxxx.lock).The test runners HTTP client (most probably stuff in
UIHelperTrait) examines the response headers and extracts the path fromX-Drupal-Test-Wait-Terminate. If it exists, it tries to set an exclusive lock on that file (which blocks until the lock in the child site is released).Some notes:
Comment #18
znerol commentedOpened an alternative MR 4461 implementing the ideas in #16 using the symfony lock component. Attached is a patch which executes the affected test groups 50 times in a row.
Comment #20
znerol commentedRemove X-Drupal-Wait-Terminate response header and only attempt to wait for termination if container has state service.
Comment #21
znerol commentedRetain a reference to the lock, otherwise it will be released prematurely.
Comment #22
catchThis is outdated now. New approach looks very encouraging to me.
Comment #23
znerol commentedYes. And since we are not accessing the
Request/Responseobject anymore it doesn't need to go into a stack middleware. Not sure if there is a better location for the code though.Comment #25
znerol commentedSwitching to
\Drupal:lock(). We know that this is working in other cases, so let's just use that.Comment #26
znerol commentedComment #27
catchJust using lock seems both simpler and more reliable. Haven't done a line by line review yet but also nothing stuck out either, so +1.
Comment #28
bradjones1I did a once-over of the revised patch and I'm a bit confused as to where and how it's actually waiting for termination of the post-response work. From my read of the earlier comments I figured that there would be, say, a service added that would run after all other destructable services that would release the lock, but it appears this is being done within the test HTTP client? Isn't the whole problem here that the client side doesn't know how long the post-response work takes, and has to be polling some state that is asynchronous from the request->response cycle? I could definitely be reading this wrong as I am not an expert on locks (I always get turned around when implementing them) but I'm suspicious that this might not be signaling the end of work so much as adding a few ticks, and as a result the timing works out?
Comment #29
catch@bradjones I think we need a comment as a reminder, but it's fine:
DatabaseLockBackend for example does this:
So acquiring a lock at the beginning of the request, this will then run at the end of the request after all post response tasks have run, clearing out every lock that was acquired.
It's mostly a safety catch so that a request can't hold a lock even after it's finished (say if there's a fatal error before the lock can be released) but happens to do exactly what we need here.
The patch is using the default values for ::acquire() and ::wait() which means it's acquired for 30 seconds. Lock:wait() intelligently handles polling while it's waiting (i.e. it starts with millseconds and ends up at hundreds of milliseconds), so that works for making sure we wait the minimum time necessary.
Comment #31
bradjones1Ahhh, yeah, OK. That makes sense and appreciate the clarification. I think a comment that states how it works is helpful here for maintainability. Thanks.
Comment #32
spokjeThis seems ready for core comitters review, and is IMHO a much more deterministic way then whacking
sleep(1)into tests.Comment #34
catchCommitted 1debb39 and pushed to 11.x. Thanks!