Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
Test failed today in Drupal\BuildTests\Framework\BuildTestBase::instantiateServer
in the layout_builder module:
https://www.drupal.org/pift-ci-job/2065061
In case that goes away, the error says:
Build.Drupal\BuildTests\Framework\Tests\BuildTestTest
✗
Drupal\BuildTests\Framework\Tests\BuildTestTest
exception: [Other] Line 0 of sites/default/files/simpletest/phpunit-2.xml:
PHPUnit Test failed to complete; Error: PHPUnit 9.5.4 by Sebastian Bergmann and contributors.
Warning: Your XML configuration validates against a deprecated schema.
Suggestion: Migrate your XML configuration using "--migrate-configuration"!
Testing Drupal\BuildTests\Framework\Tests\BuildTestTest
...E. 5 / 5 (100%)
Time: 00:15.392, Memory: 14.50 MB
There was 1 error:
1) Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany
RuntimeException: Unable to start the web server.
ERROR OUTPUT:
/var/www/html/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php:435
/var/www/html/core/tests/Drupal/BuildTests/Framework/Tests/BuildTestTest.php:145
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestSuite.php:677
/var/www/html/vendor/phpunit/phpunit/src/TextUI/TestRunner.php:667
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:143
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:96
ERRORS!
Tests: 5, Assertions: 18, Errors: 1.
Seems a rare one, have found mentions of it in:
#1464244-55: Rewrite as URL adding equals sign to end of url.
#2350711-41: Update module fails on installation
Steps to reproduce
Proposed resolution
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
Comment | File | Size | Author |
---|---|---|---|
#40 | 3214565-40.patch | 1.06 KB | longwave |
Issue fork drupal-3214565
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 3214565-random-test-failure changes, plain diff MR !687
Comments
Comment #2
LendudeLooking at
\Drupal\BuildTests\Framework\BuildTestBase::instantiateServer
, the test waits for the server to be up for 1 second (1000 iterations of waiting 1000 microseconds).So the obvious thing to try would be to wait a little longer? Maybe 2 seconds?
Comment #3
SpokjeCorrected c/p fail in IS.
Comment #4
Spokje@lendude:
I would expect that to be 1000 x 1000ms = 16.6666667 minutes, but that's insanely long.
Looking at the CI log:
It takes around 7 seconds for
Drupal\BuildTests\Framework\Tests\BuildTestTest
to give up hope and throw the exception after the loop:Anyway: Since this wait-loop returns with an answer when the port is no longer available (
$ps
) as soon as possible, I think a longer wait is OK here.Created an MR for that.
Not really sure how to test/check this, the test-failure is (fairly) rare, found only 2 other documented occurrences so far. (See links in IS).
Running with the theory that this only occurs when TestBot is really busy.
Comment #6
LendudeIt's in μs not ms (that threw me too the first time I looked at it :))
Comment #7
SpokjeLife's too short to worry about μs IMHO, but you're absolutely right :P
Comment #8
SpokjeHit another occurrence: https://www.drupal.org/pift-ci-job/2069592
Comment #9
longwaveTo me the number of iterations is too high and the wait is far too short. It feels like something like 200 iterations and waiting 10 ms / 10000 microseconds between each one would be plenty while avoiding hammering the port to check if it's open yet.
Comment #10
SpokjeAnother bites the dust, this time in the
9.3.x-dev
test-on-commit: https://www.drupal.org/pift-ci-job/2069801Comment #11
SpokjeThus spoketh @longwave in #9
Can't argue with that!
Updated MR.
Comment #12
longwaveAssuming this is the problem, hopefully these changes will make it go away.
Comment #13
alexpottThis test seems flawed. The more build tests we have the more likely it is to fail. The gap between finding an available port and then instantiating a server leaves a gap. This introduces a race which is common source for random fails. I'm not sure that waiting for longer will reduce random fails. I think we need to implement a locking mechanism for port numbers.
But before we do anything we need to improve the reporting I think we should change the exception to:
Because it seems as though there is nothing in the error output that is telling us what is going on so maybe there is something in the regular output.
Comment #14
SpokjeNeeds followup
tag for the suggested locking mechanism. (Not creating one right now since the improved Exception output might give us a clue to do something else instead).Comment #15
SpokjeComment #16
longwaveIt's possible the exit code might also tell us something if there is no output? Wonder if we should just call assertCommandExitCode() instead which would print both output streams and the exit code.
edit: oh, but there might not be an exit code either, if it successfully started in the background?
Comment #17
SpokjeThus spoketh @longwave in #16.
Me like!
Changed MR accordingly
Comment #18
longwaveThis will improve the error reporting so let's see how this gets on.
@alexpott there is already a locking mechanism that is supposed to cover multiple processes in
BuildTestBase::findAvailablePort()
, what more do we need?Comment #19
SpokjeAnother one: https://www.drupal.org/pift-ci-job/2073236
Comment #20
alexpott@Spokje @longwave I'm not convinced that replacing the exception with assertCommandExitCode() is correct. The process is being started in the background and will not have exited yet. Can we change back to a version that adds the normal output to the exception as before.
Comment #21
alexpott@longwave ah yeah I see we have a locking mechanism... so it's not a race or we've implemented it incorrectly. FWIW it seems I implementing the locking - lolz - #3086179: Fix spurious failures attempting to allocate a web server in the build tests
Comment #22
SpokjeSure, back to RuntimeException with errorOutput.
Comment #23
SpokjeRemoved
Needs followup
tag accordingly.Comment #24
longwaveI still think the exit code might be useful, for example if it's crashing due to signal 11 or something like that there might be no output.
Comment #25
alexpottThen let's add that to the exception - the problem with using the assert is that we'll only get the output if the exit status is not 0 and that's an assumption. And the point here is that regardless of the exit status we need to throw an exception.
Comment #26
longwaveLet's see if we can trigger this with repeat. This might not work if it's a concurrency issue with other tests.
Comment #28
Spokje> Let's see if we can trigger this with repeat. This might not work if it's a concurrency issue with other tests.
My money is on the concurrency, so no dice with the repeat-test, but would love to be wrong.
Updated the MR to include the exit-code.
EDIT: Right...beaten by 1 minute and a wrong prediction. Not much info sadly :/
Comment #29
alexpottLet's get more info... maybe it is always the same port....
Comment #30
alexpottMoar info...
Comment #31
SpokjeIf it ain't consistently broke: Break it consistently! 😈
Comment #34
longwaveLet's see what happens if we wait a bit longer and don't retry as quickly.
Comment #35
longwaveA lucky pass, or is this the fix? Let's try that again.
Comment #36
alexpott@longwave might be worth upping the repeats to 200... looking good so far... today we learn that sometimes it takes longer than 1 second to start a PHP development webserver on Drupal CI.
Comment #37
longwaveI realised the sleep and check are the wrong way round, the way we currently do it you always need at least one retry, and there is no point waiting after the last retry, so I swapped those over.
I also wanted to know how long we actually need - ie. whether we ever come close to 2 seconds - so I've temporarily made it fail with statistics if it needs more than half a second.
Comment #39
longwaveSo in an extreme case here we needed 19 iterations or 1.9 seconds, therefore I suggest bumping the max time up to at least 3 seconds if not 5 seconds to avoid running into this again any time soon. This is only the *maximum* time, and as we can see in most cases it takes far less than a second.
Comment #40
longwave50 iterations of 100ms each, I also kept the improved error messaging on failure in case we see any other weirdness from this.
Comment #41
Spokje(Slightly off-topic) I like this insight into the minds of The Big Brains, kinda like watching a live debug stream (which I never watched, no clue if it even exists). Putting it into the d.o queue format makes it easy to hop by every now and then and see where we stand.
Ponders turning this into a big-selling YouTube channel and finally become filthy rich...
Comment #43
Spokje- Patch makes TestBot happily green
- Moving
usleep
to go after thecheckPortIsAvailable($port)
check is explained in #37 in a way that even I understand.- The increase of
usleep
-value and decrease of the loop-value is test-proven in #39.- Improved logging in RuntimeException message per as (initially) requested by @alexpott in #13.
RTBC for me.
Comment #44
alexpottCommitted and pushed 2c709ec12f to 9.3.x and fbcaae7696 to 9.2.x. Thanks!
I'll backport to 9.1.x is a release manager agrees.