[random test failure] Random fail in BuildTestTest::testPortMany [#3214565]

Problem/Motivation

Test failed today in Drupal\BuildTests\Framework\BuildTestBase::instantiateServer in the layout_builder module:
https://www.drupal.org/pift-ci-job/2065061

In case that goes away, the error says:

	Build.Drupal\BuildTests\Framework\Tests\BuildTestTest
	✗	
Drupal\BuildTests\Framework\Tests\BuildTestTest

exception: [Other] Line 0 of sites/default/files/simpletest/phpunit-2.xml:
PHPUnit Test failed to complete; Error: PHPUnit 9.5.4 by Sebastian Bergmann and contributors.

Warning:       Your XML configuration validates against a deprecated schema.
Suggestion:    Migrate your XML configuration using "--migrate-configuration"!

Testing Drupal\BuildTests\Framework\Tests\BuildTestTest
...E.                                                               5 / 5 (100%)

Time: 00:15.392, Memory: 14.50 MB

There was 1 error:

1) Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany
RuntimeException: Unable to start the web server.
ERROR OUTPUT:


/var/www/html/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php:435
/var/www/html/core/tests/Drupal/BuildTests/Framework/Tests/BuildTestTest.php:145
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestSuite.php:677
/var/www/html/vendor/phpunit/phpunit/src/TextUI/TestRunner.php:667
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:143
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:96

ERRORS!
Tests: 5, Assertions: 18, Errors: 1.

Seems a rare one, have found mentions of it in:
#1464244-55: Rewrite as URL adding equals sign to end of url.
#2350711-41: Update module fails on installation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Comment	File	Size	Author
#40	3214565-40.patch	1.06 KB	longwave
#40	9.3.x: PHP 8.0 & MySQL 5.7 28,548 pass
#37	3214565-37.patch	2.9 KB	longwave
#37	9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 1 fail
#37	3214565-37.patch	2.86 KB	longwave
#37	9.3.x: PHP 8.0 & MySQL 5.7 CI aborted
#34	3214565-34.patch	2.86 KB	longwave
#34	9.3.x: PHP 8.0 & MySQL 5.7 5 pass
#30	3214565-30-repeat-50.patch	2.58 KB	alexpott
#30	9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail
#29	3214565-29-repeat-50.patch	2.55 KB	alexpott
#29	9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail
#26	3214565-26-repeat-100.patch	2.52 KB	longwave
#26	9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail

Issue fork drupal-3214565

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

1 hidden branch

3214565-random-test-failure

changes, plain diff MR !687

Check out this branch

About issue forks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

18 May 2021 at 13:21

Spokje created an issue. See original summary.

Comment #2

Lendude

Dutch

Amsterdam

CreditAttribution: Lendude at iO commented 19 May 2021 at 06:20

Looking at \Drupal\BuildTests\Framework\BuildTestBase::instantiateServer, the test waits for the server to be up for 1 second (1000 iterations of waiting 1000 microseconds).

So the obvious thing to try would be to wait a little longer? Maybe 2 seconds?

Comment #3

Spokje

Dutch

CreditAttribution: Spokje commented 19 May 2021 at 07:44

Issue summary:

View changes

Corrected c/p fail in IS.

Comment #4

Spokje

Dutch

CreditAttribution: Spokje commented 19 May 2021 at 08:02

Status:

Active

» Needs review

@lendude:

Looking at \Drupal\BuildTests\Framework\BuildTestBase::instantiateServer, the test waits for the server to be up for 1 second (1000 iterations of waiting 1000 microseconds).

I would expect that to be 1000 x 1000ms = 16.6666667 minutes, but that's insanely long.

Looking at the CI log:

00:09:37.605 Drupal\BuildTests\Framework\Tests\ExternalCommandRequirement   6 passes                                      
00:09:37.605 Drupal\BuildTests\Composer\ComposerValidateTest               34 passes                                      
00:09:45.656 Drupal\BuildTests\Framework\Tests\BuildTestTest                0 passes             1 exceptions             
00:09:52.985 FATAL Drupal\BuildTests\Framework\Tests\BuildTestTest: test runner returned a non-zero error code (2).
00:09:53.031 Drupal\BuildTests\Framework\Tests\BuildTestTest                0 passes   1 fails                            
00:09:53.031 Drupal\BuildTests\Framework\Tests\HtRouterTest                 1 passes                                      
00:10:03.120 Drupal\BuildTests\Composer\Template\ComposerProjectTemplates   3 passes

It takes around 7 seconds for Drupal\BuildTests\Framework\Tests\BuildTestTest to give up hope and throw the exception after the loop:

    // Wait until the web server has started. It is started if the port is no
    // longer available.
    for ($i = 0; $i < 1000; $i++) {
      if (!$this->checkPortIsAvailable($port)) {
        return $ps;
      }
      usleep(1000);
    }
    throw new \RuntimeException(sprintf("Unable to start the web server.\nERROR OUTPUT:\n%s", $ps->getErrorOutput()));

So the obvious thing to try would be to wait a little longer? Maybe 2 seconds?

Anyway: Since this wait-loop returns with an answer when the port is no longer available ($ps) as soon as possible, I think a longer wait is OK here.

Created an MR for that.

Not really sure how to test/check this, the test-failure is (fairly) rare, found only 2 other documented occurrences so far. (See links in IS).

Running with the theory that this only occurs when TestBot is really busy.

Comment #5

19 May 2021 at 08:03

Spokje opened merge request !687

Comment #6

Lendude

Dutch

Amsterdam

CreditAttribution: Lendude at iO commented 19 May 2021 at 08:21

I would expect that to be 1000 x 1000ms = 16.6666667 minutes

It's in μs not ms (that threw me too the first time I looked at it :))

Comment #7

Spokje

Dutch

CreditAttribution: Spokje commented 19 May 2021 at 08:27

It's in μs not ms (that threw me too the first time I looked at it :))

Life's too short to worry about μs IMHO, but you're absolutely right :P

Comment #8

Spokje

Dutch

CreditAttribution: Spokje commented 22 May 2021 at 07:01

Hit another occurrence: https://www.drupal.org/pift-ci-job/2069592

Comment #9

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 22 May 2021 at 10:44

To me the number of iterations is too high and the wait is far too short. It feels like something like 200 iterations and waiting 10 ms / 10000 microseconds between each one would be plenty while avoiding hammering the port to check if it's open yet.

Comment #10

Spokje

Dutch

CreditAttribution: Spokje commented 23 May 2021 at 06:49

Another bites the dust, this time in the 9.3.x-dev test-on-commit: https://www.drupal.org/pift-ci-job/2069801

Comment #11

Spokje

Dutch

CreditAttribution: Spokje commented 23 May 2021 at 07:23

To me the number of iterations is too high and the wait is far too short. It feels like something like 200 iterations and waiting 10 ms / 10000 microseconds between each one would be plenty while avoiding hammering the port to check if it's open yet.

Thus spoketh @longwave in #9

Can't argue with that!
Updated MR.

Comment #12

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 23 May 2021 at 10:10

Status:

Needs review

» Reviewed & tested by the community

Assuming this is the problem, hopefully these changes will make it go away.

Comment #13

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 24 May 2021 at 09:33

Status:

Reviewed & tested by the community

» Needs work

This test seems flawed. The more build tests we have the more likely it is to fail. The gap between finding an available port and then instantiating a server leaves a gap. This introduces a race which is common source for random fails. I'm not sure that waiting for longer will reduce random fails. I think we need to implement a locking mechanism for port numbers.

But before we do anything we need to improve the reporting I think we should change the exception to:

throw new \RuntimeException(sprintf("Unable to start the web server.\nOUTPUT:\n%s\n\nERROR OUTPUT:\n%s", $ps->getOutput(), $ps->getErrorOutput()));

Because it seems as though there is nothing in the error output that is telling us what is going on so maybe there is something in the regular output.

Comment #14

Spokje

Dutch

CreditAttribution: Spokje commented 24 May 2021 at 09:55

Status:	Needs work	» Needs review
Issue tags:		+Needs followup

Added Needs followup tag for the suggested locking mechanism. (Not creating one right now since the improved Exception output might give us a clue to do something else instead).
Reverted changes made until now.
Updated MR with the suggested improved Exception output.

Comment #15

Spokje

Dutch

CreditAttribution: Spokje commented 24 May 2021 at 09:59

Issue tags:

+Bug Smash Initiative

Comment #16

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 24 May 2021 at 10:56

It's possible the exit code might also tell us something if there is no output? Wonder if we should just call assertCommandExitCode() instead which would print both output streams and the exit code.

edit: oh, but there might not be an exit code either, if it successfully started in the background?

Comment #17

Spokje

Dutch

CreditAttribution: Spokje commented 24 May 2021 at 11:08

It's possible the exit code might also tell us something if there is no output? Wonder if we should just call assertCommandExitCode() instead which would print both output streams and the exit code.

Thus spoketh @longwave in #16.

Me like!
Changed MR accordingly

Comment #18

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 26 May 2021 at 08:27

Status:

Needs review

» Reviewed & tested by the community

This will improve the error reporting so let's see how this gets on.

@alexpott there is already a locking mechanism that is supposed to cover multiple processes in BuildTestBase::findAvailablePort(), what more do we need?

Comment #19

Spokje

Dutch

CreditAttribution: Spokje commented 26 May 2021 at 13:25

Another one: https://www.drupal.org/pift-ci-job/2073236

Comment #20

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 08:01

Status:

Reviewed & tested by the community

» Needs work

@Spokje @longwave I'm not convinced that replacing the exception with assertCommandExitCode() is correct. The process is being started in the background and will not have exited yet. Can we change back to a version that adds the normal output to the exception as before.

Comment #21

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 08:07

@longwave ah yeah I see we have a locking mechanism... so it's not a race or we've implemented it incorrectly. FWIW it seems I implementing the locking - lolz - #3086179: Fix spurious failures attempting to allocate a web server in the build tests

Comment #22

Spokje

Dutch

CreditAttribution: Spokje commented 27 May 2021 at 08:17

Status:

Needs work

» Needs review

Sure, back to RuntimeException with errorOutput.

Comment #23

Spokje

Dutch

CreditAttribution: Spokje commented 27 May 2021 at 08:29

Issue tags:

-Needs followup

Added Needs followup tag for the suggested locking mechanism. (Not creating one right now since the improved Exception output might give us a clue to do something else instead).

@alexpott there is already a locking mechanism that is supposed to cover multiple processes in BuildTestBase::findAvailablePort(), what more do we need?

@longwave ah yeah I see we have a locking mechanism... so it's not a race or we've implemented it incorrectly.

Removed Needs followup tag accordingly.

Comment #24

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 11:23

I still think the exit code might be useful, for example if it's crashing due to signal 11 or something like that there might be no output.

Comment #25

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 11:25

Then let's add that to the exception - the problem with using the assert is that we'll only get the output if the exit status is not 0 and that's an assumption. And the point here is that regardless of the exit status we need to throw an exception.

Comment #26

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 11:38

File	Size
3214565-26-repeat-100.patch	2.52 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail

Let's see if we can trigger this with repeat. This might not work if it's a concurrency issue with other tests.

Comment #27

27 May 2021 at 11:50

Status:

Needs review

» Needs work

The last submitted patch, 26: 3214565-26-repeat-100.patch, failed testing. View results

Comment #28

Spokje

Dutch

CreditAttribution: Spokje commented 27 May 2021 at 11:53

Status:

Needs work

» Needs review

> Let's see if we can trigger this with repeat. This might not work if it's a concurrency issue with other tests.

My money is on the concurrency, so no dice with the repeat-test, but would love to be wrong.

Updated the MR to include the exit-code.

EDIT: Right...beaten by 1 minute and a wrong prediction. Not much info sadly :/


There was 1 error:

1) Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany
RuntimeException: Unable to start the web server.
CODE: 0
OUTPUT:


ERROR OUTPUT:


/var/www/html/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php:435
/var/www/html/core/tests/Drupal/BuildTests/Framework/Tests/BuildTestTest.php:145
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestSuite.php:677
/var/www/html/vendor/phpunit/phpunit/src/TextUI/TestRunner.php:667
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:143
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:96

Comment #29

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 12:37

File	Size
3214565-29-repeat-50.patch	2.55 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail

Let's get more info... maybe it is always the same port....

Comment #30

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 12:39

File	Size
3214565-30-repeat-50.patch	2.58 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail

Moar info...

Comment #31

Spokje

Dutch

CreditAttribution: Spokje commented 27 May 2021 at 12:43

If it ain't consistently broke: Break it consistently! 😈

Comment #32

27 May 2021 at 12:44

The last submitted patch, 29: 3214565-29-repeat-50.patch, failed testing. View results

Comment #33

27 May 2021 at 12:47

Status:

Needs review

» Needs work

The last submitted patch, 30: 3214565-30-repeat-50.patch, failed testing. View results

Comment #34

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 13:16

Status:

Needs work

» Needs review

File	Size
3214565-34.patch	2.86 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass

Let's see what happens if we wait a bit longer and don't retry as quickly.

Comment #35

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 13:30

A lucky pass, or is this the fix? Let's try that again.

Comment #36

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 13:36

@longwave might be worth upping the repeats to 200... looking good so far... today we learn that sometimes it takes longer than 1 second to start a PHP development webserver on Drupal CI.

Comment #37

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 14:52

File	Size
3214565-37.patch	2.86 KB
9.3.x: PHP 8.0 & MySQL 5.7 CI aborted
3214565-37.patch	2.9 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 1 fail

I realised the sleep and check are the wrong way round, the way we currently do it you always need at least one retry, and there is no point waiting after the last retry, so I swapped those over.

I also wanted to know how long we actually need - ie. whether we ever come close to 2 seconds - so I've temporarily made it fail with statistics if it needs more than half a second.

Comment #38

27 May 2021 at 15:09

Status:

Needs review

» Needs work

The last submitted patch, 37: 3214565-37.patch, failed testing. View results

Comment #39

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 15:32

$ curl -s https://www.drupal.org/pift-ci-job/2074455|grep "Failed asserting"|sort -k4 -n|uniq -c
     29 Failed asserting that 5 is less than 5.
     69 Failed asserting that 6 is less than 5.
     56 Failed asserting that 7 is less than 5.
     30 Failed asserting that 8 is less than 5.
     19 Failed asserting that 9 is less than 5.
     12 Failed asserting that 10 is less than 5.
      2 Failed asserting that 11 is less than 5.
      3 Failed asserting that 12 is less than 5.
      1 Failed asserting that 18 is less than 5.
      1 Failed asserting that 19 is less than 5.

So in an extreme case here we needed 19 iterations or 1.9 seconds, therefore I suggest bumping the max time up to at least 3 seconds if not 5 seconds to avoid running into this again any time soon. This is only the *maximum* time, and as we can see in most cases it takes far less than a second.

Comment #40

longwave

he/him

English

CreditAttribution: longwave at Full Fat Things commented 27 May 2021 at 15:49

Status:

Needs work

» Needs review

File	Size
3214565-40.patch	1.06 KB
9.3.x: PHP 8.0 & MySQL 5.7 28,548 pass

6 files were hidden/shown/deleted

File	Size
3214565-26-repeat-100.patch	2.52 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail
3214565-29-repeat-50.patch	2.55 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail
3214565-30-repeat-50.patch	2.58 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 2 fail
3214565-34.patch	2.86 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass
3214565-37.patch	2.86 KB
9.3.x: PHP 8.0 & MySQL 5.7 CI aborted
3214565-37.patch	2.9 KB
9.3.x: PHP 8.0 & MySQL 5.7 5 pass, 1 fail

50 iterations of 100ms each, I also kept the improved error messaging on failure in case we see any other weirdness from this.

Comment #41

Spokje

Dutch

CreditAttribution: Spokje commented 27 May 2021 at 16:17

(Slightly off-topic) I like this insight into the minds of The Big Brains, kinda like watching a live debug stream (which I never watched, no clue if it even exists). Putting it into the d.o queue format makes it easy to hop by every now and then and see where we stand.
Ponders turning this into a big-selling YouTube channel and finally become filthy rich...

Comment #42

27 May 2021 at 16:30

Spokje closed merge request !687

Comment #43

Spokje

Dutch

CreditAttribution: Spokje commented 27 May 2021 at 17:20

Status:

Needs review

» Reviewed & tested by the community

- Patch makes TestBot happily green
- Moving usleep to go after the checkPortIsAvailable($port) check is explained in #37 in a way that even I understand.
- The increase of usleep-value and decrease of the loop-value is test-proven in #39.
- Improved logging in RuntimeException message per as (initially) requested by @alexpott in #13.

RTBC for me.

Comment #44

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Thunder commented 27 May 2021 at 20:04

Version:	9.3.x-dev	» 9.2.x-dev
Status:	Reviewed & tested by the community	» Fixed

Committed and pushed 2c709ec12f to 9.3.x and fbcaae7696 to 9.2.x. Thanks!

I'll backport to 9.1.x is a release manager agrees.

Comment #45

27 May 2021 at 20:05

alexpott committed 2c709ec on 9.3.x

Issue #3214565 by Spokje, longwave, alexpott, Lendude: [random test...

Comment #46

27 May 2021 at 20:05

alexpott committed fbcaae7 on 9.2.x

Issue #3214565 by Spokje, longwave, alexpott, Lendude: [random test...

Comment #47

10 June 2021 at 20:09

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.