Problem/Motivation

Test failed today in Drupal\BuildTests\Framework\BuildTestBase::instantiateServer in the layout_builder module:
https://www.drupal.org/pift-ci-job/2065061

In case that goes away, the error says:

	Build.Drupal\BuildTests\Framework\Tests\BuildTestTest
	✗	
Drupal\BuildTests\Framework\Tests\BuildTestTest

exception: [Other] Line 0 of sites/default/files/simpletest/phpunit-2.xml:
PHPUnit Test failed to complete; Error: PHPUnit 9.5.4 by Sebastian Bergmann and contributors.

Warning:       Your XML configuration validates against a deprecated schema.
Suggestion:    Migrate your XML configuration using "--migrate-configuration"!

Testing Drupal\BuildTests\Framework\Tests\BuildTestTest
...E.                                                               5 / 5 (100%)

Time: 00:15.392, Memory: 14.50 MB

There was 1 error:

1) Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany
RuntimeException: Unable to start the web server.
ERROR OUTPUT:


/var/www/html/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php:435
/var/www/html/core/tests/Drupal/BuildTests/Framework/Tests/BuildTestTest.php:145
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestSuite.php:677
/var/www/html/vendor/phpunit/phpunit/src/TextUI/TestRunner.php:667
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:143
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:96

ERRORS!
Tests: 5, Assertions: 18, Errors: 1.

Seems a rare one, have found mentions of it in:
#1464244-55: Rewrite as URL adding equals sign to end of url.
#2350711-41: Update module fails on installation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3214565

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Spokje created an issue. See original summary.

Lendude’s picture

Looking at \Drupal\BuildTests\Framework\BuildTestBase::instantiateServer, the test waits for the server to be up for 1 second (1000 iterations of waiting 1000 microseconds).

So the obvious thing to try would be to wait a little longer? Maybe 2 seconds?

Spokje’s picture

Issue summary: View changes

Corrected c/p fail in IS.

Spokje’s picture

Status: Active » Needs review

@lendude:

Looking at \Drupal\BuildTests\Framework\BuildTestBase::instantiateServer, the test waits for the server to be up for 1 second (1000 iterations of waiting 1000 microseconds).

I would expect that to be 1000 x 1000ms = 16.6666667 minutes, but that's insanely long.

Looking at the CI log:

00:09:37.605 Drupal\BuildTests\Framework\Tests\ExternalCommandRequirement   6 passes                                      
00:09:37.605 Drupal\BuildTests\Composer\ComposerValidateTest               34 passes                                      
00:09:45.656 Drupal\BuildTests\Framework\Tests\BuildTestTest                0 passes             1 exceptions             
00:09:52.985 FATAL Drupal\BuildTests\Framework\Tests\BuildTestTest: test runner returned a non-zero error code (2).
00:09:53.031 Drupal\BuildTests\Framework\Tests\BuildTestTest                0 passes   1 fails                            
00:09:53.031 Drupal\BuildTests\Framework\Tests\HtRouterTest                 1 passes                                      
00:10:03.120 Drupal\BuildTests\Composer\Template\ComposerProjectTemplates   3 passes        

It takes around 7 seconds for Drupal\BuildTests\Framework\Tests\BuildTestTest to give up hope and throw the exception after the loop:

    // Wait until the web server has started. It is started if the port is no
    // longer available.
    for ($i = 0; $i < 1000; $i++) {
      if (!$this->checkPortIsAvailable($port)) {
        return $ps;
      }
      usleep(1000);
    }
    throw new \RuntimeException(sprintf("Unable to start the web server.\nERROR OUTPUT:\n%s", $ps->getErrorOutput()));
So the obvious thing to try would be to wait a little longer? Maybe 2 seconds?

Anyway: Since this wait-loop returns with an answer when the port is no longer available ($ps) as soon as possible, I think a longer wait is OK here.

Created an MR for that.

Not really sure how to test/check this, the test-failure is (fairly) rare, found only 2 other documented occurrences so far. (See links in IS).

Running with the theory that this only occurs when TestBot is really busy.

Lendude’s picture

I would expect that to be 1000 x 1000ms = 16.6666667 minutes

It's in μs not ms (that threw me too the first time I looked at it :))

Spokje’s picture

It's in μs not ms (that threw me too the first time I looked at it :))

Life's too short to worry about μs IMHO, but you're absolutely right :P

Spokje’s picture

longwave’s picture

To me the number of iterations is too high and the wait is far too short. It feels like something like 200 iterations and waiting 10 ms / 10000 microseconds between each one would be plenty while avoiding hammering the port to check if it's open yet.

Spokje’s picture

Another bites the dust, this time in the 9.3.x-dev test-on-commit: https://www.drupal.org/pift-ci-job/2069801

Spokje’s picture

To me the number of iterations is too high and the wait is far too short. It feels like something like 200 iterations and waiting 10 ms / 10000 microseconds between each one would be plenty while avoiding hammering the port to check if it's open yet.

Thus spoketh @longwave in #9

Can't argue with that!
Updated MR.

longwave’s picture

Status: Needs review » Reviewed & tested by the community

Assuming this is the problem, hopefully these changes will make it go away.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

This test seems flawed. The more build tests we have the more likely it is to fail. The gap between finding an available port and then instantiating a server leaves a gap. This introduces a race which is common source for random fails. I'm not sure that waiting for longer will reduce random fails. I think we need to implement a locking mechanism for port numbers.

But before we do anything we need to improve the reporting I think we should change the exception to:

throw new \RuntimeException(sprintf("Unable to start the web server.\nOUTPUT:\n%s\n\nERROR OUTPUT:\n%s", $ps->getOutput(), $ps->getErrorOutput()));

Because it seems as though there is nothing in the error output that is telling us what is going on so maybe there is something in the regular output.

Spokje’s picture

Status: Needs work » Needs review
Issue tags: +Needs followup
  • Added Needs followup tag for the suggested locking mechanism. (Not creating one right now since the improved Exception output might give us a clue to do something else instead).
  • Reverted changes made until now.
  • Updated MR with the suggested improved Exception output.
Spokje’s picture

Issue tags: +Bug Smash Initiative
longwave’s picture

It's possible the exit code might also tell us something if there is no output? Wonder if we should just call assertCommandExitCode() instead which would print both output streams and the exit code.

edit: oh, but there might not be an exit code either, if it successfully started in the background?

Spokje’s picture

It's possible the exit code might also tell us something if there is no output? Wonder if we should just call assertCommandExitCode() instead which would print both output streams and the exit code.

Thus spoketh @longwave in #16.

Me like!
Changed MR accordingly

longwave’s picture

Status: Needs review » Reviewed & tested by the community

This will improve the error reporting so let's see how this gets on.

@alexpott there is already a locking mechanism that is supposed to cover multiple processes in BuildTestBase::findAvailablePort(), what more do we need?

Spokje’s picture

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

@Spokje @longwave I'm not convinced that replacing the exception with assertCommandExitCode() is correct. The process is being started in the background and will not have exited yet. Can we change back to a version that adds the normal output to the exception as before.

alexpott’s picture

@longwave ah yeah I see we have a locking mechanism... so it's not a race or we've implemented it incorrectly. FWIW it seems I implementing the locking - lolz - #3086179: Fix spurious failures attempting to allocate a web server in the build tests

Spokje’s picture

Status: Needs work » Needs review

Sure, back to RuntimeException with errorOutput.

Spokje’s picture

Issue tags: -Needs followup

Added Needs followup tag for the suggested locking mechanism. (Not creating one right now since the improved Exception output might give us a clue to do something else instead).

@alexpott there is already a locking mechanism that is supposed to cover multiple processes in BuildTestBase::findAvailablePort(), what more do we need?

@longwave ah yeah I see we have a locking mechanism... so it's not a race or we've implemented it incorrectly.

Removed Needs followup tag accordingly.

longwave’s picture

I still think the exit code might be useful, for example if it's crashing due to signal 11 or something like that there might be no output.

alexpott’s picture

Then let's add that to the exception - the problem with using the assert is that we'll only get the output if the exit status is not 0 and that's an assumption. And the point here is that regardless of the exit status we need to throw an exception.

longwave’s picture

Let's see if we can trigger this with repeat. This might not work if it's a concurrency issue with other tests.

Status: Needs review » Needs work

The last submitted patch, 26: 3214565-26-repeat-100.patch, failed testing. View results

Spokje’s picture

Status: Needs work » Needs review

> Let's see if we can trigger this with repeat. This might not work if it's a concurrency issue with other tests.

My money is on the concurrency, so no dice with the repeat-test, but would love to be wrong.

Updated the MR to include the exit-code.

EDIT: Right...beaten by 1 minute and a wrong prediction. Not much info sadly :/


There was 1 error:

1) Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany
RuntimeException: Unable to start the web server.
CODE: 0
OUTPUT:


ERROR OUTPUT:


/var/www/html/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php:435
/var/www/html/core/tests/Drupal/BuildTests/Framework/Tests/BuildTestTest.php:145
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
/var/www/html/vendor/phpunit/phpunit/src/Framework/TestSuite.php:677
/var/www/html/vendor/phpunit/phpunit/src/TextUI/TestRunner.php:667
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:143
/var/www/html/vendor/phpunit/phpunit/src/TextUI/Command.php:96
alexpott’s picture

Let's get more info... maybe it is always the same port....

alexpott’s picture

Spokje’s picture

If it ain't consistently broke: Break it consistently! 😈

The last submitted patch, 29: 3214565-29-repeat-50.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 30: 3214565-30-repeat-50.patch, failed testing. View results

longwave’s picture

Status: Needs work » Needs review
FileSize
2.86 KB

Let's see what happens if we wait a bit longer and don't retry as quickly.

longwave’s picture

A lucky pass, or is this the fix? Let's try that again.

alexpott’s picture

@longwave might be worth upping the repeats to 200... looking good so far... today we learn that sometimes it takes longer than 1 second to start a PHP development webserver on Drupal CI.

longwave’s picture

I realised the sleep and check are the wrong way round, the way we currently do it you always need at least one retry, and there is no point waiting after the last retry, so I swapped those over.

I also wanted to know how long we actually need - ie. whether we ever come close to 2 seconds - so I've temporarily made it fail with statistics if it needs more than half a second.

Status: Needs review » Needs work

The last submitted patch, 37: 3214565-37.patch, failed testing. View results

longwave’s picture

$ curl -s https://www.drupal.org/pift-ci-job/2074455|grep "Failed asserting"|sort -k4 -n|uniq -c
     29 Failed asserting that 5 is less than 5.
     69 Failed asserting that 6 is less than 5.
     56 Failed asserting that 7 is less than 5.
     30 Failed asserting that 8 is less than 5.
     19 Failed asserting that 9 is less than 5.
     12 Failed asserting that 10 is less than 5.
      2 Failed asserting that 11 is less than 5.
      3 Failed asserting that 12 is less than 5.
      1 Failed asserting that 18 is less than 5.
      1 Failed asserting that 19 is less than 5.

So in an extreme case here we needed 19 iterations or 1.9 seconds, therefore I suggest bumping the max time up to at least 3 seconds if not 5 seconds to avoid running into this again any time soon. This is only the *maximum* time, and as we can see in most cases it takes far less than a second.

longwave’s picture

50 iterations of 100ms each, I also kept the improved error messaging on failure in case we see any other weirdness from this.

Spokje’s picture

(Slightly off-topic) I like this insight into the minds of The Big Brains, kinda like watching a live debug stream (which I never watched, no clue if it even exists). Putting it into the d.o queue format makes it easy to hop by every now and then and see where we stand.
Ponders turning this into a big-selling YouTube channel and finally become filthy rich...

Spokje’s picture

Status: Needs review » Reviewed & tested by the community

- Patch makes TestBot happily green
- Moving usleep to go after the checkPortIsAvailable($port) check is explained in #37 in a way that even I understand.
- The increase of usleep-value and decrease of the loop-value is test-proven in #39.
- Improved logging in RuntimeException message per as (initially) requested by @alexpott in #13.

RTBC for me.

alexpott’s picture

Version: 9.3.x-dev » 9.2.x-dev
Status: Reviewed & tested by the community » Fixed

Committed and pushed 2c709ec12f to 9.3.x and fbcaae7696 to 9.2.x. Thanks!

I'll backport to 9.1.x is a release manager agrees.

  • alexpott committed 2c709ec on 9.3.x
    Issue #3214565 by Spokje, longwave, alexpott, Lendude: [random test...

  • alexpott committed fbcaae7 on 9.2.x
    Issue #3214565 by Spokje, longwave, alexpott, Lendude: [random test...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.