Starting to see a new random fail in Drupal\Tests\Core\Command\QuickStartTest

https://dispatcher.drupalci.org/job/drupal_patches/57302/testReport/juni...
https://dispatcher.drupalci.org/job/drupal_patches/57504/testReport/juni...
https://dispatcher.drupalci.org/job/drupal_patches/58062/testReport/juni...
https://dispatcher.drupalci.org/job/drupal_patches/58466/testReport/juni...
https://dispatcher.drupalci.org/job/drupal_patches/58522/testReport/juni...
https://dispatcher.drupalci.org/job/drupal_patches/59593/testReport/juni...

It seems as if the test is trying to access the local server before its actually up?

Its not super frequent, but often enough that I've gotten a couple of reports asking if this was the patch or drupalCI causing the fail.

Additional info

This is what the terminal output looks like when it is completely up:

Drupal development server started: <http://127.0.0.1:8888>
This server is not meant for production use.
One time login url: <http://127.0.0.1:8888/user/reset/1/1636322519/UGLY_HASH/login>
Press Ctrl-C to quit the Drupal development server.

Actual code is waiting for a text which matches this regexp pattern: 127.0.0.1:(\d+) This kind of text appears twice in the output:
1. LINE: Drupal development server started
2. LINE: One time login url

(Have you noticed that the dots are not escaped in the IP address? It should be like this: 127\.0\.0\.1:(\d+))

When it breaks out from the while loop the "One time login url" is not there yet. (at least no all the time) But the code tries to find it. Like this: $this->assertStringContainsString("127.0.0.1:$port/user/reset/1/", $server_process->getOutput());

That is why it randomly fails. Very likely it depends on how busy the server is.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Mixologic created an issue. See original summary.

Mixologic’s picture

Anonymous’s picture

Maybe instead of:

// Give the server a couple of seconds to be ready.
sleep(2);

Will be better send guzzle request until code === 500. Though I'm not sure. Let's see if we can reproduce it on 2000 runs.

Anonymous’s picture

It seems there is some leak of resources:

21:02:55 ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel
21:02:55 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
21:02:55 Caused by: java.io.EOFException
21:02:55 	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2335)
21:02:55 	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2804)
21:02:55 	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
21:02:55 	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
21:02:55 	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
21:02:55 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
21:02:55 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)

I'm trying to cancel testing, but it starts anew: https://dispatcher.drupalci.org/job/drupal_patches/59628/console

Edit:
Re-testing also finished with FAILURE.

21:22:32 Sun May 27 21:22:32 2018 (29931): Fatal Error Insufficient shared memory!
...
Anonymous’s picture

Alarm! It is necessary to somehow stop #3.

Also 300 runs enough for an fails. Let's also check APCu with TTL=1 (favorite among resource thieves).

Mixologic’s picture

2000 tests all starting a local php webserver would probably hit some sort of failure to clean up issue in php.

But in any case theres plenty of evidence in the console logs indicative of this test Failing for sure, so that definitely succeed in showing that the failure condition is reproducable.

Anonymous’s picture

With TTL = 1 all work as always!) Now the patch, which should clearly show that without changing TTL, 300 tests fail (near the end).

Anonymous’s picture

#7 fails, but this does not seem to be the cases from the IS. Let's check without wait server ready.

Anonymous’s picture

borisson_’s picture

I don't understand why the usleep is different to the normal sleep? The guzzle request (#3) sounds like a possible solution for this problem.

Mixologic’s picture

usleep(200000) is only .2 seconds

Not sure why shortening the sleep time helps.

Anonymous’s picture

Because it does not help :)

#3 - #9 I just trying to reproduce the random fails from IS cases. And I did not succeed :(

#3, #7, #8 falling due to other reasons (or the relationship is not obvious).

I agree that the wait via guzzle request will be better than magical sleep(2). But not sure that it helps to fix the IS random fails.

It may be even more reliable to make a more global wait, like in this patch?


Also, the QuickStartTest has stable fail with #2966607: Invalidating 'node_list' and other broad cache tags early in a transaction severely increases lock wait time and probability of deadlock. This patch does not help with it too :(

Status: Needs review » Needs work

The last submitted patch, 12: x1-QuickStartTest-2975644-12.patch, failed testing. View results

Anonymous’s picture

Status: Needs work » Needs review
FileSize
6.51 KB
712 bytes

#13: My bad with check defined constant. But why CI aborted, if $process->setTimeout(500);?

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Mile23’s picture

All the links to the spurious fails are now 404, so we can't see them. I'm curious if this is still an issue.

Mile23’s picture

Ugh. It just so happens this is the only unit test in the Command group, so...

Edit:

Same deal as above. There was a panic and a bunch of failed tests, and then the test build started over again. I cancelled it because it would probably do the same thing again and again.

In this case, we're running the maximum number of concurrent run-tests processes. Each process is per test class. Each test class runs a process per test function. Each test function forks at least one process.

So it's no surprise that we run out of processes. This is more of a stress test on DrupalCI than a test of quick start.

I didn't see any fails until after the panic, though, so I'm not sure what's the deal here.

I was looking at this because QuickStartTest has an isolation issue. It says require_once for core/includes/boostrap.inc, but it does this in setUp(), which means it's outside the isolation of the separate process. This causes fails in other tests when you run them under PHPUnit.

Mile23’s picture

Some cleanup on the test, reduced to concurrency: 16.

Edit:

Same thing:

08:36:57 Drupal\Tests\Core\Command\QuickStartTest                       5 passes                                      
08:36:57 Drupal\Tests\Core\Command\QuickStartTest                       5 passes                                      
08:36:57 Drupal\Tests\Core\Command\QuickStartTest                       0 passes             1 exceptions             
08:37:16 Drupal\Tests\Core\Command\QuickStartTest                       0 passes             1 exceptions             
08:37:16 FATAL Drupal\Tests\Core\Command\QuickStartTest: test runner returned a non-zero error code (2).
08:37:17 Drupal\Tests\Core\Command\QuickStartTest                       0 passes   1 fails                            
08:37:17 FATAL Drupal\Tests\Core\Command\QuickStartTest: test runner returned a non-zero error code (2).
08:37:17 Drupal\Tests\Core\Command\QuickStartTest                       0 passes   1 fails                            
08:37:17 FATAL: command execution failed
08:37:21 java.io.EOFException
08:37:21 	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2335)
08:37:21 	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2804)
08:37:21 	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
08:37:21 	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
08:37:21 	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
08:37:21 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
08:37:21 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
08:37:21 Caused: java.io.IOException: Unexpected termination of the channel
08:37:21 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
08:37:21 Caused: java.io.IOException: Backing channel 'Testrunner (sir-399gb73q)' is disconnected.
08:37:21 	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:214)
08:37:21 	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:283)
08:37:21 	at com.sun.proxy.$Proxy67.isAlive(Unknown Source)
08:37:21 	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1137)
08:37:21 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1129)
08:37:21 	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
08:37:21 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
08:37:21 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
08:37:21 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
08:37:21 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
08:37:21 	at hudson.model.Build$BuildExecution.build(Build.java:206)
08:37:21 	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
08:37:21 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
08:37:21 	at hudson.model.Run.execute(Run.java:1798)
08:37:21 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
08:37:21 	at hudson.model.ResourceController.execute(ResourceController.java:97)
08:37:21 	at hudson.model.Executor.run(Executor.java:429)
08:37:21 FATAL: Unable to delete script file /tmp/jenkins6596459406383322495.sh
08:37:21 java.io.EOFException
08:37:21 	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2335)
08:37:21 	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2804)
08:37:21 	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
08:37:21 	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
08:37:21 	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
08:37:21 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
08:37:21 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
08:37:21 Caused: java.io.IOException: Unexpected termination of the channel
08:37:21 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
08:37:21 Caused: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on Testrunner (sir-399gb73q) failed. The channel is closing down or has closed down
08:37:21 	at hudson.remoting.Channel.call(Channel.java:948)
08:37:21 	at hudson.FilePath.act(FilePath.java:1036)
08:37:21 	at hudson.FilePath.act(FilePath.java:1025)
08:37:21 	at hudson.FilePath.delete(FilePath.java:1511)
08:37:21 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:123)
08:37:21 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
08:37:21 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
08:37:21 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
08:37:21 	at hudson.model.Build$BuildExecution.build(Build.java:206)
08:37:21 	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
08:37:21 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
08:37:21 	at hudson.model.Run.execute(Run.java:1798)
08:37:21 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
08:37:21 	at hudson.model.ResourceController.execute(ResourceController.java:97)
08:37:21 	at hudson.model.Executor.run(Executor.java:429)
08:37:21 Build step 'Execute shell' marked build as failure
08:37:21 ERROR: Step ‘Archive the artifacts’ failed: no workspace for drupal_patches #99324
08:37:21 Checking console output
08:37:21 ERROR: Step ‘Publish JUnit test result report’ failed: no workspace for drupal_patches #99324
08:37:21 Finished: FAILURE

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

neclimdul’s picture

Version: 8.9.x-dev » 9.1.x-dev
FileSize
743 bytes

I'm not sure about all the java errors but I think the problem might be the reverse of what might be expected and shortening the wait loop doesn't help. In fact, it might make the race condition built into the test worse.

This is the output of starting the standalone server:

www-data@055ba299c5ad:/app$ php ./core/scripts/drupal server --suppress-login
Drupal development server started: 
This server is not meant for production use.
One time login url: 
Press Ctrl-C to quit the Drupal development server.

and this is the code I found to fail randomly.

    while ($server_process->isRunning()) {
      if (preg_match('/127.0.0.1:(\d+)/', $server_process->getOutput(), $match)) {
        $port = $match[1];
        break;
      }
      // Wait for more output.
      sleep(1);
    }
    $this->assertEquals('', $server_process->getErrorOutput());
    $this->assertStringContainsString("127.0.0.1:$port/user/reset/1/", $server_process->getOutput());

The problem is that the "wait till server is up" logic only works if the timing is _just_ right or the server is up because it drops out matching the first message coming out of the script. But then the test immediately asserts the second message. There is no guarantee the other process has made it far enough to output the reset link though.

The fix should be pretty straight forward though, wait till we get the final boot message out before continuing with the assertions.

Side note, I found this happened a lot when running tests with the phpunit. I assume something about the way they run gives back the process to phpunit faster triggering the error more often. Possibly why some of the other things mentioned triggered the error more often as well.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Sweetchuck’s picture

Issue summary: View changes
Sweetchuck’s picture

Sweetchuck’s picture

Issue summary: View changes

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

neclimdul’s picture

Status: Needs review » Closed (duplicate)

updated version with some other fixes in #3265291: QuickStartTest: The waiting is the hardest part...

xjm’s picture

Sweetchuck’s picture

@xjm Why did you open another issue for the same problem?

In my experience it happens very often when there is an open issue for years (in this case ~4 years) an somebody opens and new issue and closes the old one as duplicated.