Last night, 8.0.x HEAD failed on bot 2813 with many fails like:

* Drupal\action\Tests\ActionUninstallTest (0 pass(es), 1 fail(s), and 6 exception(s))
- [fail] [Completion check] "The test did not complete due to a fatal error." in ActionUninstallTest.php on line 30 of Drupal\action\Tests\ActionUninstallTest->testActionUninstall().
- [exception] [Warning] "copy(/var/lib/drupaltestbot/sites/default/files/checkout/sites/simpletest/376941/settings.php): failed to open stream: No such file or directorycopy('/var/lib/drupaltestbot/sites/default/files/checkout/sites/default/default.settings.php', '/var/lib/drupaltestbot/sites/default/files/checkout/sites/simpletest/376941/settings.php')

This sounds like an issue with the bot, but it has passed other patches since on contrib projects, so I tried to retest the branch as I normally do at:
https://qa.drupal.org/pifr/test/833633/tools
When I try to retest the branch as I normally do, I get the status message:

Unable to re-queue test due to a pending cancellation request. Test will be automatically re-queued upon execution of the cancellation request.

I also used to have an option to cancel testing in the same UI; that option is not there.

The test queue is also completely empty, which seems highly suspect:
https://qa.drupal.org/pifr/queued

Filing as critical since pretty much this brings core development to a standstill.

Comments

xjm’s picture

I'm going to try pushing a commit and see if that unsticks things.

xjm’s picture

Priority: Critical » Major

Pushing the commit requeued the branch on bot 2948 (phew), so downgrading to major, but leaving this open until the test run completes, and if it happens again and we don't have a patch handy to commit and a committer online to commit it, things could be stuck for more than a couple hours. Would be great to know what happened.

alexpott’s picture

I saw this behaviour as well. Before @xjm pushed a commit I couldn't get core to retest.

xjm’s picture

Priority: Major » Critical

This is happening again today. HEAD failed with messages like:

GET http://ec2-52-5-227-253.compute-1.amazonaws.com/checkout/admin/config/se... returned 0 (0 bytes).

And I am prevented from retesting the branch.

xjm’s picture

Priority: Critical » Major

@alexpott pushed a commit again and that unstuck things again. isntall also reset the non-active bots.

Keeping this open again in case it recurs.

alexpott’s picture

I just successfully pressed retest

jthorson’s picture

The code, as written, was based on an assumption that the testbot server would always be available, and pinging back to qa.d.o regularily.

Because qa.d.o doesn't initiate any communication of it's own, we depend on testbot-initiated communications to signal a testbot to cancel a test (hacked into the response side of a testbot-initiated status check).

Since the move over to AWS, we have testbots which will suddenly 'disappear' based on spot instance pricing ... If someone attempts to 'cancel' a test which is in progress on one of these testbots, qa.d.o waits for the next time that testbot calls home to signal a cancellation of that test. Because the testbot never calls home, the process gets hung up.

Right now, the manual workaround for this scenario is to update the test in the database, or delete the associated drupal variable on the server (pifr_cancelled_tests? pifr_pending_cancellations? Something like that.)

What would probably be the best resolution is another routine on qa.d.o which periodically clears any tests which were added to the pending cancellation variable, but are still there after a few minutes ... as the testbots should (in theory) be calling back home with a status heartbeat every minute or so. A qa.d.o clear_cancelled_tests jenkins job would be beneficial for administrators as well.

alexpott’s picture

xjm’s picture

This is happening again now; HEAD just failed with:

FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Setup environment: failed to create checkout database.

on bot 2928.

I tried to request a retest but got this same error.

xjm’s picture

More detail on what happened prior to the error:

12797198	Result retrieved by project client.	17 min 48 sec ago
12797183	Result received from test client #2928.	18 min 39 sec ago
12797178	Requested by test client #2928.	18 min 46 sec ago
12797173	Test reset by client request.	18 min 46 sec ago
12797168	Requested by test client #2928.	18 min 50 sec ago
12797098	Test request received.	39 min 49 sec ago
xjm’s picture

Per @jthorson:

fix is "drush vdel pifr_server_cancel_tests" on qa.d.o

Mixologic’s picture

Status: Active » Closed (outdated)