Last night, 8.0.x HEAD failed on bot 2813 with many fails like:
* Drupal\action\Tests\ActionUninstallTest (0 pass(es), 1 fail(s), and 6 exception(s))
- [fail] [Completion check] "The test did not complete due to a fatal error." in ActionUninstallTest.php on line 30 of Drupal\action\Tests\ActionUninstallTest->testActionUninstall().
- [exception] [Warning] "copy(/var/lib/drupaltestbot/sites/default/files/checkout/sites/simpletest/376941/settings.php): failed to open stream: No such file or directorycopy('/var/lib/drupaltestbot/sites/default/files/checkout/sites/default/default.settings.php', '/var/lib/drupaltestbot/sites/default/files/checkout/sites/simpletest/376941/settings.php')
This sounds like an issue with the bot, but it has passed other patches since on contrib projects, so I tried to retest the branch as I normally do at:
https://qa.drupal.org/pifr/test/833633/tools
When I try to retest the branch as I normally do, I get the status message:
Unable to re-queue test due to a pending cancellation request. Test will be automatically re-queued upon execution of the cancellation request.
I also used to have an option to cancel testing in the same UI; that option is not there.
The test queue is also completely empty, which seems highly suspect:
https://qa.drupal.org/pifr/queued
Filing as critical since pretty much this brings core development to a standstill.
Comments
Comment #1
xjmI'm going to try pushing a commit and see if that unsticks things.
Comment #2
xjmPushing the commit requeued the branch on bot 2948 (phew), so downgrading to major, but leaving this open until the test run completes, and if it happens again and we don't have a patch handy to commit and a committer online to commit it, things could be stuck for more than a couple hours. Would be great to know what happened.
Comment #3
alexpottI saw this behaviour as well. Before @xjm pushed a commit I couldn't get core to retest.
Comment #4
xjmThis is happening again today. HEAD failed with messages like:
And I am prevented from retesting the branch.
Comment #5
xjm@alexpott pushed a commit again and that unstuck things again. isntall also reset the non-active bots.
Keeping this open again in case it recurs.
Comment #6
alexpottI just successfully pressed retest
Comment #7
jthorson commentedThe code, as written, was based on an assumption that the testbot server would always be available, and pinging back to qa.d.o regularily.
Because qa.d.o doesn't initiate any communication of it's own, we depend on testbot-initiated communications to signal a testbot to cancel a test (hacked into the response side of a testbot-initiated status check).
Since the move over to AWS, we have testbots which will suddenly 'disappear' based on spot instance pricing ... If someone attempts to 'cancel' a test which is in progress on one of these testbots, qa.d.o waits for the next time that testbot calls home to signal a cancellation of that test. Because the testbot never calls home, the process gets hung up.
Right now, the manual workaround for this scenario is to update the test in the database, or delete the associated drupal variable on the server (pifr_cancelled_tests? pifr_pending_cancellations? Something like that.)
What would probably be the best resolution is another routine on qa.d.o which periodically clears any tests which were added to the pending cancellation variable, but are still there after a few minutes ... as the testbots should (in theory) be calling back home with a status heartbeat every minute or so. A qa.d.o clear_cancelled_tests jenkins job would be beneficial for administrators as well.
Comment #8
alexpott#2477583: Unexpected timeouts on testbot runs has just caused this.
Comment #9
xjmThis is happening again now; HEAD just failed with:
on bot 2928.
I tried to request a retest but got this same error.
Comment #10
xjmMore detail on what happened prior to the error:
Comment #11
xjmPer @jthorson:
Comment #12
Mixologic