Turns out #22336: Move all core Drupal files under a /core folder to improve usability and upgrades doesn't just fail to test correctly, it actually consistently destroys (temporarily) any testbot it's run on. It fills up the /tmpfs database space, which is normally about 42% full.
This is the biggest patch ever tested by the testbots, but I'm not exactly sure how it ruins the tmpfs. However, it's not just this patch that does that. I have to clean up the random testbot occasionally.
It probably also tells us something about #1156976: 0 tests passing is not green - it's not just a display problem, it's most likely a massive testbot failure like this one.
If you look at the test log it still gets way down through thousands of (successful) tests before
Additional uncaught exception thrown while handling exception.
PDOException: SQLSTATE[HY000]: General error: 2013 Lost connection to MySQL server during query: TRUNCATE {cache_field} ; Array ( ) in cache_clear_all() (line 169 of /var/lib/drupaltestbot/sites/default/files/checkout/core/includes/cache.inc).
I had thought this was max_allowed_packet, but since that's adjusted, and since we see the database partition (/tmpfs) being overrun, it seems that may not be the issue.
Questions:
1. Does simpletest clean up test tables when an exception like this is hit?
2. How should we properly debug this?
Comment | File | Size | Author |
---|---|---|---|
#5 | pifr.run-clean-before-testing-1192680-5.patch | 1.15 KB | jthorson |
Comments
Comment #1
rfaySome further learning:
* I increased the tmpfs partition and then #22336: Move all core Drupal files under a /core folder to improve usability and upgrades was able to complete, but it still used nearly all of the 1GB available to it.
* When I fixed the most blatant error (a failure to load a particular include, that happened many, many places), the usage of the partition stabilized and we didn't need nearly as much.
Comment #2
rfayThis has been happening less.
And the recovery is easier than what I had been doing. Delete enough in /tmpfs/mysql that mysql can run. Then start it again. Drop the drupaltestbotmysql database and recreate it. All good.
Comment #3
jthorson CreditAttribution: jthorson commentedJust confirming this happens on failed local testbot runs as well ... running run-tests.sh with the --clean option often enough seems to prevent it.
This would likely affect performance, but can we have PIFR execute run-tests.sh --clean before kicking off a new test run, which should clear out the leftover tables and help keep /tmpfs clean?
Comment #4
rfayI don't see why we shouldn't do that. Sounds like a good idea to me.
Compared to running a full suite of simpletests, all other performance issues pale in comparison.
Comment #5
jthorson CreditAttribution: jthorson commentedThis should do it ... executes "run-tests.sh --php PIFR_CLIENT_PHP --clean" before kicking off the initial run_tests.sh call.
Comment #6
jthorson CreditAttribution: jthorson commentedComment #7
rfayCommitted: 1c8c3ac
Thanks!
Comment #8
jthorson CreditAttribution: jthorson commentedWhen re-certifying client tests for scratchtestbot (to enable it for testing), the certify test failed on the --clean step.
No result was returned to qa.scratch.
Comment #9
jthorson CreditAttribution: jthorson commentedThe failure caused the testbot to hang until the max-test-length timeout. Executing the following woke it back up ...
Now the above failure may actually have been expected ... the certification process runs 'failed' tests and checks responses to see if the expected failure occurs. However, once the 'Review complete' line hits the log, I expected the testbot to give up on that test ... not halt everything until cron wakes it up 90 minutes later. May need to add a variable_del() somewhere to wake up the testbot after this type of failure.
While possibly a coincidence (because I happened to requeue the test and reset the timer after the 'fail' test and before the 'pass') ... I commented out the code added in the above patch, and the testbot began testing on the next 'per minute' cron run.
In any case, requires more testing. May set up both scratchtestbot and a local testbot to re-certify tommorrow (one with the code and one without) to see if this was actually the issue.
Comment #10
jthorson CreditAttribution: jthorson commentedBehaviour in #9 confirmed ... deleting the variables triggered the next test to start. So it appears that when the --clean step fails, the next test is not being triggered.
However, on the next test (without the 'failure' patch), the 'clean' seems to work fine ... so there isn't an issue with the actual code added in this patch - it's more an issue of how the failure case freezes the testing for some period of time. Next question is rather this 'freeze' clears itself over time, or requires manual intervention (ie. the variable_delete).
Going to kick off new certifications for my local testbot and scratchtestbot, and come back in the morning to see if they completed without assistance - it's possible I'm simply not patient enough tonight.
Comment #11
jthorson CreditAttribution: jthorson commentedBoth confirmation tests passed, with the --clean code enabled.
Manually clearing the 'freeze' causes the certification test to 'fail', even though all tests pass.
Setting back to fixed, and will investigate the delay as a new issue.