Turns out #22336: Move all core Drupal files under a /core folder to improve usability and upgrades doesn't just fail to test correctly, it actually consistently destroys (temporarily) any testbot it's run on. It fills up the /tmpfs database space, which is normally about 42% full.

This is the biggest patch ever tested by the testbots, but I'm not exactly sure how it ruins the tmpfs. However, it's not just this patch that does that. I have to clean up the random testbot occasionally.

It probably also tells us something about #1156976: 0 tests passing is not green - it's not just a display problem, it's most likely a massive testbot failure like this one.

If you look at the test log it still gets way down through thousands of (successful) tests before

Additional uncaught exception thrown while handling exception.
PDOException: SQLSTATE[HY000]: General error: 2013 Lost connection to MySQL server during query: TRUNCATE {cache_field} ; Array ( ) in cache_clear_all() (line 169 of /var/lib/drupaltestbot/sites/default/files/checkout/core/includes/cache.inc).

I had thought this was max_allowed_packet, but since that's adjusted, and since we see the database partition (/tmpfs) being overrun, it seems that may not be the issue.

Questions:
1. Does simpletest clean up test tables when an exception like this is hit?
2. How should we properly debug this?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

rfay’s picture

Some further learning:

* I increased the tmpfs partition and then #22336: Move all core Drupal files under a /core folder to improve usability and upgrades was able to complete, but it still used nearly all of the 1GB available to it.
* When I fixed the most blatant error (a failure to load a particular include, that happened many, many places), the usage of the partition stabilized and we didn't need nearly as much.

rfay’s picture

This has been happening less.

And the recovery is easier than what I had been doing. Delete enough in /tmpfs/mysql that mysql can run. Then start it again. Drop the drupaltestbotmysql database and recreate it. All good.

jthorson’s picture

Just confirming this happens on failed local testbot runs as well ... running run-tests.sh with the --clean option often enough seems to prevent it.

This would likely affect performance, but can we have PIFR execute run-tests.sh --clean before kicking off a new test run, which should clear out the leftover tables and help keep /tmpfs clean?

rfay’s picture

I don't see why we shouldn't do that. Sounds like a good idea to me.

Compared to running a full suite of simpletests, all other performance issues pale in comparison.

jthorson’s picture

This should do it ... executes "run-tests.sh --php PIFR_CLIENT_PHP --clean" before kicking off the initial run_tests.sh call.

jthorson’s picture

Status: Active » Needs review
rfay’s picture

Status: Needs review » Fixed

Committed: 1c8c3ac

Thanks!

jthorson’s picture

Status: Fixed » Needs work

When re-certifying client tests for scratchtestbot (to enable it for testing), the certify test failed on the --clean step.

[21:37:11] Command [git apply -v -p1 /var/lib/drupaltestbot/sites/default/files/review/review.patch 2>&1] succeeded
  Duration: 0 seconds
  Directory: [/var/lib/drupaltestbot/sites/default/files/checkout]
  Completion status: [0]
  Output: [Checking patch scripts/run-tests.sh...
Applied patch scripts/run-tests.sh cleanly.].
[21:37:11] Applied [review.patch] to:
 > scripts/run-tests.sh
[21:37:11] Invoking operation [syntax]...
[21:37:11] Checked [0 of 0] relevant file(s) for syntax.
[21:37:11] Invoking operation [install]...
[21:37:22] Invoking operation [review]...
[04:37:26] Command [/usr/bin/php ./scripts/run-tests.sh --php /usr/bin/php --clean 2>&1] failed
  Duration 0 seconds
  Directory [/var/lib/drupaltestbot/sites/default/files/checkout]
  Status [255]
 Output: [PHP Fatal error:  Call to undefined function break_me() in /var/lib/drupaltestbot/sites/default/files/checkout/scripts/run-tests.sh on line 7].
[04:37:26] Encountered error on [review], details:
array (
  '@reason' => 'failed during invocation of run-tests.sh --clean',
)
[04:37:26] Review complete. test_id=154109 result code=8 details=Array
(
    [@reason] => failed during invocation of run-tests.sh --clean
)

[04:37:26] Static variables reset.

No result was returned to qa.scratch.

jthorson’s picture

The failure caused the testbot to hang until the max-test-length timeout. Executing the following woke it back up ...

drush php-eval 'variable_del("pifr_client_test_start");'
drush php-eval 'variable_del("pifr_client_test");'

Now the above failure may actually have been expected ... the certification process runs 'failed' tests and checks responses to see if the expected failure occurs. However, once the 'Review complete' line hits the log, I expected the testbot to give up on that test ... not halt everything until cron wakes it up 90 minutes later. May need to add a variable_del() somewhere to wake up the testbot after this type of failure.

While possibly a coincidence (because I happened to requeue the test and reset the timer after the 'fail' test and before the 'pass') ... I commented out the code added in the above patch, and the testbot began testing on the next 'per minute' cron run.

In any case, requires more testing. May set up both scratchtestbot and a local testbot to re-certify tommorrow (one with the code and one without) to see if this was actually the issue.

jthorson’s picture

Behaviour in #9 confirmed ... deleting the variables triggered the next test to start. So it appears that when the --clean step fails, the next test is not being triggered.

However, on the next test (without the 'failure' patch), the 'clean' seems to work fine ... so there isn't an issue with the actual code added in this patch - it's more an issue of how the failure case freezes the testing for some period of time. Next question is rather this 'freeze' clears itself over time, or requires manual intervention (ie. the variable_delete).

Going to kick off new certifications for my local testbot and scratchtestbot, and come back in the morning to see if they completed without assistance - it's possible I'm simply not patient enough tonight.

jthorson’s picture

Status: Needs work » Fixed

Both confirmation tests passed, with the --clean code enabled.

Manually clearing the 'freeze' causes the certification test to 'fail', even though all tests pass.

Setting back to fixed, and will investigate the delay as a new issue.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.