Testbot fills up /tmpfs partition [#1192680]

Turns out #22336: Move all core Drupal files under a /core folder to improve usability and upgrades doesn't just fail to test correctly, it actually consistently destroys (temporarily) any testbot it's run on. It fills up the /tmpfs database space, which is normally about 42% full.

This is the biggest patch ever tested by the testbots, but I'm not exactly sure how it ruins the tmpfs. However, it's not just this patch that does that. I have to clean up the random testbot occasionally.

It probably also tells us something about #1156976: 0 tests passing is not green - it's not just a display problem, it's most likely a massive testbot failure like this one.

If you look at the test log it still gets way down through thousands of (successful) tests before

Additional uncaught exception thrown while handling exception.
PDOException: SQLSTATE[HY000]: General error: 2013 Lost connection to MySQL server during query: TRUNCATE {cache_field} ; Array ( ) in cache_clear_all() (line 169 of /var/lib/drupaltestbot/sites/default/files/checkout/core/includes/cache.inc).

I had thought this was max_allowed_packet, but since that's adjusted, and since we see the database partition (/tmpfs) being overrun, it seems that may not be the issue.

Questions:
1. Does simpletest clean up test tables when an exception like this is hit?
2. How should we properly debug this?

Comment	File	Size	Author
#5	pifr.run-clean-before-testing-1192680-5.patch	1.15 KB	jthorson

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

rfay

English

Palisade, CO, USA

CreditAttribution: rfay commented 20 June 2011 at 16:03

Some further learning:

* I increased the tmpfs partition and then #22336: Move all core Drupal files under a /core folder to improve usability and upgrades was able to complete, but it still used nearly all of the 1GB available to it.
* When I fixed the most blatant error (a failure to load a particular include, that happened many, many places), the usage of the partition stabilized and we didn't need nearly as much.

Comment #2

rfay

English

Palisade, CO, USA

CreditAttribution: rfay commented 30 June 2011 at 02:36

This has been happening less.

And the recovery is easier than what I had been doing. Delete enough in /tmpfs/mysql that mysql can run. Then start it again. Drop the drupaltestbotmysql database and recreate it. All good.

Comment #3

jthorson CreditAttribution: jthorson commented 12 July 2011 at 21:55

Just confirming this happens on failed local testbot runs as well ... running run-tests.sh with the --clean option often enough seems to prevent it.

This would likely affect performance, but can we have PIFR execute run-tests.sh --clean before kicking off a new test run, which should clear out the leftover tables and help keep /tmpfs clean?

Comment #4

rfay

English

Palisade, CO, USA

CreditAttribution: rfay commented 12 July 2011 at 22:57

I don't see why we shouldn't do that. Sounds like a good idea to me.

Compared to running a full suite of simpletests, all other performance issues pale in comparison.

Comment #5

jthorson CreditAttribution: jthorson commented 13 July 2011 at 16:46

File	Size
pifr.run-clean-before-testing-1192680-5.patch	1.15 KB

This should do it ... executes "run-tests.sh --php PIFR_CLIENT_PHP --clean" before kicking off the initial run_tests.sh call.

Comment #6

jthorson CreditAttribution: jthorson commented 13 July 2011 at 16:46

Status:

Active

» Needs review

Comment #7

rfay

English

Palisade, CO, USA

CreditAttribution: rfay commented 18 July 2011 at 02:44

Status:

Needs review

» Fixed

Committed: 1c8c3ac

Thanks!

Comment #8

jthorson CreditAttribution: jthorson commented 25 July 2011 at 06:02

Status:

Fixed

» Needs work

When re-certifying client tests for scratchtestbot (to enable it for testing), the certify test failed on the --clean step.

[21:37:11] Command [git apply -v -p1 /var/lib/drupaltestbot/sites/default/files/review/review.patch 2>&1] succeeded
  Duration: 0 seconds
  Directory: [/var/lib/drupaltestbot/sites/default/files/checkout]
  Completion status: [0]
  Output: [Checking patch scripts/run-tests.sh...
Applied patch scripts/run-tests.sh cleanly.].
[21:37:11] Applied [review.patch] to:
 > scripts/run-tests.sh
[21:37:11] Invoking operation [syntax]...
[21:37:11] Checked [0 of 0] relevant file(s) for syntax.
[21:37:11] Invoking operation [install]...
[21:37:22] Invoking operation [review]...
[04:37:26] Command [/usr/bin/php ./scripts/run-tests.sh --php /usr/bin/php --clean 2>&1] failed
  Duration 0 seconds
  Directory [/var/lib/drupaltestbot/sites/default/files/checkout]
  Status [255]
 Output: [PHP Fatal error:  Call to undefined function break_me() in /var/lib/drupaltestbot/sites/default/files/checkout/scripts/run-tests.sh on line 7].
[04:37:26] Encountered error on [review], details:
array (
  '@reason' => 'failed during invocation of run-tests.sh --clean',
)
[04:37:26] Review complete. test_id=154109 result code=8 details=Array
(
    [@reason] => failed during invocation of run-tests.sh --clean
)

[04:37:26] Static variables reset.

No result was returned to qa.scratch.

Comment #9

jthorson CreditAttribution: jthorson commented 25 July 2011 at 07:09

The failure caused the testbot to hang until the max-test-length timeout. Executing the following woke it back up ...

drush php-eval 'variable_del("pifr_client_test_start");'
drush php-eval 'variable_del("pifr_client_test");'

Now the above failure may actually have been expected ... the certification process runs 'failed' tests and checks responses to see if the expected failure occurs. However, once the 'Review complete' line hits the log, I expected the testbot to give up on that test ... not halt everything until cron wakes it up 90 minutes later. May need to add a variable_del() somewhere to wake up the testbot after this type of failure.

While possibly a coincidence (because I happened to requeue the test and reset the timer after the 'fail' test and before the 'pass') ... I commented out the code added in the above patch, and the testbot began testing on the next 'per minute' cron run.

In any case, requires more testing. May set up both scratchtestbot and a local testbot to re-certify tommorrow (one with the code and one without) to see if this was actually the issue.

Comment #10

jthorson CreditAttribution: jthorson commented 25 July 2011 at 07:37

Behaviour in #9 confirmed ... deleting the variables triggered the next test to start. So it appears that when the --clean step fails, the next test is not being triggered.

However, on the next test (without the 'failure' patch), the 'clean' seems to work fine ... so there isn't an issue with the actual code added in this patch - it's more an issue of how the failure case freezes the testing for some period of time. Next question is rather this 'freeze' clears itself over time, or requires manual intervention (ie. the variable_delete).

Going to kick off new certifications for my local testbot and scratchtestbot, and come back in the morning to see if they completed without assistance - it's possible I'm simply not patient enough tonight.