Summary
It appears that certain scenarios can result in an empty (failed) result being saved for a test by PIFR_server, even though the testbot has successfully passed the test. When this occurs, the test is reported as FAILED: [[SimpleTest]]: [MySQL] Failed to run tests: @reason; [[Coder]]: [Code review] .
In the reported case, it appeared that this may perhaps have been related to multiple testbots picking up the same test, as a couple of Test not assigned to client watchdog logs were found for the test in question.
The original report also mentioned that the MySQL and Coder tabs being swapped, which can happen based on which environment is loaded first in the $environments array while building the page.
Original Report:
http://qa.drupal.org/8.x-status
8.x branch tests were failing this morning with message:
FAILED: [[SimpleTest]]: [MySQL] Failed to run tests: @reason; [[Coder]]: [Code review] .
There were no failures in actual test assertions.
aspilicious noticed the MySQL and Coder tabs were also mysteriously swapped.
| Comment | File | Size | Author |
|---|---|---|---|
| #31 | 1361468-31.patch | 724 bytes | jthorson |
| #27 | pifr_what-were-you-thinking_1361468-27.patch | 649 bytes | jthorson |
| #26 | pifr_avoid-dupe-pifr-env-status-entries-1361468-26.patch | 3.21 KB | jthorson |
| #22 | pifr_modify-results-for-all-env-check_1361468-22.patch | 638 bytes | jthorson |
Comments
Comment #1
xjmLog entries:
http://qa.drupal.org/admin/reports/event/4239363 for the mysql tab log
http://qa.drupal.org/admin/reports/event/4239293 for the coder tab log
Comment #2
xjmHm, maybe I misread the message -- there's a semicolon in the middle, so I guess the second part is just the normal message stub from coder. So the tests just aren't running for some other reason?
Comment #3
xjmBranch tests passed again a couple hours later:
http://qa.drupal.org/admin/reports/event/4239603
So downgrading to normal.
Comment #4
jthorson commentedLooking at the logs, it appears this test ran on multiple testbots simultaneously, and tripped a watchdog error when the second testbot reported results:
pifr_server 12/05/2011 - 08:01:57 Test not assigned to client: (t: 16628, c: 699)Normally, PIFR handles this fine, but since we don't know what causes the multiple testbot scenario, I couldn't test how PIFR_Server responds when it's a Coder test which gets crossed up.
In any case, both tests ran successfully, and the branch failure was a false positive due to PIFR_Server somehow getting confused with the duplicate tests and improperly parsing the results.
There could be an issue with the 'advisory' pass/fail calculation in this scenario, but we'll have to keep an eye out to try and figure exactly what situation causes the failure ... I suspect it's the order that the test is picked up and returned from the multiple testbots, but even the testing by multiple testbots is a race condition itself.
Comment #4.0
jthorson commentedUpdated issue summary.
Comment #5
jthorson commentedTitle change.
Comment #6
jthorson commentedAnother example:
https://www.evernote.com/shard/s15/sh/16200e89-a70d-42e9-a7de-e3126a988d...
Comment #7
longwaveJust noticed this on Ubercart's 7.x-3.x branch test at http://qa.drupal.org/pifr/test/132134
The cause here seems to be two testbots picking up the same test:
The MySQL and Code review tabs are in the expected order in this case.
Comment #8
rfaySorry can't help myself. Just changing title.
Comment #9
longwaveActually, the duplicate test requests seem to be a red herring. What appears to be happening is the results are being parsed in the wrong order, so the code review results are being parsed as MySQL results (which will fail for most contrib as Coder Tough Love reports false positives) and vice versa.
As an example, the last failing Ubercart branch test ended in MySQL failing with "617 pass(es), 44 fail(s), and 899 exception(es)" and Code review passing with "1,816 minor(s), 0 critical(s), and 0 normal(s)" - the numbers are correct, but for the opposite environments.
This can be corrected by forcing enough retests until the bots pick up the results in the right order, although this is somewhat tedious!
Comment #10
longwaveSpotted at http://qa.drupal.org/pifr/status while retesting:
Client 659
Enabled - Reviewing
Ubercart - 7.x-3.x
Environment: 2
Client 664
Enabled - Reviewing
Ubercart - 7.x-3.x
Environment:
Note that "environment" is blank for the second one; usually this is 1 or 2.
Comment #11
longwaveShould this be moved to the PIFR queue, or is this the correct place to report this?
Comment #12
jthorson commentedEither queue is fine ... we'll triage as needed.
I don't believe the test results as displayed on the qa.d.o page can really be trusted when this happens ... I suspect you're actually being shown the result details from the *previous* test, and not the most recent one. This is why you end up with Coder results in the MySQL tab and vice versa.
Comment #13
longwaveYou can check this out right now at http://qa.drupal.org/pifr/test/132134 (I'm fed up of forcing retests for today at least)
The test status, event log and review log timestamps all seem to match up. Expanding the review log seems to imply that the tests are being parsed the wrong way round.
Also, on the MySQL tab (with the coder review results!) is there any reason why some of the non-pass items are expandable while others aren't?
Comment #14
jthorson commentedThe testbots have a maximum # of assertions that they will return ... but with Coder, the number of issues flagged often exceeds this value. The expanable items are those that fell under the max assertions threshold, and the non-expandable fell above.
Comment #15
jthorson commentedOkay ... here's an explanation.
Occasionally (and I don't yet know why), the same test gets farmed out to multiple testbots. This means that the same environment may be under test by multiple testbots simultaneously. However, because the database keys on test_id + environment_id, it only has a single record ... in other words, it's only aware of one testbot's tests.
When a test is received back from a testbot, and the server goes to record the results, control passes to pifr_server_test_result(). The key to the failures is the following code:
Because you might have multiple test running for the same environment, due to multiple testbots getting the same test (thought this shouldn't happen!), you can end up with two coder results returned before you receive your simpletest results (which take much longer to run).
It appears that you then satisfy the count($results) == count($environments) check, at which point qa.d.o marks the whole test as complete ... even though your simpletest tests are still running.
Then, when the simpletest results are returned ... they get rejected because the test is not actually queued anymore. Because it's not queued, the results don't get saved to the db.
Now I haven't looked into why you still sometimes get simpletest results on the test page, or crossed up tabs ... but I suspect that, since the results don't get saved to the db, the assertions for the previous run of that environment still exist in the database. Because the environment keys were reset when the test was farmed out to the testbots, the various tables are no longer in synch - so you can get old Coder results in a mysql tab, or old Mysql results under a Coder tab.
It doesn't mean the results have been swapped ... it means that you probably aren't looking at current results.
Comment #16
jthorson commentedNext time we see this, lets do a db query and see what was actually there.
Select result_id, test_id, environment_id, code, details from pifr_result where test_id = [test_id]Comment #17
xjmIt happened again this morning.
webchick ran:
select result_id, test_id, environment_id, code, details from pifr_result where test_id = 16628Got:
Table 'drupal.pifr_result' doesn't existPresumably it's in a different DB that QA uses? Can we get that DB documented here so we can ask the right question next time? :)
Edit: I requeued the branch because patch tests are getting really backed up, between this and a separate issue last night with a broken commit.
Comment #18
webchickYeah, I'm happy to run queries like this, but when I do:
ssh webchick@qa.drupal.org
I get:
Permission denied (publickey,gssapi-with-mic).
If someone is able to take care of that, I can help. :) My key is on util.
Comment #19
rfayThis is qa's database, which is not accessible on util. You can log into www2 and go to /var/www/qa.drupal.org/htdocs and run drush cli.
Here's what I got just now. But unfortunately I had already hit the retest button.
Comment #20
jthorson commentedLooking at the logs for Drupal 8.x today (which failed with this symptom MULTIPLE times), it appears that there is a 'Test reset by client request' event for each one of the failures.
If we see more of this issue, please check for this in the current test cycle on the event log ... if this line appears every time that a test responds this way, there's a good chance we've ran into the same race condition as what I suspect was causing #1347582: Testbot reconfirmation fails: failed to retrieve [.patch] from [qa.drupal.org]. (Namely, secondary requests arriving during the window between an xmlrpc request and response).
Comment #21
jthorson commentedHmm ... and when we see 'Environment: ' with no value on the status page, the following might help shed some light on things:
And as before, once the test completes and comes back as failed=>@reason:
select result_id, test_id, environment_id, code, details from pifr_result wehre test_id = $test_id;Comment #22
jthorson commentedThis patch doesn't address the root problem, but it should help with one scenario (where two coder responses come in before the simpletest result, thus prematurely satisfying the
count($results) == count($environments)check ... don't believe it should break anything else.Comment #23
jthorson commentedMoving to PIFR
Comment #24
jthorson commentedPatch in 22 committed to 6.x-2.x (dbc57ef).
Comment #25
jthorson commentedWith the patch in #22, when the environment gets out of sync, qa.d.o now returns results as soon as all 'known' environments have results.
Unfortunately, if the 'out of sync' environment is related to the simpletest test (i.e. the entry with no environment is the MySql environment), then the results are returned as soon as Coder has completed ... which tells me that the environment sync issue causes an incomplete $environments array when the check applied in the above patch is executed.
Just noting this here, as it's another starting point for further troubleshooting.
Comment #26
jthorson commentedThis patch should prevent the test from being sent to the testbot when a testbot requests the same test twice, resulting in the following appearing in watchdog:
User warning: Duplicate entry '219798-1' for key 'PRIMARY' query: INSERT INTO pifr_environment_status (environment_id, test_id, client_id) VALUES (1, 219798, 699) in _db_query() (line 147 of /var/www/qa.drupal.org/htdocs/includes/database.mysqli.inc).or when two testbots call next() at the same time, and end up getting fed the same test; which corrupts the entries in pifr_environment_status().
Comment #27
jthorson commentedUm ... Re: #22 ...
... yeah. :(
Comment #28
jthorson commented#27 committed to 6.x-2.x (a70d793). #26 still needs review.
Comment #29
jthorson commented#26 committed to 6.x-2.x (commit 5aa600f). Will roll 6.x-2.8-rc3 with this fix, and deploy as soon as my deploy access is granted ... and if things stay stable for a week, will tag as 6.x-2.8 final.
Comment #30
jthorson commentedBatting a solid .000 this weekend ... :(
Comment #31
jthorson commentedBugfix patch to fix extra '$' included in the patch in #26.
Comment #32
jthorson commentedCommitted to 6.x-2.8-rc4.
Comment #33.0
(not verified) commentedUpdating Issue Summary, as this looks like a generic PIFR issue rather than a 8.x specific one.