Summary
It appears that certain scenarios can result in an empty (failed) result being saved for a test by PIFR_server, even though the testbot has successfully passed the test. When this occurs, the test is reported as FAILED: [[SimpleTest]]: [MySQL] Failed to run tests: @reason; [[Coder]]: [Code review] .

In the reported case, it appeared that this may perhaps have been related to multiple testbots picking up the same test, as a couple of Test not assigned to client watchdog logs were found for the test in question.

The original report also mentioned that the MySQL and Coder tabs being swapped, which can happen based on which environment is loaded first in the $environments array while building the page.

Original Report:
http://qa.drupal.org/8.x-status
8.x branch tests were failing this morning with message:
FAILED: [[SimpleTest]]: [MySQL] Failed to run tests: @reason; [[Coder]]: [Code review] .
There were no failures in actual test assertions.

aspilicious noticed the MySQL and Coder tabs were also mysteriously swapped.

Comments

xjm’s picture

xjm’s picture

Title: 8.x branch tests failing on coder advisory review » 8.x branch tests failing

Hm, maybe I misread the message -- there's a semicolon in the middle, so I guess the second part is just the normal message stub from coder. So the tests just aren't running for some other reason?

xjm’s picture

Priority: Critical » Normal

Branch tests passed again a couple hours later:
http://qa.drupal.org/admin/reports/event/4239603

So downgrading to normal.

jthorson’s picture

Looking at the logs, it appears this test ran on multiple testbots simultaneously, and tripped a watchdog error when the second testbot reported results:

pifr_server 12/05/2011 - 08:01:57 Test not assigned to client: (t: 16628, c: 699)

Normally, PIFR handles this fine, but since we don't know what causes the multiple testbot scenario, I couldn't test how PIFR_Server responds when it's a Coder test which gets crossed up.

In any case, both tests ran successfully, and the branch failure was a false positive due to PIFR_Server somehow getting confused with the duplicate tests and improperly parsing the results.

There could be an issue with the 'advisory' pass/fail calculation in this scenario, but we'll have to keep an eye out to try and figure exactly what situation causes the failure ... I suspect it's the order that the test is picked up and returned from the multiple testbots, but even the testing by multiple testbots is a race condition itself.

jthorson’s picture

Project: Drupal.org Testbots » Project Issue File Review
Issue summary: View changes

Updated issue summary.

jthorson’s picture

Title: 8.x branch tests failing » PIFR reports failed test result with no details, even though testbots have ran successfully
Project: Project Issue File Review » Drupal.org Testbots

Title change.

jthorson’s picture

longwave’s picture

Just noticed this on Ubercart's 7.x-3.x branch test at http://qa.drupal.org/pifr/test/132134

The cause here seems to be two testbots picking up the same test:

2,684,143 	Result retrieved by project client. 	1 hour 58 min ago
2,684,133 	Result retrieved by project client. 	1 hour 58 min ago
2,684,128 	Result received from test client #664. 	1 hour 58 min ago
2,684,123 	Result received from test client #659. 	1 hour 59 min ago
2,684,118 	Requested by test client #664. 	1 hour 59 min ago
2,684,113 	Requested by test client #659. 	1 hour 59 min ago
2,684,108 	Test reset by client request. 	1 hour 59 min ago
2,684,103 	Requested by test client #659. 	2 hours 45 sec ago
2,684,093 	Requested by test client #699. 	2 hours 1 min ago
2,684,083 	Test request received. 	2 hours 1 min ago 

The MySQL and Code review tabs are in the expected order in this case.

rfay’s picture

Title: PIFR reports failed test result with no details, even though testbots have ran successfully » PIFR reports failed test result with no details, even though testbots have run successfully

Sorry can't help myself. Just changing title.

longwave’s picture

Actually, the duplicate test requests seem to be a red herring. What appears to be happening is the results are being parsed in the wrong order, so the code review results are being parsed as MySQL results (which will fail for most contrib as Coder Tough Love reports false positives) and vice versa.

As an example, the last failing Ubercart branch test ended in MySQL failing with "617 pass(es), 44 fail(s), and 899 exception(es)" and Code review passing with "1,816 minor(s), 0 critical(s), and 0 normal(s)" - the numbers are correct, but for the opposite environments.

This can be corrected by forcing enough retests until the bots pick up the results in the right order, although this is somewhat tedious!

longwave’s picture

Spotted at http://qa.drupal.org/pifr/status while retesting:

Client 659
Enabled - Reviewing
Ubercart - 7.x-3.x
Environment: 2

Client 664
Enabled - Reviewing
Ubercart - 7.x-3.x
Environment:

Note that "environment" is blank for the second one; usually this is 1 or 2.

longwave’s picture

Should this be moved to the PIFR queue, or is this the correct place to report this?

jthorson’s picture

Either queue is fine ... we'll triage as needed.

I don't believe the test results as displayed on the qa.d.o page can really be trusted when this happens ... I suspect you're actually being shown the result details from the *previous* test, and not the most recent one. This is why you end up with Coder results in the MySQL tab and vice versa.

longwave’s picture

You can check this out right now at http://qa.drupal.org/pifr/test/132134 (I'm fed up of forcing retests for today at least)

The test status, event log and review log timestamps all seem to match up. Expanding the review log seems to imply that the tests are being parsed the wrong way round.

Also, on the MySQL tab (with the coder review results!) is there any reason why some of the non-pass items are expandable while others aren't?

jthorson’s picture

The testbots have a maximum # of assertions that they will return ... but with Coder, the number of issues flagged often exceeds this value. The expanable items are those that fell under the max assertions threshold, and the non-expandable fell above.

jthorson’s picture

Okay ... here's an explanation.

Occasionally (and I don't yet know why), the same test gets farmed out to multiple testbots. This means that the same environment may be under test by multiple testbots simultaneously. However, because the database keys on test_id + environment_id, it only has a single record ... in other words, it's only aware of one testbot's tests.

When a test is received back from a testbot, and the server goes to record the results, control passes to pifr_server_test_result(). The key to the failures is the following code:

  if ($test['status'] == PIFR_SERVER_TEST_STATUS_QUEUED) {
    // If this result constitutes the final result (all environments reviewed)
    // then mark the test as complete.
    if (count($results) == count($environments)) {
      $test['status'] = PIFR_SERVER_TEST_STATUS_RESULT;
    }

Because you might have multiple test running for the same environment, due to multiple testbots getting the same test (thought this shouldn't happen!), you can end up with two coder results returned before you receive your simpletest results (which take much longer to run).

It appears that you then satisfy the count($results) == count($environments) check, at which point qa.d.o marks the whole test as complete ... even though your simpletest tests are still running.

Then, when the simpletest results are returned ... they get rejected because the test is not actually queued anymore. Because it's not queued, the results don't get saved to the db.

Now I haven't looked into why you still sometimes get simpletest results on the test page, or crossed up tabs ... but I suspect that, since the results don't get saved to the db, the assertions for the previous run of that environment still exist in the database. Because the environment keys were reset when the test was farmed out to the testbots, the various tables are no longer in synch - so you can get old Coder results in a mysql tab, or old Mysql results under a Coder tab.

It doesn't mean the results have been swapped ... it means that you probably aren't looking at current results.

jthorson’s picture

Next time we see this, lets do a db query and see what was actually there.

Select result_id, test_id, environment_id, code, details from pifr_result where test_id = [test_id]

xjm’s picture

It happened again this morning.

webchick ran:
select result_id, test_id, environment_id, code, details from pifr_result where test_id = 16628

Got:
Table 'drupal.pifr_result' doesn't exist

Presumably it's in a different DB that QA uses? Can we get that DB documented here so we can ask the right question next time? :)

Edit: I requeued the branch because patch tests are getting really backed up, between this and a separate issue last night with a broken commit.

webchick’s picture

Yeah, I'm happy to run queries like this, but when I do:

ssh webchick@qa.drupal.org

I get:
Permission denied (publickey,gssapi-with-mic).

If someone is able to take care of that, I can help. :) My key is on util.

rfay’s picture

This is qa's database, which is not accessible on util. You can log into www2 and go to /var/www/qa.drupal.org/htdocs and run drush cli.

Here's what I got just now. But unfortunately I had already hit the retest button.

select result_id, test_id, environment_id, code, details from pifr_result where test_id = 16628;
+-----------+---------+----------------+------+---------------------------------------------------------------------+
| result_id | test_id | environment_id | code | details                                                             |
+-----------+---------+----------------+------+---------------------------------------------------------------------+
|    483413 |   16628 |              1 |    8 | a:3:{s:5:"@pass";i:908;s:5:"@fail";i:134;s:10:"@exception";i:2922;} | 
+-----------+---------+----------------+------+---------------------------------------------------------------------+
1 row in set (0.00 sec)
jthorson’s picture

Looking at the logs for Drupal 8.x today (which failed with this symptom MULTIPLE times), it appears that there is a 'Test reset by client request' event for each one of the failures.

If we see more of this issue, please check for this in the current test cycle on the event log ... if this line appears every time that a test responds this way, there's a good chance we've ran into the same race condition as what I suspect was causing #1347582: Testbot reconfirmation fails: failed to retrieve [.patch] from [qa.drupal.org]. (Namely, secondary requests arriving during the window between an xmlrpc request and response).

jthorson’s picture

Hmm ... and when we see 'Environment: ' with no value on the status page, the following might help shed some light on things:

Select * from pifr_test where test_id = $test_id;
Select * from pifr_test_environment where test_id = $test_id;
Select * from pifr_environment_status where test_id = $test_id;

And as before, once the test completes and comes back as failed=>@reason:

select result_id, test_id, environment_id, code, details from pifr_result wehre test_id = $test_id;

jthorson’s picture

Status: Active » Needs review
StatusFileSize
new638 bytes

This patch doesn't address the root problem, but it should help with one scenario (where two coder responses come in before the simpletest result, thus prematurely satisfying the count($results) == count($environments) check ... don't believe it should break anything else.

jthorson’s picture

Project: Drupal.org Testbots » Project Issue File Review
Version: » 6.x-2.x-dev

Moving to PIFR

jthorson’s picture

Patch in 22 committed to 6.x-2.x (dbc57ef).

jthorson’s picture

With the patch in #22, when the environment gets out of sync, qa.d.o now returns results as soon as all 'known' environments have results.

Unfortunately, if the 'out of sync' environment is related to the simpletest test (i.e. the entry with no environment is the MySql environment), then the results are returned as soon as Coder has completed ... which tells me that the environment sync issue causes an incomplete $environments array when the check applied in the above patch is executed.

Just noting this here, as it's another starting point for further troubleshooting.

jthorson’s picture

This patch should prevent the test from being sent to the testbot when a testbot requests the same test twice, resulting in the following appearing in watchdog:

User warning: Duplicate entry '219798-1' for key 'PRIMARY' query: INSERT INTO pifr_environment_status (environment_id, test_id, client_id) VALUES (1, 219798, 699) in _db_query() (line 147 of /var/www/qa.drupal.org/htdocs/includes/database.mysqli.inc).

or when two testbots call next() at the same time, and end up getting fed the same test; which corrupts the entries in pifr_environment_status().

jthorson’s picture

StatusFileSize
new649 bytes

Um ... Re: #22 ...

... yeah. :(

jthorson’s picture

#27 committed to 6.x-2.x (a70d793). #26 still needs review.

jthorson’s picture

Status: Needs review » Fixed

#26 committed to 6.x-2.x (commit 5aa600f). Will roll 6.x-2.8-rc3 with this fix, and deploy as soon as my deploy access is granted ... and if things stay stable for a week, will tag as 6.x-2.8 final.

jthorson’s picture

Status: Fixed » Needs work

Batting a solid .000 this weekend ... :(

jthorson’s picture

StatusFileSize
new724 bytes

Bugfix patch to fix extra '$' included in the patch in #26.

jthorson’s picture

Status: Needs work » Fixed

Committed to 6.x-2.8-rc4.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Updating Issue Summary, as this looks like a generic PIFR issue rather than a 8.x specific one.