PIFR reports failed test result with no details, even though testbots have run successfully [#1361468]

Comment	File	Size	Author
#31	1361468-31.patch	724 bytes	jthorson
#27	pifr_what-were-you-thinking_1361468-27.patch	649 bytes	jthorson
#26	pifr_avoid-dupe-pifr-env-status-entries-1361468-26.patch	3.21 KB	jthorson
#22	pifr_modify-results-for-all-env-check_1361468-22.patch	638 bytes	jthorson

Comment #1

she/her

English

commented 5 December 2011 at 14:40

Log entries:
http://qa.drupal.org/admin/reports/event/4239363 for the mysql tab log
http://qa.drupal.org/admin/reports/event/4239293 for the coder tab log

Log in or register to post comments

Comment #2

xjm

she/her

English

commented 5 December 2011 at 14:52

Title:

8.x branch tests failing on coder advisory review

» 8.x branch tests failing

Hm, maybe I misread the message -- there's a semicolon in the middle, so I guess the second part is just the normal message stub from coder. So the tests just aren't running for some other reason?

Log in or register to post comments

Comment #3

xjm

she/her

English

commented 5 December 2011 at 15:42

Priority:

Critical

» Normal

Branch tests passed again a couple hours later:
http://qa.drupal.org/admin/reports/event/4239603

So downgrading to normal.

Log in or register to post comments

Comment #4

jthorson commented 5 December 2011 at 16:09

Looking at the logs, it appears this test ran on multiple testbots simultaneously, and tripped a watchdog error when the second testbot reported results:

pifr_server 12/05/2011 - 08:01:57 Test not assigned to client: (t: 16628, c: 699)

Normally, PIFR handles this fine, but since we don't know what causes the multiple testbot scenario, I couldn't test how PIFR_Server responds when it's a Coder test which gets crossed up.

In any case, both tests ran successfully, and the branch failure was a false positive due to PIFR_Server somehow getting confused with the duplicate tests and improperly parsing the results.

There could be an issue with the 'advisory' pass/fail calculation in this scenario, but we'll have to keep an eye out to try and figure exactly what situation causes the failure ... I suspect it's the order that the test is picked up and returned from the multiple testbots, but even the testing by multiple testbots is a race condition itself.

Log in or register to post comments

Comment #4.0

jthorson commented 5 December 2011 at 16:09

Project:	Drupal.org Testbots	» Project Issue File Review
Issue summary:	View changes

Updated issue summary.

Log in or register to post comments

Comment #5

jthorson commented 6 December 2011 at 02:35

Title:	8.x branch tests failing	» PIFR reports failed test result with no details, even though testbots have ran successfully
Project:	Project Issue File Review	» Drupal.org Testbots

Title change.

Log in or register to post comments

Comment #6

jthorson commented 16 December 2011 at 21:20

Another example:

https://www.evernote.com/shard/s15/sh/16200e89-a70d-42e9-a7de-e3126a988d...

Log in or register to post comments

Comment #7

longwave

he/him

English

UK

commented 21 December 2011 at 00:30

Just noticed this on Ubercart's 7.x-3.x branch test at http://qa.drupal.org/pifr/test/132134

The cause here seems to be two testbots picking up the same test:

2,684,143 	Result retrieved by project client. 	1 hour 58 min ago
2,684,133 	Result retrieved by project client. 	1 hour 58 min ago
2,684,128 	Result received from test client #664. 	1 hour 58 min ago
2,684,123 	Result received from test client #659. 	1 hour 59 min ago
2,684,118 	Requested by test client #664. 	1 hour 59 min ago
2,684,113 	Requested by test client #659. 	1 hour 59 min ago
2,684,108 	Test reset by client request. 	1 hour 59 min ago
2,684,103 	Requested by test client #659. 	2 hours 45 sec ago
2,684,093 	Requested by test client #699. 	2 hours 1 min ago
2,684,083 	Test request received. 	2 hours 1 min ago

The MySQL and Code review tabs are in the expected order in this case.

Log in or register to post comments

Comment #8

rfay

he,him

English

Palisade, CO, USA

commented 21 December 2011 at 01:12

Title:

PIFR reports failed test result with no details, even though testbots have ran successfully

» PIFR reports failed test result with no details, even though testbots have run successfully

Sorry can't help myself. Just changing title.

Log in or register to post comments

Comment #9

longwave

he/him

English

UK

commented 6 January 2012 at 01:23

Actually, the duplicate test requests seem to be a red herring. What appears to be happening is the results are being parsed in the wrong order, so the code review results are being parsed as MySQL results (which will fail for most contrib as Coder Tough Love reports false positives) and vice versa.

As an example, the last failing Ubercart branch test ended in MySQL failing with "617 pass(es), 44 fail(s), and 899 exception(es)" and Code review passing with "1,816 minor(s), 0 critical(s), and 0 normal(s)" - the numbers are correct, but for the opposite environments.

This can be corrected by forcing enough retests until the bots pick up the results in the right order, although this is somewhat tedious!

Log in or register to post comments

Comment #10

longwave

he/him

English

UK

commented 6 January 2012 at 01:25

Spotted at http://qa.drupal.org/pifr/status while retesting:

Client 659
Enabled - Reviewing
Ubercart - 7.x-3.x
Environment: 2

Client 664
Enabled - Reviewing
Ubercart - 7.x-3.x
Environment:

Note that "environment" is blank for the second one; usually this is 1 or 2.

Log in or register to post comments

Comment #11

longwave

he/him

English

UK

commented 6 January 2012 at 01:30

Should this be moved to the PIFR queue, or is this the correct place to report this?

Log in or register to post comments

Comment #12

jthorson commented 6 January 2012 at 02:37

Either queue is fine ... we'll triage as needed.

I don't believe the test results as displayed on the qa.d.o page can really be trusted when this happens ... I suspect you're actually being shown the result details from the *previous* test, and not the most recent one. This is why you end up with Coder results in the MySQL tab and vice versa.

Log in or register to post comments

Comment #13

longwave

he/him

English

UK

commented 6 January 2012 at 03:05

You can check this out right now at http://qa.drupal.org/pifr/test/132134 (I'm fed up of forcing retests for today at least)

The test status, event log and review log timestamps all seem to match up. Expanding the review log seems to imply that the tests are being parsed the wrong way round.

Also, on the MySQL tab (with the coder review results!) is there any reason why some of the non-pass items are expandable while others aren't?

Log in or register to post comments

Comment #14

jthorson commented 6 January 2012 at 03:45

The testbots have a maximum # of assertions that they will return ... but with Coder, the number of issues flagged often exceeds this value. The expanable items are those that fell under the max assertions threshold, and the non-expandable fell above.

Log in or register to post comments

Comment #15

jthorson commented 20 January 2012 at 05:09

Okay ... here's an explanation.

Occasionally (and I don't yet know why), the same test gets farmed out to multiple testbots. This means that the same environment may be under test by multiple testbots simultaneously. However, because the database keys on test_id + environment_id, it only has a single record ... in other words, it's only aware of one testbot's tests.

When a test is received back from a testbot, and the server goes to record the results, control passes to pifr_server_test_result(). The key to the failures is the following code:

  if ($test['status'] == PIFR_SERVER_TEST_STATUS_QUEUED) {
    // If this result constitutes the final result (all environments reviewed)
    // then mark the test as complete.
    if (count($results) == count($environments)) {
      $test['status'] = PIFR_SERVER_TEST_STATUS_RESULT;
    }

Because you might have multiple test running for the same environment, due to multiple testbots getting the same test (thought this shouldn't happen!), you can end up with two coder results returned before you receive your simpletest results (which take much longer to run).

It appears that you then satisfy the count($results) == count($environments) check, at which point qa.d.o marks the whole test as complete ... even though your simpletest tests are still running.

Then, when the simpletest results are returned ... they get rejected because the test is not actually queued anymore. Because it's not queued, the results don't get saved to the db.

Now I haven't looked into why you still sometimes get simpletest results on the test page, or crossed up tabs ... but I suspect that, since the results don't get saved to the db, the assertions for the previous run of that environment still exist in the database. Because the environment keys were reset when the test was farmed out to the testbots, the various tables are no longer in synch - so you can get old Coder results in a mysql tab, or old Mysql results under a Coder tab.

It doesn't mean the results have been swapped ... it means that you probably aren't looking at current results.

Log in or register to post comments

Comment #16

jthorson commented 20 January 2012 at 05:45

Next time we see this, lets do a db query and see what was actually there.

Select result_id, test_id, environment_id, code, details from pifr_result where test_id = [test_id]

Log in or register to post comments

Comment #17

xjm

she/her

English

commented 20 January 2012 at 13:36

It happened again this morning.

webchick ran:
select result_id, test_id, environment_id, code, details from pifr_result where test_id = 16628

Got:
Table 'drupal.pifr_result' doesn't exist

Presumably it's in a different DB that QA uses? Can we get that DB documented here so we can ask the right question next time? :)

Edit: I requeued the branch because patch tests are getting really backed up, between this and a separate issue last night with a broken commit.

Log in or register to post comments

Comment #18

webchick

she/they

English

Vancouver 🇨🇦

commented 20 January 2012 at 13:42

Yeah, I'm happy to run queries like this, but when I do:

ssh webchick@qa.drupal.org

I get:
Permission denied (publickey,gssapi-with-mic).

If someone is able to take care of that, I can help. :) My key is on util.

Log in or register to post comments

Comment #19

rfay

he,him

English

Palisade, CO, USA

commented 20 January 2012 at 16:04

This is qa's database, which is not accessible on util. You can log into www2 and go to /var/www/qa.drupal.org/htdocs and run drush cli.

Here's what I got just now. But unfortunately I had already hit the retest button.

select result_id, test_id, environment_id, code, details from pifr_result where test_id = 16628;
+-----------+---------+----------------+------+---------------------------------------------------------------------+
| result_id | test_id | environment_id | code | details                                                             |
+-----------+---------+----------------+------+---------------------------------------------------------------------+
|    483413 |   16628 |              1 |    8 | a:3:{s:5:"@pass";i:908;s:5:"@fail";i:134;s:10:"@exception";i:2922;} | 
+-----------+---------+----------------+------+---------------------------------------------------------------------+
1 row in set (0.00 sec)

Log in or register to post comments

Comment #20

jthorson commented 22 January 2012 at 05:05

Looking at the logs for Drupal 8.x today (which failed with this symptom MULTIPLE times), it appears that there is a 'Test reset by client request' event for each one of the failures.

If we see more of this issue, please check for this in the current test cycle on the event log ... if this line appears every time that a test responds this way, there's a good chance we've ran into the same race condition as what I suspect was causing #1347582: Testbot reconfirmation fails: failed to retrieve [.patch] from [qa.drupal.org]. (Namely, secondary requests arriving during the window between an xmlrpc request and response).

Log in or register to post comments

Comment #21

jthorson commented 22 January 2012 at 05:51

Hmm ... and when we see 'Environment: ' with no value on the status page, the following might help shed some light on things:

Select * from pifr_test where test_id = $test_id;
Select * from pifr_test_environment where test_id = $test_id;
Select * from pifr_environment_status where test_id = $test_id;

And as before, once the test completes and comes back as failed=>@reason:

select result_id, test_id, environment_id, code, details from pifr_result wehre test_id = $test_id;

Log in or register to post comments

Comment #22

jthorson commented 22 January 2012 at 06:27

Status:

Active

» Needs review

Status	File	Size
new	pifr_modify-results-for-all-env-check_1361468-22.patch	638 bytes

This patch doesn't address the root problem, but it should help with one scenario (where two coder responses come in before the simpletest result, thus prematurely satisfying the count($results) == count($environments) check ... don't believe it should break anything else.

Log in or register to post comments

Comment #23

jthorson commented 3 February 2012 at 02:52

Project:	Drupal.org Testbots	» Project Issue File Review
Version:		» 6.x-2.x-dev

Moving to PIFR

Log in or register to post comments

Comment #24

jthorson commented 3 February 2012 at 03:14

Patch in 22 committed to 6.x-2.x (dbc57ef).

Log in or register to post comments

Comment #25

jthorson commented 3 February 2012 at 14:27

With the patch in #22, when the environment gets out of sync, qa.d.o now returns results as soon as all 'known' environments have results.

Unfortunately, if the 'out of sync' environment is related to the simpletest test (i.e. the entry with no environment is the MySql environment), then the results are returned as soon as Coder has completed ... which tells me that the environment sync issue causes an incomplete $environments array when the check applied in the above patch is executed.

Just noting this here, as it's another starting point for further troubleshooting.

Log in or register to post comments

Comment #26

jthorson commented 4 February 2012 at 03:10

Status	File	Size
new	pifr_avoid-dupe-pifr-env-status-entries-1361468-26.patch	3.21 KB

This patch should prevent the test from being sent to the testbot when a testbot requests the same test twice, resulting in the following appearing in watchdog:

User warning: Duplicate entry '219798-1' for key 'PRIMARY' query: INSERT INTO pifr_environment_status (environment_id, test_id, client_id) VALUES (1, 219798, 699) in _db_query() (line 147 of /var/www/qa.drupal.org/htdocs/includes/database.mysqli.inc).

or when two testbots call next() at the same time, and end up getting fed the same test; which corrupts the entries in pifr_environment_status().

Log in or register to post comments

Comment #27

jthorson commented 4 February 2012 at 18:59

Status	File	Size
new	pifr_what-were-you-thinking_1361468-27.patch	649 bytes

Um ... Re: #22 ...

... yeah. :(

Log in or register to post comments

Comment #28

jthorson commented 4 February 2012 at 19:02

#27 committed to 6.x-2.x (a70d793). #26 still needs review.

Log in or register to post comments

Comment #29

jthorson commented 25 February 2012 at 04:40

Status:

Needs review

» Fixed

#26 committed to 6.x-2.x (commit 5aa600f). Will roll 6.x-2.8-rc3 with this fix, and deploy as soon as my deploy access is granted ... and if things stay stable for a week, will tag as 6.x-2.8 final.