testbot not reporting results back to issue queues, no Re-test link [#1848714]

Problem/Motivation

testbots are not reporting back to the issue queue with the results of completed tests.
For a recent example, #1804688-61: Download and import interface translations

Proposed resolution

unknown resolution.

work around is to remember to go back to the test request after about an hour,
click on the view details,
see the results there,
manually make a comment in the issue, updating the issue status as appropriate (like to needs work if the test failed).

unknown work around for "retest".
maybe contact one of the testbot people in irc to ask for a retest,
or, upload the patch again

Remaining tasks

make sure this is the correct project/issue queue
make sure this is the correct component
add other remaining tasks

User interface changes

none expected.

API changes

none expected.

Comments

Comment #1

jthorson CreditAttribution: jthorson commented 23 November 2012 at 15:26

Did some troubleshooting last night. The communications channel between qa.d.o and d.o is working fine for some of the data exchanges, but the one for reporting results back is resulting in 'Server Internal Error' (and 2-5 watchdog error logs per minute) due to one of the items in the test array missing a piece of data which is mandatory for the processing. Not sure how this test got in missing this parameter, but it causes the entire response to fail.

I tweaked a variable to bypass this test, which got rid of the 'Server Internal Error' issue ... but resulted in this particular xml-rpc function to simply time out.

Plan on finding whatever time I can today (while at the 'day-job') to chase things down with the infra team.

Comment #2

jthorson CreditAttribution: jthorson commented 23 November 2012 at 16:56

Just recording some info I may need to clean up later ...

last good test (original pift_retrieve_last): 1353602796
dupe test_id: 390448
outstanding tests: 109
last test_id: 391218

first fixed timestamp: 1353647198

Comment #3

jthorson CreditAttribution: jthorson commented 23 November 2012 at 17:23

Priority:	Normal	» Major
Status:	Active	» Fixed

Issue was initially caused by an errant test result record that found its way into the pifr_results database ... test 390448 ended up with two pifr_results entries, one for simpletest and one with environment = 0. The PIFR retrieve result code couldn't deal with this environment = 0 (which is completely invalid to start with), resulting in PHP warnings while processing it and the 'Server Internal Error' response being sent back to PIFT.

Once this extra record was deleted, the backlog of tests waiting to send results back to PIFT was 120 tests long. PIFR attempts to process these in batches no larger than 100; but even with this the processing was exceeding the default 30 second timeout in drupal_http_request() (which is called by xmlrpc()). By shrinking the batch down to about 70 tests, the processing was able to complete successfully and unblock the logjam.

Unfortunately, with the current codebase, I have no way of forcing specific tests (i.e. the other 120) to try and update ... the code simply updates every record after the last 'successful' update received by PIFT.

Resulting recommendations:

- #1848800: Decrease the PIFR_SERVER_BATCH_MAX_COUNT constant from 100 down to 50
- #1848806: Modify pifr_server_test_get_since (or downstream functions) to screen out invalid test results with environment = 0
- #1848814: Do not allow PIFR to write a result with environment = 0 to the pifr_result database table
- #1848820: Create a way to manually force a qa.d.o -> d.o result update and cancel queued test.