Posting here from #2008626: Tests on testbots complete, but then start over, cycling endlessly..

Symptom:
Large testbot result sets are not successfully communicated back to qa.d.o; which causes that test to continuously cycle on the testbot and eventually cause an apache segfault. Originally this was *all* D8 tests, until we removing a debug() in core which was responsible for about 10k exceptions in the result set. This was a band-aid fix which works for 'clean' D8 tests, but does not address the root problem ... tests for broken patches with alot of exceptions are still tripping the HTTP 413 response, and I've had to clean up at least 2 segfaults per bot in the last 16 hours.

Root cause:
I've managed to isolate the headers from a failed results post from a testbot to qa.d.o. The HTTP 413 is being returned by nginx. This may explain why it was not showing up in the qa.d.o apache logs, if the POST is being blocked by the proxy before it ever gets to qa.d.o.

The result set in question was approx 4MB in size, with only 5k exceptions (and I've seen tests with 3-4 times that in the past); so we're going to need to set a fairly high value in order to prevent this from biting us again in the future. However, nnewton said he had already increased the client_max_body_size; so I'm not sure if this change isn't taking effect, or nginx is proxying the response from something further downstream.

Comments

jthorson’s picture

Request:

[request] => POST /xmlrpc.php HTTP/1.0
  Host: qa.drupal.org
  User-Agent: Drupal (+http://drupal.org/)
  Content-Length: 4118522
  Content-Type: text/xml

<?xml version="1.0"?>
<methodCall>
<methodName>pifr.result</methodName>
<params>
...

Response Data:

[data] => <html>
<head><title>413 Request Entity Too Large</title></head>
<body bgcolor="white">
<center><h1>413 Request Entity Too Large</h1></center>
<hr><center>nginx/1.4.1</center>
</body>
</html>

Response Metadata:

    [protocol] => HTTP/1.1
    [status_message] => Request Entity Too Large
    [headers] => Array
        (
            [Server] => nginx/1.4.1
            [Date] => Sat, 01 Jun 2013 18:27:11 GMT
            [Content-Type] => text/html
            [Content-Length] => 198
            [Connection] => close
        )

    [error] => Request Entity Too Large
    [code] => 413

chx’s picture

Category:support» bug
Priority:Major» Critical

Jeremy is way too kind with his issue priorities.

rfay’s picture

nnewton suggests adding to the testbots a hosts entry like

140.211.10.17 qa.drupal.org

(140.211.10.17 is actually www2.drupal.org)

which might make requests to qa come in behind nginx.

drumm’s picture

Is it possible to compress the test results? I'm guessing they would gzip really well. I doubt you can compress the entire POST data and have xmlrpc.php recognize it. You would have to include it in just enough XMLRPC wrapping to get the string back on the server and extract the data. Might be a better long-term solution than hard-coding IPs and maybe even faster for bandwidth.

jthorson’s picture

Status:Active» Fixed

#3 got us running smoothly again ... we'll pursue the results compression in another issue.

rfay’s picture

Priority:Critical» Major
Status:Fixed» Active

This really isn't fixed, it's hacked. I'd like to see us get an appropriate way to manage it.

We need to not be blocked on size when going through nginx, or something. The current hack in testbot /etc/hosts is nothing more than a short, very short-term approach.

jthorson’s picture

Sorry ... I forgot the crosslink. The long term solution should be #2010482: Compress test results before sending over to qa.d.o.

drumm’s picture

Title:Nginx blocking Testbot communications with qa.d.o» Nginx blocking large requests, including testbot communications and file uploads
Component:qa.drupal.org» Webserver

This is affecting file uploads too. Specifically, session submissions for DrupalCon Prague, which are due this week.

Component:Webserver» Servers
jthorson’s picture

Issue summary:View changes

This is resolved from a testbot perspective ... I'd close it, if not for Neil's comment regarding file uploads; I'll leave it open for his comments on whether this is still an issue.

drumm’s picture

Status:Active» Fixed

Fixed as far as I know.

Status:Fixed» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.