https://qa.drupal.org/pifr/test/1032863 is a three hour test run on bot 2943. No idea why since the previous run was the expected 30 mins https://qa.drupal.org/pifr/test/1033103 and the changes in #2272685-23: Incorrect handling of invalid cache items in apcu backend are very similar to the changes in previous patch on the issue which completed quickly.

Comments

alexpott’s picture

Also given that we are using bots that complete in 30 mins I think we should be terminating any runs that take longer than an hour. I think the existing testbot infra has a way of doing this.

alexpott’s picture

Issue summary: View changes

The next run on 2943 is already up to 40 mins - https://qa.drupal.org/pifr/test/1033243

xjm’s picture

Another very long test run underway in #2204697: Move getConfigPrefix() to ConfigEntityTypeInterface:
https://qa.drupal.org/pifr/test/1035983
It's already up to an hour. @alexpott will post specifics he found logging into the bot.

xjm’s picture

Title: Unexpected 3 hour testbot run » Unexpected timeouts on testbot runs
Related issues: +#2478797: Reduce testbot time limit from 180 minutes to 60 minutes

Note that the 3h limit was reduced to 1h in #2478797: Reduce testbot time limit from 180 minutes to 60 minutes, so retitling accordingly.

alexpott’s picture

So the 1+ hr test for #2204697: Move getConfigPrefix() to ConfigEntityTypeInterface was followed up by a re-test on the same bot - that took 31.5 minutes. How odd. See https://qa.drupal.org/pifr/test/1035983

Result retrieved by project client.	12 min 32 sec ago
Result received from test client #2853.	13 min 14 sec ago
Requested by test client #2853.	41 min 39 sec ago
Test reset by client request.	41 min 39 sec ago
Requested by test client #2853.	1 hour 38 min ago
alexpott’s picture

Also I don't know why the test was reset by client request before the hour was reached.

jthorson’s picture

I'd expect the testbot watchdog log should provide some insight as to why it reset ... but here's some pure speculation:

One thing that I have seen from time to time is a race condition window which would trick a PIFR testbot into kicking off two simultaneous 'run-tests.sh' test runs on the same server. I was able to refactor the code to significantly minimize this window, but not close it completely.

When this happens, you end up with two simultaneous run_tests.sh executions, at maximum concurrency, on the same bot. However, from a testbot perspective (based on the contents of the database), only the second request exists; and the bot is not aware of the first test process. However, when the first process completes, and the code related to that particular process returns it's result, the bot can't correlate it with any expected test run and therefore resets everything to clean it's environment.

Because of the extra load, the duration of these tests increases; and seeing this at the 55 minute mark on a 30 minute test would make sense as one *potential* explanation for the observed behavior.

alexpott’s picture

re #7 I managed to log on the server and look at the processes running. I don't think I saw two lots of tests running - i think I saw the 14 run-tests processes that I expected.

alexpott’s picture

This has just happened on a branch test. https://qa.drupal.org/8.0.x-status. Looks like only one test was terminated...

Drupal\views_ui\Tests\ViewsUITourTest                         46 passes
Drupal\views\Tests\Plugin\DisplayTest                        154 passes                            3 messages
Drupal\views_ui\Tests\OverrideDisplaysTest                   150 passes
Drupal\views\Tests\Plugin\PagerTest                          138 passes
Drupal\views_ui\Tests\PreviewTest                            170 passes
Drupal\views_ui\Tests\ViewEditTest                           135 passes                            1 messages
Terminated].
[12:09:05] Encountered error on [review], details:
array (
  '@reason' => 'failed during invocation of run-tests.sh',
)
[12:09:05] Review complete. test_id=833633 result code=8 details=Array
(
    [@reason] => failed during invocation of run-tests.sh
)
alexpott’s picture

More info

ID	Message	Timestamp
12790293	Result retrieved by project client.	1 hour 27 min ago
12790278	Result received from test client #2938.	1 hour 28 min ago
12789898	Requested by test client #2938.	2 hours 28 min ago
12789888	Test request received.	2 hours 29 min ago
12787488	Result received from test client #2938.	9 hours 14 min ago
12787483	Result retrieved by project client.	9 hours 14 min ago
12787408	Requested by test client #2938.	9 hours 43 min ago
12787403	Test request received.	9 hours 43 min ago

So the bot had the usual 30 min run 9 hours ago but then 2 hours ago had a run that exceeded the hour.

Mixologic’s picture

Super sad. I *think* we're hitting a bug in APC.

On the testbot that was dragging on slowly (one that was taking a very long time to complete, > 1 hour) we were seeing that the apc opcode cache was not storing any cached opcodes, thereby rendering it useless.

The apache logs show that one of the apache children died with a segfault:

[Tue May 05 06:25:02 2015] [warn] RSA server certificate CommonName (CN) `drupaltestbot' does NOT match server name!?
[Tue May 05 06:25:02 2015] [notice] Apache/2.2.22 (Debian) PHP/5.4.40-1~dotdeb+wheezy.1 mod_ssl/2.2.22 OpenSSL/1.0.1e configured -- resuming normal operations
[Tue May 05 07:03:16 2015] [notice] child pid 87863 exit signal Segmentation fault (11)
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.engine
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.inc
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.install
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.make

......

Which may be causing us to run into some sort of unreleased lock scenario as detailed by this APC bug:

https://bugs.php.net/bug.php?id=69048

If this is true, then I have no idea what we might be able to do to mitigate this issue.

alexpott’s picture

Priority: Major » Critical

I've just bumped the timeout on all testbots to 90 since we don't seem to be able to complete tests within the 60 minute limit atm.

alexpott’s picture

Also I've reduced the concurrency from 15 to 12. The current amazon instances have 16 cores - only leaving 1 free seems like it's not enough.

alexpott’s picture

90 minutes does not appear to be enough - bumping to 120

xjm’s picture

@alexpott eventually ended up bumping the timeouts to 150. So not quite 180. :P

alexpott’s picture

So this was only on the bad old bots. The new west coast bots have a timeout 60 again and they are all completing well within that time (around 30 mins).

jthorson’s picture

Just curious ... have we been attempting any apache reloads on these bots? The last time we were running into segfaults, an apache reload seemed to perk things back up.

Mixologic’s picture

Status: Active » Closed (outdated)