Unexpected timeouts on testbot runs [#2477583]

Comment #1

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 24 April 2015 at 21:37

Also given that we are using bots that complete in 30 mins I think we should be terminating any runs that take longer than an hour. I think the existing testbot infra has a way of doing this.

Log in or register to post comments

Comment #2

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 24 April 2015 at 21:38

Issue summary:

View changes

The next run on 2943 is already up to 40 mins - https://qa.drupal.org/pifr/test/1033243

Log in or register to post comments

Comment #3

xjm

she/her

English

CreditAttribution: xjm at Acquia commented 1 May 2015 at 14:42

Another very long test run underway in #2204697: Move getConfigPrefix() to ConfigEntityTypeInterface:
https://qa.drupal.org/pifr/test/1035983
It's already up to an hour. @alexpott will post specifics he found logging into the bot.

Log in or register to post comments

Comment #4

xjm

she/her

English

CreditAttribution: xjm at Acquia commented 1 May 2015 at 14:52

Title:	Unexpected 3 hour testbot run	» Unexpected timeouts on testbot runs
Related issues:		+#2478797: Reduce testbot time limit from 180 minutes to 60 minutes

Note that the 3h limit was reduced to 1h in #2478797: Reduce testbot time limit from 180 minutes to 60 minutes, so retitling accordingly.

Log in or register to post comments

Comment #6

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 1 May 2015 at 15:21

Also I don't know why the test was reset by client request before the hour was reached.

Log in or register to post comments

Comment #7

jthorson CreditAttribution: jthorson commented 1 May 2015 at 16:13

I'd expect the testbot watchdog log should provide some insight as to why it reset ... but here's some pure speculation:

One thing that I have seen from time to time is a race condition window which would trick a PIFR testbot into kicking off two simultaneous 'run-tests.sh' test runs on the same server. I was able to refactor the code to significantly minimize this window, but not close it completely.

When this happens, you end up with two simultaneous run_tests.sh executions, at maximum concurrency, on the same bot. However, from a testbot perspective (based on the contents of the database), only the second request exists; and the bot is not aware of the first test process. However, when the first process completes, and the code related to that particular process returns it's result, the bot can't correlate it with any expected test run and therefore resets everything to clean it's environment.

Because of the extra load, the duration of these tests increases; and seeing this at the 55 minute mark on a 30 minute test would make sense as one *potential* explanation for the observed behavior.

Log in or register to post comments

Comment #8

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 1 May 2015 at 16:19

re #7 I managed to log on the server and look at the processes running. I don't think I saw two lots of tests running - i think I saw the 14 run-tests processes that I expected.

Log in or register to post comments

Comment #9

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 4 May 2015 at 20:38

This has just happened on a branch test. https://qa.drupal.org/8.0.x-status. Looks like only one test was terminated...

Drupal\views_ui\Tests\ViewsUITourTest                         46 passes
Drupal\views\Tests\Plugin\DisplayTest                        154 passes                            3 messages
Drupal\views_ui\Tests\OverrideDisplaysTest                   150 passes
Drupal\views\Tests\Plugin\PagerTest                          138 passes
Drupal\views_ui\Tests\PreviewTest                            170 passes
Drupal\views_ui\Tests\ViewEditTest                           135 passes                            1 messages
Terminated].
[12:09:05] Encountered error on [review], details:
array (
  '@reason' => 'failed during invocation of run-tests.sh',
)
[12:09:05] Review complete. test_id=833633 result code=8 details=Array
(
    [@reason] => failed during invocation of run-tests.sh
)

Log in or register to post comments

Comment #10

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 4 May 2015 at 20:41

More info

ID	Message	Timestamp
12790293	Result retrieved by project client.	1 hour 27 min ago
12790278	Result received from test client #2938.	1 hour 28 min ago
12789898	Requested by test client #2938.	2 hours 28 min ago
12789888	Test request received.	2 hours 29 min ago
12787488	Result received from test client #2938.	9 hours 14 min ago
12787483	Result retrieved by project client.	9 hours 14 min ago
12787408	Requested by test client #2938.	9 hours 43 min ago
12787403	Test request received.	9 hours 43 min ago

So the bot had the usual 30 min run 9 hours ago but then 2 hours ago had a run that exceeded the hour.

Log in or register to post comments

Comment #11

Mixologic

He/Him/His

English

Portland, OR

CreditAttribution: Mixologic at Drupal Association commented 5 May 2015 at 23:40

Super sad. I *think* we're hitting a bug in APC.

On the testbot that was dragging on slowly (one that was taking a very long time to complete, > 1 hour) we were seeing that the apc opcode cache was not storing any cached opcodes, thereby rendering it useless.

The apache logs show that one of the apache children died with a segfault:

[Tue May 05 06:25:02 2015] [warn] RSA server certificate CommonName (CN) `drupaltestbot' does NOT match server name!?
[Tue May 05 06:25:02 2015] [notice] Apache/2.2.22 (Debian) PHP/5.4.40-1~dotdeb+wheezy.1 mod_ssl/2.2.22 OpenSSL/1.0.1e configured -- resuming normal operations
[Tue May 05 07:03:16 2015] [notice] child pid 87863 exit signal Segmentation fault (11)
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.engine
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.inc
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.install
[Tue May 05 08:28:22 2015] [error] [client 172.30.2.177] client denied by server configuration: /var/lib/drupaltestbot/checkout/core/modules/system/tests/fixtures/HtaccessTest/access_test.make

......

Which may be causing us to run into some sort of unreleased lock scenario as detailed by this APC bug:

https://bugs.php.net/bug.php?id=69048

If this is true, then I have no idea what we might be able to do to mitigate this issue.

Log in or register to post comments

Comment #12

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 6 May 2015 at 13:03

Priority:

Major

» Critical

I've just bumped the timeout on all testbots to 90 since we don't seem to be able to complete tests within the 60 minute limit atm.

Log in or register to post comments

Comment #13

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 6 May 2015 at 13:09

Also I've reduced the concurrency from 15 to 12. The current amazon instances have 16 cores - only leaving 1 free seems like it's not enough.

Log in or register to post comments

Comment #14

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 6 May 2015 at 13:48

90 minutes does not appear to be enough - bumping to 120

Log in or register to post comments

Comment #15

xjm

she/her

English

CreditAttribution: xjm at Acquia commented 6 May 2015 at 21:17

@alexpott eventually ended up bumping the timeouts to 150. So not quite 180. :P

Log in or register to post comments

Comment #16

alexpott

he/they

English

🇪🇺🌍

CreditAttribution: alexpott at Chapter Three commented 6 May 2015 at 21:50

So this was only on the bad old bots. The new west coast bots have a timeout 60 again and they are all completing well within that time (around 30 mins).

Log in or register to post comments

Comment #17

jthorson CreditAttribution: jthorson commented 8 May 2015 at 15:34

Just curious ... have we been attempting any apache reloads on these bots? The last time we were running into segfaults, an apache reload seemed to perk things back up.