Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
https://qa.drupal.org/pifr/test/1032863 is a three hour test run on bot 2943. No idea why since the previous run was the expected 30 mins https://qa.drupal.org/pifr/test/1033103 and the changes in #2272685-23: Incorrect handling of invalid cache items in apcu backend are very similar to the changes in previous patch on the issue which completed quickly.
Comments
Comment #1
alexpottAlso given that we are using bots that complete in 30 mins I think we should be terminating any runs that take longer than an hour. I think the existing testbot infra has a way of doing this.
Comment #2
alexpottThe next run on 2943 is already up to 40 mins - https://qa.drupal.org/pifr/test/1033243
Comment #3
xjmAnother very long test run underway in #2204697: Move getConfigPrefix() to ConfigEntityTypeInterface:
https://qa.drupal.org/pifr/test/1035983
It's already up to an hour. @alexpott will post specifics he found logging into the bot.
Comment #4
xjmNote that the 3h limit was reduced to 1h in #2478797: Reduce testbot time limit from 180 minutes to 60 minutes, so retitling accordingly.
Comment #5
alexpottSo the 1+ hr test for #2204697: Move getConfigPrefix() to ConfigEntityTypeInterface was followed up by a re-test on the same bot - that took 31.5 minutes. How odd. See https://qa.drupal.org/pifr/test/1035983
Comment #6
alexpottAlso I don't know why the test was reset by client request before the hour was reached.
Comment #7
jthorson CreditAttribution: jthorson commentedI'd expect the testbot watchdog log should provide some insight as to why it reset ... but here's some pure speculation:
One thing that I have seen from time to time is a race condition window which would trick a PIFR testbot into kicking off two simultaneous 'run-tests.sh' test runs on the same server. I was able to refactor the code to significantly minimize this window, but not close it completely.
When this happens, you end up with two simultaneous run_tests.sh executions, at maximum concurrency, on the same bot. However, from a testbot perspective (based on the contents of the database), only the second request exists; and the bot is not aware of the first test process. However, when the first process completes, and the code related to that particular process returns it's result, the bot can't correlate it with any expected test run and therefore resets everything to clean it's environment.
Because of the extra load, the duration of these tests increases; and seeing this at the 55 minute mark on a 30 minute test would make sense as one *potential* explanation for the observed behavior.
Comment #8
alexpottre #7 I managed to log on the server and look at the processes running. I don't think I saw two lots of tests running - i think I saw the 14 run-tests processes that I expected.
Comment #9
alexpottThis has just happened on a branch test. https://qa.drupal.org/8.0.x-status. Looks like only one test was terminated...
Comment #10
alexpottMore info
So the bot had the usual 30 min run 9 hours ago but then 2 hours ago had a run that exceeded the hour.
Comment #11
MixologicSuper sad. I *think* we're hitting a bug in APC.
On the testbot that was dragging on slowly (one that was taking a very long time to complete, > 1 hour) we were seeing that the apc opcode cache was not storing any cached opcodes, thereby rendering it useless.
The apache logs show that one of the apache children died with a segfault:
Which may be causing us to run into some sort of unreleased lock scenario as detailed by this APC bug:
https://bugs.php.net/bug.php?id=69048
If this is true, then I have no idea what we might be able to do to mitigate this issue.
Comment #12
alexpottI've just bumped the timeout on all testbots to 90 since we don't seem to be able to complete tests within the 60 minute limit atm.
Comment #13
alexpottAlso I've reduced the concurrency from 15 to 12. The current amazon instances have 16 cores - only leaving 1 free seems like it's not enough.
Comment #14
alexpott90 minutes does not appear to be enough - bumping to 120
Comment #15
xjm@alexpott eventually ended up bumping the timeouts to 150. So not quite 180. :P
Comment #16
alexpottSo this was only on the bad old bots. The new west coast bots have a timeout 60 again and they are all completing well within that time (around 30 mins).
Comment #17
jthorson CreditAttribution: jthorson commentedJust curious ... have we been attempting any apache reloads on these bots? The last time we were running into segfaults, an apache reload seemed to perk things back up.
Comment #18
Mixologic