Summary
- Multiple testbots were taking more than 60 mins to complete 8.0.x test suite runs, and therefore timing out at an apparent 60 min limit, causing HEAD to fail and no patches to be tested.
- It appears other testbots are still completing runs in about 30 minutes, so it doesn't appear to only be a problem with (e.g.) the core test suite.
- xjm disabled the bots that were taking 60 mins on qa.drupal.org:
3083 Disabled Test http://ec2-52-25-236-68.us-west-2.compute.amazonaws.com/ view test | request enable | edit | remove 3088 Disabled Test http://ec2-54-172-185-1.compute-1.amazonaws.com/ view test | request enable | edit | remove 3093 Disabled Test http://ec2-52-6-45-117.compute-1.amazonaws.com/ view test | request enable | edit | remove 3098 Disabled Test http://ec2-54-172-173-148.compute-1.amazonaws.com/ view test | request enable | edit | remove - Examining miscellaneous test results for these bots for the past several days, it appears that many patches on them previously were taking just under 60 mins, meaning they would have just barely escaped the 60 min timeout. This was the case even for patches that made no code changes to core, at least as far back as June 1.
- It is unknown whether it is only a problem with the hardware or configuration of these bots, or if there was a change in core that led to the problem.
- Once the above bots were disabled, HEAD tested successfully in 30 mins on bot 3063.
Bot to watch: 3078 could not be disabled. It may also have 60-minute test runs and cause all this to kick up again if it picks up HEAD.
Original summary
HEAD failed at least three times on the same bot (3098) with non-failures. Earlier today (per the IRC 8.0.x fail factoid) it got stuck for 3h with no failure so Berdir retested it. Then it failed three more times on that bot this evening:
[failed] Drupal core - 8.0.x on Wed, 06/03/2015 - 22:07:24
Drupal core - 8.0.x fail: https://qa.drupal.org/8.0.x-status
Overall Summary
==============================================
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Failed to run tests: failed during invocation of run-tests.sh.
Individual Environment Summaries
==============================================
-- [[SimpleTest]]: [PHP 5.4 MySQL] --
* No relevant data[failed] Drupal core - 8.0.x on Wed, 06/03/2015 - 23:22:24
Drupal core - 8.0.x fail: https://qa.drupal.org/8.0.x-status
Overall Summary
==============================================
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Failed to run tests: failed during invocation of run-tests.sh.
Individual Environment Summaries
==============================================
-- [[SimpleTest]]: [PHP 5.4 MySQL] --
* No relevant data
[failed] Drupal core - 8.0.x on Thu, 06/04/2015 - 00:39:07
Drupal core - 8.0.x fail: https://qa.drupal.org/8.0.x-status
Overall Summary
==============================================
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Failed to run tests: failed during invocation of run-tests.sh.
Individual Environment Summaries
==============================================
-- [[SimpleTest]]: [PHP 5.4 MySQL] --
* No relevant data
I attempted to mitigate this by disabling the bot since I can run tests locally through run-tests.sh without any problems. HEAD has now been picked up by a different bot.
An hour later, the test failed on a different bot (#3088) after 60 mins -- apparently due to a timeout. (The testbot log says "terminated" repeatedly at the end of test results.)
Comments
Comment #1
xjmAlso, I asked myself if the test runs were timing out again, but those aren't quite 90 mins apart. (Unless we reduced the limit to 60 mins again, plus overhead?)
Comment #3
xjmOkay, it looks like the timeout is actually 60 mins now and it's timing out. This time on bot 3088:
Exactly one hour apart.
The end of the test log says:
Elevating to critical.
Comment #5
xjmEdit: updating comment to avoid anything misleading, just facts.
Comment #7
xjmI've tried pinging multiple people and no one is available to respond to pings, so I've tried disabling all bots that were taking 60 mins (those being 3088, 3093, and 3098) and leaving enabled the ones that were passing tests quickly as of today. This way maybe HEAD will be picked up by a different bot. I have no idea if this is actually the clients themselves or something in core; I don't have access to investigate that more closely.
Comment #8
xjmIt passed in 30 min on 3063:
So it appears something may be pretty wrong with the build or specs of those other bots.
Comment #9
xjmComment #10
xjmOne concern -- 3078 which was another 60-min bot from its logs is currently "testing client" and I can't disable it. I'm not sure if it will become automatically enabled when the confirmation tests are complete, but it says the "client confirmation process" was started over 7h ago.
Comment #11
wim leersThis has been significantly slowing down D8 progress.
Comment #12
basic commentedI want to leave an update here outside of IRC to document a few issues-
AWS appears to be running out of capacity of the c4.4xlarge and c4.8xlarge instances we run.
We stopped running the c4.8xl instances because the spot prices were jumping above the $9/hr maximum allowable spot price (amazon limits the max spot price we can request, so that other on-demand and reserved instance customers take priority), and we reverted back to a larger number of c4.4xl bots because their pricing history seemed more stable.
These are not small instances, a c4.4xlarge is the second most expensive compute optimized AWS instance -- we have thrown hardware at the issue by moving test runs to AWS from the OSL SuperCell cluster after DrupalCon Amsterdam.
The issue now is the stability of the bots, which has been an ongoing problem since the move the AWS. What would be most helpful would be tracking down the most stable priced EC2 instances that can run the tests within our time requirements. That was the goal by moving to the slightly smaller c4.4xlarge instances, but they appear to be giving us continued grief with spiking spot prices and the only other options may or may not be as fast or as stable as what we have now. To combat the spiking prices, some bots were spun up as c3.4xl -- 2nd place top of the line instances from a previous generation -- apparently these bots are taking > 60 minutes to run tests, and are possibly the reason for the latest HEAD issues.
If we moved back to the OSL supercell we'd be looking at 2-3 hour test runs, so that's not really an option given our time requirements. OSL offered us a faster Power7+ IBM compute cluster to use, but in testing it was less stable than AWS.
We do have hardware (a $10,000 server) that is roughly equivalent to a c4.8xlarge instance, but it needs to be rebuilt and deployed -- and this only solves the issue for a single test runner. Maybe trying the older cc2.8xlarge instances would be a good balance of pricing stability and have less demand so the spot prices don't hit their maximum.
Comment #13
effulgentsia commentedAccording to http://aws.amazon.com/ec2/purchasing-options/dedicated-instances/, the on-demand price of c4.8xlarge is $1.856/hr. Is there something that would allow us to launch that when the spot price is higher? Or does Amazon have some way of closing that loophole?
Comment #14
basic commentedThe issue with spot instances is that they exist at the bottom of the tiers -- they are only available when there are compute resources available that are not being used by reserve or on-demand instances.
When there's a big spike in a region for reserved/on-demand instances, the spot price seems to hit the maximum, killing off any running spot instances, to make room for the on-demand / reserve instances. I believe this is by design, as spot pricing is pennies compared to on-demand and reserved.
Comment #15
basic commentedThe issue here is that we don't dynamically spin up instances right now. qa.drupal.org expects the testbots to be long running instances that never go down.
Comment #16
basic commentedI've launched 3 cc2.8xlarge bots in us-west-2c. These appear to be stable price wise (demand is likely lower for last generation instances), and appear to finish tests in a reasonable amount of time with the 32 cpu cores available on the instances.
I've disabled 2833 and 3078 as both were taking > 60 minutes to run multiple tests.
https://qa.drupal.org/pifr/test/1063783 is an example of a test running on a cc2.8xlarge bot. I will continue keeping an eye on this issue and the qa.drupal.org status page to swap out bots taking > 45 minutes to run tests with cc2.8xlarge instances.
Comment #17
basic commentedI've disabled the remaining c*.4xlarge instances, replaced with 6 cc2.8xlarge instances.
I ran a core 8.0.x test on the new bots with a decent ~25 minute test run:
Hopefully this helps stabilize things a bit. Bumping priority down until this becomes a problem again.
Comment #18
akalata commented#16 says that bot 3078 was disabled, but I've got two issues with patches that are failing with
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Failed to run tests: failed during invocation of run-tests.sh --clean.. These are:Comment #19
xjm@akalata, that was only bot 3078 at the time; the actual machines change. Which is not at all confusing. ;)
Looks like the disk is full on the test runs you linked. So a separate problem in this case. If your test passes locally and the disk full problem recurs on that bot we'll file a separate issue from this one, which is specifically about test runs timing out.
Comment #20
xjmHere's that issue: #2505687: Bot #3078 disabled (disk full errors)
Comment #21
xjm8.0.x HEAD just took over an hour and timed out in the same way on the current bot #3073:
3073 Enabled Test http://ec2-52-1-73-29.compute-1.amazonaws.com/ Log view test | disable | edit | remove 97a6fc988b528073a06b1c5fd77607a8Comment #22
MixologicAllrighty. I got to the bottom of the failures that were causing tests to slow down. AWS provisions the bots with a limited amount of IOPS. We get a certain amount of 'credit' at the start, and then we burn through it and the bots drop to a minimum amount of IO - about 60 IOPS - which causes all the tests to drag drag drag.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#...
So, this morning I killed off the bots and reprovisioned them with a 'Provisioned IOPS' - five of the bots are at 150, with a 90 minute timeout, and one of the bots (3088) I set to 600 IOPS with a 300 minute timeout. (the max for the hard drive size)
Instead of raising our floor, it seems to have just set the bar there and they seem to be throttled right out of the gates, which is disappointing. As of this moment we've got about 4 tests that have already reached over an hour. (actually I just checked and one of them completed Test run duration: 58 min 12 sec).
So, the tests seem to be finishing in about an hour right now, but there are many things we can do to mitigate that - bigger drives + more provisioned iops is possible. The good news is they wont get any slower, and we can speed these up from here.
Additionally we were able to mitigate a different issue that was causing by bad patches by boosting tmpfs, so we should have far fewer bots that just crash.
Comment #23
wim leersWhy did we not hit this IOPS problem for many months some time ago? e.g. Q4 2014, Q1 2015, we've been having 24-minute testbot runs consistently, without noticeable testbot problems.
What changed?
Comment #24
basic commentedThe c4 series we were running has a minimum throughput that increases with the larger instances. We would eventually hit this limit with them, though I'm guessing the minimum throughput would help keep them from completely falling over performance wise. That said, we may have been hitting this, but we were also hitting spot price increases, full /tmpfs, and various other things that have been fixed up
What changed recently is using an older cc2 generation bot (see #16 above), which doesn't have a minimum throughput, and also has been running much more stable -- until ~2 weeks ago we were relaunching testbots at least once a week. I don't remember the exact number, but since Q4 2014 we've relaunched somewhere between 300-500 testbots (this is a manual process, with hard costs in time, debug, waiting for spot requests, and linking to qa.drupal.org). Obviously this is not sustainable.
@isntall is working on moving the checkout directory to tmpfs as well, to avoid incrementing the ssd iops limits all together. This should get the test runs back to a consistent level.
Comment #25
Mixologic