Right now, https://qa.drupal.org/ shows 31 queued issues, waiting for 3 hours+. This has been happening roughly weekly, and it generally happens overnight for Portland while DA people are sleeping, so Europe suffers with this until the end of their work day, which is introducing major delays into the work of folks trying to get D8 done. According to Wim/Gábor, this is a relatively recent development (started in the past couple of months).

I raised this in IRC the other day with basic. He said that folks in general are trying to focus on drupal CI instead and while I whole-heartedly agree with this direction, based on #2456147: [meta] DrupalCI D8 Accelerate sprint hit-list I don't see Drupal CI getting into production any earlier than DrupalCon LA, if that. Which means we're stuck with another couple of months of Europeans frequently using half a working day to testbot slowness.

So please, could someone from the DA investigate why this keeps happening? Spot instances https://aws.amazon.com/ec2/purchasing-options should be able to help with this, in theory. I feel like I've read something about us changing from those due to pricing changes.

Can we use some D8 Accelerate funds to help with testbot availability in some way? For example, to "bid" on a higher price and keep them online longer? (note: I have no idea what I'm talking about, trying to channel others ;))

Comments

webchick’s picture

[3/23/15, 10:01:55 AM] Wim Leers: Looks like we're down to *one* functioning testbot
[3/23/15, 10:01:59 AM] Wim Leers: We're now >5 hours behind

joshuami’s picture

I've asked opdavies to do a testbot check the start of his working day. He can spin up additional testbots if we see an issue. We are using spot instances, but we don't have automated spin up of new instances in the current structure.

isntall is also going to keep an eye on possible causes.

isntall’s picture

Assigned: Unassigned » isntall
isntall’s picture

So the biggest problem is coverage. The DA spans from Wales to Oregon, which although not small
that leaves quite a large swath of the earth without someone watching.

We use c4.8xLarge machines, which can churn thru the queue pretty quickly. The spot instance
market has been fluctuating quite a bit as of late from $0.25USD an hour to $8USD+. We can raise
our normal cap, which is $1.41USD, to something higher; but $8USD per a machine per an hour gets
really expensive really quickly

Automated spin up and spin down is a key feature of the DrupalCI project and is pretty much
unfeasible for the current setup and distract from the DrupalCI efforts.

I think the bigger issue is that the tests can and do kill the testbots, which qa.d.o can't really
detect. We can't tell something is amiss, until the test has been running for an hour or the
testbot hasn't gotten a job in a hour. Since I started tending to the testbots in October, I've
seen test disable the built-in Drupal site to making the AWS EC2 instance completely unresponsive
and that shouldn't really be possible.

As I see it we can do the following things to mitigate the issue:
- Up the minimum number of testbots we have on hand; currently we aim for 5 PHP5.4 bots
- Up the maximum spot instance price; currently it is $1.41USD
- Find people with the expertise and willingness to help the qa.d.o in a timezone that the DA
doesn't cover like Australia

webchick’s picture

#2467621: Tests not being added to the testbot queue happened once again during EU hours, but no DA activity yet. :(

#4 implies there's monitoring in place to flag when either a test run takes > 1 hour or is queued for > 1 hour, which sounds great. Did this problem somehow not get caught by that monitoring?

wim leers’s picture

To clarify: it's been going on for ~6 hours or so… right during EU business hours. All testbots are idling.

jthorson’s picture

This is a different failure ... corrupt node (or something) causing a malformed entity exception on the queuing side. Seldom easy to troubleshoot, and possibly something the team has not seen before (I've only seen it less than a half dozen times).

The monitoring wouldn't pick this up, as it's on the qa.d.o side of things ... and to qa.d.o, the queue appears empty.

I'll be jumping on it in about 20 minutes if it hasn't been resolved yet by that time.

Mixologic’s picture

Status: Active » Closed (outdated)