It'd be nice to re-introduce parallelism into our queue. Some background: the original queue, alternately referred to as the "cron" or "ghetto" queue will run at most once a minute, and triggers a configurable number of tasks. Mass platform migrations could be problematic when run in parallel, due to a large numbers of backups piling up, and eventually saturating I/O, thus crashing MySQL. Largely to support faster mass platform migrations when running strictly sequential tasks, we moved to a queue daemon. We've resolved many of the bottleneck issues by running the task daemon both niced and ioniced. However, we never re-implemented the ability to run tasks in parallel with the daemon. As this is now the recommended standard, we should look to optimize this, imo.

Comments

ergonlogic created an issue. See original summary.

Steven Jones’s picture

We'd need to make sure the queue item claiming worked correctly, sometimes with old Aegir, both the queue runner and the ghetto queue could try and run the exactly same task at the same time. But assuming that worked:

Would an approach just be a run more than one queue runner at a time? And somehow handle that at the system level?

ergonlogic’s picture

I think we already only select queue items that are in a 'queued' state, which gets updated to 'processing' once it's "claimed", but we'll need to check.

I hadn't thought to run multiple queue daemon processes. It seems to me that such an approach would be more likely to lead to the kinds of issues you mention. I had thought to simply have the queue daemon trigger up to X queued tasks.

If we wanted to really optimize things, we might want to consider assigning a "score" to tasks, that'd represent their resource consumption. Then we might be able to, say, only run one backup or clone at a time (which consume lots of I/O), but multiple verify tasks (which have relatively little performance impact).

colan’s picture

Steven Jones’s picture

Had a look at this and specifically the code. So we have a global lock, so running multiple daemons won't work at the moment. Alternatively we could change our invocation of the task so that we fork instead of running in the same process, then the task daemon would actually run as many tasks as it could find as fast as it wants to.

Obviously we wouldn't want hundreds of tasks to be run all at the same time, but we already have hosting_task_count_running that we can use to see how many tasks are actually running, and ensure we keep within the concurrency bounds.

Then I think the only thing that wouldn't work is the 'post task' delay, because we wouldn't actually know when the task has finished. We could just add that as a hidden option to the hosting-task Drush command, and then send it along as it's dispatched.

I don't think we'd lose out on anything else really, so...why not!

Steven Jones’s picture

Status: Active » Needs work
FileSize
6 KB

Here's a completely untested initial stab at this. Will pick this up tomorrow again.

ergonlogic’s picture

Neat! FYI, I've started re-factoring Skynet somewhat, a re-implementation of the task queue in Python: https://github.com/ergonlogic/hosting_skynet/tree/master. While I haven't added consurrency yet, I've begun: #2714065: Make the task queue a service. Not complete yet, but both parallel task execution and heartbeats are planned features.

Steven Jones’s picture

Status: Needs work » Needs review
FileSize
7.96 KB

Have tested it now :)

First time through and it ran all my tasks. My server was not happy about installing 8 heavy Drupal sites at once :)

Queue daemon code changes were fine, it was actually this code:

function hosting_task_count_running() {
  return db_query("SELECT COUNT(t.nid) FROM {node} n INNER JOIN {hosting_task} t ON n.vid=t.vid WHERE type = !type AND t.task_status = !t.task_status", array('!type' => 'task', '!t.task_status' => HOSTING_TASK_PROCESSING))->fetchField();
}

Note the invalid PDO placeholders in the db_query there.

So, fixed this function up in the latest patch (along with some others that were nearby) and it works!

I set the concurrency to 2, and loaded up a lot of tasks into the queue, and the queue daemon dutifully ensured that there were two tasks running until the queue was processed.
The post task delay also works, but look at bit different now, instead of there just being a random state of nothing happening, the task appears to still be executing, even though it's basically done.

So, over to someone else to review please!

ergonlogic’s picture

The code looks solid, though I haven't tested it yet. It doesn't look like this changes the API any; just enhances it with new capabilities. Correct? As such, we shouldn't need to have this wait until Aegir 4.x.

That said, I think we should wait to have more feedback from others running it in production before merging this. In particular, I think we'd want to confirm that forking the process with the default concurrency of 1 operates pretty much identically to the current behaviour.

Also, --post-task-delay could perhaps use a bit more documentation. Presumably, it's in place to avoid race conditions, or the like. Could the conditions in which we'd want to use it be explained further?

  • helmo committed 43415b2 on 7.x-3.x
    Issue #2714049 by Steven Jones: Pre-commit code style improvements
    
helmo’s picture

The patch no longer applied .. here's a re-roll. I've committed the included code style improvements. (that's also where it broke the patch)

helmo’s picture

helmo’s picture

Running with concurrency=2 seems to work as expected.

I did notice that when concurrency=1 I did have multiple forked queued child's trying to start the same task.
The locking mechanism is probably saving the day but This task is already running, use --force should not be seen normally.

Another thing to consider is if we want multiple tasks to run in parallel on the same site... I wouldn't want a backup and git pull task to run together.

The config form should probably mention that a restart of the queued is needed for changed to take effect.

Steven Jones’s picture

Status: Needs review » Needs work

Thanks for the review, and re-roll etc.

I did notice that when concurrency=1 I did have multiple forked queued child's trying to start the same task.
The locking mechanism is probably saving the day but This task is already running, use --force should not be seen normally.

So I suppose with this patch, it'll get an item to be run from the queue, fire off the new Drush command to execute that and then immediately loop back around and fetch the same item from the queue and fire off the same Drush command.

The dispatcher here isn't actually keeping track of its child processes, but is just looking at the global queue state to see if it should start executing some more tasks.

We could solve this in a rubbish way by sleeping for a second if we have spun up child processes to give them chance to claim the item from the queue so that when we check we see how many running processes there are, we see them.

Another thing to consider is if we want multiple tasks to run in parallel on the same site... I wouldn't want a backup and git pull task to run together.

Yeah, that's a good point, maybe it would be enough to group the select query for new tasks by rid, so that we only ever get one task per item to run.
This should probably be fixed globally, since this will be a problem for the old cron based dispatcher too.

The config form should probably mention that a restart of the queued is needed for changed to take effect.

So...I think we should maybe introduce a flag to restart the daemon, so that when we save the config form, we set a flag somewhere that the daemon picks up on and restarts, then we don't need a nasty message, and the other options on that page will also be updated asap.

helmo’s picture

colan’s picture

Steven Jones’s picture

So just a massive word of caution, I experienced some serious issues with running tasks in parallel. There are subtle race conditions everywhere.

Things like Apache being restarted by another process in between config files being written and files referenced by the config being written causing Apache to output a notice and Aegir to log it.

colan’s picture