It'd be nice to re-introduce parallelism into our queue. Some background: the original queue, alternately referred to as the "cron" or "ghetto" queue will run at most once a minute, and triggers a configurable number of tasks. Mass platform migrations could be problematic when run in parallel, due to a large numbers of backups piling up, and eventually saturating I/O, thus crashing MySQL. Largely to support faster mass platform migrations when running strictly sequential tasks, we moved to a queue daemon. We've resolved many of the bottleneck issues by running the task daemon both niced and ioniced. However, we never re-implemented the ability to run tasks in parallel with the daemon. As this is now the recommended standard, we should look to optimize this, imo.
| Comment | File | Size | Author |
|---|---|---|---|
| #12 | allow_parallelism_in-2714049-10.patch | 5.74 KB | helmo |
| #8 | hosting-2714049-parallel_queues.patch | 7.96 KB | steven jones |
Comments
Comment #2
steven jones commentedWe'd need to make sure the queue item claiming worked correctly, sometimes with old Aegir, both the queue runner and the ghetto queue could try and run the exactly same task at the same time. But assuming that worked:
Would an approach just be a run more than one queue runner at a time? And somehow handle that at the system level?
Comment #3
ergonlogicI think we already only select queue items that are in a 'queued' state, which gets updated to 'processing' once it's "claimed", but we'll need to check.
I hadn't thought to run multiple queue daemon processes. It seems to me that such an approach would be more likely to lead to the kinds of issues you mention. I had thought to simply have the queue daemon trigger up to X queued tasks.
If we wanted to really optimize things, we might want to consider assigning a "score" to tasks, that'd represent their resource consumption. Then we might be able to, say, only run one backup or clone at a time (which consume lots of I/O), but multiple verify tasks (which have relatively little performance impact).
Comment #4
colanComment #5
steven jones commentedHad a look at this and specifically the code. So we have a global lock, so running multiple daemons won't work at the moment. Alternatively we could change our invocation of the task so that we fork instead of running in the same process, then the task daemon would actually run as many tasks as it could find as fast as it wants to.
Obviously we wouldn't want hundreds of tasks to be run all at the same time, but we already have
hosting_task_count_runningthat we can use to see how many tasks are actually running, and ensure we keep within the concurrency bounds.Then I think the only thing that wouldn't work is the 'post task' delay, because we wouldn't actually know when the task has finished. We could just add that as a hidden option to the
hosting-taskDrush command, and then send it along as it's dispatched.I don't think we'd lose out on anything else really, so...why not!
Comment #6
steven jones commentedHere's a completely untested initial stab at this. Will pick this up tomorrow again.
Comment #7
ergonlogicNeat! FYI, I've started re-factoring Skynet somewhat, a re-implementation of the task queue in Python: https://github.com/ergonlogic/hosting_skynet/tree/master. While I haven't added consurrency yet, I've begun: #2714065: Make the task queue a service. Not complete yet, but both parallel task execution and heartbeats are planned features.
Comment #8
steven jones commentedHave tested it now :)
First time through and it ran all my tasks. My server was not happy about installing 8 heavy Drupal sites at once :)
Queue daemon code changes were fine, it was actually this code:
Note the invalid PDO placeholders in the db_query there.
So, fixed this function up in the latest patch (along with some others that were nearby) and it works!
I set the concurrency to 2, and loaded up a lot of tasks into the queue, and the queue daemon dutifully ensured that there were two tasks running until the queue was processed.
The post task delay also works, but look at bit different now, instead of there just being a random state of nothing happening, the task appears to still be executing, even though it's basically done.
So, over to someone else to review please!
Comment #9
ergonlogicThe code looks solid, though I haven't tested it yet. It doesn't look like this changes the API any; just enhances it with new capabilities. Correct? As such, we shouldn't need to have this wait until Aegir 4.x.
That said, I think we should wait to have more feedback from others running it in production before merging this. In particular, I think we'd want to confirm that forking the process with the default concurrency of 1 operates pretty much identically to the current behaviour.
Also,
--post-task-delaycould perhaps use a bit more documentation. Presumably, it's in place to avoid race conditions, or the like. Could the conditions in which we'd want to use it be explained further?Comment #11
helmo commentedThe patch no longer applied .. here's a re-roll. I've committed the included code style improvements. (that's also where it broke the patch)
Comment #12
helmo commentedComment #13
helmo commentedRunning with concurrency=2 seems to work as expected.
I did notice that when concurrency=1 I did have multiple forked queued child's trying to start the same task.
The locking mechanism is probably saving the day but
This task is already running, use --forceshould not be seen normally.Another thing to consider is if we want multiple tasks to run in parallel on the same site... I wouldn't want a backup and git pull task to run together.
The config form should probably mention that a restart of the queued is needed for changed to take effect.
Comment #14
steven jones commentedThanks for the review, and re-roll etc.
So I suppose with this patch, it'll get an item to be run from the queue, fire off the new Drush command to execute that and then immediately loop back around and fetch the same item from the queue and fire off the same Drush command.
The dispatcher here isn't actually keeping track of its child processes, but is just looking at the global queue state to see if it should start executing some more tasks.
We could solve this in a rubbish way by sleeping for a second if we have spun up child processes to give them chance to claim the item from the queue so that when we check we see how many running processes there are, we see them.
Yeah, that's a good point, maybe it would be enough to group the select query for new tasks by
rid, so that we only ever get one task per item to run.This should probably be fixed globally, since this will be a problem for the old cron based dispatcher too.
So...I think we should maybe introduce a flag to restart the daemon, so that when we save the config form, we set a flag somewhere that the daemon picks up on and restarts, then we don't need a nasty message, and the other options on that page will also be updated asap.
Comment #15
helmo commentedAlso keep an eye on #2672530: Adopt Python queue daemon replacement
Comment #16
colanComment #17
steven jones commentedSo just a massive word of caution, I experienced some serious issues with running tasks in parallel. There are subtle race conditions everywhere.
Things like Apache being restarted by another process in between config files being written and files referenced by the config being written causing Apache to output a notice and Aegir to log it.
Comment #18
colan#2875766: Tasks are not single threading - getting deadlocks for tasks just came in.