Allow parallelism in queue daemon [#2714049]

It'd be nice to re-introduce parallelism into our queue. Some background: the original queue, alternately referred to as the "cron" or "ghetto" queue will run at most once a minute, and triggers a configurable number of tasks. Mass platform migrations could be problematic when run in parallel, due to a large numbers of backups piling up, and eventually saturating I/O, thus crashing MySQL. Largely to support faster mass platform migrations when running strictly sequential tasks, we moved to a queue daemon. We've resolved many of the bottleneck issues by running the task daemon both niced and ioniced. However, we never re-implemented the ability to run tasks in parallel with the daemon. As this is now the recommended standard, we should look to optimize this, imo.

Comment	File	Size	Author
#12	allow_parallelism_in-2714049-10.patch	5.74 KB	helmo
#8	hosting-2714049-parallel_queues.patch	7.96 KB	steven jones
#6	hosting-2714049-parallel_queues.patch	6 KB	steven jones

Comments

Comment #1

27 April 2016 at 04:42

ergonlogic created an issue. See original summary.

Comment #2

steven jones commented 27 April 2016 at 15:37

We'd need to make sure the queue item claiming worked correctly, sometimes with old Aegir, both the queue runner and the ghetto queue could try and run the exactly same task at the same time. But assuming that worked:

Would an approach just be a run more than one queue runner at a time? And somehow handle that at the system level?

Comment #3

ergonlogic

he/him

English

Montréal, Québec 🇨🇦

commented 27 April 2016 at 18:24

I think we already only select queue items that are in a 'queued' state, which gets updated to 'processing' once it's "claimed", but we'll need to check.

I hadn't thought to run multiple queue daemon processes. It seems to me that such an approach would be more likely to lead to the kinds of issues you mention. I had thought to simply have the queue daemon trigger up to X queued tasks.

If we wanted to really optimize things, we might want to consider assigning a "score" to tasks, that'd represent their resource consumption. Then we might be able to, say, only run one backup or clone at a time (which consume lots of I/O), but multiple verify tasks (which have relatively little performance impact).

Comment #4

colan

he/him

English

Toronto 🇨🇦

commented 28 April 2016 at 14:58

Comment #5

steven jones commented 21 May 2016 at 20:08

Had a look at this and specifically the code. So we have a global lock, so running multiple daemons won't work at the moment. Alternatively we could change our invocation of the task so that we fork instead of running in the same process, then the task daemon would actually run as many tasks as it could find as fast as it wants to.

Obviously we wouldn't want hundreds of tasks to be run all at the same time, but we already have hosting_task_count_running that we can use to see how many tasks are actually running, and ensure we keep within the concurrency bounds.

Then I think the only thing that wouldn't work is the 'post task' delay, because we wouldn't actually know when the task has finished. We could just add that as a hidden option to the hosting-task Drush command, and then send it along as it's dispatched.

I don't think we'd lose out on anything else really, so...why not!

Comment #6

steven jones commented 21 May 2016 at 21:20

Status:

Active

» Needs work

Status	File	Size
new	hosting-2714049-parallel_queues.patch	6 KB

Here's a completely untested initial stab at this. Will pick this up tomorrow again.

Comment #7

ergonlogic

he/him

English

Montréal, Québec 🇨🇦

commented 21 May 2016 at 22:36

Neat! FYI, I've started re-factoring Skynet somewhat, a re-implementation of the task queue in Python: https://github.com/ergonlogic/hosting_skynet/tree/master. While I haven't added consurrency yet, I've begun: #2714065: Make the task queue a service. Not complete yet, but both parallel task execution and heartbeats are planned features.

Comment #8

steven jones commented 22 May 2016 at 05:49

Status:

Needs work

» Needs review

Status	File	Size
new	hosting-2714049-parallel_queues.patch	7.96 KB

1 file was hidden/shown/deleted

Status	File	Size
hidden	hosting-2714049-parallel_queues.patch	6 KB

Have tested it now :)

First time through and it ran all my tasks. My server was not happy about installing 8 heavy Drupal sites at once :)

Queue daemon code changes were fine, it was actually this code:

function hosting_task_count_running() {
  return db_query("SELECT COUNT(t.nid) FROM {node} n INNER JOIN {hosting_task} t ON n.vid=t.vid WHERE type = !type AND t.task_status = !t.task_status", array('!type' => 'task', '!t.task_status' => HOSTING_TASK_PROCESSING))->fetchField();
}

Note the invalid PDO placeholders in the db_query there.

So, fixed this function up in the latest patch (along with some others that were nearby) and it works!

I set the concurrency to 2, and loaded up a lot of tasks into the queue, and the queue daemon dutifully ensured that there were two tasks running until the queue was processed.
The post task delay also works, but look at bit different now, instead of there just being a random state of nothing happening, the task appears to still be executing, even though it's basically done.

So, over to someone else to review please!

Comment #9

ergonlogic

he/him

English

Montréal, Québec 🇨🇦

commented 22 May 2016 at 14:08

The code looks solid, though I haven't tested it yet. It doesn't look like this changes the API any; just enhances it with new capabilities. Correct? As such, we shouldn't need to have this wait until Aegir 4.x.

That said, I think we should wait to have more feedback from others running it in production before merging this. In particular, I think we'd want to confirm that forking the process with the default concurrency of 1 operates pretty much identically to the current behaviour.

Also, --post-task-delay could perhaps use a bit more documentation. Presumably, it's in place to avoid race conditions, or the like. Could the conditions in which we'd want to use it be explained further?

Comment #10

30 June 2016 at 19:41

helmo committed 43415b2 on 7.x-3.x

Issue #2714049 by Steven Jones: Pre-commit code style improvements

Comment #11

helmo commented 30 June 2016 at 19:41

The patch no longer applied .. here's a re-roll. I've committed the included code style improvements. (that's also where it broke the patch)

Comment #12

helmo commented 30 June 2016 at 20:04

Status	File	Size
new	allow_parallelism_in-2714049-10.patch	5.74 KB

Comment #13

helmo commented 30 June 2016 at 20:15

Running with concurrency=2 seems to work as expected.

I did notice that when concurrency=1 I did have multiple forked queued child's trying to start the same task.
The locking mechanism is probably saving the day but This task is already running, use --force should not be seen normally.

Another thing to consider is if we want multiple tasks to run in parallel on the same site... I wouldn't want a backup and git pull task to run together.

The config form should probably mention that a restart of the queued is needed for changed to take effect.

Comment #14

steven jones commented 5 July 2016 at 13:21

Status:

Needs review

» Needs work

Thanks for the review, and re-roll etc.

I did notice that when concurrency=1 I did have multiple forked queued child's trying to start the same task.
The locking mechanism is probably saving the day but This task is already running, use --force should not be seen normally.

So I suppose with this patch, it'll get an item to be run from the queue, fire off the new Drush command to execute that and then immediately loop back around and fetch the same item from the queue and fire off the same Drush command.

The dispatcher here isn't actually keeping track of its child processes, but is just looking at the global queue state to see if it should start executing some more tasks.

We could solve this in a rubbish way by sleeping for a second if we have spun up child processes to give them chance to claim the item from the queue so that when we check we see how many running processes there are, we see them.

Another thing to consider is if we want multiple tasks to run in parallel on the same site... I wouldn't want a backup and git pull task to run together.

Yeah, that's a good point, maybe it would be enough to group the select query for new tasks by rid, so that we only ever get one task per item to run.
This should probably be fixed globally, since this will be a problem for the old cron based dispatcher too.

The config form should probably mention that a restart of the queued is needed for changed to take effect.

So...I think we should maybe introduce a flag to restart the daemon, so that when we save the config form, we set a flag somewhere that the daemon picks up on and restarts, then we don't need a nasty message, and the other options on that page will also be updated asap.

Comment #15

helmo commented 7 July 2016 at 08:01

Also keep an eye on #2672530: Adopt Python queue daemon replacement

Comment #16

colan

he/him

English

Toronto 🇨🇦

commented 7 July 2016 at 12:23

Comment #17

steven jones commented 29 July 2016 at 13:38

So just a massive word of caution, I experienced some serious issues with running tasks in parallel. There are subtle race conditions everywhere.

Things like Apache being restarted by another process in between config files being written and files referenced by the config being written causing Apache to output a notice and Aegir to log it.

Comment #18

colan

he/him

English

Toronto 🇨🇦

commented 4 May 2017 at 16:39

#2875766: Tasks are not single threading - getting deadlocks for tasks just came in.

Comment #19

23 January 2018 at 16:05

helmo committed 43415b2 on 7.x-4.x

Issue #2714049 by Steven Jones: Pre-commit code style improvements

Comment #20

10 October 2019 at 03:28

Steven Jones committed 2872920 on 2714049-parallel-queue

Issue #2714049 by Steven Jones: Allow parallelism in queue daemon

Allow parallelism in queue daemon

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Related issues

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community