dispatcher locking [#500362]

When the server load gets so high that dispatch cannot finish within the cron run interval, the cron job will bring down the server in flames as it will attempts bootstrap over bootstrap of the drupal base install. We should have locking to keep this from happening.

This probably belongs to drush (since drupal is already bootstrapped by the time we hit the hosting dispatch code), but since our use case is (again) so specific, I will leave this here for now.

Comments

Comment #1

anarcat CreditAttribution: anarcat commented 25 August 2009 at 21:41

Another fun thing that can happen is that the multiple tasks launched by the dispatcher all decide to apachectl graceful apache, which will take non-negligible time, especially under load and/or with a lot of vhosts. We need a better way to reload apache...

Comment #2

anarcat CreditAttribution: anarcat commented 25 August 2009 at 21:58

A LOCK TABLES ... READ should probably be aquired on the queue so that only one dispatcher runs at a time. This seems like the most elegant solution. Of course, if we still fire tasks in parallel, then we will probably still have the problem of task buildup, so I would suggest that the dispatcher (or at least the task dispatcher should *stop* running in parallel).

Comment #3

adrian CreditAttribution: adrian commented 26 August 2009 at 15:15

Making the tasks sequential isn't the solution. it simply increases the time it takes for things to get done.

I think we could make the default number of concurrent processes smaller, which you could update based on your available hardware, but making it sequential will make large work loads take _forever_.

An example here would be skwashd using aegir to install 2000 sites with a fairly complex install profile. He was able to do all this in about 24 hours because it could install more than one site at a time.

The dispatcher is coded to be very light in that it spawns the item and removes it from the queue immediately. We are talking about 20 really simple sql queries, which I would assume could be done in under a minute.

Comment #4

anarcat CreditAttribution: anarcat commented 26 August 2009 at 15:58

Even if the dispatcher is light, it still has to bootstrap Drupal to get the SQL credentials, which does mean significant load.

Furthermore, if it just fires and forgets tasks, it can't keep track of which tasks started and finished, so tasks could just build up.

There are two degenerative cases:

1. the dispatcher overload: the first dispatcher has not finished loading and a new dispatcher starts, which also bootstraps drupal and those cronjob stack up to takes up all resources
2. the task (or cronjob) overload: the dispatcher runs normally, but starts too many tasks at a time, which stack up and take up all the resources.

Of course case 2 can degenarate into case 1 for maximum overload.

Case 1 may be solved by a lock on the table, but I'm not sure: to get there it needs SQL, and to get to SQL, it needs to boostrap Drupal, which is the main issue here. Maybe a filesystem-level lock would be better here instead.

Case 2 will *not* be solved by a lock on the table. We *could* examine the state of the queue to see how many tasks are currently running and fire only the tasks we're allowed to run. So say there are 3 tasks still running and we're allowed to run 5, we would fire up the two next tasks.

I would rather, however, have a simple, non-blocking filesystem-level lock *and* a SQL level lock (for when we do multi-frontends). The filesystem level lock will keep case 1 from happening, and the SQL level lock will ensure frontend consistency.

We could also keep running the tasks in parallel *but* wait for them to finish before quitting the dispatcher. That way we easily control the number of parallel tasks running (which fixes #2). Services reload could also be offloaded to the dispatcher, which would restart apache/bind/whatever only once per dispatch run (instead of once per task right now).

Comment #5

anarcat CreditAttribution: anarcat commented 11 February 2010 at 19:46

Issue tags:

+self-hosting

Since this relates to self-hosting, i'm tagging this accordingly, see #454312: self-provisionning support.

Comment #6

anarcat CreditAttribution: anarcat commented 31 May 2010 at 20:34

Title:	dispatching lock	» dispatcher locking
Status:	Active	» Fixed

I have made it so the dispatcher respects the "hosting_queue_NAME_running" variable (where NAME is the name of the queue): if it's set, it will avoid running the queue, so this variable can be considered a semaphore, to some extent (variable_set()/variable_get() isn't a atomic locking mechanism).

(We could use 6.16+'s lock.inc for this instead, but I'm not sure we want to depend on 6.16 just yet.)

Here's that first patch:

http://git.aegirproject.org/?p=hostmaster.git;a=commitdiff;h=6632d0a6faf...

Then I also made sure no more than N tasks are running in parallel, by checking already running tasks:

http://git.aegirproject.org/?p=hostmaster.git;a=commitdiff;h=406a7a0789a...

Finally, I made it so that the "PROCESSING" status of a task is set only when the task actually starts, instead of when the dispatch starts it, because the task may fail to start for unrelated reasons (the load check in provision_init() is a frequent example):

http://git.aegirproject.org/?p=hostmaster.git;a=commitdiff;h=9ab840f4b99...

Comment #7

anarcat CreditAttribution: anarcat commented 31 May 2010 at 20:42

I opened an issue about proper locking in #814296: use Drupal's semaphore/locking system instead of variable_get/set for dispatcher locking.

Comment #8

14 June 2010 at 20:50

Status:	Fixed	» Closed (fixed)
Issue tags:	-self-hosting

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #9

14 June 2010 at 20:50

Issue tags:

+self-hosting

Restoring issue tags, see #2125755: System messages removed all issue tags during D7 upgrade.

Comment #10

9 May 2014 at 13:46

Commit aecf1bb on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x by anarcat:
```
#500362 - don't run queue if it's already running

this implements a...
```

Comment #11

12 June 2014 at 08:59

Commit aecf1bb on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x by anarcat:
```
#500362 - don't run queue if it's already running

this implements a...
```