When the server load gets so high that dispatch cannot finish within the cron run interval, the cron job will bring down the server in flames as it will attempts bootstrap over bootstrap of the drupal base install. We should have locking to keep this from happening.

This probably belongs to drush (since drupal is already bootstrapped by the time we hit the hosting dispatch code), but since our use case is (again) so specific, I will leave this here for now.

Comments

anarcat’s picture

Another fun thing that can happen is that the multiple tasks launched by the dispatcher all decide to apachectl graceful apache, which will take non-negligible time, especially under load and/or with a lot of vhosts. We need a better way to reload apache...

anarcat’s picture

A LOCK TABLES ... READ should probably be aquired on the queue so that only one dispatcher runs at a time. This seems like the most elegant solution. Of course, if we still fire tasks in parallel, then we will probably still have the problem of task buildup, so I would suggest that the dispatcher (or at least the task dispatcher should *stop* running in parallel).

adrian’s picture

Making the tasks sequential isn't the solution. it simply increases the time it takes for things to get done.

I think we could make the default number of concurrent processes smaller, which you could update based on your available hardware, but making it sequential will make large work loads take _forever_.

An example here would be skwashd using aegir to install 2000 sites with a fairly complex install profile. He was able to do all this in about 24 hours because it could install more than one site at a time.

The dispatcher is coded to be very light in that it spawns the item and removes it from the queue immediately. We are talking about 20 really simple sql queries, which I would assume could be done in under a minute.

anarcat’s picture

Even if the dispatcher is light, it still has to bootstrap Drupal to get the SQL credentials, which does mean significant load.

Furthermore, if it just fires and forgets tasks, it can't keep track of which tasks started and finished, so tasks could just build up.

There are two degenerative cases:

1. the dispatcher overload: the first dispatcher has not finished loading and a new dispatcher starts, which also bootstraps drupal and those cronjob stack up to takes up all resources
2. the task (or cronjob) overload: the dispatcher runs normally, but starts too many tasks at a time, which stack up and take up all the resources.

Of course case 2 can degenarate into case 1 for maximum overload.

Case 1 may be solved by a lock on the table, but I'm not sure: to get there it needs SQL, and to get to SQL, it needs to boostrap Drupal, which is the main issue here. Maybe a filesystem-level lock would be better here instead.

Case 2 will *not* be solved by a lock on the table. We *could* examine the state of the queue to see how many tasks are currently running and fire only the tasks we're allowed to run. So say there are 3 tasks still running and we're allowed to run 5, we would fire up the two next tasks.

I would rather, however, have a simple, non-blocking filesystem-level lock *and* a SQL level lock (for when we do multi-frontends). The filesystem level lock will keep case 1 from happening, and the SQL level lock will ensure frontend consistency.

We could also keep running the tasks in parallel *but* wait for them to finish before quitting the dispatcher. That way we easily control the number of parallel tasks running (which fixes #2). Services reload could also be offloaded to the dispatcher, which would restart apache/bind/whatever only once per dispatch run (instead of once per task right now).

anarcat’s picture

Issue tags: +self-hosting

Since this relates to self-hosting, i'm tagging this accordingly, see #454312: self-provisionning support.

anarcat’s picture

Title: dispatching lock » dispatcher locking
Status: Active » Fixed

I have made it so the dispatcher respects the "hosting_queue_NAME_running" variable (where NAME is the name of the queue): if it's set, it will avoid running the queue, so this variable can be considered a semaphore, to some extent (variable_set()/variable_get() isn't a atomic locking mechanism).

(We could use 6.16+'s lock.inc for this instead, but I'm not sure we want to depend on 6.16 just yet.)

Here's that first patch:

http://git.aegirproject.org/?p=hostmaster.git;a=commitdiff;h=6632d0a6faf...

Then I also made sure no more than N tasks are running in parallel, by checking already running tasks:

http://git.aegirproject.org/?p=hostmaster.git;a=commitdiff;h=406a7a0789a...

Finally, I made it so that the "PROCESSING" status of a task is set only when the task actually starts, instead of when the dispatch starts it, because the task may fail to start for unrelated reasons (the load check in provision_init() is a frequent example):

http://git.aegirproject.org/?p=hostmaster.git;a=commitdiff;h=9ab840f4b99...

anarcat’s picture

Status: Fixed » Closed (fixed)
Issue tags: -self-hosting

Automatically closed -- issue fixed for 2 weeks with no activity.

  • Commit aecf1bb on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x by anarcat:
    #500362 - don't run queue if it's already running
    
    this implements a...

  • Commit aecf1bb on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x by anarcat:
    #500362 - don't run queue if it's already running
    
    this implements a...