So that was fun, after discovering that cron jobs were stuck since our latest upgrade (see #695236: cron never runs), I was able to run cron jobs using hosting-cron. Somehow, it was able to completely crash (well almost crash) my server with a fork-bomb-like attack.

top - 17:12:22 up 69 days, 23:16,  6 users,  load average: 129.00, 111.08, 57.22

Load eventually went up to 170 as Apache and drush were competing for resources on the server. I ended up killing all php processes and restarting apache, and the machine recovered.

Comments

anarcat’s picture

Status: Active » Needs review

I'm now using this patch in production:

diff --git a/modules/hosting/cron/hosting_cron.module b/modules/hosting/cron/hosting_cron.module
index d4b0ce4..c973d73 100755
--- a/modules/hosting/cron/hosting_cron.module
+++ b/modules/hosting/cron/hosting_cron.module
@@ -19,7 +19,6 @@ function hosting_cron_hosting_queues() {
 function hosting_cron_queue($count) {
   $sites = hosting_cron_get_sites($count);
 
-  drush_set_context('DRUSH_LOG_CALLBACK', '_hosting_watchdog');
   foreach ($sites as $site) {
     $platform = node_load($site->platform);
     $web_server = node_load($platform->web_server);
@@ -34,7 +33,7 @@ function hosting_cron_queue($count) {
       $username = $web_server->script_user;
     }
 
-    drush_backend_fork("cron", $data, $drush_path, $hostname, $username);
+    drush_backend_invoke("cron", $data, $drush_path, $hostname, $username);
 
     $site->revision = false;
     $site->no_verify = true; // do not generate verify task

It makes cron a little more ... polite.

anarcat’s picture

Status: Needs review » Fixed

fixed in r859c05368bc7

anarcat’s picture

Status: Fixed » Needs work

That's not enough: my server died again. What I noticed first was the extreme load on the server:

top - 14:02:58 up 74 days, 20:07,  7 users,  load average: 130.59, 87.13, 59.31

Now, that can be partly explained by the load of the Apache server, but the problem is once load is too high, everything degenerates. Here's a processlist dump of the aegir user:

 9956 ?        Zs     0:00 [php] <defunct>
 9971 ?        Zs     0:00 [php] <defunct>
 9975 ?        S      0:02 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10129 ?        S      0:02 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10540 ?        Zs     0:02 [php] <defunct>
10560 ?        Zs     0:02 [php] <defunct>
10562 ?        Zs     0:02 [php] <defunct>
10579 ?        Zs     0:02 [php] <defunct>
10581 ?        Zs     0:02 [php] <defunct>
10611 ?        Zs     0:02 [php] <defunct>
10612 ?        Zs     0:02 [php] <defunct>
10615 ?        Zs     0:03 [php] <defunct>
10629 ?        Zs     0:01 [php] <defunct>
10774 ?        Zs     0:00 [php] <defunct>
10784 ?        Zs     0:01 [php] <defunct>
10799 ?        Zs     0:02 [php] <defunct>
10805 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10808 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10809 ?        S      0:02 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10821 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10828 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10835 ?        S      0:02 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10838 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10851 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10856 ?        S      0:02 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10862 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10882 ?        Zs     0:01 [php] <defunct>
10908 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10916 ?        S      0:01 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10927 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://foo.test.koumbit.net --root=/var/aegir/drupal-6.14-TNI cron --backend
10939 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://foo.test.koumbit.net --root=/var/aegir/drupal-6.14-TNI cron --backend
10940 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://foo.test.koumbit.net --root=/var/aegir/drupal-6.14-TNI cron --backend
10958 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://foo.test.koumbit.net --root=/var/aegir/drupal-6.14-TNI cron --backend
10960 ?        Ds     0:00 php /usr/bin/drush hosting-dispatch --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net
10961 ?        Ds     0:00 php /usr/bin/drush hosting-dispatch --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net
10962 ?        Ds     0:00 php /usr/bin/drush hosting-dispatch --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net
10963 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10964 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10967 ?        D      0:00 php /srv/aegir/drush/drush.php --items=10 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-cron --backend
10970 ?        D      0:00 php /srv/aegir/drush/drush.php --items=5 --quiet --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net hosting-tasks --backend
10982 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10984 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10985 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10986 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10990 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10992 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10994 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10996 ?        D      0:00 php /srv/aegir/drush/drush.php --uri=http://dev.foo.com --root=/var/aegir/drupal-6.15-1.0-prod cron --backend
10998 ?        Ds     0:00 /bin/bash -c php '/usr/bin/drush' hosting-dispatch --root='/srv/aegir/hostmaster-0.4-alpha3+HEAD' --uri='aegir.koumbit.net'
10999 ?        Ds     0:00 php /usr/bin/drush hosting-dispatch --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net
11001 ?        Ds     0:00 [php]
11007 ?        Ds     0:00 php /usr/bin/drush hosting-dispatch --root=/srv/aegir/hostmaster-0.4-alpha3+HEAD --uri=aegir.koumbit.net

I think we could probably use a locking mechanism for cron. Note that foo.test.koumbit.net is a particularly heavy site, so having multiple cron jobs like this is just insane.

anarcat’s picture

Priority: Critical » Normal

#623920: check load before dispatching new tasks/crons has implemented load controls that may help with this. Now we just need to figure out why there are duplicate cron jobs being started here...

adrian’s picture

Either one or the other, not both #366387: cronjob queue not aggressive enough

SocialNicheGuru’s picture

This patch seems to work for me.
I need to verify that Aegir cron activates cron for my sites. I will report back.

if a site cron is not completed for whatever reason, does that mean that the aegir cron will also stop? If that is the case, i would like to know which site cron halted or ended in an error. It would be nice to have it displayed on the hostmaster > admin > queue tab

omega8cc’s picture

There are two separate Aegir cron queues, for tasks and for sites. Both are using its own semaphores. And both can experience locked semaphore issue. Currently I'm using some simple fixes to avoid both too big tasks/sites cron queue and high server load:

1. This patch makes the sites cron running in sequence instead of starting them all at once: http://drupal.org/node/931550#comment-3533854

2. This patch allows to auto-release locked old (over 1h) semaphores for sites/tasks: http://drupal.org/node/931550#comment-3762510

Also, Barracuda and Octopus are using shell script wrapper to run Aegir system cron jobs, just to make sure we are not starting any Aegir crons (tasks/sites) when there is too high system load (built-in Aegir check for system load is too slow to avoid issues) and also to make sure we are not starting any crons when system performs some auto-healing and/or backups, and/or database integrity checks etc, which could cause Aegir crons to fail or generate high system load.

[EDIT] Furthermore, my default settings for tasks queue is to allow only one (1) task every 5 seconds. This helps to avoid many many issues with race condition when user will queue a few tasks on the same site and/or on a few sites using heavy distro and/or with large "files" directory. I believe we should use this value by default.

anarcat’s picture

Status: Needs work » Fixed

Hi!

#931550: when hosting-cron is interrupted, it keeps its semaphore in the database (reboot problem) was committed.

I am not sure what #931550: when hosting-cron is interrupted, it keeps its semaphore in the database (reboot problem) accomplishes: is pausing 3 seconds really helping? How about 5 or 1? It feels really arbitrary to me.. Furthermore, looking at the code it seems to me that the tasks *are* ran in sequence:

    if (variable_get('hosting_cron_use_backend', TRUE)) {
      provision_backend_invoke($site_name, "cron");
    }
    else {
      $cmd = sprintf("wget -O - -q %s  > /dev/null", escapeshellarg(_hosting_site_url($site) . '/cron.php'));
      drush_shell_exec($cmd);
    }

No parallel execution here...

As for the tasks queue, it seems like a separate issue to me: we're talking about an evil hosting-cron here.

Marking this as fixed: I haven't seen cron kill my server recently, although mass migrates... but that's a different issue!

omega8cc’s picture

The 3s sleep makes the sequence "real" and more safe. If not used, the crons are started in the so fast sequence, than it is in fact "parallel".

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

SocialNicheGuru’s picture

Status: Closed (fixed) » Active

Hi.

Is there a way to reset each semapaphore manually?

Also I just installed aegir-rc1 and neither of the patches in #7 were part of the hostmaster install.

anarcat’s picture

Status: Active » Closed (fixed)

  • Commit 3143247 on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x authored by anarcat:
    #695244 - don't run cron jobs in parallel but in serie
    
    since the...

  • Commit 3143247 on 6.x-2.x, 7.x-3.x, dev-ssl-ip-allocation-refactor, dev-sni, dev-helmo-3.x authored by anarcat:
    #695244 - don't run cron jobs in parallel but in serie
    
    since the...