Aegir should check its environment condition before dispatching new tasks/crons when it is easy to kill the server using just a few heavy distros with very busy crons, with batch migrate or with mass-re-verify sites/platforms on Aegir upgrade.

Before trying to implement this in Aegir as a part of future enhancements, useful to monitor resources used by sites/platforms, like CPU and RAM estimated usage based on requests/minute, sites size (numbers of users/nodes/comments) bandwidth etc. it is also possible to introduce simple server load control before running Aegir cron.

Every minute we can run from cron simple shell script to check current load and decide if we can run the Aegir dispatcher safely.

Example:

#!/bin/sh

renice -9 $$

control()
{
NOW_LOAD=`awk '{print $1*100}' /proc/loadavg`
CTL_LOAD=200
if [ $NOW_LOAD -lt $CTL_LOAD ]; then
echo load is $NOW_LOAD while maxload is $CTL_LOAD
echo ... now doing CTL...
su - aegir -c "sh /home/aegir/aegir.sh"
else
echo load is $NOW_LOAD while maxload is $CTL_LOAD
echo ...we have to wait...
fi
}
control

and /home/aegir/aegir.sh script can include just:

#!/bin/sh

/path/to/php '/path/to/drush/drush.php' hosting dispatch --root='/path/to/aegir/domain' --uri=http://aegir.domain

Comments

omega8cc’s picture

BTW: we can also run dispatcher every 15 seconds, just for crash-test or to avoid too long waiting for safe load :)

#!/bin/sh

renice -9 $$

control()
{
NOW_LOAD=`awk '{print $1*100}' /proc/loadavg`
CTL_LOAD=200
if [ $NOW_LOAD -lt $CTL_LOAD ]; then
echo load is $NOW_LOAD while maxload is $CTL_LOAD
echo ... now doing CTL...
su - aegir -c "sh /home/aegir/aegir.sh"
else
echo load is $NOW_LOAD while maxload is $CTL_LOAD
echo ...we have to wait...
fi
}
control
sleep 15
control
sleep 15
control
adrian’s picture

we can't depend on the /proc filesystem, and we can't depend on bash.

we need php functions to test these things if we are to test them at all, and i don't object to putting a switch into the dispatcher to not fire if the system load is too heavy.

anarcat’s picture

Assigned: Unassigned » anarcat

I'll attack this, I'm tired of aegir finishing off my struggling servers. I found this:

http://ca2.php.net/manual/fr/function.sys-getloadavg.php

anarcat’s picture

Status: Active » Fixed

this required a little of restructuring. i wanted to abort in an drush_init() so that we bootstrap the least possible. So I had to move the hosting-dispatch command from a callback to a regular command (r8543f5255148).

The fix for this itself is in rf39e00906f0f. I added two functions: provision_count_cpus() and provision_load_critical().

Both should work on all platform (but windows), but provision_count_cpus() currently returns FALSE on anything else than Linux. Extensions can easily be written for that, because unfortunatly, there's no way to tell the number of CPUs from within PHP reliably, so we need hacks to figure that out. So basically, provision_count_cpus() is Linux-only.

provision_load_critical(), which decides what "critical" means, uses the number of CPUs to figure out the load. That part is platform-agnostic (except windows).

The number of CPUs is important to make some sense of the load and evaluate if it's critical or not. A load of 4 on a 4 CPU system is not uncommon and shouldn't be too much of a problem. A load of 4 on a single CPU is bad and the machine feels unresponsive.

If we don't know the number of CPUs, we assume that a load of 10 (magic number!) is critical. Otherwise, we assume a load of 5 processes per CPU (e.g. 5 for a single CPU, 20 for a 4-CPU machine) is critical.

When load is critical, drush just doesn't run, which should help recovery during critical situations (as opposed to now, where Aegir aggravates the problem by spawning more drush bootstrap sequences, see #695244: hosting-cron killed my server for a few good examples).

So basically, I think this answers the spec: it works everywhere, because we have sane defaults if the platform-specific stuff cannot be figured out. We get the load in a platform-agnostic manner.

Testing would be appreciated, but I'm running this in production already.

anarcat’s picture

anarcat’s picture

Ah, and another thing: I have implemented the controls in provision, which means it affects any drush command not defining a callback. This may not be desirable: maybe we want to implement the controls only over hosting-dispatch. If that's the case, we'd just need to move provision_drush_init() and the code for cpu and load detection to hosting.

I do think however, that it's useful to have such controls at the backend level, especially in the context of multiple server support: who cares about the load on the master server if the slave that's supposed to run the task is hosed?

anarcat’s picture

Title: Aegir should check its environment condition before dispatching new tasks/crons + simple recipe » check load before dispatching new tasks/crons

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.