Drupal 10, the latest version of the open-source digital experience platform with even more features, is here.So that was fun, after discovering that cron jobs were stuck since our latest upgrade (see #695236: cron never runs), I was able to run cron jobs using hosting-cron. Somehow, it was able to completely crash (well almost crash) my server with a fork-bomb-like attack.
top - 17:12:22 up 69 days, 23:16, 6 users, load average: 129.00, 111.08, 57.22
Load eventually went up to 170 as Apache and drush were competing for resources on the server. I ended up killing all php processes and restarting apache, and the machine recovered.










Comments
Comment #1
anarcat CreditAttribution: anarcat commentedI'm now using this patch in production:
It makes cron a little more ... polite.
Comment #2
anarcat CreditAttribution: anarcat commentedfixed in r859c05368bc7
Comment #3
anarcat CreditAttribution: anarcat commentedThat's not enough: my server died again. What I noticed first was the extreme load on the server:
Now, that can be partly explained by the load of the Apache server, but the problem is once load is too high, everything degenerates. Here's a processlist dump of the aegir user:
I think we could probably use a locking mechanism for cron. Note that foo.test.koumbit.net is a particularly heavy site, so having multiple cron jobs like this is just insane.
Comment #4
anarcat CreditAttribution: anarcat commented#623920: check load before dispatching new tasks/crons has implemented load controls that may help with this. Now we just need to figure out why there are duplicate cron jobs being started here...
Comment #5
adrian CreditAttribution: adrian commentedEither one or the other, not both #366387: cronjob queue not aggressive enough
Comment #6
SocialNicheGuru CreditAttribution: SocialNicheGuru commentedThis patch seems to work for me.
I need to verify that Aegir cron activates cron for my sites. I will report back.
if a site cron is not completed for whatever reason, does that mean that the aegir cron will also stop? If that is the case, i would like to know which site cron halted or ended in an error. It would be nice to have it displayed on the hostmaster > admin > queue tab
Comment #7
omega8cc CreditAttribution: omega8cc commentedThere are two separate Aegir cron queues, for tasks and for sites. Both are using its own semaphores. And both can experience locked semaphore issue. Currently I'm using some simple fixes to avoid both too big tasks/sites cron queue and high server load:
1. This patch makes the sites cron running in sequence instead of starting them all at once: http://drupal.org/node/931550#comment-3533854
2. This patch allows to auto-release locked old (over 1h) semaphores for sites/tasks: http://drupal.org/node/931550#comment-3762510
Also, Barracuda and Octopus are using shell script wrapper to run Aegir system cron jobs, just to make sure we are not starting any Aegir crons (tasks/sites) when there is too high system load (built-in Aegir check for system load is too slow to avoid issues) and also to make sure we are not starting any crons when system performs some auto-healing and/or backups, and/or database integrity checks etc, which could cause Aegir crons to fail or generate high system load.
[EDIT] Furthermore, my default settings for tasks queue is to allow only one (1) task every 5 seconds. This helps to avoid many many issues with race condition when user will queue a few tasks on the same site and/or on a few sites using heavy distro and/or with large "files" directory. I believe we should use this value by default.
Comment #8
anarcat CreditAttribution: anarcat commentedHi!
#931550: when hosting-cron is interrupted, it keeps its semaphore in the database (reboot problem) was committed.
I am not sure what #931550: when hosting-cron is interrupted, it keeps its semaphore in the database (reboot problem) accomplishes: is pausing 3 seconds really helping? How about 5 or 1? It feels really arbitrary to me.. Furthermore, looking at the code it seems to me that the tasks *are* ran in sequence:
No parallel execution here...
As for the tasks queue, it seems like a separate issue to me: we're talking about an evil hosting-cron here.
Marking this as fixed: I haven't seen cron kill my server recently, although mass migrates... but that's a different issue!
Comment #9
omega8cc CreditAttribution: omega8cc commentedThe 3s sleep makes the sequence "real" and more safe. If not used, the crons are started in the so fast sequence, than it is in fact "parallel".
Comment #11
SocialNicheGuru CreditAttribution: SocialNicheGuru commentedHi.
Is there a way to reset each semapaphore manually?
Also I just installed aegir-rc1 and neither of the patches in #7 were part of the hostmaster install.
Comment #12
anarcat CreditAttribution: anarcat commentedplease keep fixed issues fixed. maybe you are refering to #931550: when hosting-cron is interrupted, it keeps its semaphore in the database (reboot problem)?