First let me state that this is unrelated to #2150787: Queue daemon fails to restart itself despite the nearly identical titles. I was also experiencing the issue described there related to the queue daemon failing to restart on PEAR-based installs, but I resolved this with the included patch.

I periodically and seemingly at random discover that the Hosting queue daemon is not running after waiting a few minutes for an Aegir task to kick off. I verify that the daemon is probably not running by checking /admin/hosting/queued and noting the "Last started:" entry is greater than the 30 minute restart interval I have set. I also then verify that it is not running by checking the status of the hosting-queued service from the command line of the server. I can start the service from the command line and it always starts successfully. It will run for hours, days, or even weeks at a time just fine and then without explanation or any error that I can find, it will fail to restart during one of the regularly scheduled daemon restarts.

I know I can write a quick cron job or something to start the daemon if it isn't already running, but isn't that the point of what the daemon already does? I am mostly looking for anyone else having experienced a similar issue and how they solved it or if anyone has any ideas on where on the system to check for a log message or something similar to help me determine why it is failing to restart. I am running RHEL. The instantaneous task execution advantage of using hosting-queued over the cron-based task runner is quickly lost when my coworkers and I are unable to get tasks to execute, so this is obviously something we are very interested in fixing. I am tempted to just go back to the cron runner, but I feel like there is something deeper going on here that isn't working the way it should.

Comments

jwestcu created an issue. See original summary.

jpwester’s picture

Issue summary: View changes
helmo’s picture

The 7.x version has some improvements on this, but it's also not perfect.

One reason I've seen is a database server that is down. Even a few seconds during an upgrade ... gives a fatal php error which is not caught.

jpwester’s picture

Yes, I suspect that might be my issue as well.

Without abandoning the daemon all together, there are two options as I see it. The first would obviously be to write the cron job I mentioned to check the status of the daemon and start it if it isn't running. The other option is likely a little more complicated. At least in my setup, the aegir user on our servers doesn't have the ability to start/stop the service; I have to use my own non-aegir user with sudo for that. In theory, I suppose I could get our Sys Admins to allow the aegir user to start/stop the service and then use a hook or something to check the status of the service upon the creation of a task. I haven't looked too closely at the code, but knowing Drupal, I'm sure there's a way to do that. :)

jpwester’s picture

I meant to add some sort of question to my last post. Do these two options sound realistic or am I missing something? Has anybody else done anything similar in the past? I just don't want to get too deep in this and find out I was way off base. Thanks!

helmo’s picture

There has been talk about using something else to replace our own php code. A more standalone mature queue thing ... but that's long term. Although ergonlogic has been experimenting with this.

jpwester’s picture

Interesting, I'd be curious to see what comes of that for sure!

What I ended up doing was getting our server guys to allow the apache user to manage the service and I wrote a tiny module that checks the status of the service and starts if necessary on any node insert or update (hook_nodeapi). This was the best way I came up with to fire up the service upon task creation in Aegir.

ergonlogic’s picture

Status: Active » Closed (outdated)

The 6.x-2.x branch will go EOL along with Drupal this week. So I'm closing this issue. If it remains a confirmed issue in 7.x-3.x, feel free to re-open, or better yet, create a new issue referencing this one.

As for the queue daemon, see: #2672530: Adopt Python queue daemon replacement