After running the apt-get upgrade command, aegir3-hostmaster gets stuck half way through being configured.

At first it was stuck at Platforms path /var/aegir/platforms is writable.

Thanks to a helpful person the IRC I was able to get additional debugging information. See attached file.

Now I see it is actually stuck on

Executing: mysql --defaults-extra-file=/tmp/drush_07o4Pu --database=exampledatabase_0 --host=localhost --port=3306 --silent < /tmp/drush_ixjEKl

We were also able to determine that the mysql connection is open, but idle.

| 409 | exampledatabase_0 | localhost | exampledatabase_0 | Sleep | 1 | | NULL | 0.000 |
| 1168 | exampledatabase_0 | localhost | exampledatabase_0 | Sleep | 0 | | NULL | 0.000 |
| 1211 | exampledatabase_0 | localhost | exampledatabase_0 | Sleep | 1 | | NULL | 0.000 |

I am using Debian Jessie and all other pending upgrades have processed successfully.

Any assistance is highly appreciated since this has caused our server to become unavailable.

Work around

Check to see if there's a task in the aegir task queue. Either make sure it's finished before upgrading or remove it.
When viewing the task node, use the 'Edit' tab to find the node id...and then visit example.com/node/1234/delete to get rid of it.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

g33kg1rl created an issue. See original summary.

g33kg1rl’s picture

An strace on the process provided this information over and over

poll([{fd=6, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
sendto(6, "p\0\0\0\3SELECT COUNT(t.vid) FROM ho"..., 116, 0, NULL, 0) = 116
recvfrom(6, "\1\0\0\1\1\"\0\0\2\3def\0\0\0\fCOUNT(t.vid)\0\f?"..., 16384, 0, NULL, NULL) = 67
poll([{fd=6, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
sendto(6, "x\0\0\0\3SELECT count(t.nid) FROM no"..., 124, 0, NULL, 0) = 124
recvfrom(6, "\1\0\0\1\1\"\0\0\2\3def\0\0\0\fcount(t.nid)\0\f?"..., 16384, 0, NULL, NULL) = 67
write(2, "\0DRUSH_BACKEND:{\"type\":\"message\""..., 186) = 186
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0},

g33kg1rl’s picture

I checked what was in the /tmp/drush_ixjEKl file and found SHOW TABLES;

g33kg1rl’s picture

Status: Active » Closed (fixed)

OK I figured out the solution. I had to do a manual upgrade as per this documentation --> http://docs.aegirproject.org/en/latest/install/upgrade/#upgrades-with-de...

colan’s picture

In my case, I couldn't get debugging turned on as neither of these added any extra output:

  • env DPKG_DEBUG=yes sudo apt full-upgrade
  • env DPKG_DEBUG=developer sudo apt full-upgrade

I had a task (of which I was unaware) that was still running (status -1). It was causing a wait state in drush_hosting_pause_validate(). Because debugging messages couldn't be turned on, I couldn't see this:

Waiting for the task queue to be processed and tasks to complete.

Once I deleted the task, all was well, and apt could continue.

Thanks to helmo for getting me on the right track!

matthewgann’s picture

Can confirm that had multiple tasks stuck at processing (-1) that were causing the update to stall as well as a few other issues. Created a view to list those tasks and removed them. Upgrade finished processing.

Thanks @colan and @helmo

g33kg1rl’s picture

I have been running into this issue every time I try to upgrade. I can confirm that creating the view to see the processing tasks and removing them will allow the upgrade to finish.

helmo’s picture

Issue summary: View changes
helmo’s picture

helmo’s picture

Version: 7.x-3.6 » 7.x-3.x-dev
Status: Closed (fixed) » Needs review
FileSize
589 bytes

I'd like to address the root cause here ... the upgrade process is waiting until all tasks in the queue are finished.

Waiting for running tasks seems a good thing, lets address the viability of such a messages in #2861696: Extra output via debugging messages cannot be enabled on Debian package upgrades

I propose to NOT wait for tasks that are queued... this patch does just that.

Jon Pugh’s picture

I'm now experiencing this in while trying desperately to get upgrade test working in travis for devshop: https://travis-ci.org/opendevshop/devshop/jobs/224067830

Funny I spent a lot of time struggling to figure out why it is hanging before finding this issue.

I'd like to address the root cause here ... the upgrade process is waiting until all tasks in the queue are finished.

Well that explains why I can't make it work in the upgrade test, because it has to run as a single process, all the way through to the behat tests. There's no separate queue runner at all in docker, so...

I'll try out this patch in that devmaster test!

Jon Pugh’s picture

Component: Debian package » Code

It's not just debian. This is happening in the hostmaster-migrate command.

Jon Pugh’s picture

Ok, I have a better question: What is the point of this command?

/**
 * Drush command to pause the Aegir frontend queues.
 *
 * This is really just deleting our code from the crontab.
 */
function drush_hosting_pause($url) {
  // Wipe out cron entry.
  _hosting_setup_cron($add = FALSE);
}

All this command does is stop Crontab. I don't see why this specific drush command should wait for tasks to finish processing. It doesn't stop a Hosting Queued, it doesn't stop tasks from being run manually with drush.

In fact, I can see this being a problem, in that if there are tasks stuck in processing, cron will just keep on running and will never be turned off, which might trigger new tasks... which will keep the command waiting even longer!

I propose we remove this validate hook completely and see what happens.

Jon Pugh’s picture

Sadly, after testing I have found that on hostmaster-migrate, the drush hosting-pause command is running from the old codebase.

So we are essentially stuck with this problem until after the next release.

I guess that means we need some kind of manual intervention when upgrading?

Still looking into this...

helmo’s picture

We could extend the workaround offered here in the summary to suggest applying a small patch to the previous platform.

But yes ... hosting-pause might be from a time where we had no queue daemon. Removing seems very tempting.

helmo’s picture

In the Debian package upgrade we call 'service hosting-queued stop' from debian/aegir3-hostmaster.postinst so both cron and queued would be disabled during such an upgrade.
I think there are certainly things that can go wrong when you run regular task during a hostmaster-migrate.

It might be nice to let hosting-pause also block the queued though.

helmo’s picture

Here's an UNTESTED patch, it adds a time-out to the query about running tasks... now set to 3600 secs (1 hour)

helmo’s picture

new patch ... < should be >

Jon Pugh’s picture

So it only loads tasks that were started in the last hour?

helmo credited viashimo.

helmo credited viashimo.

helmo’s picture

Status: Needs review » Fixed

I increaded the time-out to 8 hour.

helmo’s picture

@Jon Pugh: It's not loading them, just checking for any running tasks. And now ignoring a task that has the running status, but was started more then 8 hours ago.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

milovan’s picture

I confirm patch from #18 works on the latest version. Solved the issue, please commit it if you didn't have already.