Problem/Motivation

When a queue worker encounters a problem, it has no way of notifying the queue runner that has called it.

In particular, if the problem is of a nature that will affect the whole queue, it means that the queue runner is likely to pointlessly waste time calling the worker over and over again.

For example, this can happen for a queue worker that is calling a remote server, and finds the server is down, or isn't responding correctly.

Proposed resolution

#2021933: Catch exceptions from queue workers (followup: tests don't work) has added the ability for a queue worker to throw an exception, and for queue runners to catch this and keep going.

Therefore, we need a different class of exception to signal 'I'm broken, and I have reason to believe my whole queue is broken too'.

Remaining tasks

Write a change record.

API changes

callback_queue_worker() may now throw an exception of class SuspendQueueException.

Original report by [username]

Consider this code from drupal_cron_run:

foreach ($queues as $queue_name => $info) {
  $function = $info['worker callback'];
  $end = time() + (isset($info['time']) ? $info['time'] : 15);
  $queue = DrupalQueue::get($queue_name);
  while (time() < $end && ($item = $queue->claimItem())) {
    $function($item->data);
    $queue->deleteItem($item);
  }
}

Note that the return value of the 'worker callback' function is never checked. It could (and should) be used to indicate whether or not to delete the item from the queue (see the subsequent line).

Consider a business rule that requires that only certain items in the queue should be processed now. As a concrete example, let's say that the item has a 'send_date' and the 'worker callback' checks this date before sending an email. If the function could return FALSE to indicate not to delete the item and TRUE to indicate that the item should be deleted, then a whole host of issues could be avoided. As it is, the 'worker callback' function has to re-queue the item, but this then results in an infinite loop.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

fgm’s picture

Version: 7.9 » 7.x-dev

This is basically one of the cases for which the the queue:releaseItem() method was included in Queue API. A better behavior would be:

 while (time() < $end && ($item = $queue->claimItem())) {
    if ($function($item->data)) {
      $queue->deleteItem($item);
    }
    else {
      $queue->releaseItem($item);
    }
  }

However, since D7 has not been making any use of the return of worker callbacks since 7.0, it is probably not a good idea to change the API that way, as it could break the behavior of existing sites.

A better approach would probably be to use an exception. Since these are not handled currently, no worker code is likely to be using them, and it could go like:

class QueueWorkerException extends Exception;
while (time() < $end && ($item = $queue->claimItem())) {
  try {
    $function($item->data);
    $queue->deleteItem($item);
  }
  catch (QueueWorkerException $e) {
    // Probably log the event here.
    $queue->releaseItem($item);
  }
superspring’s picture

Version: 7.x-dev » 8.x-dev
Status: Active » Needs review
FileSize
2.28 KB

Here is a patch which implements what is talked about above.

chx’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

I like this patch a lot, it's going to be superb useful for the new batch API but it needs a test. And nos is a typo, it needs to be noes :)

superspring’s picture

Status: Needs work » Needs review
FileSize
2.28 KB

Same patch with 'noes' spelt correctly.

I've temporarily put it to needs review for the test bot.

superspring’s picture

Spell check again

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-4.patch, failed testing.

superspring’s picture

Status: Needs work » Needs review
FileSize
4.25 KB

Added a simple test to prove it's functionality.

chx’s picture

There was no second queue before. Why did that appear?

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-5.patch, failed testing.

superspring’s picture

Status: Needs work » Needs review
FileSize
7.06 KB

The ordering of items being added _back_ to the queue after they've been tried is important. The whole idea fails if an item is claimed, tested, unclaimed and then the same item repeats. The second queue ensures that any items that has been claimed will be added back to the end of the original queue.

Here is another patch with more commenting, a speed up and more testing code.

superspring’s picture

This patch is an improvement to the System queue to remove the need of having a 'second queue' as per @chx and @fiasco's reviews.

superspring’s picture

This is a new patch using this issue's (#1832818: Allow a queue item to be postponed) patch for the queue's ordering guarantee.
Otherwise no changes.

chx’s picture

This will be rtbc a) once the other patch is in b) the superflous releaseItem is removed.

superspring’s picture

As per @chx's review.

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-14.patch, failed testing.

superspring’s picture

Status: Needs work » Needs review
FileSize
13.33 KB

Same patch as above with diff fix.

kerasai’s picture

Issue tags: -Needs tests

#16: skip_cron_item-1524550-16.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Needs tests

The last submitted patch, skip_cron_item-1524550-16.patch, failed testing.

rbunch’s picture

In progress... Drupalcon sprint, May 24, 2013

subson’s picture

Status: Needs work » Needs review
FileSize
10.77 KB

re-rolling the new patch.

@rbunch - I was working on this issue, re-rolled the new patch. Sorry I forgot to assign it to myself before starting on it.

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-19.patch, failed testing.

subson’s picture

Status: Needs work » Needs review

trying to run tests again.

subson’s picture

Issue tags: -Needs tests

#20: skip_cron_item-1524550-19.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Needs tests

The last submitted patch, skip_cron_item-1524550-19.patch, failed testing.

socketwench’s picture

Status: Needs work » Needs review
FileSize
10.77 KB

Reroll.

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-25.patch, failed testing.

star-szr’s picture

Issue tags: +Needs reroll

This needs another reroll.

xjm’s picture

Assigned: Unassigned » chx
Category: feature » bug
Priority: Major » Normal

So, from the summary, it sounds like this might be a bug (as @colinafoley pointed out on IRC). Assigning to chx to get his confirmation on that point.

If it is a bug, the next step in this issue is to add a failing test for the bug that will pass when combined with a fix: https://drupal.org/contributor-tasks/write-tests

Since this issue depends on #1832818: Allow a queue item to be postponed, it would be good to also provide a patch named whatever-do-not-test.patch alongside the full one that just shows the changes this issue will add on top of that one. We'll want that issue to go in first, but we have more work we can do here in the meanwhile.

Crell’s picture

chx’s picture

Assigned: chx » xjm

This functionality didn't exist before -- although it should have been -- but it's definitely an API addition. Not sure whether "it should've been" makes it a bug or a feature request. I will let xjm decide.

chx’s picture

Title: drupal_cron_run should respect the return value of the 'worker callback' » queue cron workers can't signal an uncompleted job
Assigned: xjm » Unassigned

And by that title it's a bug although one that can only be solved by an API addition; however since the exception idea it's not an API change but an addition so it's a go.

xjm’s picture

Issue tags: +API addition

Works for me.

Crell’s picture

By #31, do you mean using an exception to signal "worker fail, try again later" makes this not an API break? (I hadn't intended that with the other issue, but I'm fine with it as that's how update hooks work now.)

David Hernández’s picture

Status: Needs work » Needs review
FileSize
10.84 KB

Quick reroll. Updated some deprecated functions (variable_get/set) and fixed some minor coding standard errors. Maybe I left some deprecated functions.

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-34.patch, failed testing.

David Hernández’s picture

Status: Needs work » Needs review
FileSize
883 bytes
10.85 KB

Fixed info.yml file

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-36.patch, failed testing.

David Hernández’s picture

Status: Needs work » Needs review
FileSize
8.66 KB

Fixing 3 out of 4 tests was easy. But I've been trying to resolve the last one for a couple hours. Unfortunately, my time is over and I have to leave the DrupalCon Prague sprint. I've tracked the issue to the common_test_cron_exception_helper. Looks like the function common_test_cron_exception_helper_callback is not being called and I wasn't able to find why.

I've attached my last version, so maybe someone can fix it for me.

Marked needs review to show the improvements and have the test report.

Status: Needs review » Needs work

The last submitted patch, skip_cron_item-1524550-38.patch, failed testing.

marthinal’s picture

Status: Needs work » Needs review
FileSize
3.46 KB
11.8 KB

1) Each time we do this :

while (time() < $end && ($item = $queue->claimItem())) {

We call rows with expire as 0. I made a few tests manually and always when I call moveToTheEnd() method the other rows seem never accessed.

I have fixed the problem by adding the value 0 to expire once the while loop is ended. But I really don't know the best way...

2) About QueueTest, if we apply the patch, we add different weight per row. So the current test should verify the situation with the weight, right?

3) If we increase the value of the weight each time we'll have very high values in the case we always have an exception. I don't know if this could be a problem.

Hope it helps.

joachim’s picture

Issue summary: View changes
Status: Needs review » Needs work

#2021933: Catch exceptions from queue workers (followup: tests don't work) is now in.

Either this needs a reroll, or it could be considered a duplicate, as it's now possible to signal that a job wasn't completed by throwing an exception.

joachim’s picture

One thing that might be interesting as an addition to just the worker throwing an exception would be to have the worker able to signal that the *whole queue* is currently fubar and should be skipped.

Example: queued items are data to be fetched or pushed to a remote site. If one worker fails because the remote site is down there is no point in cron trying to work on more items from this queue.

If the worker callback is capable of knowing that this is indeed the reason for the failure, it could signal to system_cron() that the current queue should be skipped completely on this cron run.

Crell’s picture

The obvious way to do that would be to allow a special exception to be thrown to indicate the entire queue should be paused. Any other exception just requeues that one item and on we go to the next one.

How the queue system would mark an entire queue as "paused", I have no idea. :-) But that would be the obvious API for workers.

joachim’s picture

I think this could be easily done in https://api.drupal.org/api/drupal/core!lib!Drupal!Core!Cron.php/function... at least:

    foreach ($queues as $queue_name => $info) {
      if (isset($info['cron'])) {
        $callback = $info['worker callback'];
        $end = time() + (isset($info['cron']['time']) ? $info['cron']['time'] : 15);
        $queue = $this->queueFactory->get($queue_name);
        while (time() < $end && ($item = $queue->claimItem())) {
          try {
            call_user_func_array($callback, array($item->data));
            $queue->deleteItem($item);
          }
          catch (Exception $e) {
            // In case of exception log it and leave the item in the queue
            // to be processed again later.
            watchdog_exception('cron', $e);
          }
          catch (\Drupal\Core\Queue\SuspendQueue $e) {
            // If the worker indicates there is a problem with the whole queue,
            // skip it.
            watchdog_exception('cron', $e);

            continue 2;
          }
        }
      }
    }
joachim’s picture

Title: queue cron workers can't signal an uncompleted job » queue cron workers can't signal a broken queue
Status: Needs work » Needs review
FileSize
1.68 KB

Here's a patch that does that.

Crell’s picture

+++ b/core/lib/Drupal/Core/Cron.php
@@ -142,6 +143,14 @@ public function run() {
+          catch (SuspendQueueException $e) {

The more specific catch statement must come first; as is, the first Exception reference will catch SuspendQueueException as well and this block will never be called.

This will bypass that queue for this cron run. Is that sufficient? Probably. For a waiting queue, though (which is what we ought to be using rather than a polling queue), we'd need something more robust. Probably the exception should indicate how long the queue should be paused for. That may be follow-up material, though.

joachim’s picture

> The more specific catch statement must come first; as is, the first Exception reference will catch SuspendQueueException as well and this block will never be called.

Ok, will reroll.

I'll see if I have time to add a test.

What would be the best way for the queue worker to indicate to the test case how many times it was been called? All I can think of is setting a system variable each time it's called.

joachim’s picture

Fixes the exception catching. Not had time to tests for this I'm afraid.

joachim’s picture

Issue tags: -Needs reroll

Depending on status of #2208649: document queue worker callback, this will either need to update the documentation that issue adds, or reroll the patch at that issue.

Crell’s picture

joachim: Both issues look good to me. To avoid rerolling too much, please fold them both into one patch here and I can RTBC it. :-) (As is I'd RTBC both, but...)

joachim’s picture

Thanks for the review.
Though I think I'd rather keep them separate, as the docs issue will need backporting to 7.
Plus I'm not sure it would actually save us a reroll -- I'd reroll now, rather than reroll whichever of the two patches doesn't get in first, if you see what I mean.

Crell’s picture

Status: Needs review » Reviewed & tested by the community

Eh, whatever. I defer to the maintainers on the logistics.

joachim’s picture

Status: Reviewed & tested by the community » Needs work

The patch at #2208649: document queue worker callback won the race, therefore I'll reroll this with the documentation changes that it incurs.

joachim’s picture

Status: Needs work » Needs review
FileSize
3.02 KB

Added mention of this to the callback docs.

Crell’s picture

Status: Needs review » Reviewed & tested by the community

And we're back.

catch’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs change record

The new exception should be documented in a draft change notice for drush/waiting_queue and other queue runners so they can update to use it correctly.

Could also do with a (short) issue summary update since the original use case presented doesn't really match what we've ended up with.

joachim’s picture

Issue summary: View changes

Updated issue summary.

joachim’s picture

joachim’s picture

Status: Needs work » Needs review
Crell’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: -Needs change record

I tweaked the change notice a little. Back to RTBC.

joachim’s picture

> It is up to the runner to determine when it is safe to try that queue again.

I'm not sure a runner would have the means to do that!

Crell’s picture

In most cases it's probably some sort of timeout-retry. That's effectively what the cron runner does, ie, try again on the next cron run.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

As of #3 we need a test - but we don't appear to have one.

joachim’s picture

Status: Needs work » Needs review
FileSize
5.92 KB

Added a test.

Status: Needs review » Needs work
joachim’s picture

Status: Needs work » Needs review
FileSize
5.95 KB

Whoops.

Status: Needs review » Needs work
joachim’s picture

Ah I get it:

    // Run cron; the worker for this queue should process as far as the crashing
    // item.
    $this->cronRun();

    // Only one item should have been processed.
    $this->assertEqual($queue->numberOfItems(), 2, 'Failing queue stopped processing at the failing item.');

    // Check the items remaining in the queue.
    $item = $queue->claimItem();
    $this->assertEqual($item->data, 'crash', 'Failing item remains in the queue.');
    $item = $queue->claimItem();
    $this->assertEqual($item->data, 'ignored', 'Item beyond the failing item remains in the queue.');

The queue item that threw the exception is still claimed, hence the test code can't claim it.

I don't see an API for releasing ALL of a queue's leases.

Should we:

a) have cron run release the item that throws the exception, which would also fix the test
b) change the test to inspect the queue's DB table directly rather than go via the API?

I'm really not sure which is best. Any thoughts?

Crell’s picture

I think A is best. In practice it shouldn't change the production-time effect for most queues, but seems more robust. Ie, if we know that an item is unprocessable we should explicitly say as much, since we're releasing the whole queue in this case anyway.

Also, coupling the tests to the DB implementation seems like a horrible idea, as this is testing the queue system, not the DB implementation thereof.

joachim’s picture

Status: Needs work » Needs review
FileSize
6.25 KB

Done.

Fixed the docs for the new exception class too, which didn't have a proper 1 line summary.

Crell’s picture

Status: Needs review » Reviewed & tested by the community

Seems reasonable to me.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Committed 082bf59 and pushed to 8.x. Thanks!

  • Commit 082bf59 on 8.x by alexpott:
    Issue #1524550 by superspring, joachim, David Hernández, marthinal,...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.