Problem/Motivation
When a queue worker encounters a problem, it has no way of notifying the queue runner that has called it.
In particular, if the problem is of a nature that will affect the whole queue, it means that the queue runner is likely to pointlessly waste time calling the worker over and over again.
For example, this can happen for a queue worker that is calling a remote server, and finds the server is down, or isn't responding correctly.
Proposed resolution
#2021933: Catch exceptions from queue workers (followup: tests don't work) has added the ability for a queue worker to throw an exception, and for queue runners to catch this and keep going.
Therefore, we need a different class of exception to signal 'I'm broken, and I have reason to believe my whole queue is broken too'.
Remaining tasks
Write a change record.
API changes
callback_queue_worker() may now throw an exception of class SuspendQueueException.
Original report by [username]
Consider this code from drupal_cron_run:
foreach ($queues as $queue_name => $info) {
$function = $info['worker callback'];
$end = time() + (isset($info['time']) ? $info['time'] : 15);
$queue = DrupalQueue::get($queue_name);
while (time() < $end && ($item = $queue->claimItem())) {
$function($item->data);
$queue->deleteItem($item);
}
}
Note that the return value of the 'worker callback' function is never checked. It could (and should) be used to indicate whether or not to delete the item from the queue (see the subsequent line).
Consider a business rule that requires that only certain items in the queue should be processed now. As a concrete example, let's say that the item has a 'send_date' and the 'worker callback' checks this date before sending an email. If the function could return FALSE to indicate not to delete the item and TRUE to indicate that the item should be deleted, then a whole host of issues could be avoided. As it is, the 'worker callback' function has to re-queue the item, but this then results in an infinite loop.
Comments
Comment #1
fgmThis is basically one of the cases for which the the queue:releaseItem() method was included in Queue API. A better behavior would be:
However, since D7 has not been making any use of the return of worker callbacks since 7.0, it is probably not a good idea to change the API that way, as it could break the behavior of existing sites.
A better approach would probably be to use an exception. Since these are not handled currently, no worker code is likely to be using them, and it could go like:
Comment #2
superspring CreditAttribution: superspring commentedHere is a patch which implements what is talked about above.
Comment #3
chx CreditAttribution: chx commentedI like this patch a lot, it's going to be superb useful for the new batch API but it needs a test. And nos is a typo, it needs to be noes :)
Comment #4
superspring CreditAttribution: superspring commentedSame patch with 'noes' spelt correctly.
I've temporarily put it to needs review for the test bot.
Comment #5
superspring CreditAttribution: superspring commentedSpell check again
Comment #7
superspring CreditAttribution: superspring commentedAdded a simple test to prove it's functionality.
Comment #8
chx CreditAttribution: chx commentedThere was no second queue before. Why did that appear?
Comment #10
superspring CreditAttribution: superspring commentedThe ordering of items being added _back_ to the queue after they've been tried is important. The whole idea fails if an item is claimed, tested, unclaimed and then the same item repeats. The second queue ensures that any items that has been claimed will be added back to the end of the original queue.
Here is another patch with more commenting, a speed up and more testing code.
Comment #11
superspring CreditAttribution: superspring commentedThis patch is an improvement to the System queue to remove the need of having a 'second queue' as per @chx and @fiasco's reviews.
Comment #12
superspring CreditAttribution: superspring commentedThis is a new patch using this issue's (#1832818: Allow a queue item to be postponed) patch for the queue's ordering guarantee.
Otherwise no changes.
Comment #13
chx CreditAttribution: chx commentedThis will be rtbc a) once the other patch is in b) the superflous releaseItem is removed.
Comment #14
superspring CreditAttribution: superspring commentedAs per @chx's review.
Comment #16
superspring CreditAttribution: superspring commentedSame patch as above with diff fix.
Comment #17
kerasai CreditAttribution: kerasai commented#16: skip_cron_item-1524550-16.patch queued for re-testing.
Comment #19
rbunch CreditAttribution: rbunch commentedIn progress... Drupalcon sprint, May 24, 2013
Comment #20
subson CreditAttribution: subson commentedre-rolling the new patch.
@rbunch - I was working on this issue, re-rolled the new patch. Sorry I forgot to assign it to myself before starting on it.
Comment #22
subson CreditAttribution: subson commentedtrying to run tests again.
Comment #23
subson CreditAttribution: subson commented#20: skip_cron_item-1524550-19.patch queued for re-testing.
Comment #25
socketwench CreditAttribution: socketwench commentedReroll.
Comment #27
star-szrThis needs another reroll.
Comment #28
xjmSo, from the summary, it sounds like this might be a bug (as @colinafoley pointed out on IRC). Assigning to chx to get his confirmation on that point.
If it is a bug, the next step in this issue is to add a failing test for the bug that will pass when combined with a fix: https://drupal.org/contributor-tasks/write-tests
Since this issue depends on #1832818: Allow a queue item to be postponed, it would be good to also provide a patch named
whatever-do-not-test.patch
alongside the full one that just shows the changes this issue will add on top of that one. We'll want that issue to go in first, but we have more work we can do here in the meanwhile.Comment #29
Crell CreditAttribution: Crell commentedAlso related: #2021933: Catch exceptions from queue workers (followup: tests don't work)
Comment #30
chx CreditAttribution: chx commentedThis functionality didn't exist before -- although it should have been -- but it's definitely an API addition. Not sure whether "it should've been" makes it a bug or a feature request. I will let xjm decide.
Comment #31
chx CreditAttribution: chx commentedAnd by that title it's a bug although one that can only be solved by an API addition; however since the exception idea it's not an API change but an addition so it's a go.
Comment #32
xjmWorks for me.
Comment #33
Crell CreditAttribution: Crell commentedBy #31, do you mean using an exception to signal "worker fail, try again later" makes this not an API break? (I hadn't intended that with the other issue, but I'm fine with it as that's how update hooks work now.)
Comment #34
David Hernández CreditAttribution: David Hernández commentedQuick reroll. Updated some deprecated functions (variable_get/set) and fixed some minor coding standard errors. Maybe I left some deprecated functions.
Comment #36
David Hernández CreditAttribution: David Hernández commentedFixed info.yml file
Comment #38
David Hernández CreditAttribution: David Hernández commentedFixing 3 out of 4 tests was easy. But I've been trying to resolve the last one for a couple hours. Unfortunately, my time is over and I have to leave the DrupalCon Prague sprint. I've tracked the issue to the common_test_cron_exception_helper. Looks like the function common_test_cron_exception_helper_callback is not being called and I wasn't able to find why.
I've attached my last version, so maybe someone can fix it for me.
Marked needs review to show the improvements and have the test report.
Comment #40
marthinal CreditAttribution: marthinal commented1) Each time we do this :
while (time() < $end && ($item = $queue->claimItem())) {
We call rows with expire as 0. I made a few tests manually and always when I call moveToTheEnd() method the other rows seem never accessed.
I have fixed the problem by adding the value 0 to expire once the while loop is ended. But I really don't know the best way...
2) About QueueTest, if we apply the patch, we add different weight per row. So the current test should verify the situation with the weight, right?
3) If we increase the value of the weight each time we'll have very high values in the case we always have an exception. I don't know if this could be a problem.
Hope it helps.
Comment #41
joachim CreditAttribution: joachim commented#2021933: Catch exceptions from queue workers (followup: tests don't work) is now in.
Either this needs a reroll, or it could be considered a duplicate, as it's now possible to signal that a job wasn't completed by throwing an exception.
Comment #42
joachim CreditAttribution: joachim commentedOne thing that might be interesting as an addition to just the worker throwing an exception would be to have the worker able to signal that the *whole queue* is currently fubar and should be skipped.
Example: queued items are data to be fetched or pushed to a remote site. If one worker fails because the remote site is down there is no point in cron trying to work on more items from this queue.
If the worker callback is capable of knowing that this is indeed the reason for the failure, it could signal to system_cron() that the current queue should be skipped completely on this cron run.
Comment #43
Crell CreditAttribution: Crell commentedThe obvious way to do that would be to allow a special exception to be thrown to indicate the entire queue should be paused. Any other exception just requeues that one item and on we go to the next one.
How the queue system would mark an entire queue as "paused", I have no idea. :-) But that would be the obvious API for workers.
Comment #44
joachim CreditAttribution: joachim commentedI think this could be easily done in https://api.drupal.org/api/drupal/core!lib!Drupal!Core!Cron.php/function... at least:
Comment #45
joachim CreditAttribution: joachim commentedHere's a patch that does that.
Comment #46
Crell CreditAttribution: Crell commentedThe more specific catch statement must come first; as is, the first Exception reference will catch SuspendQueueException as well and this block will never be called.
This will bypass that queue for this cron run. Is that sufficient? Probably. For a waiting queue, though (which is what we ought to be using rather than a polling queue), we'd need something more robust. Probably the exception should indicate how long the queue should be paused for. That may be follow-up material, though.
Comment #47
joachim CreditAttribution: joachim commented> The more specific catch statement must come first; as is, the first Exception reference will catch SuspendQueueException as well and this block will never be called.
Ok, will reroll.
I'll see if I have time to add a test.
What would be the best way for the queue worker to indicate to the test case how many times it was been called? All I can think of is setting a system variable each time it's called.
Comment #48
joachim CreditAttribution: joachim commentedFixes the exception catching. Not had time to tests for this I'm afraid.
Comment #49
joachim CreditAttribution: joachim commentedDepending on status of #2208649: document queue worker callback, this will either need to update the documentation that issue adds, or reroll the patch at that issue.
Comment #50
Crell CreditAttribution: Crell commentedjoachim: Both issues look good to me. To avoid rerolling too much, please fold them both into one patch here and I can RTBC it. :-) (As is I'd RTBC both, but...)
Comment #51
joachim CreditAttribution: joachim commentedThanks for the review.
Though I think I'd rather keep them separate, as the docs issue will need backporting to 7.
Plus I'm not sure it would actually save us a reroll -- I'd reroll now, rather than reroll whichever of the two patches doesn't get in first, if you see what I mean.
Comment #52
Crell CreditAttribution: Crell commentedEh, whatever. I defer to the maintainers on the logistics.
Comment #53
joachim CreditAttribution: joachim commentedThe patch at #2208649: document queue worker callback won the race, therefore I'll reroll this with the documentation changes that it incurs.
Comment #54
joachim CreditAttribution: joachim commentedAdded mention of this to the callback docs.
Comment #55
Crell CreditAttribution: Crell commentedAnd we're back.
Comment #56
catchThe new exception should be documented in a draft change notice for drush/waiting_queue and other queue runners so they can update to use it correctly.
Could also do with a (short) issue summary update since the original use case presented doesn't really match what we've ended up with.
Comment #57
joachim CreditAttribution: joachim commentedUpdated issue summary.
Comment #58
joachim CreditAttribution: joachim commentedChange notice: https://drupal.org/node/2214873
Comment #59
joachim CreditAttribution: joachim commentedComment #60
Crell CreditAttribution: Crell commentedI tweaked the change notice a little. Back to RTBC.
Comment #61
joachim CreditAttribution: joachim commented> It is up to the runner to determine when it is safe to try that queue again.
I'm not sure a runner would have the means to do that!
Comment #62
Crell CreditAttribution: Crell commentedIn most cases it's probably some sort of timeout-retry. That's effectively what the cron runner does, ie, try again on the next cron run.
Comment #63
alexpottAs of #3 we need a test - but we don't appear to have one.
Comment #64
joachim CreditAttribution: joachim commentedAdded a test.
Comment #66
joachim CreditAttribution: joachim commentedWhoops.
Comment #68
joachim CreditAttribution: joachim commentedAh I get it:
The queue item that threw the exception is still claimed, hence the test code can't claim it.
I don't see an API for releasing ALL of a queue's leases.
Should we:
a) have cron run release the item that throws the exception, which would also fix the test
b) change the test to inspect the queue's DB table directly rather than go via the API?
I'm really not sure which is best. Any thoughts?
Comment #69
Crell CreditAttribution: Crell commentedI think A is best. In practice it shouldn't change the production-time effect for most queues, but seems more robust. Ie, if we know that an item is unprocessable we should explicitly say as much, since we're releasing the whole queue in this case anyway.
Also, coupling the tests to the DB implementation seems like a horrible idea, as this is testing the queue system, not the DB implementation thereof.
Comment #70
joachim CreditAttribution: joachim commentedDone.
Fixed the docs for the new exception class too, which didn't have a proper 1 line summary.
Comment #71
Crell CreditAttribution: Crell commentedSeems reasonable to me.
Comment #72
alexpottCommitted 082bf59 and pushed to 8.x. Thanks!