Problem/Motivation
Items in the queue cannot be reclaimed or get incorrect lease times leading to unexpected behaviour with expiration and claims.
There are three errors:
- The system cron only resets the expired items of the Database Queue. Memory queue items can never be reclaimed, because their expire value is never set to 0.
- The cron time and the lease time are incorrectly using the same value defined in the annotation of the queue worker.
- The lease time for items in cron-based queues are incorrectly set to 1 second. Fixed: #3230541: Queue items only reserved by cron for 1 second
Further details available in #25.
Proposed resolution
1. Let queues handle expiration of items themselves.
2. Fix the lease time for cron based queues.
3. Allow queueworkers to set specific lease times as well as cron times.
API changes
Added cron lease time in the QueueWorker annotation definition.
| Comment | File | Size | Author |
|---|---|---|---|
| #81 | 2893933-queue_lease_time-11.x-81.patch | 15.19 KB | gun_dose |
| #76 | 2893933-75-rerolled-for-10-3.patch | 15.26 KB | chri5tia |
| #75 | 2893933-queue_lease_time-11.x-75.patch | 15.2 KB | bapplejax |
| #72 | cron-missing-time-3396207.patch | 729 bytes | nick.murza |
| #67 | reroll_diff_62-67.txt | 4.09 KB | jastraat |
Issue fork drupal-2893933
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
kmoll commentedHere is a patch to update this.
Comment #3
kmoll commentedComment #4
kmoll commentedComment #7
m4oliveiHere's a failing test patch without the fix to show the problem more clearly.
Comment #8
kmoll commentedthanks @m4olivei. I would love to get some visibility on this. At least a discussion. I don't like the fact that we need to rely on garbage collection, which runs in a separate process, in order for an expired item to be fetched again. IMO it would be simple to alter the query to get expired items. Maybe we can add the test to the above patch to show that it is working when the patch is applied. Let me know if you have time to update that. If not I can. Thanks!
Comment #9
m4oliveiYep, I'm on it. Just waiting to make sure that the last patch failed the way I intended. I've got the complete patch with your fix and my test ready to go.
Comment #10
kmoll commentedAwesome! thanks.
Comment #12
m4oliveiAttaching patch from #2 including test from #7 (and renaming the patch to be correct :/ ).
Comment #13
m4oliveiHumm, oops, I didn't completely understand the test class when I wrote that test. The code I added was specific to the database queue, but that code also gets run in testing the memory queue.
Taking a look..
Comment #14
m4oliveiOK, I removed the DB query for expire in the test, since that makes no sense for the memory queue. The test is still valid the way it is, that was just an extra assertion. I found after doing this that the memory queue exhibits the same issue wherein it would not claim an item unless it's expiry was exactly 0. That fix is included as well.
Comment #15
m4oliveiOne thing to be careful about. Once you fix this, another problem rears:
#2883819: Cron does not use lease time correctly when claiming item
That issue means that cron queue processing currently ignores any lease time that you give it, erroneously using a constant of 1 second lease time for any queue item claim. So even though lease times are properly checked now, items expire far sooner than typically expected.
We probably need both to make sure we don't break things temporarily. Not totally sure though.
Comment #16
mustanggb commentedI noticed this in D7.
Wondering if this was done deliberately so that looped
claimItem()'s don't run forever and this is why they require asystem_cron()execution to reset expired back to 0.Perhaps a better option would be to keep the current setup, where
claimItem()only pulls fromexpired = 0, and instead add aSystemQueue::resetLeases()function that does the same thing as thesystem_cron()code and can be called before a loopedclaimItem()(or in the loop, if you're being careful), which will allow expired items to be claimed, but prevent infinite loops.Comment #17
manuel garcia commentedComment #18
manuel garcia commentedUpdating IS and title to reflect the current scope which also includes the Memory queue.
Comment #19
manuel garcia commentedComment #20
deviantintegral commentedI found the original issue at #391340: Job queue API, and I don't see any comments specifically about why expiration clearing happens in cron. comment #98 did bring it up as a performance concern.
Something to note is that this is how all expiries work in Drupal. For example, caches are only cleared on cron and may be continued to be used past their expiry time (this comes up with the page cache). I'm guessing that's the source of this in the first place.
I think a good place to start would just be some improved documentation - I've been tripped up on this, I've seen other developers have the same confusions, and it's not how developers typically think of expired item handling.
One nice side effect of the current handling is that it encourages site owners to run cron more often. I've seen sites where cron has been needlessly disabled, or only run 2-3 times a day, due to incorrect perceptions about performance. Being able to say "the floor of queue expiries and page caches is your cron interval" is a strong argument in favour of running cron with reasonable intervals.
Luckily, I don't think expired handling is really part of the "public" queue API. Even though nearly all Drupal sites use database queues, there's no contract that cron has anything to do with a given queue implementation. IOW, I could see this potentially being committed to 8.5.x.
Comment #21
deviantintegral commented+ $queue1->claimItem(-3600);This feels like a bit of hack, given that it violates the interface docs on claimItem(). It would be reasonable for a queue implementation to require only positive lease times and throw an \InvalidArgumentException otherwise. I think it would be OK if the test manually inserted a row.
Should this patch remove the system_cron implementation too? That may give us hints if we're breaking any other expectations by altering the claimItem() behaviour.
Comment #23
deviantintegral commentedComment #24
nuezComment #25
nuezI think it's a good idea to incorporate two other issues in this issue:
Some observations:
$lease_time, which is the argument of the Queue's::claimItem(). Note thattime()is used to get the actual time and not the time at the start of the request.cron = {time = x}is defined in the annotation of your QueueWorker plugin, the lease time is erroneously set to 1 second as mentioned by m4olivei. If the annotation is not set, then the default of 30 seconds is applied in both the database queue and the memory queue.$lease_time = isset($info['cron']['time']) ?: NULL;3. Currently the time indicated in the annotation of the QueueWorker plugin is incorrectly used for both lease time and cron execution time.
I agree that resetting lease times in the system_cron() is not only an akward place for it (it took me a while to find it), it also seems to be faulty:
It currently doesn't release items in the Memory queue, since the MemoryQueue doesn't implement the QueueGarbageCollectionInterface.
Expired but claimed items in the memory queue are never released it seems.
My patch builds on the patch in #41, also fixing the 1 second lease time and the incorrect use of the cron time as lease time. It also adds a kernel test on a cron queue execution level to see if leases correctly expire.
Comment #27
deviantintegral commentedA straight reroll against 8.8.x to run tests against while I review.
Comment #28
deviantintegral commentedIdentical to the above patch with some minor CS fixes.
A few questions.
Comment #29
nuezAdding a patch to reroll this against the lastest 8.7.x branch too, as the patch in #25 doesn't seem to apply anymore.
It's been a while, but:
I would not move the constants to QueueInterface, as the constants are specific to CRON executing the queue, and doesn't apply generically to the QueueInterface.
You could consider using an hour instead of 30 seconds, I don't really have an opinion about that, other than that the 3600 seconds in QueueInterface is actually never used in Core as every implementation of that class overrides that value to become either 30 seconds (Database and Memory Queue) or 0 (BatchMemory).
Generally you would want to motivate people to build short-lasting queue jobs, creating jobs that take long generally means you should cut them up into smaller jobs. If you do need a long lease time, you can add the annotation. So 30 seconds is still in line with what's currently being used by core.
I need to think about that. My first thought would be that this happens in the DatabaseQueue which implements the ReliableQueueInterface, which maintains the order of the tasks being executed, if I'm not mistaken, so I'm not sure if the scenario that you're imagining is technically likely to happen.
Comment #31
nuezReroll for the 8.8.x branch
Comment #32
nuezComment #34
manuel garcia commentedThanks for the reroll @nuez
Just a bit of work here, with this
QueueTestshould come back green.Comment #39
dieterholvoet commentedRerolled for the 9.2.x branch and opened a MR.
Comment #40
dieterholvoet commentedIn case anyone needs it, a reroll for the 9.1.x branch.
Comment #43
codebymikey commentedFixed the broken tests on the 9.3.x reroll - https://git.drupalcode.org/project/drupal/-/merge_requests/254/diffs
Further review required. I've tested the patch, and it seems to be functional.
If accepted, I guess a change notice will be required for the new
lease_timeattribute as well as the introduction of theCronInterface::DEFAULT_QUEUE_CRON_TIMEandCronInterface::DEFAULT_QUEUE_CRON_LEASE_TIMEconstants which can be used by other Cron implementations.Comment #45
dieterholvoet commentedI rebased the branch against 9.4.x.
Comment #46
tibezh commentedAccording to the #3222251: [November 8, 2021] Replace all isset constructs with the null coalescing operator and #2909370: Fix 'Drupal.Commenting.VariableComment.IncorrectVarType' coding standard issues, we cannot apply a patch from the #40 comment on the 9.3.x and 9.4.x branches.
An actual patch is attached to the comment.
Comment #47
tibezh commentedFixed code quality
Comment #50
codebymikey commentedRerolled for 9.4, 9.5 reroll requires updating the target of the issue fork (which can only be done by the original creator @DieterHolvoet)
Comment #51
codebymikey commentedUpdated the TEST-ONLY patch to remove reference to non-existent
QueueWorkerManagerInterface::DEFAULT_QUEUE_CRON_LEASE_TIMEconstant.Comment #52
codebymikey commentedOne thing to note is whether when specifying the default
lease_time, we should ensure that it's always larger than or equal to the currenttime.i.e. If a custom queue
timeof 100 seconds is specified, and thelease_timeis unspecified, thelease_timeshould probably default to 100/115 seconds rather than the default 30 seconds?Comment #53
codebymikey commentedAddress broken test which doesn't pass the static queue worker definition through the
QueueWorkerManagerdefinition processor.Comment #55
mile23Comment #56
_utsavsharma commentedTried to fix #53 for 10.1.x.
Please review.
Comment #57
jastraat commentedRe-rolling #53 for Drupal 9.5.x
Comment #58
jastraat commentedRe-rolling again for 9.5.x
Comment #60
jastraat commentedCreated an issue for the unrelated, random, intermittent errors from ckeditor5: https://www.drupal.org/project/drupal/issues/3332456
Comment #61
jastraat commentedComment #62
jastraat commentedRe-rolling for 10.1.x
Comment #63
mile23Adding Needs subsystem maintainer review, and tagging @neclimdul who's listed for queue. We don't have subsystem maintainers for cron, and we don't have a queue component to tag the issue.
Patch applies. :-)
I'd suggest that we add
@seetags from all the functions/variables we're changing toDrupal\Core\Queue\QueueInterface::claimItem()because it has an explanation of what a lease is. If there's some other, better explanation, we should@seethat instead.I'd RTBC for this other than the above. We're using the 9.5 patch in #58 in production, for queue/cron issues we're having in DKAN.
Comment #65
jastraat commentedRe-roll for 11.x that also adds @see comments as suggested above.
Comment #67
jastraat commentedAdjusting 11.x tests in new CronSuspendQueueDelayTest.php
Comment #68
mile23Reroll for 11.x looks good, my concerns from #63 about @see are addressed.
Setting RTBC.
Hopefully we'll be able to backport this to 10.0 and/or 9.5.
I want to give a special tangential shout-out to #3068764: Refactor cron queue processing into its own class because I just spent a bunch of time trying to figure out a bespoke queue implementation that should have been an easy subclass of a generic queue handler. Tangent over.
Comment #69
neclimdulHaven't made it all the way through this patch because I got hung up on this. I don't understand _why_ the change is needed. In addition if we read above the update the second block of code is part of we see the code is using the
expire = 0part of the condition to ensure only one process can claim an item. The current code will always succeed in updating removing this protection.Comment #70
mile23Doing some triage.
Comment #71
mile23Comment #72
nick.murza commentedpatch kill the issue, but need to find the core of error.
Comment #73
intrafusionThe 10.1.x patch from #62 doesn't apply to 10.2.x but the 11.x patch from #67 does, testing whether this still fixes the issue for me
Comment #74
swirtPatch #67 seems to no longer apply cleanly to 10.3
Comment #75
bapplejax commentedThis re-roll patch will cleanly apply the patch referenced in #67 to 10.3.x.
The main issue being QueueWorker being un-commented out in the two files listed:
core/modules/system/tests/modules/cron_queue_test/src/Plugin/QueueWorker/CronQueueTestDatabaseDelayException.php
and
core/modules/system/tests/modules/cron_queue_test/src/Plugin/QueueWorker/CronQueueTestLeaseTime.php
So the patch was looking for commented-out code where it has since been applied in 10.3.
Comment #76
chri5tia commentedRerolled patch #75 for Drupal 10.3.6
Comment #77
moshe weitzman commentedSeems like a serious issue. I'd like to move this to the queue component because claimItem() bugs affect non-cron queues as well. Alas I dont think that component exists.
Comment #78
neclimdul"bugs affect non-cron queues as well"
Do you have an example? I've not run into this issue and there's conceptual explanation but not real world examples.
Looks like the subsystem tag was added for me and I added the review so removing the tag.
Comment #79
gun_dose commentedI have just updated #75 patch to be compatible with 11.2.x branch. There was small change in comments, so composer couldn't find a place to apply chunk
Comment #80
gun_dose commentedComment #81
gun_dose commentedSorry, I have uploaded a wrong file. This one is correct.