Motivation

This is a trimmed down version of issue #2893933.

The reason for this new issue is to focus purely on the bug outlined in issue #2893933 which pertains to the Drupal Cron service's queue processor (without changing queue specific behavior).

Problem

Queue items are only ever leased by cron for 1 second (which can easily result in two separate cron processes attempting to process the same queue item at once).

Proposed resolution

Queue items should be leased for the time specified in the cron-time annotation for a given queue. This will ensure a queue item is leased for at least as long as a queue is being processed.

Steps to reproduce

  1. Create a queue worker with a process item method which takes a long time to run (see below for an example queue worker).
  2. Create a single queue item.
  3. Start multiple cron processes; allow a few seconds between starting each (if you have drush installed you can run something like: drush php-eval "Drupal::queue('q_test_queueworker_test')->createItem('test');").
  4. Observe how cron hangs on the same queue item.

Testing Steps

  1. Apply the patch for this issue.
  2. Create a queue worker with a process item method which takes a long time to run (see below for an example queue worker).
  3. Create a single queue item.
  4. Start multiple cron processes; allow a few seconds between starting each (if you have drush installed you can run something like: drush php-eval "Drupal::queue('q_test_queueworker_test')->createItem('test');").
  5. Observe how only the first cron process picks up the single queue item.

Example Queue Worker

<?php

namespace Drupal\slow_queue\Plugin\QueueWorker;

use Drupal\Core\Queue\QueueWorkerBase;

/**
 * Sleep for 5 minutes and then log a message to watchdog.
 *
 * @QueueWorker(
 *   id = "slow_queue",
 *   title = @Translation("This queue worker processes items slowly and logs a message in watchdog when it's finished"),
 *   cron = {
 *     "time" = 60
 *   }
 * )
 */
class SlowQueueWorker extends QueueWorkerBase {

  /**
   * {@inheritdoc}
   */
  public function processItem($item) {
    \Drupal::logger('slow_queue')->notice('Starting: ' . $data);
    sleep(60);
    \Drupal::logger('slow_queue')->notice('Done: ' . $item);
  }

}

Issue fork drupal-3230541

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

cliddell created an issue. See original summary.

cliddell’s picture

Issue summary: View changes

cliddell’s picture

Issue summary: View changes
cliddell’s picture

Status: Active » Needs review
cliddell’s picture

Title: Fix broken behavior in the Drupal Cron service's queue processor » Fix queue items not being leased by cron for proper time segments
cliddell’s picture

Here's a patch for 9.1.x - 9.3.x in case anyone needs it. Please see the above MR for review (unless you'd prefer looking through a patch instead :) ).

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

jday’s picture

Status: Needs review » Needs work

The last submitted patch, 9: 3230541-8.patch, failed testing. View results

cliddell’s picture

cliddell’s picture

Status: Needs work » Needs review
cliddell’s picture

Issue summary: View changes
joshf’s picture

I tested this using a long-running queueworker similar to the one @cliddell posted in the description. Before the patch, starting two simultaneous cron runs with a single long-running queue item would cause the item to be processed twice. After the patch, the item is correctly only processed once.

Please note that this ONLY happens when you process the queue via cron; running the queueworker manually via drush queue:run (or, presumably, via other non-cron methods) does not trigger the faulty behavior.

The patch code looks good as well, and includes tests. +1 RTBC

jday’s picture

+1 for this patch

It keeps the cron process from attempting to run a cron job already in progress. Without it, the same cron job gets picked up multiple times.

Thank you @cliddell

SamLerner’s picture

I verified this patch works as well. I can't trigger the same queue runner with multiple cron runs after applying.

This would be a great addition to core!

cliddell’s picture

Issue summary: View changes
Status: Needs review » Reviewed & tested by the community
cliddell’s picture

Status: Reviewed & tested by the community » Needs review

The above commits are just to rebase the feature branch off 9.4.x and fix a small typo in a docblock.

I've moved the ticket back to "Needs Review" so someone can review it against 9.4.x.

cmlara’s picture

Status: Needs review » Needs work

Needs work as listed this is more of a feature than a bugfix at the moment with some possible scope creep.

The cron time and the lease time are incorrectly using the same value defined in the annotation of the queue worker.

There actually is not anything wrong with this IMHO. When claiming an item the lease_time is the max time you may hold onto it before someone else may claim it.

For the first item claimed in the run it could legitimately be held the entire time of the queue worker and no other tasks are permitted to run. If the task finishes early the QueueWorker releases the lock and no harm is done.

I also wonder if this might have been done to avoid fast failures from causing a claim loop, if instead the claim time was guaranteed to last the entire time of the QueueWorker run there is no chance of trying to claim the same queue item twice in the same execution preventing a loop of failures, however that is speculation on my part at the moment without digging deeper into the code and history.

cliddell’s picture

Title: Fix queue items not being leased by cron for proper time segments » Queue items only leased by cron for 1 second
Issue summary: View changes
Status: Needs work » Needs review

Hi @cmlara. Thanks for the feedback!

1.

Needs work as listed this is more of a feature than a bugfix at the moment...

I think the summary for this issue may have been a bit misleading.

The cron time and lease time are not actually being set to the same value. $cron_time is being set to the value of the queue worker cron time annotation, while $lease_time is being set to isset($info['cron']['time']) ?: NULL;, or TRUE since $info['cron']['time'] is always set, and later coerced to 1.

Sorry for any confusion which may have stemmed from this. I've removed the erroneous statement from the problem summary and reworded the title for clarity.

2. I don't think the changes in this issue should be considered a feature since it doesn't look like the $lease_time variable has ever been set properly, and I'm unsure of the original authors intention from reading the code.

Cron time is used to define how long to continue picking up new items. I can imagine a scenario where defining a separate queue item lease time would have an advantage over just using a queue's "cron time". For example, if someone had a queue worker which might process an item for up to 5 minutes, but didn't want to continue picking up new items after 3 minutes.

Again, I don't really see this as a new feature since this behavior has never worked properly, and I think this bug should be fixed in a way which makes the most sense, which IMHO would be allowing developers to specify the lease time for queues.

3.

...with some possible scope creep.

Fair enough.

The scope creep in question was to fix a couple of deprecations in the DatabaseQueue class. Since that file is being modified for no other reason, and these changes could distract away from the primary purpose of this issue, I've reverted the changes in question.

4.

When claiming an item the lease_time is the max time you may hold onto it before someone else may claim it... I also wonder if this might have been done to avoid fast failures from causing a claim loop...

It does seem that lease time was intended to prevent multiple cron processes from picking up the same queue item, however there are other checks in place to prevent a queue from picking up the same item in a loop unintentionally.

The DatabaseQueue and MemoryQueue queues will only process items with a lease_time of 0 (see the ::claimItem() method for each class). Lease time is set to 0 for all items with expired leases at the beginning of a cron run.

The Batch and BatchMemory queues both ignore the lease time provided in order to allow, per the ::claimItem() method documentation, "item[s] to be claimed repeatedly until [they are] deleted". If lease_time were set to 0, it wouldn't affect the claim behavior of the queues any differently than if it were set to 5000 or more.

As such, the lease_time variable appears to only be useful for preventing two cron processes from picking up the same item.

cliddell’s picture

Title: Queue items only leased by cron for 1 second » Queue items only reserved by cron for 1 second
cliddell’s picture

Issue summary: View changes
cmlara’s picture

Status: Needs review » Needs work

The cron time and lease time are not actually being set to the same value. $cron_time is being set to the value of the queue worker cron time annotation, while $lease_time is being set to isset($info['cron']['time']) ?: NULL;, or TRUE since $info['cron']['time'] is always set, and later coerced to 1.

The difference between isset($info['cron']['time']) ?: NULL; and isset($info['cron']['time']) ? $info['cron']['time'] : NULL; OR $info['cron']['time'] ?: NULL; when written is what stops this from working with the current code (the NULL is an issue as well and should be sure to pickup a default value somewhere). This is basically an "incorrect value assignment" bug, we generally do not need to add new config options to fix an assignment fault.

It does seem that lease time was intended to prevent multiple cron processes from picking up the same queue item, however there are other checks in place to prevent a queue from picking up the same item in a loop unintentionally.

The DatabaseQueue and MemoryQueue queues will only process items with a lease_time of 0 (see the ::claimItem() method for each class). Lease time is set to 0 for all items with expired leases at the beginning of a cron run.

Ah you are correct about this for the default core queues. However this will not hold true for other Queue backends such as RabbitMQ that can implement actual timing on the locks.

The Database and Memory queue are very simple backends intended to work for many cases, being forced to implement features in unique ways since they are not full queue systems.

I can imagine a scenario where defining a separate queue item lease time would have an advantage over just using a queue's "cron time". For example, if someone had a queue worker which might process an item for up to 5 minutes, but didn't want to continue picking up new items after 3 minutes.

I could see this, though that becomes a feature request not a bugfix as there is currently no provision to provide a lease time nor any sign to indicate that was intended to be allowed for the QueueWorker.

I believe if you decrease this patch to just the one line of Cron.php to correctly calculate the lease_time and add any necessary tests to catch this issue you will have a better chance of this being more rapidly adopted into core. After this is adopted you could targeted the feature of adding lease_time as a config option.

cliddell’s picture

Issue summary: View changes
Status: Needs work » Needs review

I believe if you decrease this patch to just the one line of Cron.php to correctly calculate the lease_time and add any necessary tests to catch this issue you will have a better chance of this being more rapidly adopted into core. After this is adopted you could targeted the feature of adding lease_time as a config option.

Ok, you're probably right. I can see the new annotation slowing down adoption, and I don't want to slow down a potential Core fix.

I still don't like the idea of using the existing cron-time setting for queue item lease times, but it's better than the existing behavior of queue items only ever being leased for 1 second.

I've trimmed down the issue MR like you suggested, and I've created a follow up issue (#3274931) for adding a lease_time annotation.

Thanks again for the feedback!

cmlara’s picture

This looks good to me. Logic takes care to use a default value or to pull in the cron run time.

The build failure appears to be an issue in DrupalCi (error about task run twice) I saw no errors reported from any of the actual tests.

I think this may still need a fail test? Leaving as NR for second opinion on that.

cliddell’s picture

Status: Needs review » Needs work

Transitioning back to Needs Work until I'm able to add tests.

yogeshmpawar made their first commit to this issue’s fork.

yogeshmpawar’s picture

Status: Needs work » Needs review
cliddell’s picture

Issue tags: +Bug Smash Initiative
Ghost of Drupal Past’s picture

+ $lease_time = $info['cron']['time'] ?? self::DEFAULT_QUEUE_CRON_TIME;

The ?? part is not necessary because

  protected function processQueues() {
    // Grab the defined cron queues.
    foreach ($this->queueManager->getDefinitions() as $queue_name => $info) {
      if (isset($info['cron'])) {

and if (isset($info['cron'])) { then time is set by the queue manager:

  public function processDefinition(&$definition, $plugin_id) {
    parent::processDefinition($definition, $plugin_id);

    // Assign a default time if a cron is specified.
    if (isset($definition['cron'])) {
      $definition['cron'] += [
        'time' => 15,
      ];
    }
  }

Indeed, this caused the bug to always manifest because the isset() was guaranteed to be TRUE.

OTOH

$end = time() + $info['cron']['time']

If we are patching anyways , then shouldn't this be

$end = \Drupal::time()->getCurrentTime() + $info['cron']['time']

And ?? isn't needed again.

neclimdul’s picture

Status: Needs review » Needs work

Chx is right, and that check is always wrong and it is odd. It was added by #2705809: Queue garbage collection is not correctly run on cron and there was some churn around testing claimItems so... maybe something weird was happening with tests back then and breaking this "fixed" it? Hard to peer pack that long but its definitely broken.

On the general approach, I think moving the value to a constant is fine, but we need to align the processing code.

As far as the null coalescing, again I think chx is right that it shouldn't be needed. But if we want to be consistent with the safeguard in the run length code maybe document that it is just safeguards for bad definitions and @see the processing.

Side note, that safeguard isn't all that great because it protects against null but not 0. Don't put 0 in your cron time or you're likely to have a bad time because that while loop is going to be very short 🤣.

neclimdul’s picture

Oh, also I'm not too keen on all the time changes.

1) I don't know how its currently being wrangled but those time changes are going to conflict with #2902895: [meta][no patch] Replace uses of REQUEST_TIME and time() with time service.
2) I spent a good couple minutes hunting the actual fix before finding it. Thought this was just massaging time calls and adding a test at first which didn't make sense.
3) There's a pretty good argument the global calls are the wrong way to make this change and this isn't that issue. My argument would be that QueueDatabaseFactory is responsible for constructing and injecting dependencies into DatabaseQueue, of which time service is one of them. Since the factory lives on the container it should also get the service injected and pass it along.
4) Also it should include chx's suggested time() change.

cliddell’s picture

Status: Needs work » Needs review

Hi Chx, neclimdul. Thanks for the feedback!

As far as the null coalescing, again I think chx is right that it shouldn't be needed. But if we want to be consistent with the safeguard in the run length code maybe document that it is just safeguards for bad definitions and @see the processing.

Side note, that safeguard isn't all that great because it protects against null but not 0. Don't put 0 in your cron time or you're likely to have a bad time because that while loop is going to be very short 🤣.

Yeah, agreed. I've updated the MR as follows:

1. Removed the unnecessary null coalesce usage in the cron service.
2. Moved the default reservation time constant to the QueueWorkerManager.
3. Updated the safeguard to ensure zero and negative values were replaced with the constant.
4. Improved documentation around the safe-guard.

Oh, also I'm not too keen on all the time changes.

The reason for the time changes was to ensure the request time would use the mocked value set in CronQueueTest::setUp() (since there's no way to mock a global constant).

I'm open to suggestions on how best to handle this. It doesn't seem like using a global call in the DatabaseQueue would cause disruption. Also, it looks like that's the approach being used in #3112295: Replace REQUEST_TIME in rest of OO code (except for tests).

neclimdul’s picture

Ah I see, that's correct. So, technically this is OO code and with the factory it does have access to the container so that wouldn't be the correct approach. But also looks like that's the approach already being used in this class so I'll just need to follow up to fix it in general. No action here unless you want to make me an issue so i don't forget 🤣

I think the only missing I can think of is maybe we should test the safeguard since that's part of what we're fixing? Add a worker with a negative and/or 0 time and make sure the code fixes it and runs correctly.

Thanks!

alison’s picture

Issue tags: +GiftofOpenSource

[Mentoring] + adding GiftofOpenSource tag

mradcliffe’s picture

Status: Needs review » Needs work

It looks like there is still a code review item to address based on @neclimdul's comment #34 so flipping to Needs work.

cliddell’s picture

@neclimdul Awesome! I've created a follow up issue (#3277996: Inject time service into DatabaseQueue; remove dependency on global container) to handle injecting the time service.

I'll begin working on a test for the safeguard (hopefully sometime today, fingers-crossed, no promises 😅).

cliddell’s picture

Status: Needs work » Needs review
neclimdul’s picture

Status: Needs review » Reviewed & tested by the community

New test makes sense.

alexpott’s picture

Version: 9.4.x-dev » 9.3.x-dev
Status: Reviewed & tested by the community » Fixed

Committed and pushed 0fc50fa51d to 10.0.x and 7a35abe96e to 9.5.x and 652e1faf8a to 9.4.x and d218a382b9 to 9.3.x. Thanks!

  • alexpott committed 0fc50fa on 10.0.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed 7a35abe on 9.5.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed 652e1fa on 9.4.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed d218a38 on 9.3.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed 6572d37 on 10.0.x
    Revert "Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul,...

  • alexpott committed 5d068a4 on 9.5.x
    Revert "Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul,...

  • alexpott committed 97ebbd6 on 9.4.x
    Revert "Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul,...

  • alexpott committed 889dc15 on 9.3.x
    Revert "Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul,...
alexpott’s picture

I committed the patch and not the MR. And funnily enough the MR is much better. Whoops. Committing the MR now...

Committed and pushed 97d9a57d4e to 10.0.x and c88351cfa0 to 9.5.x and a8650ce147 to 9.4.x and a6bf9113b5 to 9.3.x. Thanks!

  • alexpott committed 367742e on 10.0.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed c88351c on 9.5.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed a8650ce on 9.4.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

  • alexpott committed a6bf911 on 9.3.x
    Issue #3230541 by cliddell, jday, yogeshmpawar, neclimdul, cmlara,...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.