Use rpoplpush or brpoplpush for reliable/safe queue processing [#1374000]

http://redis.io/commands/rpoplpush
http://redis.io/commands/brpoplpush

It looks like the current implementation is vulnerable to having a worker crash and/or you are implementing almost same thing via separate rPop and zAdd commands?

However, by using zAdd seems liek you could have a logic error if the same job is in the queue 2x? ZAdd will update its score if it already exists.

Comment	File	Size	Author
#15	1374000-15.patch	9.52 KB	pwolanin
#14	1374000-14.patch	6.48 KB	pwolanin
#13	1374000-13.patch	5.94 KB	pwolanin
#9	1374000-9.patch	2.8 KB	msonnabaum
#6	1374000-6.patch	6.9 KB	pwolanin
#5	1374000-5.patch	6.56 KB	pwolanin
#4	1374000-4.patch	6.56 KB	pwolanin

Comments

Comment #1

msonnabaum commented 17 December 2011 at 21:19

I was using rpoplpush originally, but switched to the separate zadd so I could support claimed item expiration using scores.

I'm very much open to a different technique here that'll move us back to rpoplpush but still support expiration. I'm also open to the argument that expiration is not worth implementing here, but it is part of the interface.

Comment #2

pwolanin commented 18 December 2011 at 19:50

Yeah, looking at the Redis API, it's hard to see how to really do this right. Basically you want some supervisor process to be able to look at the backup queue and see what's "timed out".

The redis docs say: "Another process (that we call Helper), can monitor the backup list to check for timed out entries to re-push against the main queue."

But I don't see how you can easily determine that something timed out. If you added a timestamp to the object when you added it to the queue you don't know how long it took to get claimed.

A scheme that could work is this:

create a new backup queue every second (or 10 seconds, or something easy) and name with the timestamp, e.g. backup:1324237030

Assuming a new claimed queue per 10 seconds, we just make some minimal changes to the code like:


function __construct($name) {
  $this->claimed = 'drupal:queue:' . $name . ':claimed:' . round(REQUEST_TIME, -1);

  ...
}

you can then use the command: http://redis.io/commands/keys to find all Lists matching the claimed pattern (as you already have in public function expire()), and split out the timestamp to find any that are expired.

A quick test with redis locally shows that the KEYS command skips empty lists (or maybe they are effectively deleted by being empty), so you do not even have to delete them if all workers completed.

Comment #3

pwolanin commented 18 December 2011 at 20:13

Here's an even better idea.

Why not just store a simple string ID for each item, and store the serialized value in a Redis Hash?

Comment #4

pwolanin commented 18 December 2011 at 21:09

Status:

Active

» Needs review

Status	File	Size
new	1374000-4.patch	6.56 KB

Here's a totally untested rewrite using a hash to store the actual job data, and using lists as queues just of job Ids.

Comment #5

pwolanin commented 18 December 2011 at 21:11

Status	File	Size
new	1374000-5.patch	6.56 KB

oops - that last one had a syntax error.

Comment #6

pwolanin commented 18 December 2011 at 21:16

Status	File	Size
new	1374000-6.patch	6.9 KB

Fix expiration logic (maybe).

There is still highly duplicated code between claimItem() and claimItemBlocking() that could be eliminated.

It would be interesting to see how much (if any) faster this is for jobs with large amounts of data, since you avoid moving the data between the avail and claimed lists.

As written, a lease time of < 10 seconds rounds up to 10 seconds, so maybe that should be configurable.

The core implementation has a default lease time of 30 sec, so not sure why it's 3600 sec here

http://api.drupal.org/api/drupal/modules--system--system.queue.inc/funct...

Comment #7

pwolanin commented 18 December 2011 at 23:06

Marc points me to this discussion initiated by chx:

http://stackoverflow.com/questions/7625101/redis-queue-with-claim-expire

The basic algorithm one might take way from that is that after using rpoplpush to move a queued item from "available" to "claimed" you set with an expiration value a key derived from the ID of the queued item, e.g.

$qid = $this->queue->rpoplpush($this->avail, $this->claimed);
$this->queue->setex($this->avail . '_lock:' . $qid, $lease_time, time());

And then to expire and recycle the items in the claimed list you basically iterate over the claimed list and test whether the lock key still exists. If not, add it back to available.

A downside here is that this algorithm does not use the atomic rpoplpush() the way the patch above does for moving items back to available.

Comment #8

rb2k commented 19 December 2011 at 14:04

I think the initial idea with rpoplpush was a good approach.
To realize a simple queue in redis that can be used to resubmit crashed jobs I'd try something like this:

- 1 list "up_for_grabs"
- 1 list "being_worked_on"
- auto expiring locks

Note that this assumes we don't need fancy features like filtering out duplicate jobs.
I'll try to put this down as ruby-like pseudo code.

A worker trying to grab a job would do something like this:

timeout = 3600
#Move the job away from the queue so nobody else tries to claim it
job = RPOPLPUSH(up_for_grabs, being_worked_on)
#Set a lock and expire it, the value tells us when that job will time out. This can be arbitrary though
SETEX('lock:' + job, Time.now + timeout, timeout)
#our application logic
do_work(job)
#Remove the finished item from the queue.
LREM being_worked_on -1 job
#Delete the item's lock. If it crashes here, the expire will take care of it
DEL('lock:' + job)

And every now and then, we could just grab our list and check that all jobs that are in there actually have a lock.
If we find any jobs that DON'T have a lock, this means it expired and our worker probably crashed.
In this case we would resubmit.

This would be the pseudo code for that:

loop do
	items = LRANGE(being_worked_on, 0, -1)
	items.each do |job| 
		if !(EXISTS("lock:" + job))
			puts "We found a job that didn't have a lock, resubmitting: #{job}"
                        LREM being_worked_on -1 job
                        LPUSH(up_for_grabs, job)
		end
	end
	sleep 60
end

Comment #9

msonnabaum commented 19 December 2011 at 00:34

Status	File	Size
new	1374000-9.patch	2.8 KB

Here's an implementation of more or less what Marc described.

Seems pretty simple and tests are passing with it.

Comment #10

pwolanin commented 19 December 2011 at 03:22

Per my above patch I think you should also avoid moving the actual job data between queues, but rather just move the IDs.

Currently you rely on the serialized item being the same as the original item, which I would consider very fragile, and which would remove any additional jobs that serializes to the same value. lrem($this->claimed, $item, -1)

Comment #11

msonnabaum commented 19 December 2011 at 03:33

I include the qid in the serialized item, so there shouldn't ever be duplicates. Also, -1 should only remove one item, so worst case it removes one of the duplicates, not all.

And now that the data moving is happening in a single operation within redis (rpoplpush), I'm not that concerned with moving the data along with the id. I'm assuming your concern was performance, but let me know if there's something else I'm missing there.

Comment #12

pwolanin commented 19 December 2011 at 03:36

2 concerns:

first is the performance of potentially moving large serialized objects around.

second is (as before) that the client may e.g. alter the $item in some way so that serialized value doesn't match the thing in the queue. In this case, the same job might get re-run in an infinite loop.

Comment #13

pwolanin commented 19 December 2011 at 04:12

Status	File	Size
new	1374000-13.patch	5.94 KB

Here's another patch - haven't actually gotten the driver set up, so not tested, but uses the expiring lease algorithm plus keeps the actual jobs data in a hash so only the job IDs are moving around.

Comment #14

pwolanin commented 1 January 2012 at 19:33

Status	File	Size
new	1374000-14.patch	6.48 KB

Revised patch which passes the unit tests.

Comment #15

pwolanin commented 1 January 2012 at 21:24

Status	File	Size
new	1374000-15.patch	9.52 KB

Added an expireAll() method to be called from hook_cron() and corresponding test assertions.

Also, renames $this->queue to $this->redis, for better code clarity.

Comment #16

msonnabaum commented 3 January 2012 at 00:11

Status:

Needs review

» Fixed

#15 Looks good to me. Committed.

Comment #17

17 January 2012 at 00:12

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Use rpoplpush or brpoplpush for reliable/safe queue processing

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

News items

Our community

Documentation

Drupal code base

Governance of community