lock_acquire might be checking the wrong thing when extending a lock lease [#2379883]

It looks to me as though lock_acquire does the wrong check on http://cgit.drupalcode.org/memcache/tree/memcache-lock-code.inc#n42

 elseif ($result = dmemcache_get($name, 'semaphore') && isset($locks[$name]) && $locks[$name] == _lock_id())

The patch would be super-easy, but what I'm really looking for is confirmation that my hunch is correct.

I think that last check on the line should be "$locks[$name] == $result", as the current "$locks[$name] == _lock_id()" is surely always TRUE, isn't it? What we want to be checking on this line is that it is our process that has the lock, that is, that the $result coming out of memcache for a given name equals the lock ID that this process has for that lock name.

There is a clue that this might be wrong in that $result is otherwise never used within this function. I think the intention probably was to use $result in this comparison.

Setting to major, as I believe there are significant implications of getting this wrong (though you would only ever see problems if a lock was allowed to expire, which is somewhat less common in practice).

Comment	File	Size	Author
#7	shared-lock.php_.txt	516 bytes	Jeremy
#2	memcache-2379883-lock-extend.patch	684 bytes	Alan Evans

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Alan Evans CreditAttribution: Alan Evans commented 21 November 2014 at 23:21

PoC code:

For process 1:

$lock = lock_acquire('test', 10);
var_dump($lock);
sleep(15);
echo "Lock should be gone\n";
sleep(10);
$lock = lock_acquire('test');
var_dump($lock);
sleep(30);

Process 2:

$lock = lock_acquire('test');
var_dump($lock);
sleep(30);

In order to test, run the process 1 script. As soon as you get "lock should be gone" output, start the second one in a separate terminal. You would hope that because process 2 claimed the lock after it expired in process 1, process 1 would not be able to extend its lock, but what happens is process 1 thinks it got the lock:

Process 1:

bool(true)
Lock should be gone
bool(true)

Process 2:

bool(true)

Comment #2

Alan Evans CreditAttribution: Alan Evans commented 21 November 2014 at 23:25

Status:

Active

» Needs review

File	Size
memcache-2379883-lock-extend.patch	684 bytes

The attached patch makes it so that the first process cannot extend a lock which has been claimed by another process, and so process 1 test code output looks like this:

bool(true)
Lock should be gone
bool(false)

I added a couple of extra parentheses for clarity.

Comment #3

Alan Evans CreditAttribution: Alan Evans commented 21 November 2014 at 23:32

Note also: as with #2379897: lock_release does not check for lock_id : it's actually a forced lock break, the behaviour with this patch more closely emulates the core SQL lock.inc behaviour in that it does not allow a process to extend a lock that was already claimed by another process. This is the output from process 1 when running with core lock.inc (I added the "DB lock" output into includes/lock.inc to prove it was being executed):

DB lock
bool(true)
Lock should be gone
DB lock
bool(false)

Comment #4

Alan Evans CreditAttribution: Alan Evans commented 24 November 2014 at 11:31

Or course, there's also the possibility that I'm just reading this wrong and the unused $result is a red herring. If the purpose of that is simply to establish that there is a non-FALSE result (in which case, $result can be removed altogether) from that get() call AND the check at the end is simply to establish "I believe I claimed the lock earlier in this process" before extending the lock, then the only problem that remains is the race condition between the get() and the set(). But then, there is already the isset() earlier in the line to do that, so the last check doesn't help.

However it comes out, this could use a comment to ensure that the intention is obvious.

Many lock implementations don't attempt to extend an existing lock, and so don't have to deal with this issue. The problem with memcache in this case is that we can't impose multiple conditions on what we're attempting to set, which core's lock.inc can, due to MySQL. I'm wondering if there's a slightly more elaborate memcache-based data structure that would allow this to work, using multiple keys then, or whether we'd need a "micro-mutex" (can't think of a better word) lock around the non-atomic sections (anywhere that has a separate get and set operation). That micro-mutex would have no lock ID at all so that it's fully shared between all processes, and would live only microseconds in order to protect the attempt to get/set the actual lock we're interested in. This would be needed to do proper checked release calls (see #2379897: lock_release does not check for lock_id : it's actually a forced lock break) as well as extending the lock as discussed here.

Comment #5

Alan Evans CreditAttribution: Alan Evans commented 26 November 2014 at 23:24

I suspect that this may be the reason this test fails: http://cgit.drupalcode.org/memcache/tree/tests/memcache-lock.test#n68

Comment #6

24 December 2014 at 10:53

Jeremy committed 6037ca1 on 7.x-1.x authored by Alan Evans

Issue #2379883 by Alan Evans, Jeremy: fix broken logic and order of...

Comment #7

Jeremy CreditAttribution: Jeremy commented 24 December 2014 at 10:55

Version:	7.x-1.x-dev	» 6.x-1.x-dev
Status:	Needs review	» Patch (to be ported)

File	Size
shared-lock.php_.txt	516 bytes

@Alan Evans nice find, you are correct. Thanks for the patch and testing.

The logic was broken (we were renewing based on a local cache instead of the real state of the lock), and there was an order of precedence error causing $result to not be set correctly. Committed fix:
http://cgit.drupalcode.org/memcache/commit/?id=6037ca1

For testing I wrote a simple looping script that grabs and renews a lock. Run as many copies as you like in multiple terminals doing something like drush scr shared-lock.php to confirm that only one process can hold the lock at a time. Random timing causes different instances to get the lock at different times.

Yes, there's a tiny race window still in the lock renewal logic, but I think that's a minor issue compared to what is already fixed. We could open up a followup ticket to address that.

This fix needs to be backported to 6.x-1.x.