our repository locking is very clumsy, as is our behavior in general when it comes to handling failures during repository synchronization. this really needs to be improved.
i've taken a first step by adding a VersioncontrolSynchronizationException class, which i want us to begin throwing whenever something unexpected happens during synchronization. we may also want to add another column to {versioncontrol_repositories} that is used as a sort of "failures flag" field - a bitmask where 0 means healthy, and anything else means some manner of problem has been detected on the repository. if we do that, then we can begin catching errors during synchronization (or, say, repeated lock collisions) and flag that field.
in any case, this is a meta issue so we can discuss other possibilities here, then fan out other issues for them.
| Comment | File | Size | Author |
|---|---|---|---|
| #11 | sync-fails-1642326-11.patch | 15.02 KB | marvil07 |
| #11 | interdiff.txt | 2.84 KB | marvil07 |
| #10 | sync-fails-1642326-10.patch | 14.81 KB | marvil07 |
| #10 | interdiff.txt | 5 KB | marvil07 |
| #7 | 0001-Issue-1642326-Improve-repository-synchronization-fai.patch | 14.89 KB | marvil07 |
Comments
Comment #1
sdboyer commenteda ton of progress has been made towards this on the
sync-tinkeringbranch. two more exceptions (VersioncontrolNeedsFullSynchronizationExceptionandVersioncontrolNeedsFullSynchronizationException), a ton of flow control added inVersioncontrolRepository::sync()andVersioncontrolRepository::syncEvent(). those two methods are now expected to be the main entry point for pretty much all synchronization. commit diff for the bulk of those changes.Comment #2
marvil07 commentedrelated #1237764: Create custom exceptions (almost duplicate?)
Comment #3
sdboyer commenteddefinitely related, but not a duplicate so much. that issue refers to all of our exceptions, while this one is concerned with exceptions thrown by a particular type of plugin.
Comment #4
sdboyer commentedok so this is definitely a blocker, because right now we indicate success on a sync attempt when we actually failed b/c the repo was locked (by another process). what HAS been improved is that we almost never have erroneous locks left around...so we can actually trust the lock value. here's what we need to do:
VersioncontrolRepositoryHistorySynchronizerInterface::sync*()and use that to populate $log->successful instead of just manually setting it.VersioncontrolSynchronizationConcurrencyExceptionshould work - and a special catch path in the sync*() methods to deal with that exception.VersioncontrolRepositoryHistorySynchronizerInterfacemethods.Comment #5
marvil07 commented@sdboyer: What do you think about this patch?
Not sure if we need VersioncontrolSynchronizationConcurrencyException, we already have VersioncontrolLockedRepositoryException.
And about errors field, not sure how we could catch many errors at the same time, if we use exceptions to communicate(so only one is used). I'm still in favor of having the field, but unless backend override the VersioncontrolRepository::sync*() it's not possible to have more than one error, and maybe that's fine :-)
Comment #6
marvil07 commentedUhm, I was reviewing this, but it still needs work, will update soon. i.e. when repository is locked is not storing the right message on event sync log table.
Comment #7
marvil07 commentedCopying the commit message:
Naturally this needs a new method implementation and a change on the generateCodeArrivalEvent() method for all backends. I'm adding the patch for vc_git here for convenience.
On the other side, about the solution, I saw two alternatives:
- Use a global flag on repository table to indicate it needs a fullsync the next time.
- Re-enqueue a job to do sync of the same event again.
I found the second more natural, and that's what is implemented.
@sdboyer: What do you think about this? I would like to know your opinion.
Comment #9
sdboyer commentedthe architecture here makes sense, i'm +1 to that. being able to effectively simulate the event for later reruns makes total sense. i'm not gonna be able to go over it line-by-line to verify that the
Let's make this a constant on the interface.
this worries me. could we check and see if there's a standard approach to rejecting a queued item, or something? there's the possibility, particularly on d.o, that if a repo is locked then we just spin on it endlessly. and on d.o...one locked repo could end up instantly locking every worker as the job repeatedly recreates itself. it wouldn't prevent jobs from being processed, but it would peg cpu or db on production.
bottom line is, in order to automatically retry, we'll need to set up some kind of escape hatch.
Comment #10
marvil07 commentedBased on the code from drupal_cron_run()
I think there is no standard way to deal re-added items to a queue, main system by design does not know about it(it just remove the item from queue after the worker callback is called), so I have just added a check for the number of times the event has been processed base on the versioncontrol_sync_log table entries and a new global variable versioncontrol_reposync_max_syncevent_retries.
Adding this here for bot to take a look, while I test it manually.
Also an interdiff for easier review, in summary:
Comment #11
marvil07 commentedAfter trying this locally and making some minor changes(not really on ideas but on minor bugs) and some manual testing, I think this is ready.
Comment #12
marvil07 commentedAdded to 7.x-1.x, now D6 backport is pending
Comment #13
marvil07 commentedSimple cherry-pick. BTW also added on d6 and d7 at vc_git the follow ups.