our repository locking is very clumsy, as is our behavior in general when it comes to handling failures during repository synchronization. this really needs to be improved.

i've taken a first step by adding a VersioncontrolSynchronizationException class, which i want us to begin throwing whenever something unexpected happens during synchronization. we may also want to add another column to {versioncontrol_repositories} that is used as a sort of "failures flag" field - a bitmask where 0 means healthy, and anything else means some manner of problem has been detected on the repository. if we do that, then we can begin catching errors during synchronization (or, say, repeated lock collisions) and flag that field.

in any case, this is a meta issue so we can discuss other possibilities here, then fan out other issues for them.

Comments

sdboyer’s picture

a ton of progress has been made towards this on the sync-tinkering branch. two more exceptions (VersioncontrolNeedsFullSynchronizationException and VersioncontrolNeedsFullSynchronizationException), a ton of flow control added in VersioncontrolRepository::sync() and VersioncontrolRepository::syncEvent(). those two methods are now expected to be the main entry point for pretty much all synchronization. commit diff for the bulk of those changes.

marvil07’s picture

related #1237764: Create custom exceptions (almost duplicate?)

sdboyer’s picture

definitely related, but not a duplicate so much. that issue refers to all of our exceptions, while this one is concerned with exceptions thrown by a particular type of plugin.

sdboyer’s picture

ok so this is definitely a blocker, because right now we indicate success on a sync attempt when we actually failed b/c the repo was locked (by another process). what HAS been improved is that we almost never have erroneous locks left around...so we can actually trust the lock value. here's what we need to do:

  • capture the boolean return from VersioncontrolRepositoryHistorySynchronizerInterface::sync*() and use that to populate $log->successful instead of just manually setting it.
  • introduce a new type of exception to indicate a concurrency/locking collision - VersioncontrolSynchronizationConcurrencyException should work - and a special catch path in the sync*() methods to deal with that exception.
  • add @throws documentation to VersioncontrolRepositoryHistorySynchronizerInterface methods.
  • introduce a new constant to represent this type of error, and record it as a flag on the $log->error bitfield.
marvil07’s picture

Assigned: Unassigned » sdboyer
Status: Active » Needs review
StatusFileSize
new10.31 KB

@sdboyer: What do you think about this patch?

Not sure if we need VersioncontrolSynchronizationConcurrencyException, we already have VersioncontrolLockedRepositoryException.

And about errors field, not sure how we could catch many errors at the same time, if we use exceptions to communicate(so only one is used). I'm still in favor of having the field, but unless backend override the VersioncontrolRepository::sync*() it's not possible to have more than one error, and maybe that's fine :-)

marvil07’s picture

Assigned: sdboyer » marvil07
Status: Needs review » Needs work

Uhm, I was reviewing this, but it still needs work, will update soon. i.e. when repository is locked is not storing the right message on event sync log table.

marvil07’s picture

Assigned: marvil07 » sdboyer
Priority: Normal » Major
Status: Needs work » Needs review
StatusFileSize
new2.64 KB
new5.14 KB
new14.89 KB

Copying the commit message:

Issue #1642326: Recover from locked exception on repository syncEvent().

It is possible and necessary to do a recovery when syncing from an
event, otherwise the queue item is completely ignored.
To do that the queue item is en-queued again when the repository is
found locked.

Two main changes to be able to do that:

- A new abstract method is added to repository class,
generateRawData($event), which converts an event into the raw string
that the vcs backend hook script produces.
- Now, the repository generateCodeArrivalEvent() method can possibly
also receive a $data['elid'], which means the event was already
processed before, so a simple event load is expected in that case.

Naturally this needs a new method implementation and a change on the generateCodeArrivalEvent() method for all backends. I'm adding the patch for vc_git here for convenience.

On the other side, about the solution, I saw two alternatives:
- Use a global flag on repository table to indicate it needs a fullsync the next time.
- Re-enqueue a job to do sync of the same event again.

I found the second more natural, and that's what is implemented.

@sdboyer: What do you think about this? I would like to know your opinion.

Status: Needs review » Needs work

The last submitted patch, 0001-Issue-1642326-Improve-repository-synchronization-fai.patch, failed testing.

sdboyer’s picture

Status: Needs work » Needs review

the architecture here makes sense, i'm +1 to that. being able to effectively simulate the event for later reruns makes total sense. i'm not gonna be able to go over it line-by-line to verify that the

+++ b/versioncontrol.module
@@ -72,6 +72,13 @@ define('VERSIONCONTROL_ITEM_DIRECTORY_DELETED', 4);
+define('VERSIONCONTROL_REPOSYNC_ERROR_LOCKED', 0x01);

Let's make this a constant on the interface.

+++ b/includes/VersioncontrolRepository.php
@@ -249,29 +258,47 @@ abstract class VersioncontrolRepository implements VersioncontrolEntityInterface
+      $queue = DrupalQueue::get('versioncontrol_reposync', TRUE);
+      $queue->createItem($payload);

this worries me. could we check and see if there's a standard approach to rejecting a queued item, or something? there's the possibility, particularly on d.o, that if a repo is locked then we just spin on it endlessly. and on d.o...one locked repo could end up instantly locking every worker as the job repeatedly recreates itself. it wouldn't prevent jobs from being processed, but it would peg cpu or db on production.

bottom line is, in order to automatically retry, we'll need to set up some kind of escape hatch.

marvil07’s picture

Assigned: sdboyer » marvil07
StatusFileSize
new5 KB
new14.81 KB

bottom line is, in order to automatically retry, we'll need to set up some kind of escape hatch.

Based on the code from drupal_cron_run()

  foreach ($queues as $queue_name => $info) {
    $function = $info['worker callback'];
    $end = time() + (isset($info['time']) ? $info['time'] : 15);
    $queue = DrupalQueue::get($queue_name);
    while (time() < $end && ($item = $queue->claimItem())) {
      $function($item->data);
      $queue->deleteItem($item);
    }
  }

I think there is no standard way to deal re-added items to a queue, main system by design does not know about it(it just remove the item from queue after the worker callback is called), so I have just added a check for the number of times the event has been processed base on the versioncontrol_sync_log table entries and a new global variable versioncontrol_reposync_max_syncevent_retries.

Adding this here for bot to take a look, while I test it manually.

Also an interdiff for easier review, in summary:

  • Move the locked syncronization error constant to the sync interface.
  • Only enqueue the resync again on locked repository exception if we are under a limit of retries.
  • Implement the new repository generateRawData() method on the test backend.
marvil07’s picture

StatusFileSize
new2.84 KB
new15.02 KB

After trying this locally and making some minor changes(not really on ideas but on minor bugs) and some manual testing, I think this is ready.

marvil07’s picture

Version: 7.x-1.x-dev » 6.x-2.x-dev
Status: Needs review » Needs work
Issue tags: +Needs backport to D6

Added to 7.x-1.x, now D6 backport is pending

marvil07’s picture

Status: Needs work » Fixed
Issue tags: -Needs backport to D6

Simple cherry-pick. BTW also added on d6 and d7 at vc_git the follow ups.

Automatically closed -- issue fixed for 2 weeks with no activity.

  • Commit 94d12da on 7.x-1.x, drush-vc-sync-unlock by marvil07:
    Issue #1642326: Improve repository synchronization failure behaviour...

  • Commit 94d12da on 7.x-1.x by marvil07:
    Issue #1642326: Improve repository synchronization failure behaviour...