Revisit syncEvent() logic on main repository sync plugin [#1546110]

Related Issues

This is a child task and a blocker of #1568176: [meta] Get latest D6 versioncontrol code deployed

Problem/Motivation

VersioncontrolGitRepositoryHistorySynchronizerDefault::syncEvent() and some general logic about git event handling is not working correctly and needs some revision/fixing.

Proposed resolution

Iterative review/change the code until:

It stores the information about events coherently.
syncEvent() has better performance than syncFull() for the default git history syncronizer plugin.

Remaining tasks

~~Change logic to store the information about git events coherently.~~ Done.
~~Remove commits data member from ref changes.~~ Done.
~~Move any exec callback to repsync plugin.~~ Done.
~~Fix event tests.~~ Done.
~~Add reftype to git event data pk.~~ Done.
~~Use new constants for ref types(instead of label constants), so we can extend them in the future(i.e. notes)~~ Done.
~~Make Git token integration tests pass again.~~
~~Make Git event data integrity tests pass again.~~
~~Re-test for performance.~~
~~Update: Make a port for 6.x-2.x(i.e. remember change the hook_update_N)~~

User interface changes

None.

API changes

Many, but since we only have one history syncronizer plugin for git now, and we are changing it here too, nothing affecting external code.

Original report by sdboyer

I never finished properly reviewing/testing the logic cvangysel put in VersioncontrolGitRepositoryHistorySynchronizerDefault::syncEvent() before we merged generic-reposync back in at Denver, and there are some outstanding issues. At minimum, I've noticed that the logic is currently set to not parse commits if more than 5 came in on a push - this was probably the result of a misunderstanding, as while the commits absolutely need to be parsed into the DB, there's a limit on the number we should explicitly record on the event object itself.

There may well be other problems with it - need to run some more tests.

Comment	File	Size	Author
#5	git-repo-sync-event-take-1.patch	10.46 KB	marvil07

Comments

Comment #1

sdboyer commented 24 April 2012 at 23:06

Issue tags:

+git, +blocking versioncontrol 6.x-2.x deployment, +drupal.org D7

tagging

Comment #2

sdboyer commented 25 April 2012 at 00:03

Title:

Revisit syncEvent() logic on main

» Revisit syncEvent() logic on main repository sync plugin

fixing title

Comment #3

marvil07 commented 25 April 2012 at 17:27

Version:

6.x-2.x-dev

» 7.x-1.x-dev

First on D7, then backport.

Comment #4

marvil07 commented 17 May 2012 at 02:09

Assigned:	Unassigned	» marvil07
Priority:	Normal	» Major

I was supposed to create another issue about git ref changes object commits data member inclusion or not, but going a little deeper into this, I end up here.

We definitely have some code to rewrite on the default git reposync plugin.

I will be reporting about what I find.

Comment #5

marvil07 commented 17 May 2012 at 10:55

Status:

Active

» Needs work

Status	File	Size
new	git-repo-sync-event-take-1.patch	10.46 KB

The following patch make event logic work by storing git commit hashes(instead of VersioncontrolOperation objects) on the versioncontrol_git_event_data.commits field from generateCodeArrivalEvent(). Not sure yet if we want that, or instead load those commits dynamically after at syncEvent() based on versioncontrol_git_event_data old_sha1 and new_sha1 fields (the problem with that approach is to not be able to replicate the information from the table in future if some commits get removed from the repository).

In the other side I have noticed it's syncEvent() is now slower than syncFull(), so we definitely want to improve it more(it should be faster, that's the point).

Comment #6

senpai commented 18 May 2012 at 20:03

Issue tags:

+sprint 2

Tagging for Sprint 2.

Comment #7

eliza411 commented 18 May 2012 at 20:39

Issue tags:

+sprint 3

Tagging for Sprint 3.

Comment #8

sdboyer commented 25 May 2012 at 19:54

basic smoke testing indicates this mostly working in the git6 environment. woohoo! commits make it into the db, are properly associated with users, and entries are made in the versioncontrol_event_log and versioncontrol_git_event_data tables. and they're done not just for normal pushes, but for tag & branch creation/deletion, even with no new commits pushed. we also seem to handle the case of tag reassignment fairly elegantly. excellent.

couple problems - we're still inserting the full array of commit hashes into the 'commits' text field, and not concatenating it to 5 when it's more than that, as we'd discussed. we're also still inserting ALL the hashes when a new branch is pushed...which is the biggest and riskiest case. like, i just saw 700 hashes show up when i pushed a new branch to versioncontrol. the approach there should be that we insert none of the hashes at all, leaving that array empty. we may want to entertain doing a bit of custom logic in order to determine the branch from which the new branch most likely descends, but that's quite optional and can be done later.

we also need to add a {versioncontrol_git_event_data}.commit_count field that records the total number of commits pushed to any given ref. yes, that's derivable info, but it's also totally nonvolatile and thus worth 'caching' it as the lookup is nontrivial to do during a normal request.

what tests did you run that indicated this is slower than syncFull()? and do you have any intuitions about why?

Comment #9

marvil07 commented 27 May 2012 at 06:49

After some discussion, we agree to remove the commits field on ref changes, and also from versioncontrol_git_event_data table, I have just pushed one commit about it(see 3fc73c9bf3e91486605efec0515f6dddde92fd7b on generic-reposync-follow-up topic branch).

Still missing:

Move any exec callback to repsync plugin(i.e. rm them from generateCodeArrivalEvent)
Re-test for performance. btw sdboyer: I end up mentioning it's slower from a simple comparison of times between fetch button(syncFull) and parsing from git post-receive hook(syncEvent), specially for new branches, but maybe it has changed with the last commit. Anyway I will try a comparison again when I had removed the callbacks.
Update: Make a port for 6.x-2.x(i.e. remember change the hook_update_N)

Comment #10

marvil07 commented 28 May 2012 at 06:15

I've added two more changes to the topic branch:

e440551 Add one ref commits data member missing change.
3b6f353 Move fast-forward detection to syncEvent method.

Comment #11

sdboyer commented 30 May 2012 at 03:18

sounds good. i'll have a look at these changes as soon as i can, probably thursday.

Comment #13

marvil07 commented 31 May 2012 at 21:37

I have just duplicated #1497392: Fix reposync related tests in favor of this issue, and I have an uncompleted patch for the git event tests I will be posting here soon.

Comment #14

marvil07 commented 1 June 2012 at 07:31

Sadly I still do not have the tests ready. In the other side I have identified two more pending corrections:

add reftype to git event data pk. Git allows to have the same string as name of a branch and a label, so events should support that too(i.e. push them both)
use new constants for ref types(instead of label constants), so we can extend them in the future(i.e. notes)

Comment #14.0

marvil07 commented 5 June 2012 at 19:13

Issue summary:

View changes

Convert description into issue summary.

Comment #14.1

marvil07 commented 5 June 2012 at 19:14

Issue summary:

View changes

minor removal of unnecessary comment.

Comment #15

marvil07 commented 5 June 2012 at 19:22

New code on the topic branch:

96f53a2 Add another ref commits data member missing change.
d9d5558 Avoid using refnames as event refs indexes.
3b931a5 Remove drupal_write_record from event update.
7f84e9a Update git event tests to use new code. <- git event test working!
439be26 Identation fix

I have just updated the issue to have a full issue summary with details of what's left instead of keep posting items on comments about it.

Comment #16

senpai commented 11 June 2012 at 19:37

Issue tags:

+sprint 4

Tagging for Sprint 4.

Comment #17

marvil07 commented 14 June 2012 at 19:00

Two more changes on the topic branch:

e104163 Added reftype as part of the {versioncontrol_git_event_data} PK.
ac578e7 Use new constants to refer event ref types.

Comment #17.0

marvil07 commented 14 June 2012 at 19:00

Issue summary:

View changes

minor: use real html listings

Comment #18

sdboyer commented 14 June 2012 at 20:46

great. i'm gonna dig in on this tonight, too, see what we can get done before scrum tomorrow.

Comment #18.0

sdboyer commented 14 June 2012 at 20:46

Issue summary:

View changes

Formatting pending list and updating it to reflect last comment.

Comment #19

marvil07 commented 14 June 2012 at 21:01

A have added missing test to fix on the TODO list:

Currently passing:
- Git data integrity tests
- Git event tests
Needs to be fixed:
- Git token integration tests
- Git event data integrity tests.

Those not-passing tests probably indicate some code still needs a little tweaking(or maybe the tests itself).

Comment #20

marvil07 commented 15 June 2012 at 11:43

Two more commits on the topic branch:

db7b464 Check existing labels before adding them to operations at default reposync plugin.
This fixes a common error on both failing tests.

703acd2 Port token support to D7.
Token support was broken because we forgot to port it. It seems to be working fine, i.e.:

$r = versioncontrol_repository_load($repository_id);
$events = $r->loadEvents(array($elid));
print token_replace('[event-git:refschanges]', array('VersioncontrolEvent' => $es[$elid]));

But for some reason not inside the token tests.

Comment #21

sdboyer commented 15 June 2012 at 14:50

ok, with those latest changes pulled, the token tests are now passing, great. now i've just got four failures on the git event data integrity. will investigate those.

Comment #22

marvil07 commented 15 June 2012 at 19:36

ok, with those latest changes pulled, the token tests are now passing, great.

That's interesting, I have the token tests not passing locally here.

Comment #22.0

marvil07 commented 15 June 2012 at 19:36

Issue summary:

View changes

Add tests to fix on TODO

Comment #22.1

senpai commented 15 June 2012 at 20:09

Issue summary:

View changes

This is a child task and a blockerof #1568176: [meta] Get latest D6 versioncontrol code deployed

Comment #23

sdboyer commented 17 June 2012 at 00:18

i've fixed a number of problems. the first and most glaring one was that our tests, when simulating pushes, did not actually reflect what those pushes would/should have done in the repository. as a result, our history synchronizer could get quite confused, because it would receive a bunch of information about the state of the repository that was inconsistent with the data it directly fetched from the repository during the course of normal operation.

from there, some new problems emerged, primarily that the handling of branch deletions was kinda entirely broken. the primary issue is that we don't do the garbage collection we should - we don't clean out old commits that are no longer associated with any branch. Technically, this should be easy - it should just be the difference between:

      elseif ($ref->eventDeletedMe() && empty($label_repo) && ($label = $label_db)) {
        $label->delete();
      }

and

      elseif ($ref->eventDeletedMe() && empty($label_repo) && ($label = $label_db)) {
        $label->delete(array('nested' => TRUE));
      }

because we included support for this exact sort of thing in the base API itself. however, this is one of those places where i don't expect the base API to scale very well, as its approach to performing this delete is to load ALL of the commits on a branch, then iterate over them and check to see if they're only attached to the branch that's being deleted. that's an unacceptably large number, and i fully expect memory exhaustion to occur as a result. so basically, that's another todo for us, and i've created an issue - #1642310: Optimize logic for branch deletion with nested option.

from there, we still had the problem that the subsequent branch/commits reconciliation logic simply didn't check to see if the branch being worked on had been deleted (if it had, it could be skipped). however, that led me to another problem - VersioncontrolGitRepositoryHistorySynchronizerDefault::getCommitInterval() is wrong:

  protected function getCommitInterval($start, $end) {
    if ($start == GIT_NULL_REV) {
      $range = $end;
    }
    elseif ($end == GIT_NULL_REV) {
      $range = "$start..";
    }
    else {
      $range = "$start..$end";
    }

    $command = "rev-list --reverse $range";

the idea here is basically to be a passthru to git rev-list , and if the null rev is passed for either start or end, it's simply dropped. dropping it is the problem, though - without the second argument, the .. notation assumes HEAD. which means the range being reported is the difference between the first SHA and whatever HEAD happens to be. which is nothing like what we're looking for there. so, i've switched out that second if statement with a new Exception i commushed to vcapi - VersioncontrolSynchronizationException. i'd like us to start using this exception - a lot - in order to make it clearer when and where our failures are occurring during synchronization. i've also opened an issue for that - #1642326: [meta] Improve repo locking & behavior on synchronization failures.

with that cleared up, we still have another potential memory exhaustion point here, for much the same reason that there was an issue with the deletion logic: a new branch being passed through that commit interval logic will get back ALL commits, all the way back to root. for core, loading all of those up only to discover that a scant few out of tens of thousands are actually new, is *not* an acceptable approach. so i'm finishing off a commit that introduces a heuristic trick with git merge-base. instead of going all the way back to the root, we merge-base against HEAD in order to determine a shared commit that is likely to be a good shared base, then generate the interval against that. we could pick any branch, but in mos repositories, the branch set as HEAD will be the most likely one from which branches are created, and so is likely to have the smallest interval. we could engage in more inspection to definitively determine which branch is closest, but it's not worth delaying this patch for that logic.

there are two cases i can think of in which this won't entirely work:

if the push also updated the bare repo's HEAD, and the new branch diverges from HEAD at one of those new commits being pushed up, then we'll miss a couple commits in that sequence. however, this shouldn't be an issue (at least in the immediate term) because the later passthru which processes the HEAD ref will catch those missing commits. it may, however, become an issue later when we attempt to do weighting, as processing those commits out of sequence could be problematic.
since we are so shitty about letting people set HEAD on d.o, it's quite possible that HEAD may not exist. this is eminently fixable, but it does need resolution, which we should do in a follow-up. it'd be resolved by adding the more complex nearest-parent logic.

Comment #24

sdboyer commented 17 June 2012 at 05:47

ok, ran out of patience on this, time for sleep. there are still several failing tests, and i *cannot* figure out why. it's seeming like it's refusing to insert new branches into the db that really should be there.

we're pretty close...though i am again reminded just how not an exhaustive job we have done of planning out all the various situations this code path has to handle. it's visible in the spaghetti in that code, as well as in the lack of meta-level structure & organization in our tests.

it makes me very uncomfortable.

Comment #25

sdboyer commented 18 June 2012 at 16:03

...and that approach creates a new problem, because we end up not creating the commit<->branch mapping in {versioncontrol_operation_labels}. ugh. it's also sticky because the merge-base i added returns something non-useful if the tip of the new branch == HEAD, or any ancestor of HEAD.

so, i'm still tinkering. i really want to separate & fully comment this logic, though, as it's very difficult to follow everything that's going on.

Comment #26

sdboyer commented 18 June 2012 at 20:27

Version:	7.x-1.x-dev	» 6.x-2.x-dev
Status:	Needs work	» Patch (to be ported)

OK, i removed the merge-base logic and just let it go as-is. after some more fixes to the tests, everything is passing.

which does not mean all is well. we still have potential memory-busting problems in large repositories. and, of course, the longstanding problem of a swelling {versioncontrol_operation_labels} table.

i can't wait until we can switch to some system that's more reliant on fetching data directly from the repository at runtime.

the token tests pass, at least for me, so that's close enough to fixed for me. i've merged the branch back in, and am now working on the backport to 6.x.

Comment #27

sdboyer commented 18 June 2012 at 21:10

oh, a note - i was meaning to try an array_chunk() approach to cutting down the number of commit objects we load simultaneously in order to decrease the likelihood of the memory-busting problem. if it works like it should, it could be a good enough solution that we don't really need to look at anything else. basically, we just array_chunk() the results of the rev-list that comes back from getCommitInterval(), then wrap the remaining logic in a foreach over the chunks. i think everything inside there is isolated from needing overall state, so the chunking should be transparent.

Comment #28

marvil07 commented 19 June 2012 at 04:54

Assigned:

marvil07

» sdboyer

All test passing! \o/ (except token, but that's irrelevant now ;-))
Great steps taken on the last commits, thanks!

About {versioncontrol_operation_labels}, yes, that's definitely a bottleneck, but I do not see a change for a direct repository access near the future of the module. Anyway why we would like a VCS abstraction? Mayeb it worths to remember that we are now optimizing the sync's, but that's not so usual as the read's, which I guess is our main target.

About chunking the hash array before loading/update, it makes sense, and sounds like we should use a variable to define the chunks size(or how do you define "high memory", we are using drupal anyway :-p).

I'm assigning this to you as your comment on 26.
AFAIR the only commit you should not add to 6.x-1.x is a3a791bbd7a17bed83b9c1c2e67f9e53d2621cad, and we need to change some that are touching hook_update_N() functions: 3fc73c9bf3e91486605efec0515f6dddde92fd7b and e1041632b9a4112f66791b501379baab4fad71f2.

Comment #29

sdboyer commented 19 June 2012 at 11:50

Assigned:

sdboyer

» marvil07

yep, i skipped that commit, and a couple others.

6.x-1.x backport is looking pretty good, though we're getting some weird errors on the $event->ref->refname being set to something bizarre and too long. tough to tell just yet what's causing it, as the code is really pretty nearly identical between 6 and 7.

Comment #30

sdboyer commented 19 June 2012 at 13:52

Assigned:

marvil07

» sdboyer

whoops, xpost

Comment #31

sdboyer commented 19 June 2012 at 17:44

Status:

Patch (to be ported)

» Fixed

About {versioncontrol_operation_labels}, yes, that's definitely a bottleneck, but I do not see a change for a direct repository access near the future of the module. Anyway why we would like a VCS abstraction? Mayeb it worths to remember that we are now optimizing the sync's, but that's not so usual as the read's, which I guess is our main target.

there are already 2 million rows in that table on live d.o, and it's just gonna get worse. a lot worse. it is currently the primary db-level weak link - at read or write time. write because index generation gets nasty, and read for the obvious reasons.

on a separate note, though, i've finished the backport (now that we've sorted out yet more weird issues with testing), pushed all of the changes up, and am now testing on git6. woot woot.

Comment #31.0

sdboyer commented 19 June 2012 at 17:45

Issue summary:

View changes

Forgot to add a space.

Comment #32

sdboyer commented 20 June 2012 at 09:04

bit of a note, i guess, not sure where else to put it...i've made sync-tinkering branches in both vc_git and vcapi. there are just a *massive* number of additional improvements in there. they're quite near finished, but more importantly, are not the sort of changes that would block us up anyway. after a little more work (~30 minutes) to VersioncontrolRepository::syncEvent(), we'll be ready to test on git6 again, and i expect it'll go more smoothly than it did before (these changes solve one of the problems i was having on git6).

all these changes will need-forward-porting to d7, but it should be trivial cherry-picking.

Comment #33

4 July 2012 at 09:11

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #33.0

(not verified) commented 4 July 2012 at 09:11

Issue summary:

View changes

striking out remaining tasks

Comment #34

26 March 2014 at 15:02

Commit 0ba5d3a on 7.x-1.x, fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Unify initialization of sync* methods for git_default...
```
Commit 738e64d on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Reuse repo exec on default reposync plugin.

It makes...
```
Commit 12cfa8e on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Change logic related to default reposyn plugin.

-...
```
Commit 3fc73c9 on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Remove commits from ref changes.
```
Commit 3b6f353 on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Move fast-forward detection to syncEvent method.
```

Commit d9d5558 on fix-invalid-default-branches, fullsync-memory by marvil07:

Issue #1546110: Avoid using refnames as event refs indexes.

Git allows...

Commit 3b931a5 on fix-invalid-default-branches, fullsync-memory by marvil07:

Issue #1546110: Remove drupal_write_record from event update.

Now it's...

Commit 7f84e9a on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Update git event tests to use new code.
```
Commit e104163 on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Added reftype as part of the {...
```

Commit ac578e7 on fix-invalid-default-branches, fullsync-memory by marvil07:

Issue #1546110: Use new constants to refer event ref types.

Instead of...

Commit e0880ca on fix-invalid-default-branches, fullsync-memory by marvil07:

Issue #1546110: Check existing labels before adding them to operations...

Commit a3a791b on fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Port token support to D7.
```
Commit e4c2482 on 7.x-1.x, fix-invalid-default-branches, fullsync-memory by sdboyer:
```
Issue #1546110: numerous fixups following the generic-reposync merge....
```
Commit dee034d on 7.x-1.x, fix-invalid-default-branches, fullsync-memory by marvil07:
```
Issue #1546110: Merge remote-tracking branch 'origin/sync-tinkering-d7'...
```