Related Issues

This is a child task and a blocker of #1568176: [meta] Get latest D6 versioncontrol code deployed

Problem/Motivation

VersioncontrolGitRepositoryHistorySynchronizerDefault::syncEvent() and some general logic about git event handling is not working correctly and needs some revision/fixing.

Proposed resolution

Iterative review/change the code until:

  • It stores the information about events coherently.
  • syncEvent() has better performance than syncFull() for the default git history syncronizer plugin.

Remaining tasks

  • Change logic to store the information about git events coherently. Done.
  • Remove commits data member from ref changes. Done.
  • Move any exec callback to repsync plugin. Done.
  • Fix event tests. Done.
  • Add reftype to git event data pk. Done.
  • Use new constants for ref types(instead of label constants), so we can extend them in the future(i.e. notes) Done.
  • Make Git token integration tests pass again.
  • Make Git event data integrity tests pass again.
  • Re-test for performance.
  • Update: Make a port for 6.x-2.x(i.e. remember change the hook_update_N)

User interface changes

None.

API changes

Many, but since we only have one history syncronizer plugin for git now, and we are changing it here too, nothing affecting external code.

Original report by sdboyer

I never finished properly reviewing/testing the logic cvangysel put in VersioncontrolGitRepositoryHistorySynchronizerDefault::syncEvent() before we merged generic-reposync back in at Denver, and there are some outstanding issues. At minimum, I've noticed that the logic is currently set to not parse commits if more than 5 came in on a push - this was probably the result of a misunderstanding, as while the commits absolutely need to be parsed into the DB, there's a limit on the number we should explicitly record on the event object itself.

There may well be other problems with it - need to run some more tests.

CommentFileSizeAuthor
#5 git-repo-sync-event-take-1.patch10.46 KBmarvil07
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

sdboyer’s picture

sdboyer’s picture

Title: Revisit syncEvent() logic on main » Revisit syncEvent() logic on main repository sync plugin

fixing title

marvil07’s picture

Version: 6.x-2.x-dev » 7.x-1.x-dev

First on D7, then backport.

marvil07’s picture

Assigned: Unassigned » marvil07
Priority: Normal » Major

I was supposed to create another issue about git ref changes object commits data member inclusion or not, but going a little deeper into this, I end up here.

We definitely have some code to rewrite on the default git reposync plugin.

I will be reporting about what I find.

marvil07’s picture

Status: Active » Needs work
FileSize
10.46 KB

The following patch make event logic work by storing git commit hashes(instead of VersioncontrolOperation objects) on the versioncontrol_git_event_data.commits field from generateCodeArrivalEvent(). Not sure yet if we want that, or instead load those commits dynamically after at syncEvent() based on versioncontrol_git_event_data old_sha1 and new_sha1 fields (the problem with that approach is to not be able to replicate the information from the table in future if some commits get removed from the repository).

In the other side I have noticed it's syncEvent() is now slower than syncFull(), so we definitely want to improve it more(it should be faster, that's the point).

Senpai’s picture

Issue tags: +sprint 2

Tagging for Sprint 2.

eliza411’s picture

Issue tags: +sprint 3

Tagging for Sprint 3.

sdboyer’s picture

basic smoke testing indicates this mostly working in the git6 environment. woohoo! commits make it into the db, are properly associated with users, and entries are made in the versioncontrol_event_log and versioncontrol_git_event_data tables. and they're done not just for normal pushes, but for tag & branch creation/deletion, even with no new commits pushed. we also seem to handle the case of tag reassignment fairly elegantly. excellent.

couple problems - we're still inserting the full array of commit hashes into the 'commits' text field, and not concatenating it to 5 when it's more than that, as we'd discussed. we're also still inserting ALL the hashes when a new branch is pushed...which is the biggest and riskiest case. like, i just saw 700 hashes show up when i pushed a new branch to versioncontrol. the approach there should be that we insert none of the hashes at all, leaving that array empty. we may want to entertain doing a bit of custom logic in order to determine the branch from which the new branch most likely descends, but that's quite optional and can be done later.

we also need to add a {versioncontrol_git_event_data}.commit_count field that records the total number of commits pushed to any given ref. yes, that's derivable info, but it's also totally nonvolatile and thus worth 'caching' it as the lookup is nontrivial to do during a normal request.

what tests did you run that indicated this is slower than syncFull()? and do you have any intuitions about why?

marvil07’s picture

After some discussion, we agree to remove the commits field on ref changes, and also from versioncontrol_git_event_data table, I have just pushed one commit about it(see 3fc73c9bf3e91486605efec0515f6dddde92fd7b on generic-reposync-follow-up topic branch).

Still missing:

  • Move any exec callback to repsync plugin(i.e. rm them from generateCodeArrivalEvent)
  • Re-test for performance. btw sdboyer: I end up mentioning it's slower from a simple comparison of times between fetch button(syncFull) and parsing from git post-receive hook(syncEvent), specially for new branches, but maybe it has changed with the last commit. Anyway I will try a comparison again when I had removed the callbacks.
  • Update: Make a port for 6.x-2.x(i.e. remember change the hook_update_N)
marvil07’s picture

I've added two more changes to the topic branch:

  • e440551 Add one ref commits data member missing change.
  • 3b6f353 Move fast-forward detection to syncEvent method.
sdboyer’s picture

sounds good. i'll have a look at these changes as soon as i can, probably thursday.

marvil07’s picture

I have just duplicated #1497392: Fix reposync related tests in favor of this issue, and I have an uncompleted patch for the git event tests I will be posting here soon.

marvil07’s picture

Sadly I still do not have the tests ready. In the other side I have identified two more pending corrections:

  • add reftype to git event data pk. Git allows to have the same string as name of a branch and a label, so events should support that too(i.e. push them both)
  • use new constants for ref types(instead of label constants), so we can extend them in the future(i.e. notes)
marvil07’s picture

Issue summary: View changes

Convert description into issue summary.

marvil07’s picture

Issue summary: View changes

minor removal of unnecessary comment.

marvil07’s picture

New code on the topic branch:

  • 96f53a2 Add another ref commits data member missing change.
  • d9d5558 Avoid using refnames as event refs indexes.
  • 3b931a5 Remove drupal_write_record from event update.
  • 7f84e9a Update git event tests to use new code. <- git event test working!
  • 439be26 Identation fix

I have just updated the issue to have a full issue summary with details of what's left instead of keep posting items on comments about it.

Senpai’s picture

Issue tags: +sprint 4

Tagging for Sprint 4.

marvil07’s picture

Two more changes on the topic branch:

  • e104163 Added reftype as part of the {versioncontrol_git_event_data} PK.
  • ac578e7 Use new constants to refer event ref types.
marvil07’s picture

Issue summary: View changes

minor: use real html listings

sdboyer’s picture

great. i'm gonna dig in on this tonight, too, see what we can get done before scrum tomorrow.

sdboyer’s picture

Issue summary: View changes

Formatting pending list and updating it to reflect last comment.

marvil07’s picture

A have added missing test to fix on the TODO list:

  • Currently passing:
    • Git data integrity tests
    • Git event tests
  • Needs to be fixed:
    • Git token integration tests
    • Git event data integrity tests.

Those not-passing tests probably indicate some code still needs a little tweaking(or maybe the tests itself).

marvil07’s picture

Two more commits on the topic branch:

  • db7b464 Check existing labels before adding them to operations at default reposync plugin.
    This fixes a common error on both failing tests.
  • 703acd2 Port token support to D7.
    Token support was broken because we forgot to port it. It seems to be working fine, i.e.:
    $r = versioncontrol_repository_load($repository_id);
    $events = $r->loadEvents(array($elid));
    print token_replace('[event-git:refschanges]', array('VersioncontrolEvent' => $es[$elid]));
    

    But for some reason not inside the token tests.

sdboyer’s picture

ok, with those latest changes pulled, the token tests are now passing, great. now i've just got four failures on the git event data integrity. will investigate those.

marvil07’s picture

ok, with those latest changes pulled, the token tests are now passing, great.

That's interesting, I have the token tests not passing locally here.

marvil07’s picture

Issue summary: View changes

Add tests to fix on TODO

Senpai’s picture

Issue summary: View changes
sdboyer’s picture

i've fixed a number of problems. the first and most glaring one was that our tests, when simulating pushes, did not actually reflect what those pushes would/should have done in the repository. as a result, our history synchronizer could get quite confused, because it would receive a bunch of information about the state of the repository that was inconsistent with the data it directly fetched from the repository during the course of normal operation.

from there, some new problems emerged, primarily that the handling of branch deletions was kinda entirely broken. the primary issue is that we don't do the garbage collection we should - we don't clean out old commits that are no longer associated with any branch. Technically, this should be easy - it should just be the difference between:

      elseif ($ref->eventDeletedMe() && empty($label_repo) && ($label = $label_db)) {
        $label->delete();
      }

and

      elseif ($ref->eventDeletedMe() && empty($label_repo) && ($label = $label_db)) {
        $label->delete(array('nested' => TRUE));
      }

because we included support for this exact sort of thing in the base API itself. however, this is one of those places where i don't expect the base API to scale very well, as its approach to performing this delete is to load ALL of the commits on a branch, then iterate over them and check to see if they're only attached to the branch that's being deleted. that's an unacceptably large number, and i fully expect memory exhaustion to occur as a result. so basically, that's another todo for us, and i've created an issue - #1642310: Optimize logic for branch deletion with nested option.

from there, we still had the problem that the subsequent branch/commits reconciliation logic simply didn't check to see if the branch being worked on had been deleted (if it had, it could be skipped). however, that led me to another problem - VersioncontrolGitRepositoryHistorySynchronizerDefault::getCommitInterval() is wrong:

  protected function getCommitInterval($start, $end) {
    if ($start == GIT_NULL_REV) {
      $range = $end;
    }
    elseif ($end == GIT_NULL_REV) {
      $range = "$start..";
    }
    else {
      $range = "$start..$end";
    }

    $command = "rev-list --reverse $range";

the idea here is basically to be a passthru to git rev-list , and if the null rev is passed for either start or end, it's simply dropped. dropping it is the problem, though - without the second argument, the .. notation assumes HEAD. which means the range being reported is the difference between the first SHA and whatever HEAD happens to be. which is nothing like what we're looking for there. so, i've switched out that second if statement with a new Exception i commushed to vcapi - VersioncontrolSynchronizationException. i'd like us to start using this exception - a lot - in order to make it clearer when and where our failures are occurring during synchronization. i've also opened an issue for that - #1642326: [meta] Improve repo locking & behavior on synchronization failures.

with that cleared up, we still have another potential memory exhaustion point here, for much the same reason that there was an issue with the deletion logic: a new branch being passed through that commit interval logic will get back ALL commits, all the way back to root. for core, loading all of those up only to discover that a scant few out of tens of thousands are actually new, is *not* an acceptable approach. so i'm finishing off a commit that introduces a heuristic trick with git merge-base. instead of going all the way back to the root, we merge-base against HEAD in order to determine a shared commit that is likely to be a good shared base, then generate the interval against that. we could pick any branch, but in mos repositories, the branch set as HEAD will be the most likely one from which branches are created, and so is likely to have the smallest interval. we could engage in more inspection to definitively determine which branch is closest, but it's not worth delaying this patch for that logic.

there are two cases i can think of in which this won't entirely work:

  • if the push also updated the bare repo's HEAD, and the new branch diverges from HEAD at one of those new commits being pushed up, then we'll miss a couple commits in that sequence. however, this shouldn't be an issue (at least in the immediate term) because the later passthru which processes the HEAD ref will catch those missing commits. it may, however, become an issue later when we attempt to do weighting, as processing those commits out of sequence could be problematic.
  • since we are so shitty about letting people set HEAD on d.o, it's quite possible that HEAD may not exist. this is eminently fixable, but it does need resolution, which we should do in a follow-up. it'd be resolved by adding the more complex nearest-parent logic.
sdboyer’s picture

ok, ran out of patience on this, time for sleep. there are still several failing tests, and i *cannot* figure out why. it's seeming like it's refusing to insert new branches into the db that really should be there.

we're pretty close...though i am again reminded just how not an exhaustive job we have done of planning out all the various situations this code path has to handle. it's visible in the spaghetti in that code, as well as in the lack of meta-level structure & organization in our tests.

it makes me very uncomfortable.

sdboyer’s picture

...and that approach creates a new problem, because we end up not creating the commit<->branch mapping in {versioncontrol_operation_labels}. ugh. it's also sticky because the merge-base i added returns something non-useful if the tip of the new branch == HEAD, or any ancestor of HEAD.

so, i'm still tinkering. i really want to separate & fully comment this logic, though, as it's very difficult to follow everything that's going on.

sdboyer’s picture

Version: 7.x-1.x-dev » 6.x-2.x-dev
Status: Needs work » Patch (to be ported)

OK, i removed the merge-base logic and just let it go as-is. after some more fixes to the tests, everything is passing.

which does not mean all is well. we still have potential memory-busting problems in large repositories. and, of course, the longstanding problem of a swelling {versioncontrol_operation_labels} table.

i can't wait until we can switch to some system that's more reliant on fetching data directly from the repository at runtime.

the token tests pass, at least for me, so that's close enough to fixed for me. i've merged the branch back in, and am now working on the backport to 6.x.

sdboyer’s picture

oh, a note - i was meaning to try an array_chunk() approach to cutting down the number of commit objects we load simultaneously in order to decrease the likelihood of the memory-busting problem. if it works like it should, it could be a good enough solution that we don't really need to look at anything else. basically, we just array_chunk() the results of the rev-list that comes back from getCommitInterval(), then wrap the remaining logic in a foreach over the chunks. i think everything inside there is isolated from needing overall state, so the chunking should be transparent.

marvil07’s picture

Assigned: marvil07 » sdboyer

All test passing! \o/ (except token, but that's irrelevant now ;-))
Great steps taken on the last commits, thanks!

About {versioncontrol_operation_labels}, yes, that's definitely a bottleneck, but I do not see a change for a direct repository access near the future of the module. Anyway why we would like a VCS abstraction? Mayeb it worths to remember that we are now optimizing the sync's, but that's not so usual as the read's, which I guess is our main target.

About chunking the hash array before loading/update, it makes sense, and sounds like we should use a variable to define the chunks size(or how do you define "high memory", we are using drupal anyway :-p).

I'm assigning this to you as your comment on 26.
AFAIR the only commit you should not add to 6.x-1.x is a3a791bbd7a17bed83b9c1c2e67f9e53d2621cad, and we need to change some that are touching hook_update_N() functions: 3fc73c9bf3e91486605efec0515f6dddde92fd7b and e1041632b9a4112f66791b501379baab4fad71f2.

sdboyer’s picture

Assigned: sdboyer » marvil07

yep, i skipped that commit, and a couple others.

6.x-1.x backport is looking pretty good, though we're getting some weird errors on the $event->ref->refname being set to something bizarre and too long. tough to tell just yet what's causing it, as the code is really pretty nearly identical between 6 and 7.

sdboyer’s picture

Assigned: marvil07 » sdboyer

whoops, xpost

sdboyer’s picture

Status: Patch (to be ported) » Fixed

About {versioncontrol_operation_labels}, yes, that's definitely a bottleneck, but I do not see a change for a direct repository access near the future of the module. Anyway why we would like a VCS abstraction? Mayeb it worths to remember that we are now optimizing the sync's, but that's not so usual as the read's, which I guess is our main target.

there are already 2 million rows in that table on live d.o, and it's just gonna get worse. a lot worse. it is currently the primary db-level weak link - at read or write time. write because index generation gets nasty, and read for the obvious reasons.

on a separate note, though, i've finished the backport (now that we've sorted out yet more weird issues with testing), pushed all of the changes up, and am now testing on git6. woot woot.

sdboyer’s picture

Issue summary: View changes

Forgot to add a space.

sdboyer’s picture

bit of a note, i guess, not sure where else to put it...i've made sync-tinkering branches in both vc_git and vcapi. there are just a *massive* number of additional improvements in there. they're quite near finished, but more importantly, are not the sort of changes that would block us up anyway. after a little more work (~30 minutes) to VersioncontrolRepository::syncEvent(), we'll be ready to test on git6 again, and i expect it'll go more smoothly than it did before (these changes solve one of the problems i was having on git6).

all these changes will need-forward-porting to d7, but it should be trivial cherry-picking.

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

striking out remaining tasks

  • Commit 0ba5d3a on 7.x-1.x, fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Unify initialization of sync* methods for git_default...
  • Commit 738e64d on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Reuse repo exec on default reposync plugin.
    
    It makes...
  • Commit 12cfa8e on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Change logic related to default reposyn plugin.
    
    -...
  • Commit 3fc73c9 on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Remove commits from ref changes.
    
    
  • Commit 3b6f353 on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Move fast-forward detection to syncEvent method.
    
    
  • Commit d9d5558 on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Avoid using refnames as event refs indexes.
    
    Git allows...
  • Commit 3b931a5 on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Remove drupal_write_record from event update.
    
    Now it's...
  • Commit 7f84e9a on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Update git event tests to use new code.
    
    
  • Commit e104163 on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Added reftype as part of the {...
  • Commit ac578e7 on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Use new constants to refer event ref types.
    
    Instead of...
  • Commit e0880ca on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Check existing labels before adding them to operations...
  • Commit a3a791b on fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Port token support to D7.
    
    
  • Commit e4c2482 on 7.x-1.x, fix-invalid-default-branches, fullsync-memory by sdboyer:
    Issue #1546110: numerous fixups following the generic-reposync merge....
  • Commit dee034d on 7.x-1.x, fix-invalid-default-branches, fullsync-memory by marvil07:
    Issue #1546110: Merge remote-tracking branch 'origin/sync-tinkering-d7'...