Related Issues
This is a child task and a blocker of #1568176: [meta] Get latest D6 versioncontrol code deployed
Problem/Motivation
VersioncontrolGitRepositoryHistorySynchronizerDefault::syncEvent() and some general logic about git event handling is not working correctly and needs some revision/fixing.
Proposed resolution
Iterative review/change the code until:
- It stores the information about events coherently.
- syncEvent() has better performance than syncFull() for the default git history syncronizer plugin.
Remaining tasks
Change logic to store the information about git events coherently.Done.Remove commits data member from ref changes.Done.Move any exec callback to repsync plugin.Done.Fix event tests.Done.Add reftype to git event data pk.Done.Use new constants for ref types(instead of label constants), so we can extend them in the future(i.e. notes)Done.Make Git token integration tests pass again.Make Git event data integrity tests pass again.Re-test for performance.Update: Make a port for 6.x-2.x(i.e. remember change the hook_update_N)
User interface changes
None.
API changes
Many, but since we only have one history syncronizer plugin for git now, and we are changing it here too, nothing affecting external code.
Original report by sdboyer
I never finished properly reviewing/testing the logic cvangysel put in VersioncontrolGitRepositoryHistorySynchronizerDefault::syncEvent() before we merged generic-reposync back in at Denver, and there are some outstanding issues. At minimum, I've noticed that the logic is currently set to not parse commits if more than 5 came in on a push - this was probably the result of a misunderstanding, as while the commits absolutely need to be parsed into the DB, there's a limit on the number we should explicitly record on the event object itself.
There may well be other problems with it - need to run some more tests.
Comment | File | Size | Author |
---|---|---|---|
#5 | git-repo-sync-event-take-1.patch | 10.46 KB | marvil07 |
Comments
Comment #1
sdboyer CreditAttribution: sdboyer commentedtagging
Comment #2
sdboyer CreditAttribution: sdboyer commentedfixing title
Comment #3
marvil07 CreditAttribution: marvil07 commentedFirst on D7, then backport.
Comment #4
marvil07 CreditAttribution: marvil07 commentedI was supposed to create another issue about git ref changes object commits data member inclusion or not, but going a little deeper into this, I end up here.
We definitely have some code to rewrite on the default git reposync plugin.
I will be reporting about what I find.
Comment #5
marvil07 CreditAttribution: marvil07 commentedThe following patch make event logic work by storing git commit hashes(instead of VersioncontrolOperation objects) on the versioncontrol_git_event_data.commits field from generateCodeArrivalEvent(). Not sure yet if we want that, or instead load those commits dynamically after at syncEvent() based on versioncontrol_git_event_data old_sha1 and new_sha1 fields (the problem with that approach is to not be able to replicate the information from the table in future if some commits get removed from the repository).
In the other side I have noticed it's syncEvent() is now slower than syncFull(), so we definitely want to improve it more(it should be faster, that's the point).
Comment #6
Senpai CreditAttribution: Senpai commentedTagging for Sprint 2.
Comment #7
eliza411 CreditAttribution: eliza411 commentedTagging for Sprint 3.
Comment #8
sdboyer CreditAttribution: sdboyer commentedbasic smoke testing indicates this mostly working in the git6 environment. woohoo! commits make it into the db, are properly associated with users, and entries are made in the
versioncontrol_event_log
andversioncontrol_git_event_data tables
. and they're done not just for normal pushes, but for tag & branch creation/deletion, even with no new commits pushed. we also seem to handle the case of tag reassignment fairly elegantly. excellent.couple problems - we're still inserting the full array of commit hashes into the 'commits' text field, and not concatenating it to 5 when it's more than that, as we'd discussed. we're also still inserting ALL the hashes when a new branch is pushed...which is the biggest and riskiest case. like, i just saw 700 hashes show up when i pushed a new branch to versioncontrol. the approach there should be that we insert none of the hashes at all, leaving that array empty. we may want to entertain doing a bit of custom logic in order to determine the branch from which the new branch most likely descends, but that's quite optional and can be done later.
we also need to add a
{versioncontrol_git_event_data}.commit_count
field that records the total number of commits pushed to any given ref. yes, that's derivable info, but it's also totally nonvolatile and thus worth 'caching' it as the lookup is nontrivial to do during a normal request.what tests did you run that indicated this is slower than syncFull()? and do you have any intuitions about why?
Comment #9
marvil07 CreditAttribution: marvil07 commentedAfter some discussion, we agree to remove the commits field on ref changes, and also from versioncontrol_git_event_data table, I have just pushed one commit about it(see 3fc73c9bf3e91486605efec0515f6dddde92fd7b on generic-reposync-follow-up topic branch).
Still missing:
Comment #10
marvil07 CreditAttribution: marvil07 commentedI've added two more changes to the topic branch:
Comment #11
sdboyer CreditAttribution: sdboyer commentedsounds good. i'll have a look at these changes as soon as i can, probably thursday.
Comment #13
marvil07 CreditAttribution: marvil07 commentedI have just duplicated #1497392: Fix reposync related tests in favor of this issue, and I have an uncompleted patch for the git event tests I will be posting here soon.
Comment #14
marvil07 CreditAttribution: marvil07 commentedSadly I still do not have the tests ready. In the other side I have identified two more pending corrections:
Comment #14.0
marvil07 CreditAttribution: marvil07 commentedConvert description into issue summary.
Comment #14.1
marvil07 CreditAttribution: marvil07 commentedminor removal of unnecessary comment.
Comment #15
marvil07 CreditAttribution: marvil07 commentedNew code on the topic branch:
I have just updated the issue to have a full issue summary with details of what's left instead of keep posting items on comments about it.
Comment #16
Senpai CreditAttribution: Senpai commentedTagging for Sprint 4.
Comment #17
marvil07 CreditAttribution: marvil07 commentedTwo more changes on the topic branch:
Comment #17.0
marvil07 CreditAttribution: marvil07 commentedminor: use real html listings
Comment #18
sdboyer CreditAttribution: sdboyer commentedgreat. i'm gonna dig in on this tonight, too, see what we can get done before scrum tomorrow.
Comment #18.0
sdboyer CreditAttribution: sdboyer commentedFormatting pending list and updating it to reflect last comment.
Comment #19
marvil07 CreditAttribution: marvil07 commentedA have added missing test to fix on the TODO list:
Those not-passing tests probably indicate some code still needs a little tweaking(or maybe the tests itself).
Comment #20
marvil07 CreditAttribution: marvil07 commentedTwo more commits on the topic branch:
This fixes a common error on both failing tests.
Token support was broken because we forgot to port it. It seems to be working fine, i.e.:
But for some reason not inside the token tests.
Comment #21
sdboyer CreditAttribution: sdboyer commentedok, with those latest changes pulled, the token tests are now passing, great. now i've just got four failures on the git event data integrity. will investigate those.
Comment #22
marvil07 CreditAttribution: marvil07 commentedThat's interesting, I have the token tests not passing locally here.
Comment #22.0
marvil07 CreditAttribution: marvil07 commentedAdd tests to fix on TODO
Comment #22.1
Senpai CreditAttribution: Senpai commentedThis is a child task and a blockerof #1568176: [meta] Get latest D6 versioncontrol code deployed
Comment #23
sdboyer CreditAttribution: sdboyer commentedi've fixed a number of problems. the first and most glaring one was that our tests, when simulating pushes, did not actually reflect what those pushes would/should have done in the repository. as a result, our history synchronizer could get quite confused, because it would receive a bunch of information about the state of the repository that was inconsistent with the data it directly fetched from the repository during the course of normal operation.
from there, some new problems emerged, primarily that the handling of branch deletions was kinda entirely broken. the primary issue is that we don't do the garbage collection we should - we don't clean out old commits that are no longer associated with any branch. Technically, this should be easy - it should just be the difference between:
and
because we included support for this exact sort of thing in the base API itself. however, this is one of those places where i don't expect the base API to scale very well, as its approach to performing this delete is to load ALL of the commits on a branch, then iterate over them and check to see if they're only attached to the branch that's being deleted. that's an unacceptably large number, and i fully expect memory exhaustion to occur as a result. so basically, that's another todo for us, and i've created an issue - #1642310: Optimize logic for branch deletion with nested option.
from there, we still had the problem that the subsequent branch/commits reconciliation logic simply didn't check to see if the branch being worked on had been deleted (if it had, it could be skipped). however, that led me to another problem -
VersioncontrolGitRepositoryHistorySynchronizerDefault::getCommitInterval()
is wrong:the idea here is basically to be a passthru to
git rev-list
, and if the null rev is passed for either start or end, it's simply dropped. dropping it is the problem, though - without the second argument, the..
notation assumes HEAD. which means the range being reported is the difference between the first SHA and whatever HEAD happens to be. which is nothing like what we're looking for there. so, i've switched out that second if statement with a new Exception i commushed to vcapi -VersioncontrolSynchronizationException
. i'd like us to start using this exception - a lot - in order to make it clearer when and where our failures are occurring during synchronization. i've also opened an issue for that - #1642326: [meta] Improve repo locking & behavior on synchronization failures.with that cleared up, we still have another potential memory exhaustion point here, for much the same reason that there was an issue with the deletion logic: a new branch being passed through that commit interval logic will get back ALL commits, all the way back to root. for core, loading all of those up only to discover that a scant few out of tens of thousands are actually new, is *not* an acceptable approach. so i'm finishing off a commit that introduces a heuristic trick with
git merge-base
. instead of going all the way back to the root, wemerge-base
against HEAD in order to determine a shared commit that is likely to be a good shared base, then generate the interval against that. we could pick any branch, but in mos repositories, the branch set as HEAD will be the most likely one from which branches are created, and so is likely to have the smallest interval. we could engage in more inspection to definitively determine which branch is closest, but it's not worth delaying this patch for that logic.there are two cases i can think of in which this won't entirely work:
Comment #24
sdboyer CreditAttribution: sdboyer commentedok, ran out of patience on this, time for sleep. there are still several failing tests, and i *cannot* figure out why. it's seeming like it's refusing to insert new branches into the db that really should be there.
we're pretty close...though i am again reminded just how not an exhaustive job we have done of planning out all the various situations this code path has to handle. it's visible in the spaghetti in that code, as well as in the lack of meta-level structure & organization in our tests.
it makes me very uncomfortable.
Comment #25
sdboyer CreditAttribution: sdboyer commented...and that approach creates a new problem, because we end up not creating the commit<->branch mapping in
{versioncontrol_operation_labels}
. ugh. it's also sticky because the merge-base i added returns something non-useful if the tip of the new branch == HEAD, or any ancestor of HEAD.so, i'm still tinkering. i really want to separate & fully comment this logic, though, as it's very difficult to follow everything that's going on.
Comment #26
sdboyer CreditAttribution: sdboyer commentedOK, i removed the merge-base logic and just let it go as-is. after some more fixes to the tests, everything is passing.
which does not mean all is well. we still have potential memory-busting problems in large repositories. and, of course, the longstanding problem of a swelling
{versioncontrol_operation_labels}
table.i can't wait until we can switch to some system that's more reliant on fetching data directly from the repository at runtime.
the token tests pass, at least for me, so that's close enough to fixed for me. i've merged the branch back in, and am now working on the backport to 6.x.
Comment #27
sdboyer CreditAttribution: sdboyer commentedoh, a note - i was meaning to try an
array_chunk()
approach to cutting down the number of commit objects we load simultaneously in order to decrease the likelihood of the memory-busting problem. if it works like it should, it could be a good enough solution that we don't really need to look at anything else. basically, we justarray_chunk()
the results of the rev-list that comes back from getCommitInterval(), then wrap the remaining logic in a foreach over the chunks. i think everything inside there is isolated from needing overall state, so the chunking should be transparent.Comment #28
marvil07 CreditAttribution: marvil07 commentedAll test passing! \o/ (except token, but that's irrelevant now ;-))
Great steps taken on the last commits, thanks!
About
{versioncontrol_operation_labels}
, yes, that's definitely a bottleneck, but I do not see a change for a direct repository access near the future of the module. Anyway why we would like a VCS abstraction? Mayeb it worths to remember that we are now optimizing the sync's, but that's not so usual as the read's, which I guess is our main target.About chunking the hash array before loading/update, it makes sense, and sounds like we should use a variable to define the chunks size(or how do you define "high memory", we are using drupal anyway :-p).
I'm assigning this to you as your comment on 26.
AFAIR the only commit you should not add to 6.x-1.x is a3a791bbd7a17bed83b9c1c2e67f9e53d2621cad, and we need to change some that are touching hook_update_N() functions: 3fc73c9bf3e91486605efec0515f6dddde92fd7b and e1041632b9a4112f66791b501379baab4fad71f2.
Comment #29
sdboyer CreditAttribution: sdboyer commentedyep, i skipped that commit, and a couple others.
6.x-1.x backport is looking pretty good, though we're getting some weird errors on the $event->ref->refname being set to something bizarre and too long. tough to tell just yet what's causing it, as the code is really pretty nearly identical between 6 and 7.
Comment #30
sdboyer CreditAttribution: sdboyer commentedwhoops, xpost
Comment #31
sdboyer CreditAttribution: sdboyer commentedthere are already 2 million rows in that table on live d.o, and it's just gonna get worse. a lot worse. it is currently the primary db-level weak link - at read or write time. write because index generation gets nasty, and read for the obvious reasons.
on a separate note, though, i've finished the backport (now that we've sorted out yet more weird issues with testing), pushed all of the changes up, and am now testing on git6. woot woot.
Comment #31.0
sdboyer CreditAttribution: sdboyer commentedForgot to add a space.
Comment #32
sdboyer CreditAttribution: sdboyer commentedbit of a note, i guess, not sure where else to put it...i've made sync-tinkering branches in both vc_git and vcapi. there are just a *massive* number of additional improvements in there. they're quite near finished, but more importantly, are not the sort of changes that would block us up anyway. after a little more work (~30 minutes) to
VersioncontrolRepository::syncEvent()
, we'll be ready to test on git6 again, and i expect it'll go more smoothly than it did before (these changes solve one of the problems i was having on git6).all these changes will need-forward-porting to d7, but it should be trivial cherry-picking.
Comment #33.0
(not verified) CreditAttribution: commentedstriking out remaining tasks