when we run the next git update, we're going to need to read and write to all eighteen thousand repositories. this really can't be done directly in an update hook, nor should we even load them all at once in a single attempt (probably). we need to load them, then enqueue the job in beanstalkd. this shouldn't be terribly difficult to accomplish in a custom one-off way in drupalorg, but i will need to tinker.
| Comment | File | Size | Author |
|---|---|---|---|
| #6 | git-read-branches.sh_.txt | 430 bytes | damien tournoud |
| #6 | git-branch-data.txt.gz | 199.48 KB | damien tournoud |
Comments
Comment #1
dwwSounds like this might be useful in the future, so if we're building it from scratch, I'd be in favor of building it such that it could be reused if possible. However, if that adds extra complication and delay, please ignore me. ;)
Thanks,
-Derek
Comment #2
damien tournoud commentedAre we talking about reading and writing to the actual Git repositories or to the Drupal objects representing them? If the former, what is required?
Comment #3
eliza411 commentedTagging
Comment #4
sdboyer commented@dww - yes, it could and will be useful in the future. for sure. unfortunately figuring out a generic pattern for communicating with them is a little tricky, so we're just gonna one-off it here.
@DamZ - direct repository read, and database write. from the update hook, we need to enqueue one job for each of the 19000+ repositories. that job reads the current symbolic ref in HEAD from disk, then writes it to the db. preferably, we would also verify that the master branch still exists, as it does not for a number of repositories, then pick another of the available branches (it wouldn't be hard to come up with a rough canonicality metric), and set that as the default if master has been deleted, THEN update the db accordingly.
at least, that's what we have to do to be complete about it. if we were to skip the latter step, it would be pretty easy to just grep all
.git/HEADfiles for anything which is not set to master (only a scant few are, core included), write 'master' to the db for the rest, then just let people sort out re-setting their own default branches accordingly. i would prefer not to do that, though, since we'd be knowingly creating an inconsistent data state that the application can't normally create.ultimately the concern is that simply loading 19000 fully-classed repository objects, then running operations directly on them within the update function, could run OOM. popping up the memory limit might be enough of an answer to that question for us to do it sloppily without enqueueing, but we'll need to figure out just what the memory cost is. and to dww's original point, i don't want to have to artificially increase the memory limit every time we run an operation against all repos.
Comment #5
damien tournoud commentedI assume this could be done using a standard multi-step update function. But it would be even better just to load this metadata in one go from the filesystem and save it somewhere so that we can run the update path over and over again without needing to access the actual repositories.
Comment #6
damien tournoud commentedI took five minutes to write a small script that does that. Here is the script and the raw HEAD+branches data from our repositories.
Comment #7
sdboyer commentedoh cool, that helps. handy that we don't use packed refs (yet), so we can still get away with just reading what's under refs/. we don't need the full branch list, only whether or not the current default branch exists, as vcapi already has all that info, and can also access the info on whether or not a branch has a release associated with it. we need that info to make the smartest decision about what to set as the new default branch. my preferred criteria would be, in decreasing importance, a) that the branch has a release, b) that the branch is for D7, and c) that the branch has the highest available major version number. we've got a nontrivial number of repos to do this for:
just about as many sandboxes to do, though the first criteria doesn't apply there.
we'll still need to enqueue jobs for all those repositories which need to have that their default branch updated, but since we'll have a full list of all the repos to hit, we'll be able to single-load them all and enqueue the jobs, so no risk of memory explosion.
the list of repos which have non-master branches as the default is much smaller, and we can just manually map those.
Comment #8
sdboyer commentedhard part's done though, thanks Damien - i can take care of the update hook.
Comment #9
sdboyer commentedheh, turns out the mem usage isn't actually that bad - loading all 19k repos at once eats about 90MB. oh well :) then again, that's just loading the repos, not loading all of their branches or anything. doesn't obviate the need for a better strategy on this in the long term, either. some sort of multi-part or batch strategy.
Comment #10
sdboyer commentedcommushed, with a hardcoded path to a local version of that file. once we're ready to deploy, i'll regenerate the file so that it's as up to date as possible then ensure it's in place.
marking postponed so that we don't forget to do that on deploy.
Comment #11
damien tournoud commentedDid you meant to commit the data file too?
Comment #12
sdboyer commentednope, intentionally left it out. i'm going to generate it just before we start and drop it in the repo then.
Comment #13
sdboyer commentedmeh, actually, no real harm putting it in now. commushed the script and a current data dump in there. we can remove them from the repo after this update is done.
Comment #25
drummComment #26
drummLooks like the
git-read-branches.shscript got us through the Git migration.(The GitLab migration was done with a custom table tracking repositories with un-imported changes, and tricking the versioncontrol repomgr queue into running GitLab project imports.)