Determine an approach to managing git repositories on disk [#1038404]

We need a strategy for managing all of our git repositories on disk. On staging, everything currently lives under /git, using the project shortname for its dirname (e.g., Views lives at /git/views.git/), and I strongly believe this is _not_ a sustainable approach for the long term. The goal of this post is to lay out the various considerations relevant to how we store the repositories on disk so that they can be considered in all their intersecting glory, and we can come to an agreement on a plan that conforms to the principle of "as simple as possible, but no simpler."

Let's start with the public-facing interface we've guaranteed to users at launch. These are set in stone. For full projects:

git@git.drupal.org:project/[project shortname].git # ssh, pubkey only for authed, or RO for anonymous
[git username]@git.drupal.org:project/[project shortname].git # ssh, pubkey or password, no anon access allowed
http://git.drupal.org/project/[project shortname].git # http, RO
git://git.drupal.org/project/[project shortname].git # git, RO

And, for sandboxes. At the moment, sandbox shortnames are always the sandbox project nid, but I think it very important we roll out a change that allows them to have proper naming as quickly as possible (I won't get into why here, that's a separate discussion). Anyway:

git@git.drupal.org:sandbox/[git username]/[project shortname].git # ssh, pubkey only for authed, or RO for anonymous
[git username]@git.drupal.org:sandbox/[git username]/[project shortname].git # ssh, pubkey or password, no anon access allowed
http://git.drupal.org/sandbox/[git username]/[project shortname].git # http, RO
git://git.drupal.org/sandbox/[git username]/[project shortname].git # git, RO

That's for launch. Per-issue repositories, which may be the single biggest & most important/exciting goal in phase 3, are going to add to the scheme. My current thinking is that they will follow a scheme that looks like this:

git@git.drupal.org:issue/[issue nid].git # ssh, pubkey only for authed, or RO for anonymous
[git username]@git.drupal.org:issue/[issue nid].git # ssh, pubkey or password, no anon access allowed
http://git.drupal.org/project/issue/[issue nid].git # http, RO
git://git.drupal.org/project/issue/[issue nid].git # git, RO

Next, let's run through a list a list of the various services we're launching with that need to interact with literal, on-disk repository data:

Twisted SSH daemon - repository cloning and pushing means needing to interact with a real repository so git-upload-pack and git-receive-pack can be invoked in the proper place. Needs to resolve to a repo based on project shortname and a git username, and also needs to get auth data based on that same input.
Anon clones over git:// - I'm not even sure just how this works TBH (I haven't had to set it up myself yet), but it needs to read real repo data, same as the SSH daemon. Needs to resolve to a repo based on project shortname and sometimes a git username.
Anon clones over http:// - it's cloning, once again. Needs to resolve to a repo based on project shortname and sometimes a git username.
Repository management workers - the little guys who create and delete repositories as needed. They get their instructions from a beanstalkd queue, which is populated from a hook_nodeapi() implementation that listens for project node type manipulation. They need to know where on disk the repository is/ought to be.
Repository log parsing workers - the little guys who read the repository data and turn it into records in the drupal db. They get instructions from a beanstalkd queue, which is populated by the git hook scripts that are run every time a push is received. They're useless unless they can resolve a path to the repository.
Packaging scripts - they need repo data to package! And they need to be able to resolve it based on a project shortname. Sandboxes don't need this, of course, as they can't make releases.
Repo pruning & trimming - various commands run by git will call git gc, which helps, but we periodically need to repack repositories. It'd be better if we could respond to events that we know cause repositories to get big (large pushes, for example) rather than just lots of stupid iterations that run it on all repos on cron/hudson.
Repo disk space management - we need a service that can appraise & manage the size of repositories on disk. It's probably important that this service be able to provide feedback back into the drupal db, so that we can easily figure out whose repositories are growing out of control. Again, this is something that should be managed by enqueued workers that are tripped off as needed as much as possible - dumb cron will be a big IO killer, fast. Note that this is as yet entirely unwritten - I'm not able to design it well, my approach would be lots of brute force-y `du -s` calls.

That's what we definitely launch with. There are some other services it's reasonable to anticipate will need to access the repositories as well:

Our homegrown repository viewer, which we're hoping to have replace gitweb ASAP. This is even a bit more complicated, since it needs to be run from a webhead, and we can't solve the problem with async job queues like we do in the other cases - it needs to be synchronous, as it needs to retrieve repo data in order to fulfill the request. The full fix is either having a web service that publishes the data it needs (which is still awkward for a few reasons), or an RPC system that is fast enough to be able to run synchronously (which is what github created BERT-RPC for). And both of those services will STILL need to be able to resolve repo locations on disk, more than likely based on project shortname and maybe git username.
If we implement an input filter for showing repo data directly in user posts as part of the homegrown repo viewer (which, honestly, I care about 100x more than the homegrown viewer itself - it's what really makes it worthwhile), then we've entered the realm of HTTP requests to d.o making calls to read repository data directly.
Damien's "contributions-stable" snapshot repositories - These have one commit in them per bit of release data. We honestly don't have a scheme for these at all right now, but in principle they ought to be able to just have another top-level path entry (like {project,sandbox,issue}/[project shortname].git). But they'll need to be able to find the base repository to generate their data from, too.
Per-issue repositories are going to _seriously_ impact the amount of background activity we have going on in repositories. For example, one important feature will be the ability for the issue to report on whether or not the branch(es) in the per-issue repo can merge cleanly against the project from which they're forked. That's an RPC call that reports whether or not branches in a per-issue repository merge cleanly against the parent repo.

Now, some musings on size & growth:

We had 2221 new projects in the last year, created by 1029 unique users. That's a bit up from the year before; let's assume that it'd be up more if a) people weren't creating projects on github and b) we weren't still stuck in dev on D7. To be safe, then, let's assume 4000 new projects in the coming year.
I have NO idea how many sandbox projects to expect. I know I myself expect to toss a few in, fast and loose, since…why not? Who knows, but let's say the number of people creating sandboxes quintuples (since it's not human-supervised) Let's be really safe, then, and assume 30000. Note that our current, literal-disk-mapping approach to storage will have a critical problem if we even hit around 20k, though, as we'll be at the ext3 limit of 32k subdirs. ext4 puts us to 64k, but that's obviously not a long-term solution either.
The core repo is (currently) 55MB, Views is 15MB. The total size of all the repos on disk is around 2G. So the actual size on disk isn't much of a concern - but it will be with per-issue repositories.
There's scattered talk of implementing per-issue repositories somewhere around summertime 2011. Even if it comes a later…yeah, we had 107449 new issues filed in the last year. Not all would have repos, of course, but still. My educated guess is that, across all those repos, it's probably fair to expect a number of commits equal to 3-5x the number of patches we see posted on a yearly basis (sorry, I don't have a number for that, not sure how to generate it offhand).
Remember that there is a linearly-increasing (cpu & io) cost associated with every repository we have. The more we make cleanup and maintenance jobs evented instead of cron'd, the less this is true, but given everything here, it really strikes me as imprudent to think that util will last indefinitely for managing all of this. I don't even know if it'll last through the year - nnewton has already mentioned to me that he expects we'll need a new one by the end of the year.

Finally, some miscellaneous considerations about the whole thing:

Moving repositories around on disk as a response to actions taken on the d.o website (at least, actions by non-admins) is a really bad idea, and should be avoided if at all possible. Something may go wrong mid-move, and bam! corrupted data. But there's one case we already know of where a literal disk mapping situation will require exactly that - when sandboxes are promoted to full projects and they get a shortname change, and are no longer accessed at the username-namespaced url.
When we implement per-issue repositories, it's going to be _extremely_ important that we take advantage of git's hardlinking & other repository object-sharing techniques in order to reduce repository size. If we don't, every new core issue with a repo will have a 55MB repository attached to it. And hardlinking goodness doesn't work unless the repos are on the same partition, of course.
We need, need need need to have redundancy & failover for the repos. Seems to me like the most intelligent way to do that is have a load-balancing frontend that receives the various external requests (ssh, git, http) and forwards to a box that actually has the repository data, but can fail over.
In our current architecture on staging, EVERYTHING that needs to interact with a repo either needs to make a drush command to find out information about where the repository to be interacted with is located, or requires explicit instructions about where the repository they're working with is located. The latter case, which we already do too much, is poor architecture because it means the worker is basically a dead-end - that is, it can't pass good information about later tasks to be run, information to be logged, whatever, without either making its own drush call or creating its own logic for extracting metadata from a repo path. Such logic is inherently brittle because it makes the worker reliant on a consistent repository pathing structure, which increases the difficulty of improving our architecture later.

Given all of the size & performance considerations, I believe the prudent approach to be launching with an architecture reflective of an expectation that we'll need to start sharding our repositories across multiple git worker boxes sooner rather than later. And sharding means routing. Even if we don't need to shard, however, the complexity of resolving repositories and communicating between various parts of the system is _already_ difficult without a good, central method for keeping routing information. And it keeps us from having to move repositories on disk when they get their shortnames changed (on promotion from sandbox to project). What I really want is for our various internal infrastructure services/components to talk to each other using ONLY two pieces of data: repository id and user id.

Of course, the routing lookup service will need to be able to take some different arguments - project shortname, git username, etc. - and have them resolve to these two pieces of information. That way public-facing components of the architecture (the ssh daemon & the git:// and http:// clone listeners) can take the data provided from client calls to the public-facing interface and translate it into a repo_id and uid, and use that to then look up the various additional data from a separate bin in the central service (e.g., repo path, auth data, git server once we need it, etc.), and can communicate with downstream logic (e.g., the daemon needs to set env variables for the hook scripts so that they can enqueue jobs with enough data for the workers to know which repo to work on) using the simple, abstracted numeric ids.

Early in the process, we had been talking about using a private Varnish instance to cache this data. In retrospect, that's a little silly. There's really no reason to deal with the problems of cache warmth & expiration when we can have a persistent high-performance key/value store with redundancy that Drupal can synchronously update such that the latest routing (and auth) data is available to internal services in realtime. We could do it with a persistent strain of memcache, or mysql, or redis. We could also stay drush-centric and just use it to stitch together the data we need in a manner that follows the interface I've described above. That's a very good intermediary solution, at least, and it's the one I recommend we follow for launch, as we've already got a fair bit of code working that way and it removes the need to write logic that replicates data available in mysql into a separate data structure. And, working the kinks out in drush will make it easy for us to know what the interface ought to look like if/when we switch over to a high performance k/v store.

So, actually DOING this...we're already largely implemented in drush, and it's a pretty straight line from where we're at to abstracting the location on disk and introducing the routing system. A quick hit-list of what actually needs to be done:

{versioncontrol_repositories}.root (what vcapi typically expects to be a path to a repo location on the local filesystem) is currently a wasteland. I need to decide on what data we actually want to store there, and how it'll interact with the routing system. I do expect we can turn it into the canonical on-disk repo location storage, even once we start sharding, if we just munge some extra data in there - e.g., "git1:/git/repositories/01/38282.git", where 'git1' would be the box it lives on.
The ssh daemon needs to be refactored to conform to the spec described here, where it first translates the incoming paths then does ALL further internal communication using repo_id/uid (currently it uses project shortname and git username). This is tizzo/chizu/halstead territory.
A git http cgi backend needs to be written that can get routing service data and connect appropriately. The backends aren't very complicated in and of themselves, there are lots of examples out there - we just need to integrate it with the routing service. I'm not sure who's best for this, but it's pretty plug-and-chug.
The git:// protocol needs to be taught how to talk to the routing service. I haven't the first clue about how to do it, but chizu may.
We need a decision from nnewton on where the twisted ssh daemon(s) live vis-a-vis the repos for initial launch at least; it'd be good if we could have a forward-looking discussion about how we do that once we start sharding, too.

Comments

Comment #1

sdboyer commented 24 January 2011 at 20:01

Forgot one crucial thing in the big final todo list there - I need someone (nnewton? chizu?) to help figure out the best way to monitor and restrict repository size on disk.

Comment #2

nnewton commented 24 January 2011 at 22:18

Comment #3

pwolanin commented 25 January 2011 at 06:17

So - in other issues we discussed that the underlying structure for projects would be to have the git repo at a real path determined by the project nid, and that the shortname would be a sym (or hard) link to that.

Given that nids are unique, it seems like implementing such a system would also provide a basis for per-issue repos, since each nid is unique. For per-issue repos, seems like one could omit creating a repo until a patch is actually uploaded and the issue? Alos, I though with git hard-linking of object, such repos were going to consume trivial resources?

Comment #4

sdboyer commented 25 January 2011 at 15:58

@pwolanin - The point of a routing layer is to avoid needing to have the on-disk representation match exactly with the incoming query string. Symlinking is a kinda brute-force approach to solving that problem, and I'd far rather update mysql or a persistent k/v store when something changes than muss about with changing symlinks. That's why I moved away from symlinks as a solution.

nid is actually a poor choice for the final name of the repo on disk, since we may at some point determine that we want to attach multiple repositories to a single nid, and then we'd have a horrible refactoring job before us. Maybe we never make a decision like that, but there's no reason not to future-proof against it when it's easy (and it is). repo_id is obviously gonna still be optimal in that situation. And really, it's more sensible regardless - we have an id that uniquely identifies that repository in our database. why not use that unique id to name the repo on disk?

Git's hardlinking means that they consume trivial resources...iff they're on the same filesystem. That'll just work initially, but at some point we're going to exceed that which can be readily contained on a single disk, or we'll need to shard, and then we have to start concerning ourselves with keeping related repo clusters together, otherwise a) our disk usage will shoot up and b) there'll be lots of superfluous internal git object traffic between related repos that should have shared object storage, but instead need to be constantly fed the same data.

Comment #5

izmeez commented 25 January 2011 at 16:12

Comment #6

pwolanin commented 25 January 2011 at 16:25

@sdboyer - having a on-disk representation match the URL seems likely to be more reliable and easier to replicate into a staging version by just copying the disk. Are symlinks really that bad?

Sure, we could use some other ID as the actual directory name.

Comment #7

dww

we/he/they

commented 26 January 2011 at 12:43

I don't feel like I understand all the moving pieces enough to have a strong opinion on the right solution to all these problems. However, one quick note:

Re: "Packaging scripts - they need repo data to package! And they need to be able to resolve it based on a project shortname. Sandboxes don't need this, of course, as they can't make releases."

That's not really true. I'd rather the packaging script just did an anon git archive from whatever public-facing interface we can get working, instead of the packaging script (rather, our d.o-specific plugin for the packaging logic -- thanks #1024942: Add ctools plugin system for site-specific logic for packaging release nodes) having to know specifically about repo locations on disk. Currently it's hard-coding a direct filesystem "checkout" b/c on git-dev itself we couldn't get a public-facing interface working when I was ready to roll out the git-specific packaging. But that's a hack -- long-term this system shouldn't have to know about raw locations on disk at all, it should just be talking to something that does. See #1028950: Fix packaging scripts to checkout from sane/stable Git URLs instead of hard-coding the local filesystem

Comment #8

sdboyer commented 27 January 2011 at 22:57

Status:

Active

» Fixed

We settled on this on the call on Monday, I've just neglected to update the issue. Basically:

Stick with drush, using it to introduce the framework of a routing layer as we can initially. During performance testing (next week), we'll see if we need to introduce a more performant layer.
Continue to use literal filesystem mappings for the moment, so that the routing layer can fail gracefully.
Move all inward-facing services towards the standard use of repo_id and uid for communications.

That'll get us to launch. As for the rest of it, these are just issues to keep in mind. So marking fixed, as the approach has indeed been determined.

Comment #9

dww

we/he/they

commented 28 January 2011 at 00:54

Status:

Fixed

» Active

Projects change ownership a lot. I'm concerned about using the UID for tracking stuff on disk.

In fact, have we considered the possibility of people handing over ownership of sandboxes as they move from experimental to full? It's not just the code itself which is easily cloned via Git. Projects (even sandboxes) have a set of issues, potentially folks who have flagged the project, subscribed, are actually using it (and perhaps via git_deploy they're already reporting usage stats, etc).

Comment #10

dww

we/he/they

commented 28 January 2011 at 00:56

Status:

Active

» Fixed

x-post, didn't mean to change status. I'd still like an ack on the UID vs. repo_id question. Why use UID at all, if we have a unique ID for all repos?

Comment #11

sdboyer commented 30 January 2011 at 10:21

@dww - you read something into my intentions for uid that wasn't there, I think. In the eventual, fully-routed system, repo_id is what we use. Maybe it gets stacked into crazy dir structures because we have some other infra constraints, but it's the only real _drupal_ data that we ever have reason to reflect in the filesystem structure. When I emphasize the importance of uids, I'm not talking about using them on the filesystem itself - I'm talking about how, say, the ssh daemon communicates with the git hooks, how the git hooks communicate with the workers: the data they pass to and forth to identify what's happening. How we represent the items on disk is partially predicated on such standardization (so that everything's got the repo_id), but is a separate issue. uids are for use in logging, retrieving perms data, etc.; not in identifying repositories on disk.

Comment #12

dww

we/he/they

commented 30 January 2011 at 12:24

@sdboyer: Perfect. That wasn't 100% clear from the orig post, but glad I just misread you and we're on the same page here. ;)

Comment #13

sdboyer commented 30 January 2011 at 20:48

Just a note - with some changes I'm working on right now, I'm actually deviating a bit from this a bit. For some of the job queue actions that we know are run by drush workers (and we have no expectation that they will ever be run by anything BUT drush workers), instead of passing a repo_id I'm passing a serialized VersioncontrolGitRepository object. There are some important logic benefits to this that prompted me towards the change; for now, it seems like an acceptable deviation. We'll see in the longer term.

Comment #14

13 February 2011 at 20:50

Status:	Fixed	» Closed (fixed)
Issue tags:	-git phase 2, -git sprint 9

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #15

13 February 2011 at 20:50

Issue tags:

+git phase 2, +git sprint 9

Restoring issue tags, see #2125755: System messages removed all issue tags during D7 upgrade.

Determine an approach to managing git repositories on disk

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

News items

Our community

Documentation

Drupal code base

Governance of community