Set up anon rsync of git repositories [#1244344]

Per the discussion in #1057386: Provide a way to download the entire git codebase for all projects, we ought to set up an rsync daemon that makes /var/git/repositories publicly available. There's no sensitive information in there, so it can be done safely.

My preference would probably be to add this the gitrepobase puppet module.

Comments

Comment #1

marvil07 commented 17 August 2011 at 22:10

Let me cite from #1057386-11: Provide a way to download the entire git codebase for all projects and answer here, where IMO is relevant:

@danillonunes - yes, someone could write a script to just clone them all down directly. Would be a lot slower than a massive rsync, though, since it's got to set up and tear down the remote connection so many times. But having the whole repository history is really the point (even/especially for searching), so that's optimal already.

There is one consideration for rsync git repositories. Git data can not be assumed as invariable, I mean, git gc/git fsck changes the real files on the repository. So please notice that rsync will see all those changes that git fetch would not.

I naturally understand that the problem with using git fetch is the number of network requests we need to do to actually get whole contrib updated.

IMO, rsync is a good idea for the first download(but it can also be a ginormous tarball somewhere), but for the sub-subsequent updating I think git fetch would do a good job. As pointed the problem is the number of network connections. Assuming anonymous git cloning using git protocol there is no way we can do it faster, but using ssh protocol, we could use the ControlMaster feature, that let you re-use one opened network connection, so maybe besides the cpu it would also add on both sides, it could be faster.

The problem with the last thing is that we are not using a real openssh daemon, we are using our custom one, so, if there is a way to provide the ControlMaster ssh feature on the ssh git daemon, we could have a way to maybe do it faster than rsync the ginormous contrib directory.

Comment #2

nnewton commented 24 August 2011 at 10:12

As has probably become obvious, I'm not a huge fan of supporting multiple ways of getting to the same data. We currently support, git, ssh and http. For me to really want to add rsync, I'd like a good description of why. I've read through the grepping argument and am not fully understanding its significance. Could someone shed some light on this. Thanks.

-N

Comment #3

webchick

she/they

English

Vancouver 🇨🇦

commented 3 November 2011 at 19:59

Grepping the contrib repository is useful in the following circumstances:

- You're about to change an API in Drupal core/a key contributed module, and you want to figure out the overall impact of that change on contributed projects.
- A new type of security vulnerability is discovered, and you want to find out the impact that has on contrib to see if you need to issue multiple security advisories for multiple projects.
- When evaluating whether or not a particular API/feature of the API is useful for possible removal from core/a key contributed module, it's really handy to be able to figure out how people are using it "in the wild."
- When evaluating contributed modules that expose APIs, for possible inclusion into core, it's useful to be able to see what their proliferation is like throughout contrib.

In the past, due to the way CVS stored its values, you could actually rsync down the entire contrib repository and grep it in plaintext (see http://drupal.org/node/277268 for instructions) which was SUPER handy for answering these and other questions. I understand Git's storage mechanism is different, but that desire is still there.

There are hacks, of course. You could write a script to parse the output of http://drupal.org/project/usage and run individual git clone commands for each one. Then another script to update it periodically and update/add new repos. I have a feeling though that after the first 1000 or so projects in that list, your IP would get banned as a spammer robot. :P Let alone if every member of the security/core dev team wanted to do this.

OTOH, lots of people pointing to a public rsync repo seems like that would keep the traffic down significantly, as well as avoid people writing/running custom scripts (some of which are sure to be buggy and double-download projects, etc.) for this. It doesn't solve other problems (like I don't think an rsynced git repo can be grepped like a CVS one can) but it would at least get the information onto peoples' hard drives so they can proceed from there.

Comment #4

Robin Millette commented 5 August 2012 at 05:10

I'm really looking forward to getting my hands on all the MODULE.api.php files for a few articles I'm currently writing, if that's any incentive.

Comment #5

killes@www.drop.org commented 29 September 2013 at 19:51

I've been asked by somebody (sorry, my memory...) at DrupalCon if we do provide this. So this is still of interest.

Comment #6

webchick

she/they

English

Vancouver 🇨🇦

commented 30 September 2013 at 17:06

Yep.

For now, I'm getting by with https://drupal.org/sandbox/greggles/1481160 but I'm sure that's not remotely nice to D.o's infra.

Comment #7

Mixologic

He/Him/His

English

Portland, OR

commented 17 April 2015 at 18:21

Issue summary:	View changes
Status:	Active	» Postponed (maintainer needs more info)

I've been kicking around an Idea for a code search tool that I think would satisfy the requirement of being able to get information out of the comprehensive code base.

What other use cases would people have for having a physical copy of all the git history and working directories?

Comment #8

rocketeerbkw commented 18 April 2015 at 09:55

Status:

Postponed (maintainer needs more info)

» Active

Unless that tool is available now, being able to download all the code is still an active and useful feature.

Additionally, my use cases involve parsing or manipulating the code, not just searching. Specifically, I wanted an updated, and slightly different, list of most used hooks.

Comment #9

webchick

she/they

English

Vancouver 🇨🇦

commented 18 April 2015 at 10:06

I think #7 would work for "how much would contrib be affected if we did change X" (which does come up a lot) but it wouldn't work for scenario described in #8 (which comes up less often, but is still important).

Another approach could be resolving a long-standing feature request (which I can't find atm) to have contrib (or at least top ~100 contrib) on api.drupal.org. Then we could pull stats from #8 via database tables pretty easily.

Comment #10

mlhess commented 18 April 2015 at 14:18

The older way of doing this was using https://www.drupal.org/sandbox/greggles/1481160

However, There are a few ways to do this now with the api (https://www.drupal.org/api)

You can grab all the projects, and then just check them out from git directly. I don't think it will ever be in the roadmap to provide an rsync service, but I do not speak for the DA.

Comment #11

Mixologic

He/Him/His

English

Portland, OR

commented 18 April 2015 at 17:50

I should probably elaborate on my idea, especially since I probably wont have time to implement it anytime soon with all thats on the plate.

The code search that Im imagining is not a simple full text search, but would utilize a tool like pharborist or PHP-Parser combined with something like exhuberant C-tags to parse/manipulate/tokenize all the code so it could be faceted to put into something like elasticsearch. That way the search itself could provide things like lists of most used hooks, because the search could/should know that it is tokenizing a hook implementation and tag it as such. It could have tags for other metadata like drupal major version, test vs non-test code, etc.

From my perspective, the requirement to have the entire set of repositories available is to gather metadata about the code in the ecosystem. If we can build a tool that lets everybody get access to that metadata, and lets everbody define what metadata they want out of it, would that be sufficient?

I can see a counter argument that this only addresses the current state of history, but wouldn't necessarily provide *historical* data like the git histories would, but Im not sure what the use cases we have there.

Comment #12

rocketeerbkw commented 18 April 2015 at 23:13

You can grab all the projects, and then just check them out from git directly.

Have you tried it? It's not as easy as you make it sound.

It took me a week to get it because the list of modules in https://www.drupal.org/sandbox/greggles/1481160 was out of date and I didn't know there was an API. Then I found out about the API and switched to using it, but it's broken (see #2364755: Query parameters dropped when redirected from /node to /node.json and #2253947: format suffix not added to next-first-last page url's) so it took more time to figure out what's wrong and work around it. It takes ~30 mins to get the list of modules from the API, and another 3 days (and I have a fast connection) to get all the code.

Is it possible? Technically, yes. Currently it's very discouraging if someone is just curious about something and wants to have a look.

If an rsync download isn't the right solution, I understand. But I'm not sure a better option is in place right now.

Comment #13

drumm

he/him

NY, US

commented 18 April 2015 at 23:43

API.Drupal.org could serve as the database for that, it does store and render all the code it indexes. If we have #160470: [meta] Index all of contrib modules at api.drupal.org.

Comment #14

marvil07 commented 4 May 2015 at 03:48

@Mixologic maybe it is a good time now to open a new issue about searching through all contrib, I mentioned the idea before and suggested other project implementing it we can reuse. Also I do not think we can know hooks easily (unless we assume only core hooks). In any case let's discuss it on other issue instead.

Comment #15

irinaz commented 1 June 2022 at 20:05

@marvil07 , is this issue still relevant? If yes, could you give more details? If no, could we close this issue?

Comment #16

fgm

French

Paris, France

commented 2 June 2022 at 10:41

it is still relevant, first for the reasons outline by @webchick in #3, but also as a way to provide unfettered access to our public code, encouraging individuals to experiment, probably fail, and maybe succeed sometimes in creating value for the community instead of having the feeling of gatekeeping.

Comment #17

nnewton commented 2 June 2022 at 14:47

This issue is outdated. With the advent of the PR-forks, shared object storage due to them and hashed repositories via gitlab, this is no longer possible. You would require the full gitlab backup to do anything with the on-disk data, which both has data that cannot be publicly shared and is far too large to reasonably share like this. We can close this issue, any mirroring of the git repos will have to be through the git interface or something gitlab offers.

Set up anon rsync of git repositories

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Related issues

News items

Our community

Documentation

Drupal code base

Governance of community