Active
Project:
Drupal.org infrastructure
Component:
Git
Priority:
Normal
Category:
Task
Assigned:
Unassigned
Reporter:
Created:
9 Aug 2011 at 20:49 UTC
Updated:
2 Jun 2022 at 14:47 UTC
Jump to comment: Most recent
Per the discussion in #1057386: Provide a way to download the entire git codebase for all projects, we ought to set up an rsync daemon that makes /var/git/repositories publicly available. There's no sensitive information in there, so it can be done safely.
My preference would probably be to add this the gitrepobase puppet module.
Comments
Comment #1
marvil07 commentedLet me cite from #1057386-11: Provide a way to download the entire git codebase for all projects and answer here, where IMO is relevant:
There is one consideration for rsync git repositories. Git data can not be assumed as invariable, I mean,
git gc/git fsckchanges the real files on the repository. So please notice that rsync will see all those changes that git fetch would not.I naturally understand that the problem with using
git fetchis the number of network requests we need to do to actually get whole contrib updated.IMO, rsync is a good idea for the first download(but it can also be a ginormous tarball somewhere), but for the sub-subsequent updating I think git fetch would do a good job. As pointed the problem is the number of network connections. Assuming anonymous git cloning using git protocol there is no way we can do it faster, but using ssh protocol, we could use the ControlMaster feature, that let you re-use one opened network connection, so maybe besides the cpu it would also add on both sides, it could be faster.
The problem with the last thing is that we are not using a real openssh daemon, we are using our custom one, so, if there is a way to provide the ControlMaster ssh feature on the ssh git daemon, we could have a way to maybe do it faster than rsync the ginormous contrib directory.
Comment #2
nnewton commentedAs has probably become obvious, I'm not a huge fan of supporting multiple ways of getting to the same data. We currently support, git, ssh and http. For me to really want to add rsync, I'd like a good description of why. I've read through the grepping argument and am not fully understanding its significance. Could someone shed some light on this. Thanks.
-N
Comment #3
webchickGrepping the contrib repository is useful in the following circumstances:
- You're about to change an API in Drupal core/a key contributed module, and you want to figure out the overall impact of that change on contributed projects.
- A new type of security vulnerability is discovered, and you want to find out the impact that has on contrib to see if you need to issue multiple security advisories for multiple projects.
- When evaluating whether or not a particular API/feature of the API is useful for possible removal from core/a key contributed module, it's really handy to be able to figure out how people are using it "in the wild."
- When evaluating contributed modules that expose APIs, for possible inclusion into core, it's useful to be able to see what their proliferation is like throughout contrib.
In the past, due to the way CVS stored its values, you could actually rsync down the entire contrib repository and grep it in plaintext (see http://drupal.org/node/277268 for instructions) which was SUPER handy for answering these and other questions. I understand Git's storage mechanism is different, but that desire is still there.
There are hacks, of course. You could write a script to parse the output of http://drupal.org/project/usage and run individual
git clonecommands for each one. Then another script to update it periodically and update/add new repos. I have a feeling though that after the first 1000 or so projects in that list, your IP would get banned as a spammer robot. :P Let alone if every member of the security/core dev team wanted to do this.OTOH, lots of people pointing to a public rsync repo seems like that would keep the traffic down significantly, as well as avoid people writing/running custom scripts (some of which are sure to be buggy and double-download projects, etc.) for this. It doesn't solve other problems (like I don't think an rsynced git repo can be grepped like a CVS one can) but it would at least get the information onto peoples' hard drives so they can proceed from there.
Comment #4
Robin Millette commentedI'm really looking forward to getting my hands on all the MODULE.api.php files for a few articles I'm currently writing, if that's any incentive.
Comment #5
killes@www.drop.org commentedI've been asked by somebody (sorry, my memory...) at DrupalCon if we do provide this. So this is still of interest.
Comment #6
webchickYep.
For now, I'm getting by with https://drupal.org/sandbox/greggles/1481160 but I'm sure that's not remotely nice to D.o's infra.
Comment #7
MixologicI've been kicking around an Idea for a code search tool that I think would satisfy the requirement of being able to get information out of the comprehensive code base.
What other use cases would people have for having a physical copy of all the git history and working directories?
Comment #8
rocketeerbkw commentedUnless that tool is available now, being able to download all the code is still an active and useful feature.
Additionally, my use cases involve parsing or manipulating the code, not just searching. Specifically, I wanted an updated, and slightly different, list of most used hooks.
Comment #9
webchickI think #7 would work for "how much would contrib be affected if we did change X" (which does come up a lot) but it wouldn't work for scenario described in #8 (which comes up less often, but is still important).
Another approach could be resolving a long-standing feature request (which I can't find atm) to have contrib (or at least top ~100 contrib) on api.drupal.org. Then we could pull stats from #8 via database tables pretty easily.
Comment #10
mlhess commentedThe older way of doing this was using https://www.drupal.org/sandbox/greggles/1481160
However, There are a few ways to do this now with the api (https://www.drupal.org/api)
You can grab all the projects, and then just check them out from git directly. I don't think it will ever be in the roadmap to provide an rsync service, but I do not speak for the DA.
Comment #11
MixologicI should probably elaborate on my idea, especially since I probably wont have time to implement it anytime soon with all thats on the plate.
The code search that Im imagining is not a simple full text search, but would utilize a tool like pharborist or PHP-Parser combined with something like exhuberant C-tags to parse/manipulate/tokenize all the code so it could be faceted to put into something like elasticsearch. That way the search itself could provide things like lists of most used hooks, because the search could/should know that it is tokenizing a hook implementation and tag it as such. It could have tags for other metadata like drupal major version, test vs non-test code, etc.
From my perspective, the requirement to have the entire set of repositories available is to gather metadata about the code in the ecosystem. If we can build a tool that lets everybody get access to that metadata, and lets everbody define what metadata they want out of it, would that be sufficient?
I can see a counter argument that this only addresses the current state of history, but wouldn't necessarily provide *historical* data like the git histories would, but Im not sure what the use cases we have there.
Comment #12
rocketeerbkw commentedHave you tried it? It's not as easy as you make it sound.
It took me a week to get it because the list of modules in https://www.drupal.org/sandbox/greggles/1481160 was out of date and I didn't know there was an API. Then I found out about the API and switched to using it, but it's broken (see #2364755: Query parameters dropped when redirected from /node to /node.json and #2253947: format suffix not added to next-first-last page url's) so it took more time to figure out what's wrong and work around it. It takes ~30 mins to get the list of modules from the API, and another 3 days (and I have a fast connection) to get all the code.
Is it possible? Technically, yes. Currently it's very discouraging if someone is just curious about something and wants to have a look.
If an rsync download isn't the right solution, I understand. But I'm not sure a better option is in place right now.
Comment #13
drummAPI.Drupal.org could serve as the database for that, it does store and render all the code it indexes. If we have #160470: [meta] Index all of contrib modules at api.drupal.org.
Comment #14
marvil07 commented@Mixologic maybe it is a good time now to open a new issue about searching through all contrib, I mentioned the idea before and suggested other project implementing it we can reuse. Also I do not think we can know hooks easily (unless we assume only core hooks). In any case let's discuss it on other issue instead.
Comment #15
irinaz commented@marvil07 , is this issue still relevant? If yes, could you give more details? If no, could we close this issue?
Comment #16
fgmit is still relevant, first for the reasons outline by @webchick in #3, but also as a way to provide unfettered access to our public code, encouraging individuals to experiment, probably fail, and maybe succeed sometimes in creating value for the community instead of having the feeling of gatekeeping.
Comment #17
nnewton commentedThis issue is outdated. With the advent of the PR-forks, shared object storage due to them and hashed repositories via gitlab, this is no longer possible. You would require the full gitlab backup to do anything with the on-disk data, which both has data that cannot be publicly shared and is far too large to reasonably share like this. We can close this issue, any mirroring of the git repos will have to be through the git interface or something gitlab offers.