Many people have reasons to want the entire codebase on drupal.org available to them locally. This issue explores how this could be done in a generic, sustainable way and also provides some currently workable alternatives.

In #52 linclark presented one current solution; rfay has offered a script in #39 and a download tarball in #40 as an alternate solution.

#50 and #51 present paths forward for the d.o infrastructure, and an rsync capability like we used to have with cvs is also apparently underway.

-----
OP by hass:
I'd like to sync/get the full modules tree of all d.o projects in eclipse. Currently it looks for me like I need to create for every module/theme a new GIT project in eclipse. I cannot do this for 3000 projects by hand... How is this possible and what do I need to configure?

If you may ask why I need this - well I often search for code in other projects and this can only be done locally if the projects have been synced to my box. I do not like to reinvent the wheel every day only for the reason that I cannot search other projects code.

I tried to connect to git.drupal.org/project, but this fails always. I can only connect to git.drupal.org/project/[one project]. Not having the ability to sync all projects is a real showstopper.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

rfay’s picture

Each Drupal project is now its own repository, so there is no way to use a directory access as in CVS style.

I'm not sure what it is you really want to accomplish here. Why would you check out all of contrib?

You might consider a single project with git submodules for each item in contrib, or just checking out each one separately. Eclipse does fine with git repos in subdirectories.

I'm still not sure what you're trying to accomplish, so if you explain that maybe I'll be able to help more.

-Randy

rfay’s picture

Title: Need all projects in a tree in eclipse like it was with CVS » How do I download all projects so they can be browsed?
Category: bug » support

Sorry, I didn't read your post carefully, so now I understand.

Changing the title, as this doesn't have anything to do with Eclipse.

Sam says he has a good solution and will post it.

One thing we could do would be to make a contrib project that consisted of nothing but git submodules for (all) contrib projects.

Note: Please don't plan to do this until things have settled out with the infrastructure, as we probably shouldn't have anybody downloading the whole repo yet.

sdboyer’s picture

Title: How do I download all projects so they can be browsed? » Need all projects in a tree in eclipse like it was with CVS
Status: Active » Closed (won't fix)

Somewhere back in summer '10 this issue was brought up, iirc by davereid. The short answer is basically, "sorry". There are two fundamental reasons why this simply cannot work the same way:

  1. As rfay pointed out, we don't have one big "contributions" repository anymore. We just don't. We sorta simulate it by virtue of the fact that all projects live in a single top-level namespace, but yeah.
  2. The data storage structure of git is entirely different from CVS. Whereas CVS has all the changes ever written in plaintext to ,v files, git repacks its objects into packfiles, which consist of either a zlib-compressed streams of the plain data, or a delta + the object its delta'd from (if you really care, see http://book.git-scm.com/7_the_packfile.html). There'll be no grepping of those. You can only grep what's been checked out into the work tree - so even then, you're only grepping a single revision.

There is a possible alternative, though, if someone wants to code it. An external script, given a root location, could iterate over all repositories and use git-grep (which has some capability to search git's compressed data) to perform a grep in each repository, and do it across a long history. If someone actually does want to implement it, something like this will be your core operation that you run on each repository:

git rev-list --all | git-cat-file --batch | grep -P 'tree\ [a-z0-9]{40}$' | sed 's/^tree \([a-z0-9]\{40\}\)/\1/g' | xargs git-grep -e git

That will grep a every commit from every branch (the scope of the search is controlled by the parameters to git rev-list) of a repository for the word 'git'. Its output looks something like this, using one of the repos I have sitting around. The leading hash is the tree in which the output appears, followed by the path to the file in which it appears:

 4eb4fa50cc787fa3483a7f4e89e52ff087f6585f:gitsupervisor/files/supervisord.conf/util.drupal.org:[program:drupal_git_repo_parsing]
 4eb4fa50cc787fa3483a7f4e89e52ff087f6585f:gitsupervisor/files/supervisord.conf/util.drupal.org:command=/usr/local/sbin/drush/drush.php -r /var/www/git-dev.drupal.org/htdocs --php=/usr/bin/php process-waiting-q
 bd8fc812868ab42dab01a7b878d245e322744d53:drupalgitweb/files/cgi-bin/gitweb-tmp.cgi:our $number_of_git_cmds = 0;
 bd8fc812868ab42dab01a7b878d245e322744d53:drupalgitweb/files/cgi-bin/gitweb-tmp.cgi:# i.e. full URL is "$git_base_url/$project"
 bd8fc812868ab42dab01a7b878d245e322744d53:drupalgitweb/files/cgi-bin/gitweb-tmp.cgi:our @git_base_url_list = grep { $_ ne '' } ("");
 bd8fc812868ab42dab01a7b878d245e322744d53:drupalgitweb/files/cgi-bin/gitweb-tmp.cgi:our @diff_opts = ('-M'); # taken from git_commit

This command is just something I quickly whipped together, I am SURE there are plenty of improvements to make. But the basics are there for someone to pick up if they want. Bottom line, though, until someone actually writes a tool like this and makes it publicly available, I don't see a compelling reason to make a new rsync mirror. Git already has efficient protocols for getting the most recent data, and you're already getting everything when you fetch. So, if someone writes a script like the one I'm describing, I'll lobby for making an rsync mirror of all our (full project, at least) repos available. Until then, though, this is a won't fix, imo - the disks containing the git repos are going to be taxed enough without this extra load.

sdboyer’s picture

Anybody experimenting with that will quickly note that grepping through all commits is probably not worth it, and will generate a lot of redundant data. A realistic implementation will probably almost never use git rev-list --all, but a more fine-tuned list of source commits.

rfay’s picture

Title: Need all projects in a tree in eclipse like it was with CVS » How do I download all projects so they can be browsed?

X-post on the title.

I actually think my approach (a master repo with a set of git submodules) would be far friendlier to the average person. It would provide a populated tree, use git protocols when doing a git submodule update, and doesn't have the level of complexity of looking at every commit.

The only thing required to build this is a way to get a list of projects.

sdboyer’s picture

I'm not encouraging most folks to write that command. Not at all. I'm encouraging some one enterprising person to write a script that's based on that command, but presents a nice interface. And then most folks can use that script.

This, I think, is one of those cases where submodules seem like a good idea, but end up being a pain. To have a single repository that does this, we'd have to make it fetch & commit an update to its submodules for every push that's made to _every_ contrib project. Otherwise, when updating the super-repo, projects won't actually be at their latest. Moreover, we'll need to make an arbitrary decision about which branch to follow in the submodule. Which means that our logic when assembling that repository will figure significantly in the results that are generated (by default, and without requiring expensive checkout operations).

What I gave above that listed all commits in a repo was just a trivial example. git rev-list is _the_ powerhouse command that enables you to create any topographical commit map your heart could possibly desire. If you don't want commit ranges, though, you can easily swap out that first piece and use git rev-parse to easily search the branch tip of every branch in the repository:

git rev-parse --branches | git-cat-file --batch | grep -oP '(?<=^tree )[a-z0-9]{40}$' | xargs git-grep -e git

(I fixed the grep to obviate the sed)

Doing it this way means you can even do this on bare repositories - which means a simple rsync is all that'd be needed, and the amount of space required on a local system to perform this is much less. But really, the issue is that needing to actually _check out_ a working tree to do a standard grep on them is an O(crap) operation. Reading the data directly from compressed streams that are selected via an indexing & revwalking system designed to do this sort of thing is an O(less crap) operation. And there's none of that overhead associated with needing to generate fake commits to chase branch changes. Something as intensive as grepping several thousand repositories is not the place to repurpose porcelain. Plus, since we're reading raw repository data directly, there's no need to be concerned that "maybe it's not up to date...can I really trust this search?"

hass’s picture

Status: Closed (won't fix) » Active

Please don't make this show stopper a won't fix. This is a major issue as it does not allow developers to re-use code as all the code in CVS goes into a dark black GIT hole!

I'm fine if I would have a chance to rsync branches. It's not always a question of "checking out" and there is not always a must in having an always up-to-date sync of a branch. For example I'm searching for a way to upgrade a module and now in D7 I need to upgrade to hook_form_user_profile_form_alter() in combination of hook_user_presave(), but there is no documentation available yet or I do not understand it and need more examples. Without the ability to search in all projects on my local disk I may have found a solution to upgrade the Google Analytics module. I'm doing file searches with containing text matches on the whole contrib tree on my local disk. I don't add load on d.o machines in such cases. I'm not syncing daily - often only a full tree sync every 3 months or so. This is very common I guess and if I need a latest version of one project I simply right click on /drupal-contrib-7--1/modules/google_analytics > Replace with > Latest from Branch DRUPAL-7--1 and I get the latest bits. This is absolutely good workflow, but GIT sh** makes me to create a new project, enter ~10 or more settings, username, passwords and so on to get *one* project downloaded. This is awful bad workflow and a waste of time and it does not give me what I really need (search all code for something that may help me to get things forward).

Now we start re-inventing the wheel as we have a very big black GIT hole.

I do not like to search in commit messages or patches, but I may not understand all you have said above. I need a sync copy of branches on my local disk for speedy file based searches - nothing else. I do not need to search in specific releases - only in latest branches. I'm often using the file search also to see if there are module having features already implemented. So it stops me from duplicating code and we already have so many modules with duplicated functions... something really goes in the wrong direction.

I also used the offline sync of all projects on vacation while on a ship (offline for many days). I wouldn't have able to do further development sometimes without the sync on my box. Sometimes you only need a new idea.

sdboyer’s picture

Category: support » feature
Status: Active » Postponed

It's git, or Git. Never GIT. It's not an acronym.

I'll keep it off won't fix, but the best this can get is postponed. No matter what reasons you provide. As I said, I won't advocate for this until someone writes a script that uses the core searching functionality I've outlined.

I'm not syncing daily - often only a full tree sync every 3 months or so.

That's irrelevant. If we provide a service, the prudent assumption is that it will be used to the fullest extent we make it available.

This is absolutely good workflow, but GIT sh** makes me to create a new project, enter ~10 or more settings, username, passwords and so on to get *one* project downloaded. This is awful bad workflow and a waste of time and it does not give me what I really need (search all code for something that may help me to get things forward).

You only have to enter that information once, as it's described on your 'Git access' tab. We repeat it in the Git instructions on a per-project basis because, if you ignore/skip what's on their 'Git access' tab and don't set it per-repo, you won't receive accreditation for your commits. Once you get all that first-time setup done, it's fundamentally just git clone <repo uri>. Personally, I'll take that over cvs -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib co -d path/to/output -r <branchname> contributions/modules/<modulename>.

I also used the offline sync of all projects on vacation while on a ship (offline for many days). I wouldn't have able to do further development sometimes without the sync on my box.

I think you need to read a bit more about how Git works. It's a decentralized system. That in itself obviates the need for the rsync'd "sources." What you'd get from an rsync is exactly what you get when you clone/fetch - except rsync is probably *less* efficient.

I've already conceded that there's a legitimate use case, here. I myself used to grep my rsync of contrib fairly often. But I've provided a clear path forward here. Rallying support for that approach is going to be a lot more effective than venting. Keep in mind that I'm not guaranteeing we'll provide the rsync mirror if that script emerges - while this would work right now with the way our repositories are laid out on disk, that is likely to change in the future as we start accommodating an increasing number of repositories. In that case, though, such a script could be used as an engine for searching repository contents that we eventually present through the d.o web interface.

danillonunes’s picture

It's possible to write one script to parse some list of all d.o projects (like http://git.drupalcode.org/), and so clone/pull each of then. Is there some wrong with this approach that I can't see? (Except that, well, it will take a LOT of time, since Git downloads the complete repository, and not just the last revision like CVS).

rfay’s picture

This also doesn't require d.o infrastructure to do this. A single site could package all the contrib code up, update it from time to time, and make it available via rsync.

Also, since most of us really just want the code for searching, a website (either as part of d.o or not) could provide searching capability.

sdboyer’s picture

@danillonunes - yes, someone could write a script to just clone them all down directly. Would be a lot slower than a massive rsync, though, since it's got to set up and tear down the remote connection so many times. But having the whole repository history is really the point (even/especially for searching), so that's optimal already.

Dave Reid’s picture

Subscribing... I will greatly miss the ability to grep all of contrib.

hass’s picture

Component: Documentation » Git
Priority: Normal » Critical

I think we should hold Git migration back until there is a solution. This is a heavy blocker. As said - Git becomes a block hole. This is not acceptable and telling us "we" can write a clone script is also no acceptable solution. If there is a solution schedule the Git migration, but no day before.

Dave Reid’s picture

Component: Git » Documentation
Priority: Critical » Normal

I don't really agree this is a blocker though. It can easily be implemented post-migration.

webchick’s picture

We are absolutely not pushing off the Git migration for this. This is a very advanced use case that maybe 100 people on the entire planet have (and I say this as one of them). I'm sure one of us has enough shell scripting skills and can get something done for the other 99, and if it happens a few days or even weeks after migration, no one will die in the night screaming and bleeding.

webchick’s picture

Issue tags: +git phase 3
hass’s picture

As the first ~10 have subscribed here I guess there are many more people. And the ones who are not yet aware of this limitations are not counted. And I like to note that this script if it need to run on client boxed require windows support. This may need to be said explicitly.

chx’s picture

Priority: Normal » Critical
Status: Postponed » Active

I do not need every commit nor I need every revision -- but it's absolutely crucial for security and core development both that we are able to quickly grep all of contrib for a given Drupal version.

sdboyer’s picture

@chx - then let's get that script written.

sdboyer’s picture

To maybe make this easier - if someone could define an interface that this script should fulfill (e.g., the parameters it should take for filtering projects, filtering branches, etc.) then I might be able to string it together after launch. That's the only part I would find really onerous about this - given an interface to conform to, I could pop out such a script pretty quickly.

hass’s picture

Fulltext search and regular expression search both with ant without file name wildcards like *.install, foo*.in* and everything else you can think of and not to forget - 100% eclipse integration.

egl’s picture

FileSize
1.41 KB

For the time beeing my little stupid fetch-all script. Your free to smash your machine to pieces with it.

First part reads http://drupal.org/project/modules/index and tries to extract the repositories. This is a really stupid approach and needs some improvements.

The git part uses VersionControl_Git. You'll find it at http://pear.php.net/package/VersionControl_Git. The only obvious things which should be reconsidered in this part are the $orig_dir = getcwd; chdir($directory); chdir($orig_dir); commands.

And of course the script still needs some handling of exceptions.

hass’s picture

Priority: Critical » Normal
Status: Active » Postponed

How many days does it take to sync or re-sync with this script and where is "VersionControl/Git.php" and what about HEAD and branches?

egl’s picture

I've added some information in my original post.

It should fetch all branches and does not modify HEAD, so it is up to you to merge, rebase, or checkout. But if you are sure you want to do it on all sources and overwrite any potential local changes, you can easily add this after fetching.

And sorry, I have neither tested a complete run nor a fetch, the server is still down.

Edit: It ran 1/2h and got 266 modules (used 280M of my hard disk); assuming the first 266 modules are average sized a complete clone will last 14h (and need 8G of disk space).

danillonunes’s picture

I made an script using shell script instead of php (because I have some troubles with PHP-CLI here) to fetch project list from http://drupal.org/project/usage using YQL, so it get the most X popular modules first, where X is a limit defined in the script, and so clone/fetch the repositories form d.org. If anyone is interested, the script is here http://drupal.org/sandbox/danillonunes/1076042.

sdboyer’s picture

@hass -

Fulltext search and regular expression search both with and without file name wildcards like *.install, foo*.in*

I asked for an interface, not a wishlist. An interface means specifying parameters and arguments.

and everything else you can think of

I'm not your servant. The point of my asking for an interface was so that I didn't have to define the spec.

and not to forget - 100% eclipse integration.

lol. Would you like Rainbow Dancing Ponies too? Now you're just expecting me to cater to your personal environment.

That's it, hass, I'm done with you. Never doing anything for you again.

sdboyer’s picture

@egl and @danillonunes - sorry, I'll review your stuff soon - I'm now in the midst of moving & drupalcon preparations. Quick note, though, is that I'd rather us base this on a single, mass 'cache' repository such as that described in http://randyfay.com/node/93 or initially implemented in http://drupal.org/sandbox/mfer/1074256 ; using that approach would mean a single git fetch --all would get all the data we could want to grep, and the only thing we need to do is periodically update its list of remotes.

Also, I'd rather avoid the PEAR library if possible...it's yet-another-wrapper that's a rough duplicate of other libraries we use, and we've avoided it thus far. Now, this is all client-side, so it doesn't really matter (and I care a lot more that it works than that we use this or that library), but if we could stick with straight exec() calls until there's a compelling case for a wrapper, I'd be happier.

hass’s picture

Looks like you have not understood me. Nobody asked you to re-invent a search interface wheel. Every development tool have search capabilities and the users know them. Asking for parameters goes in a wrong direction. Provide a sync/clone/name it as you want and we are all done and able to use our prefered software. Nobody need to ask for any params/arguments/whatever - except "how can I clone the stuff for offline use on my local box".

sdboyer’s picture

@hass - You obviously do not understand numerous basic realities of git, and how those realities necessitate this problem be solved. And that's fine. This being the internet, it's really silly of me to ask you to keep your ignorance to yourself...but I'm gonna ask anyway:

Please go away. You don't understand what's involved here, and all you're doing is derailing work towards a real solution.

egl’s picture

FileSize
1.41 KB

I'd rather us base this on a single, mass 'cache'

Am I missing the point? Using a base mass cache downloads/fetches all repositories to a local cache to clone from. So you have the history twice (1x in cache and 1x in working-dir/.git). The php script clones all repositories it does not already has locally and fetches the repositories it already has. So you have one complete history and don't need another local cache.

A piece of code you'll really need is an intelligent recursive checkout of branches. You can neither assume that every project has only one branch for each major drupal revision nor that HEAD points to the correct branch. But this is necessary for both approaches.

A feature we'll really need, is a stable interface that supplies a machine readable list of all projects with repostitory urls to avoid the 'defect-by-concept' parsing of html-pages.

Btw, a pair of () is missing in my first script:

kenorb’s picture

It could be useful to download all projects in one dir.
Sometimes I'm using grep and some custom patterns to check, if anybody done something similar, to not create duplicates, or to find some bad coding, hooks, files, etc.
I tried to run #31, but after installing pear module:
sudo pear install -f VersionControl_Git
it still can't find the includes.
I'll have to check later again.

rfay’s picture

It's going to be time for us to work on this before long.

I needed this yesterday: "What module implement the VBO hooks so I can see how it's done?"

kenorb’s picture

Following command should list all the projects:

curl http://drupal.org/project/usage | perl -nle 'print "http://git.drupal.org/project/$2.git" if /(<a href=.*usage\/(.*)">.*<\/a>)/'

On Mac (with curl as default).

On other systems like Linux/Unix/Mac you can try (install lynx before):

lynx --source http://drupal.org/project/usage | perl -nle 'print "http://git.drupal.org/project/$2.git" if /(<a href=.*usage\/(.*)">.*<\/a>)/'

Add at the end:

| xargs

to make print it one line.

So the full command to download all the projects via git, is:

lynx --source http://drupal.org/project/usage | perl -nle 'print "http://git.drupal.org/project/$2.git" if /(<a href=.*usage\/(.*)">.*<\/a>)/' | xargs -L1 -P4 git clone --branch 6.x-1.x

Tested on Mac, but it should work on other shell based systems as well.
Or whatever branch you need.

rfay’s picture

Thanks for the help. I'll need to do this for #102102: Parse project .info files: present module list and dependency information to prime the pump so it doesn't take a billion years, as that code has to examine every branch of every project.

clemens.tolboom’s picture

This issue is waiting for #102102: Parse project .info files: present module list and dependency information to get completed

Powered by Dreditor (triage sandbox) and Triage transitions

(subscribing + some advertising of my triage)

dennis605’s picture

@kenorb:

Hey kenorb,

i was looking fur such a script for a long time.
thanx for this, it's working great.
regards
dennis605

Damien Tournoud’s picture

Regardless of how this goes, we *should* make an anonymous rsync available. People will come with crazy scripts to do that anyway, so we should have well make it softer on our infrastructure.

rfay’s picture

FileSize
370 bytes

Here's a version of kenorb's #34 that I used for #102102: Parse project .info files: present module list and dependency information. It's certainly not mature, but it does the job.

rfay’s picture

FYI, and to save some trouble, I put a full copy of all repositories as of yesterday at http://randyfay.com/files/drupal_gitrepos/drupal_gitrepos.tgz - it was created using #39, and can be updated with #39.

This is actually all to make it easier to get reviews of #102102: Parse project .info files: present module list and dependency information, which is basically ready to go, IMO, so if you want to get it for that reason, that's good with me :-)

chx’s picture

Title: How do I download all projects so they can be browsed? » all projects can't be grepped
Category: feature » bug
Priority: Normal » Critical
Status: Postponed » Active

This is becoming critical. Crazy script aside the inability to ack-grep contrib for this and that is making core development difficult.

EvanDonovan’s picture

Wanted to link Lin Clark's post on this issue, since it appears she linked back to the issue, but not the reverse: http://lin-clark.com/blog/writing-scripts-clone-contrib-projects-gitdrup.... Looks like a pretty basic solution, in terms of functionality, but at least it gets you a copy of everything.

What happened to random CVS stuff that wasn't in a project, btw (like CVS sandboxes)?

chx’s picture

So what stops allowing anon rsync?

sdboyer’s picture

We could set up an anon rsync. Here are the problems with it:

  • Everything that's on disk would only be bare repositories - no actual code that can be grepped. You'd have to write some additional local scripts in order to get the repos set up in a way that a recursive grep would work. Which starts bringing us back towards the other, earlier points I made about a joint repo with some magic git commands.
  • If/when we move to an on-disk layout that's abstracted from the repo URI (which we'll likely want or need to do when we add per-issue repos or first-class forks), then the correspondence between on-disk repos and what's on d.o will become less clear, if recognizable at all.

I have no problem with setting up the anon rsync, but I am already 150% loaded with responsibility for git in general, so I would really prefer it if someone else wrote the script that tied things all together. That's why I put the commands up there in the first place - hoping to put enough information out there to make it possible for someone else to step up and help. If someone wants to, I'm happy to help them through the process as they need it.

sdboyer’s picture

chx’s picture

Because I was scolded for making this critical let me elaborate.

The security team gets a report. It turns out to be a dangerous pattern. Say, you supply user submitted data to a place where one is not expected. We can not grep contrib right now to verify which modules do this. This means we can't find security holes. That's critical in my book.

sdboyer’s picture

@EvanDonovan - There were a few useful things outside the main directory structures that we moved into their own repositories, for example docs (http://drupal.org/project/documentation), but we determined that we would leave sandboxes alone, as they're entirely unrestricted in structure and thus impossible to really bring across effectively (wrt branch & tag renaming). Since none of it is "real" code anyway (so the history presumably isn't important), and it's all still available read-only, if anybody wanted to bring something over they could just do it themselves.

@chx - In #21 I asked for someone to define an interface, but it still hasn't happened - hass' rainbow shiny wish list doesn't count. The problem we have to solve is that there is no direct analogy between grepping all of Drupal's CVS and grepping all of Drupal's Git, as I explained back in #2. Consequently, we need to create a new approach, and since I don't personally have a compelling need for this system, basing it off of what I think is useful would be asinine. But since no one else seems to get what I mean by needing to define requirements, let me try to clarify. Please rank the following as "essential," "nice to have," or "wouldn't use it":

  • Grepping all revisions in all projects.
  • Grepping all branch tips in all projects.
  • Grepping only branch tips with valid release naming in all projects.
  • Grepping all tags in all projects.
  • Grepping only tags with valid release naming in all projects.
  • Single script that transparently maintains the whole local git mirror, no muss no fuss. (alternative is a set of instructions on a docs page that you make work for yourself).
  • Sandboxes available for grepping as well as projects.
  • *Any* interface other than a provided CLI script. That includes your own grepping tool, e.g. ack or perl.

Folks besides chx are obviously welcome to weigh in on these rankings.

chx’s picture

greggles’s picture

Ping for update from the spreadsheet in #48?

I have a strong reason to do this nowish across all 6.x and 7.x compatible branches/tags of themes.

That is a common case: "need to grep all [modules|themes] that are compatible with [6.x|7.x|8.x]"

chx’s picture

The poll right now stands at

  1. Grepping only branch tips with valid release naming in all projects. 140
  2. Grepping only tags with valid release naming in all projects. 126
  3. Single script that transparently maintains the whole local git mirror, no muss no fuss. (alternative is a set of instructions on a docs page that you make work for yourself). 122
  4. Grepping all branch tips in all projects. 118

The rest are behind, next would be 103 only.

sdboyer’s picture

OK yeah, we can work with that. from #50, all of 1, 2, 3, and 4 are pretty doable. Need to have the rsync mirror, then the wrapper script is only moderately difficult; the core logic is the stuff in my original comments on this thread. The rest of it is just doing some output massaging, since we don't really have to care about allowing grepping of a subset of repos.

kendouglass’s picture

Check out what linclark has done at: http://lin-clark.com/blog/writing-scripts-clone-contrib-projects-gitdrupalorg

What I do is:

  1. download a copy of Drupal
  2. cd into the Drupal folder
  3. create a list of all projects in a Drupal:
    #!/bin/bash
    curl http://drupal.org/project/usage | perl -nle 'print "$2" if /(<a href=.*usage\/(.*)">.*<\/a>)/' >project_list.txt

    (taken from http://drupal.org/node/1057386#comment-4768038)

  4. Feed that list to drush:
    drush --no dl `cat project_list.txt`

The beauty of this is that drush downloads all "recommended releases" for the version of Drupal I'm currently in, it sorts them into the appropriate folders in sites/all/..., and I get the extra date and version info in the .info files.

rfay’s picture

Title: all projects can't be grepped » Provide a way to download the entire git codebase for all projects

Updating the title. I added a simple issue summary pointing to some of the open options.

droplet’s picture

Thanks @52.

I'm only needs D7-dev version. Combined Drush & Git Caches, here is my way:

Downloading Projects:

#!/bin/bash
REPOS="/root/drupal_project_git_repos"

mkdir -p $REPOS
cd $REPOS

curl 'http://drupal.org/project/modules/index?project-status=0&drupal_core=103' | perl -nle 'print "$1" if /<span class="field-content"><a href="\/project\/(.+)">.+<\/a><\/span>/' >project_list.txt

drush --no dl --dev --package-handler=git_drupalorg --cache --skip -v `cat project_list.txt`

** added '-v' to tracking process only

Updates GIT Mirror repos

#!/bin/bash
REPOS="/root/.drush/cache/git/"
repolist=(`ls -1 ${REPOS}`)
for r in ${repolist[@]}
do
  cd ${REPOS}${r} && git fetch --all --prune
  pwd 
done

Updates Checkout Projects

#!/bin/bash
REPOS="/root/drupal_project_git_repos"
repolist=(`ls -1 ${REPOS}`)
for r in ${repolist[@]}
do
  cd ${REPOS}${r} && git pull
  pwd 
done

Just wrote these script and downloading the projects now. Hope D.rog don't block my IP :)

EDIT:
or downloading (-DEV) project without drush (UNTESTED):

#!/bin/bash
REPOS="/root/drupal_project_git_repos"
REPOS_CACHE="/root/drupal_project_git_repos_cache"
mkdir $REPOS
cd $REPOS

curl 'http://drupal.org/project/modules/index?project-status=0&drupal_core=103' | perl -nle 'print "$1" if /<span class="field-content"><a href="\/project\/(.+)">.+<\/a><\/span>/' >project_list.txt

while read line; do
 echo "$line"
 git clone --mirror git://git.drupal.org/project/${line}.git ${REPOS_CACHE}\${line}.git 
 git clone --reference ${REPOS_CACHE}/${line}.git ${REPOS}\${line}
 cd ${REPOS}\${line} && git checkout 7.x
done < "project_list.txt"
marvil07’s picture

Based on sdboyer comment at 51:

Need to have the rsync mirror, then the wrapper script is only moderately difficult; the core logic is the stuff in my original comments on this thread.

The first step should be #1244344: Set up anon rsync of git repositories

greggles’s picture

I took the solution in #54 and my access to querying d.o to create a tool that will download all themes and modules using drush to get their recommended dev version.

Since not everyon can query d.o I put the list of themes/modules and instructions in this sandbox:

http://drupal.org/sandbox/greggles/1481160

I welcome patches to make this easier. Right now on a medium speed internet connection it takes less than an hour to download everything. Not bad!

chx’s picture

We are going to write a script that clones every repo to a temp dir, checks out the latest branches and then tars up the whole thing, offers it over bittorrent and publish the torrent. Stay tuned.

Robin Millette’s picture

Looking forward to it as I explained in #1244344-4: Set up anon rsync of git repositories. Any which way works for me :-)

marvil07’s picture

After taking a new fresh look at this, based on the feedback at the poll, the main use case is actually find code, so maybe the better way to solve this is to have a way to search online, so we do not require all people to download it all and the update it constantly.

A good example of this is debian code search. Not sure yet what engine to use to index the code in a useful way, but the code behind that debian site is available. There is also a video presenting the software by the author published on youtube in case someone is interested.

sdboyer’s picture

we might be able to design some kind of solr-backed strategy focused on branch tips.

helmo’s picture

Something similar to the debian code search would be great!

But a way to download it all is still useful. Be it for some unforeseeable crazy analysis project.

helmo’s picture

What about mounting the git tree read-only on the devwww.drupal.org server? Then I, and other devs, could review it... and rsync me a copy if needed.

drumm’s picture

For code search, #160470: [meta] Index all of contrib modules at api.drupal.org would get all the code into MySQL tables. We also have #84207: Integrate API full text search into the ApacheSolr setup kicking around. Regardless of what's running it, we would have to be careful about tokenizing and such for useful code search.

drumm’s picture

Issue summary: View changes

Added an issue summary showing current status.

rocketeerbkw’s picture

The perl one-liners on this page (#34, #52, #54) only work when the projects are all listed on one page. The project usage page at https://www.drupal.org/project/usage now has a pager (I assume with the d.o move to Drupal 7) and breaks all the scripts listed so far.

I created a simple script at https://github.com/rocketeerbkw/drupal_function_stats/blob/master/get_li... to deal with the paging. This took awhile to run the first time because none of the pages were cached so I also have the results at https://raw.githubusercontent.com/rocketeerbkw/drupal_function_stats/mas....

drumm’s picture

webflo’s picture

The drupal.org API for project is not complete. Especially the git url is present in the current json output. The git url is not always project name because the url is case sensitive. One example https://www.drupal.org/project/autocache and the git urls is http://git.drupal.org/project/Autocache.git

heathdutton’s picture

Here's my way. It's dirty, but gets you all the published projects, useful for searching for current method usages.
https://github.com/heathdutton/megadrupal

Chi’s picture

For those who are still looking for a way to get the entire Drupal codebase on localhost, I created an NPM module that can download projects from updates.drupal.org (not form git). It does it fairly quickly because files are loaded in parallel. For instance, it takes me about 20 minutes to download the whole D7 codebase (12923 projects).

webchick’s picture

Holy crap, that looks very promising. Thanks, Chi!

hass’s picture

You know that 50 requests in parallel are bejond RFC specification? You may kill a webserver this way. Max 8 per server is allowed as I know. Not everyting possible is good. I killed webservers with httprl and linkchecker on our first tries. We reduced the limit in HTTPRL to 2 requests per server.

Chi’s picture

Max 8 per server is allowed as I know.

http://stackoverflow.com/questions/985431/max-parallel-http-connections-...
Per that thread the limit was removed from the specification. The actual value of the limit was 2 and most browsers never respected it. I tried to run the loader with concurrency=1000+. It worked but not reduced download time noticeably. I guess the max number of parallel requests is also limited in some other places (OS, router, provider, server, etc).

hass’s picture

The limit of 2 was enforced by IE for sure. Others have limits, too - but has increased every 5 years.

The article you linked too shows that all browsers have a limit of max 8!??

You clearly run an attack against the servers.

greggles’s picture

@hass thank you for your concern for server resources. I believe these files are fronted by a rather robust CDN. We don't want to abuse that CDN but it's also fine to leverage it. One suggestion: you might feel and also express some gratitude to Chi for solving your problem.

I am grateful to Chi for contributing their solution to this problem. Thanks, Chi.

I consider this to be a solved problem. There are multiple solutions that work reasonably well. I don't think we really need anything more from drupal.org itself at this point.

droplet’s picture

Alternative for quick source code searching:
https://github.com/drupalprojects

Chi’s picture

A few more search tools:
http://www.drupalcontrib.org - Drupal 7
http://grep.xnddx.ru - Drupal 8

rocketeerbkw’s picture

Project packages are served via Fastly CDN, I doubt this would severely impact drupal.org.

It would be polite to reduce the load on the CDN, maybe set the default concurrency to something low.

clemens.tolboom’s picture

We could now document this somewhere in the Developers corner

Chi’s picture

I agree. 50 seems to big. Just put it down to 15.

jonathan1055’s picture

http://grep.xnddx.ru is excellent. Who runs is and how can we ask for the source to be refreshed?

As part of our WebTestBase to BrowserTestBase conversions #2735005: Convert all Simpletest web tests to BrowserTestBase (or UnitTestBase/KernelTestBase) we are finding little-used functions which may or may not get added to the AssertLegactTrait class, for example getAllOptions() in #2907485: Add getAllOptions() to AssertLegacyTrait
I have just used http://grep.xnddx.ru to find a call to it in a contrib module. But one known call is not shown, and the dates are 2016. It would be great if that resource could be refreshed somehow. Many developers do not need or want to download an entire copy of the contrib module set, simply to use search functionality like this.

Chi’s picture

Who runs is and how can we ask for the source to be refreshed?

https://www.drupal.org/u/xandeadx

jonathan1055’s picture

Just for info, I asked user xandeadx to refresh their database of 8.x modules - the previous refresh (translating the page from Russian) was:

Modules in database: 2079 
Files in database: 33878 
Last updated: 06/29/2016

They very promptly responded and now the details show:

Modules in database: 4154 
Files in database: 60867 
Last updated: 09/23/2017

Great resource :-)

joseph.olstad’s picture

FileSize
129.58 KB

I updated script #39 (and improved it), since git moved I changed it from ssh to https , to be nice I added a 2 second sleep between projects.

Inside this archive includes a July 11th 2019 list of all contrib projectslisted in the 'usage' page. It is not 100% complete however it does have a lot.

with this script I git cloned 24012 projects(!)

over 24,000

apaderno’s picture

Category: Bug report » Task