Problem/Motivation

The move from CVS to Git back in 2010 was a major step towards providing a modern hosting platform for Drupal core and contributed projects. With git.drupal.org in place, and cgit.drupalcode.org to provide straightforward web access to project sources, there are still several shortcomings that have not been solved to this day:

  • Code search: The only way to search cgit.drupalcode.org is via Google, which is lacking at best, and requires several hacks to narrow results down to anything closely resembling a useful list of hits. drupalcontrib.org could be considered an alternative, but limits its offer to "Drupal's most popular contributed projects".
  • Metrics, data API: There are some ways to process the information that resides on git.drupal.org and cgit.drupalcode.org, but each involves several parsing steps that are slow and put additional pressure on the drupal.org infrastructure, and/or involve a lot of local processing. For example, to fetch the commit activity for a project, one would either have to check out the project and analyze its commit history locally, or parse the information provided on drupal.org or cgit.drupalcode.org.
  • Blame: git blame is a vital tool to discover the origin of code changes. The same can be achieved by checking out a project and running blame locally, but having blame available via a web frontend like cgit enables developers and supporters alike to quickly review possible causes of discussed problems. The related issue (#2285061: [feature regression] Bring back git blame UI) has been on hold for quite a while, and it doesn't seem like blame will find its way into cgit anytime soon.
  • Performance: Since git.drupal.org is at the center of all git related operations, it may perform poorly at times in regard to processing power and bandwidth. Any service that queries git.drupal.org has to be able to deal with delays that may amount to a multiple of what is to be expected during normal hours.

Proposed resolution

Github.com, which advertises itself as a mirror platform for open source projects, covers each of the problems mentioned above. There have been initiatives in the past to create Github mirrors for a larger number of projects (hubdrop.org currently counts 436 projects), and with a fairly small effort these can be taken to the next level.
I sent a request towards github.com regarding the number of repositories that would still be within their Terms of Use (mentioning around 34K repositories), and the response I received was:

We don't impose a hard limit on the number of repositories you can create in an account, however we do reserve the right to disable accounts/repositories if their resource consumption begins to impact the service we are offering to our customers. If that is the case we would contact you.

We would certainly exceed the number of repositories of other organizations (the Apache Software Foundation lists around 640 projects), but given the smaller size of a Drupal project, and the lower request rate, I expect that we would ultimately use less of Github's resources than e.g. the ASF.
And while Github imposes rate limits for its API, these rate at 5000 per hour for authenticated requests (and 60 for unauthenticated requests).
Repositories can be created via the Github API and should have all additional features disabled (first and foremost issues and pull requests) to emphasize the role as a mirror repository. Additional information and a link to the original project page on www.drupal.org would be provided in the repository description.

Even if we would assume an import rate of one project per minute and take into account the top 2000 projects, the initial import would finish in one and a half days, and we would have covered all projects with a reported usage of 1000 and above.
Updates would initially be applied based on project activity (ranging e.g. from once a week to a month) and could later be tied to post-receive hooks.

Remaining tasks

  1. Define the steps an import script needs to perform for mirroring a single repository.
  2. Define the steps an import script needs to perform to manage the mirroring of multiple repositories.
  3. Create an inofficial test account on github.com
  4. Implement a basic version of a mirroring script. The script should take a list of projects as arguments, with each project defining:
    • Project name
    • Project URL
    • Source repository URL
  5. Mirror Drupal core, Views, Entity API, Zen, Kickstart Commerce (any others that would increase variety?)
  6. Provide Github API examples to:
    • perform a simple text search across all mirrored repositories,
    • perform a simple text search on a single mirrored repository,
    • query the combined commit history of all mirrored repositories,
    • query the commit history of a single mirrored repository,
    • query the commit history of all mirrored repositories, limited to 7.x branches (optional; might not be possible, but would provide very interesting metrics)
  7. Discuss the results and plan further steps.

User interface changes

None.

API changes

None.

Current project metrics

Some statistics on the projects hosted on drupal.org:

cgit.drupalcode.org

  • Repositories: 34908
  • Sandboxes: 15398
  • Assumed full: 19510

www.drupal.org

By type:

By status:

  • Listed as "Unknown" / "Abandoned": 278
  • Status "Unsupported": 2911
  • Both: 136

By usage:

  • Reported usage of 0: 1832

Comments

ciss’s picture

Issue summary: View changes

Minor correction.

charginghawk’s picture

Re: search, something I've been interested in is regex searches of the codebase, which unfortunately Github doesn't support. I've been looking into ways to do this, and Benjamin Boyter who runs searchcode.com was nice enough to give me some leads:

Short answer is use Russel Cox's re2 Golang implementation,

https://swtch.com/~rsc/regexp/regexp4.html
https://github.com/google/codesearch

You can find an example of this here used to search over the debian code repositories.

http://codesearch.debian.net/

And its code here, https://github.com/Debian/dcs which is pretty much all you would need to get started actually.

If it's good enough for Debian, it's probably good enough for Drupal.

I've also been toying with the idea of an API that accepts regex and then just greps the codebase.

That said, mirroring the repos on Github is a great idea that opens up a lot possibilities.

ciss’s picture

Issue summary: View changes
darol100’s picture

Its been over 6 months since they open this issue. Any new updates on this ?

mlhess’s picture

Status: Active » Closed (outdated)

Given our move to gitlab this quarter, this issue is outdated.