Updated: Comment #53

Final solution

https://drupal.org/files/releases.tsv is generated currently 4 times a day.

Done by http://localhost:8080/view/D.o/job/releases-stopgap/ executing http://drupalcode.org/project/infrastructure.git/blob/refs/heads/master:... after the update status XML is regenerated
(from #88)

Problem/Motivation

There is not currently an easy way to get data for a comprehensive list of projects on drupal.org via XML or JSON, so that it can be consumed/displayed by other web sites. This is needed for several purposes:
- Module browser - within admin on a Drupal site to discover modules to download - http://drupal.org/project/project_browser [may not need this any more?]
- api.drupal.org - if it wants to host all contrib modules, it will need to be able to discover what they are and what branches they have.
- localize.drupal.org/#53 - needs to get updated information on projects with releases for translating
- security.drupal.org - currently synchs project names, shortnames and uids based on direct query to solr, mentioned by greggles in #34
- attiks/#36 - "my modules" internal PM tool with drupal, for all our own modules and modules that we're patching.
- (#39) http://drushmake.me/ (there are other similar sites I don't have at hand right now)
- (#39) http://drupal.org/project/profiler_builder
- (#42) statistics (http://atendesigngroup.com/blog/drupalorg-project-retention-download-usage) and a game (http://drupal.webstocks.ws/).
- (#42) Druplicon queries [not really planned, just an idea]

Proposed resolution

Make a JSON or XML formatted output containing the following fields (in parens, a list of which projects need which fields):

  • Project name - human-readable (all)
  • Project short name (all)
  • Project NID (localize.drupal.org)
  • Description [would that be the short description?] (module browser)
  • Released version numbers (project browser probably needs latest versions on each major branch; localize.drupal.org needs all released versions; attiks PM tool)
  • Download URLs for released versions (localize.drupal.org, project browser)
  • List of supported branches with branch name (e.g., 6.x-2.x), core compatibility (e.g., 6.x), when last updated with a commit [maybe], status of branch (supported, etc.) (api.drupal.org, attiks PM tool)
  • Creation date (project browser)
  • Last updated date [what does this mean? on the node?] (project browser)
  • Author name (project browser, attiks PM tool)
  • Author UID (security.drupal.org)
  • List of maintainer uids and names (security.drupal.org, attiks PM tool)
  • Number of downloads (project browser, localize.drupal.org)
  • Number of installs [for some use cases, number of installs by branch] (project browser, localize.drupal.org)
  • Category(ies) (project browser)
  • Type [theme, module, etc.] (project browser, api.drupal.org)
  • Maintenance status (project browser, api.drupal.org)
  • Development status (project browser, api.drupal.org)
  • Full project vs. sandbox (project browser, localize.drupal.org, api.drupal.org)
  • Screenshot url (project browser)

Filters that would be useful:

  • Project type (module or theme) (api.drupal.org, project browser, attiks PM tool)
  • Project status fields (sandbox, dev status, maint status) (api.drupal.org, project browser, attiks PM tool)
  • Timestamp for releases [show releases before/after a certain date, limited to tagged (non-dev) releases] (localize.drupal.org)
  • Has the project ever made a (non-dev, tagged) release (api.drupal.org, localize.drupal.org, probably others)
  • UID of author/maintainers (attiks PM tool)

There was also a suggestion for putting information from this list that doesn't change often into a "big" file, and the more rapidly-changing information into a "smaller" file, so that consumers could check the "big" file maybe weekly, and the "smaller" file more often.

There was another suggestion of paging the results, but since this is not meant to be human-readable data, it just makes it more complicated to page it for the machine consumers, so probably not a great idea unless the file is too big.

JSON is the preferred format (easier to deal with and smaller file).

Remaining tasks

Patch in progress.

User interface changes

New URL would exist for downloading this information.

API changes

None.

Original report by Bojhan

Hey,

I am working on an Module browser for Drupal 7, and I would like to get a proof-of-concept going. For this I need an XML file, which has all the modules with related information.

  • Module name
  • Description
  • Version numbers (with their download links)
  • Creation date
  • Last updated date
  • Author
  • Number of downloads
  • Number of useage
  • Category(ies)
  • Type (Theme or Module)
  • Screenshot url

Best regards,

Bojhan

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Damien Tournoud’s picture

The easiest route would be to concatenate the project release XML files. In those, you have:

Module name
Description
Version numbers (with their download links)
Creation date
Last updated date (indirectly via the releases)
Author
Number of downloads (we do not have this information in the database yet)
Number of useage
Category(ies)
Type (Theme or Module)
Screenshot url

greggles’s picture

I think we should design the interface first and the xml files later. We can create the xml files if we find out that the module browser needs them.

In the short run, we can create a few dummy files for testing. Would it be OK to just have 5 modules/themes instead of all 5,000?

Bojhan’s picture

greggles: Preferbly somewhat more, say around 500 - otherwise I model the interface after unreal data. But I am already designing the interface. So, the Project release XML files are not good enough, unless it can also list the version number with their downloads - since I cant test the biggest part then (installing).

Damien Tournoud’s picture

So, the Project release XML files are not good enough, unless it can also list the version number with their downloads - since I cant test the biggest part then (installing).

The releases (and their download links) are in the XML files. See for example:

http://updates.drupal.org/release-history/views/6.x

stijn.vanden.brande’s picture

I am working on this module and so far I can get the data that is already present and install a module using authorize.php.

The problem we see now is that we have quite a large file containing all projects ( > 6 MB), this generates lots of traffic against little result. You can get the file by going to: http://updates.drupal.org/release-history/project-list/all (suggest not to open in browser but use wget).

In order to reduce the filesize and performance reasons I suggest that we add a new file (page), this file will only contain the project name, short_name and the modified timestamp (last modified) for all projects. This way we can keep traffic at a minimum and we only need to update the projects that have been modified. It would also be nice to filter modules based on the version, similar to how you can currently use a filter for the release-history of one project (just adding extra argument in url).

We could then add the other fields that we are missing to the project release history, because these files will be smaller.

The extra fields:
- description
- creation date
- number of downloads
- number of useages
- screenshot url

Also if you go to the release history of a project I've noticed that the terms are not unique: the same term sometimes appears up to 4 times for the same project. If we could make the terms unique we could further reduce the file size.

stijn.vanden.brande’s picture

The file containing all the projects would be reduced to 670KB ( original 6.3MB) when we only include the 3 fields mentioned above (title, short name and modified).
When we leave out the title this file would be further reduced to 481KB, I believe we can do this because the 'short_name' has to be unique (correct me if I'm wrong).

mstrelan’s picture

What if there was a website that parsed this 6MB file and stored it in to a database. The website then exposed a SOAP web service which would return relevant results back to the module browser.

The big issue I can see is having the resources to host this. Anyone wanna suggest it to Dries?

I wonder how drupalmodules.com gets its data from.

doublejosh’s picture

This would be really nice to include the usage data or at least a sort order value on the usage.
Basically include this content: http://drupal.org/project/usage

I'd like to build a little Drupal utility that would benefit from sorting by general relevancy.

Can get pretty close to working XML with parsing the HMTL or just find-and-replace on the usage page. The usage columns just don't have unique identifiers.

doublejosh’s picture

Looks like I'm just a +1 for #5

werfu’s picture

Any advancement on that module browser? I'd be interesting in doing this for Drupal 6.

doublejosh’s picture

FileSize
235.59 KB

Here's the CSV (txt for upload) I created... a few weeks old, but hopefully helpful to someone.
No version info in this one, but includes a sort order number!

I'm using it to import module nodes for grouped node-references with make templates: http://betterwebshop.com/make
Managing make templates has been a need for a while.

dww’s picture

Title: XML of Drupal Projects » XML list of Projects
Project: Drupal.org site moderators » Project
Version: » 6.x-1.x-dev
Component: Other » Projects

See also #847880: Integrate list of Drupal projects with security.d.o for some related discussion.

There's already one XML list of all projects at http://updates.drupal.org/release-history/project-list/all as pointed out in comment #5.

However, for a variety of reasons, it'd probably be good to provide a smaller file with the bare minimum fields. This will need to be done inside project module itself, so moving to the more appropriate queue.

pwolanin’s picture

For security.d.o we are initially just going to pull the list from the d.o Solr search index - it's quick and easy.

For this Project module list can we consider having it in JSON format as well?

doublejosh’s picture

Great!
When/how can we get at the SOLR index?

pwolanin’s picture

Only sites within the *.drupal.org infra can access the Solr index. There is no intention to make it public, afaik.

doublejosh’s picture

Suppose I didn't mean actually opening up the index, just an XML/CSV/JSON feed of the relevant data whether the backend it SOLR, views bonus, a custom output or whatever.

SebCorbin’s picture

Localize also needs an API. I've taken in charge top port ldo to D7 and now is the good time to rewrite the drupal.org connector used to retrieve projects package.

Here's what it does at the moment:

  1. Get all translation projects (nid) by their taxonomy term [a]
  2. Get all projects (nid, uri, title, status) that have releases minus [a]
  3. Get all releases (uri, version, timestamp, filehash, filepath, status) that are older than some timestamp
  4. Get the latest usage reporting timestamp
  5. Get usage data (uri, version, count) that is older than a timestamp for releases

That's about it, if it's not very clear, all queries are shown here http://drupalcode.org/project/l10n_server.git/blob/refs/heads/6.x-3.x:/c...

So I would need three main API: projects, releases, and usage data.

dww’s picture

- Technically you could get 1 and 2 from the monster XML list of projects we already provide, but that's like a 10 meg file so that's pretty unworkable. That's the main thing we want to improve here.
- Releases you already have at updates.drupal.org.
- Usage is a new can of worms. Out of curiosity, why does l.d.o care about usage data?

doublejosh’s picture

Personally my use-case for getting usage data is for module suggestion / autocomplete type sorting.
Any tools we can make to help new people make good choices means happy site-builders, eventual devs, etc.

SebCorbin’s picture

@dww Fetching info for each release would make thousands of queries, that's unworkable too.

From Gabor: Usage is used to update the modules weight. We list modules by that, eg. http://localize.drupal.org/translate/languages/hr lists projects in decreasing usage order

SebCorbin’s picture

@dww Do you plan to work on it soon? For now I'm porting the connector as is, without the API, but it would be good to have this solved.

doublejosh’s picture

Perhaps then the usage data should be listed in a consumable format as well.
Fine to have to create a two step crawler: grab the project list, then reach out for usage on the those above ~500 installs.

dww’s picture

Title: XML list of Projects » Expose list of projects to external services (via JSON, XML, etc)

- If you actually need all the data for all releases, that's a huge amount of data no matter how you slice it. 1000s of small queries (one per project) that succeed are better than a single monstrous query that fails. ;) Especially since you could attempt to be smart about when/how you do those smaller queries and only re-fetch on projects that actually change. See #1417332: updates.drupal.org: Send "ETag" header; receive "If-None-Match" for some ideas on this.

- Duly noted on usage. So, perhaps you just need the summary of usage per-project, instead of the details of usage for each version. That'd simplify things.

- No, I don't have bandwidth to work on this anytime soon, sorry. However, y'all could already start to hash out what data should actually be included in this giant list of projects so that it's both useful and still manageable. There are lots of possible approaches for generating the list, so someone could start to explore implementation, as well. So, don't be blocked on my availability.

Thanks,
-Derek

sreynen’s picture

I'm interested in working on implementation for this when we have some consensus on what it should look like.

I've made something similar in the Drupal Webstocks API, which is scraped from the usage page:

http://drupal.webstocks.ws/blog/webstocks-api

It will be better, of course, when Drupal.org provides this data directly, but anyone is welcome to use that meanwhile. A couple suggestions on how it might work on Drupal.org:

* JSON is both smaller and generally quicker to parse than XML, so I think that should be the format.
* Selectively returning only requested data could be a good middle ground between returning everything to everyone or project-specific results. E.g. /project/api/json could default to just a list of NIDs and /project/api/json?info=usage could be NIDs plus usage numbers and /project/api/json?info=usage,downloads could add download numbers.

doublejosh’s picture

AWE·SO·ME

jhodgdon’s picture

We are hoping sometime soon to get all d.o projects (or maybe just modules?) onto api.drupal.org. In order to do that, we would also need some information from the projects (see API module issue #686312: Grab project list and packages from Drupal database or XML), so dww pointed me to this issue.

The API module will be showing branches, not individual point releases... so the information that the API module would need is what projects exist, what their branches are, and maybe when they were last updated. We can currently get at this information with the following:
a) Download the entire release history file from http://updates.drupal.org/release-history/project-list/all -- this is a big file, as noted above, and we only need a small part of the provided information from each project.
b) For each project in there, go through each core API version that is listed as being available.
c) For each project/API version, download http://updates.drupal.org/release-history/PROJECT/API_VERSION
d) Pick out the supported branches from that (which again is only a small piece of the total information available).

But this is downloading a lot of information that we don't need. All we really need is:
-- Project short name
-- Project long name/title
-- List of the branches; for each: branch name (e.g., 6.x-2.x) and core compatibility (e.g., 6.x), and maybe when last updated.

My suggestion would be an XML file something like this (or JSON equivalent -- either way would be fine):

<project>
  <short_name>views</short_name>
  <title>Views</title>
  <type>module</type>
  <branch>
     <core_compatibility>6.x</core_compatiblity>
     <branch-name>6.x-2.x</branch-name>
     <last-updated>2012-04-29 09:23</last-updated>
  </branch>
  ... more branches here ...
</project>
jhodgdon’s picture

Issue tags: +api.drupal.org contrib

Adding new issue tag for issues pertaining to getting all contrib modules on api.drupal.org.

jhodgdon’s picture

Does anyone have any comments about the format suggested in #26? Would it provide the information that you need for your purposes? That one is for the API module (api.drupal.org)... And would it be better as XML or JSON?

If there is any consensus on this, I can make a patch.

sreynen’s picture

Here's some stuff I'd like to see added to the data in #26:

  • categories
  • maintenance status
  • development status
  • reported installs
  • downloads

I still prefer JSON, for reasons mentioned in #24.

So here's my suggested format and info:

{
   "projects":[
      {
         "short_name":"views",
         "title":"Views",
         "type":"module",
         "maintenance_status":"Actively maintained",
         "development_status":"Under active development",
         "reported_installs":555557,
         "downloads":2801770,
         "categories":[
            "Views",
            "Content"
         ],
         "branches":[
            {
               "core_compatibility":"6.x",
               "branch_name":"6.x-2.x",
               "last_updated":"2012-04-29 09:23"
            }
         ]
      }
   ]
}
jhodgdon’s picture

+1 on #29. Any other data we should include?

dww’s picture

My only concern is how large a listing including all that data would be for N contrib projects (where N is currently a bazilion, growing by the hour)...

Does it make sense to have all that data for each project node (like we have for issue nodes already) and keep the global list of all projects incredibly lightweight (eg just type, short name and title)?

Otherwise, yeah, that looks like a pretty sane list of attributes.

sreynen’s picture

I'd personally use the extra data if it were available from a single URL, but I wouldn't make 11,000 HTTP requests to get it. I'd prefer 2 separate URLs, one with minimal data and one with full data, or paged results to something closer to 11 requests than 11,000.

#29 as compact JSON is 340 characters. That only includes one branch, and I would guess the average is slightly higher than that, so let's round up to 400 characters, which is also 400 bytes. http://drupal.org/project/usage currently shows 11206 projects, which would make the full list about 4.2MB. Cutting out the additional data I added cuts the size about in half, but I suspect the difference would be much less after compression.

The text in this is very repetitive, so it should compress very well. I copied the Views example many times into a 30k file and it compressed to 1k. That's obviously more repetition than we'll have with real projects, but not a lot. The attribute names will all repeat, and most attributes have less than 10 possible values. The exceptions are title, short_name, last_updated, reported_installs, and downloads, and all of those will have some repetition in parts (e.g. everything updated this year will have "2012" in the date and many projects have "views" in the title.)

My guess is we're going to end up with around a 500k gzipped response with the data in #29. But that's just a guess. We won't really know how well it compresses until we have realistic data.

I just looked at the original request again, and we're missing some attributes from that list: Description, Author, Creation date, and Screenshot URL. I don't know if that list is still relevant. Author and Creation date seem pretty straightforward. Description would be a big increase in size (and won't compress well). We don't have a screenshot URL field, exactly, only a general image field, which is sometimes used for screenshots, but also used for a variety of other types of images, e.g. logos.

jhodgdon’s picture

Obviously, making 11,000 requests to get the information we need would be less efficient than one request that gets you everything.

So. One user of this data (projected) would be api.drupal.org, which we'd like to get all of the contrib modules onto (and maybe even themes and distribution profiles, but probably just modules). For that use, we would need most of what's in #29, for most contrib projects... we don't need the taxonomy or number of downloads information... well, we might actually want the "status" taxonomy information to filter out projects that are abandoned or unmaintained... Anyway, we need most of #29 for most contrib modules, so saying we'd have to get a limited list first of just titles and then re-query for the rest would not be inefficient.

However, for this purpose, it might be useful to have filters:
- Project type (module, theme, etc. by URL argument)
- Return only full projects (not sandboxes)
- Some kind of status filter that would cut out projects that have never had a release or that have been closed down due to security issues or abandoned.

How about other projected uses?

greggles’s picture

security.drupal.org would benefit from getting:

title, uid of the maintainers, branches with maintenance status of the branch (supported or not), release tags

jhodgdon’s picture

greggles: does security.drupal.org need this for all projects or just the occasional project?

attiks’s picture

Can we add something like projects/me as well, with the same filters for sandbox, state, Drupal version, ...

Getting a list of maintainers for a project would be nice as well.

Use case: link internal PM tool with drupal, for all our own modules and modules that we're patching. List of maintainers is nice to have so I can see who else has access.

Edit: use case added.

greggles’s picture

@jhodgdon - all projects. We can make multiple requests if necessary. We're currently exploring other ways to transfer the data (e.g. sql dumps).

jhodgdon’s picture

Could anyone commenting about what data they need for this issue *please* clarify which projects you need data for: all? all modules? just a few? and what they plan to use it for. Thanks!

Robin Millette’s picture

http://drushmake.me/ (there are other similar sites I don't have at hand right now) and http://drupal.org/project/profiler_builder are two places where this data would be useful to have.

It might be a good idea to separate the data based on its rate of change. For instance, the nid, project name and authors are mostly static, # of downloads and usage are updated weekly (?), releases (version changes) could happen anytime (daily), etc. I can image one big file with everything that changes weekly, and separate files for each project for data that changes more often. Or maybe that's complicating things for no real purpose... mostly for bandwidth conservation.

That's my 2¢ for now :-)

jhodgdon’s picture

RE #39 - do you need *all* of the modules' data, all projects', or just some?

But that idea of "keep most stuff static" is probably a good one. I think the API module would probably need:
- Branch list and last updated stamp - often
- Check for new projects & update project info - less often

Robin Millette’s picture

One big file with all the projects and mostly static info.

One file for each project/module with all the info, updated daily (or more).

sreynen’s picture

This initial use in this thread was a module browser, but I don't know if that's still relevant with the direction of http://drupal.org/project/project_browser I haven't really been following that.

My own use is varied. In the past I've used this kind of data, scraped from HTML in two ways: statistics (http://atendesigngroup.com/blog/drupalorg-project-retention-download-usage) and a game (http://drupal.webstocks.ws/). In both cases, I'm using data on all projects, and generally more means I can do more.

One use case I can imagine that would benefit from individual project results is Druplicon queries. Many projects already have bot factoids, but for those that don't, this kind of data could make for useful responses. That wouldn't necessarily need individual project results in the API; it could load all projects once and then load individual projects from local storage.

Does anyone have a problem with paging the results? Seems like we have a wide variety of attributes requested and anything we do to limit those attributes is only going to delay the size problem until we have more projects. Paging should scale well with more projects without removing any attributes that would be useful to someone.

jhodgdon’s picture

Paging the results doesn't seem like it would actually do anything useful, since it seems like most consumers of the data need the whole list anyway, and if it's paged they'll just have to build logic to request all of the pages.

Filtering, however, seems like it might be a good idea, since at least some consumers would want to filter out sandboxes, abandoned projects, projects without an x.y release, etc.

sreynen’s picture

I was thinking paging would solve the problem of a single request taking too much time and/or memory on either end. I prefer filters to paging, but that still leaves the possibility of a request with no filters having a very large result. On the consumer side, that's not really an issue, since you get what you ask for. But will that be a problem on the Drupal.org side?

mlhess’s picture

FileSize
4.62 KB

Attached is a feature that provides some data for Drupal.org. This will meet the requirements for security.drupal.org, it does not meet all the requirements listed above. The data is in JSON format. It will require that views_datasource be installed.

It the fields:
User: Uid Uid
User: Name Name
Node: Nid Nid
Project: Short name Short name
Node: Title Title
Node: Body Body
Project: Demonstration Demonstration
Project: Documentation Documentation
Project: Homepage Homepage
Node: Updated date Updated date

dww’s picture

Including Node: Body Body in this feed seems like suicide for a feed for all 25K project nodes on d.o...

Can we please keep the top-level feed something very small so there's a chance people can actually use it, and then expose all the details on per-project pages (like the issue JSON pages)? Then, people can hit the big list regularly to find new stuff or whatever, and cache the details per project as appropriate for their needs...

dww’s picture

Also, we don't really need a whole other feature/module for this (since it's just providing a single new default view). Can you just export the view directly and include it as a patch for a new file in project/views/default_views? That makes it easier to review, maintain, and use.

Thanks!
-Derek

jhodgdon’s picture

I just went through all the comments on this issue and made an issue summary. It lists the consensus (I hope?) of what fields we actually need for the various proposed consumers of this project (in as much as the people proposing the uses actually listed what data they needed).

Hope that helps in deciding what needs to go into the patch... it might make sense to have a few possible output displays with different fields, since not everyone needs all the information?

jhodgdon’s picture

Issue summary: View changes

Make an issue summary

drumm’s picture

I'm not sure that we want this in project module itself. It feels a bit custom for Drupal.org. Do other project* sites want this.

I think project browser will be a totally separate system, at least that is how it is working now. Let's save project browser ideas for another issue.

dww’s picture

Let's see how custom it gets. ;) But, my thought is if REST and JSON APIs are the wave of the future, and since we're already providing issue JSON, I'm happy for Project itself to provide these kinds of listings...

jhodgdon’s picture

There are several fields identified in the new issue summary that are specific to Project Browser (which may not even need this issue anyway)... So let's get a patch (or just a Views export) that makes a listing with the other fields identified as needed by the other consumers, and then figure out where to put it. I don't think it matters a huge amount exactly where the patch goes, as long as the view is enabled on drupal.org.

treksler’s picture

I need something, anything at all, preferably yesterday. Seriously, release something now and tweak it later. it's been years!
simplytest.me scrapes! the Drupal projects. Isn't that ridiculous.

As for my use case, I want a cck field that lets me choose a Drupal module (project) entity from an autocomplete list. Most likely solution would be to have project entities that I can pull in via entity references. I want workflow around theses entities, so I can say which modules have been approved to use on our sites, etc. That is my use case.

I need name, link to project page, list of versions, short description, but basically any info that is on a project page should be accessible via the API, because the project page should be built by accessing and using the data from the same API.. is this not common sense?

so far, from what i have seen here
http://updates.drupal.org/release-history/project-list/all
looks to be the best bet for today… sigh

SebCorbin’s picture

Issue tags: +Localize D7 port

Adding tag and here's my latest needs for REST API localize connector

// @TODO Fetch all the projects that have releases in a single call.
Parameter has_release is possible?

// Only sync tagged releases which are at most one day older then our last sync date.
Here we need the already discussed "after timestamp" filtering.
Parameter to only have tagged releases (eg: no -dev releases) is possible?

// @TODO Fetch usage data for each release.
Right now, endpoint http://updates.drupal.org/release-history/admin_menu/7.x does not contain usage data.

SebCorbin’s picture

Issue summary: View changes

Updated issue summary. attics === attiks

jhodgdon’s picture

Version: 6.x-1.x-dev » 7.x-2.x-dev

Thanks SebCorbin. I think most of this information was already captured in the issue summary, but I just updated it with a few additions/notes.

At this point I think we should aim for 7.x-2.x, since I think d.o is going to be out on 7.x sometime in the near future, and development on 6.x is most likely dead.

hass’s picture

Huh? I'm still stuck at project 5.x and there is no working upgrade path to project 6.x and not to speak about 7.x...

Gábor Hojtsy’s picture

@hass: as in any other d.o project, contributions are welcome :)

@jhodgdon: I think it would be vital to have this on the 6.x code ASAP sine when the D7 cutoff happens on drupal.org, our old project syncing code will not function anymore (I believe) due to DB changes. If we only then start to be able to sync with the new D7 db, we'll need time to work out issues, etc. before it finally works again, so project / release syncing will be down for a possibly significant time. So it would be great to have some overlap and be able to build the glue code for this on localize while Drupal.org is still on 6, so the switch is painful painless.

sreynen’s picture

The issues summary says "Patch in progress" but this isn't assigned to anyone. Is someone working on a patch?

jhodgdon’s picture

Version: 7.x-2.x-dev » 6.x-1.x-dev

OK, move it back to 6.x then. Regarding patch being in progress, I am not sure whether the attachments above are actually patches (such as exported views) or just suggested formats.

hass: this is for deployment in the Project* modules on drupal.org so that drupal.org can export the data. So Project module 5.x is not relevant to us here, whether or not any particular consumer of this data we're proposing to output is running Drupal or not, or any particular version of Drupal, or whether the projects we're exporting have 5.x versions.

hass’s picture

@Gabor: i have posted 5-6 bugfix patches in the queue for review - 2 years ago. No commit. This suxxxxxxxxxxxxxxxxxx!!!

greggles’s picture

Version: 6.x-1.x-dev » 7.x-2.x-dev

7.x-2.x-dev is definitely the right branch for this issue.

Gábor Hojtsy’s picture

@greggles: that means l.d.o project synching will need to be down for an arbitrary length of time (days, weeks?), since it practically cannot turn over to use a project sync solution against the new version obviously until the d.o site switches over to 7 AND rolls this change out :/ Is that the intention?

jhodgdon’s picture

@ Gábor Hojtsy: drupal.org is migrating to 7 within a couple of weeks. Making a solution for 6 at this point is not going to be useful -- by the time we got it deployed we'd need the 7 solution.

Gábor Hojtsy’s picture

@jhodgdon: l.d.o currently uses direct SQL queries on a slave drupal.org db. Once the D7 upgrade happens, this will *not work* because the db structure changes (so we have been told at least). If by your assumption it takes min 2 weeks to roll out this change and we only start it on Drupal 7 (assume after the D7 deployment), then the project sync will be down for 2 weeks min. because the old way will not work anymore.

Gábor Hojtsy’s picture

Issue summary: View changes

Update with latest information from localize.d.o project

greggles’s picture

Thanks to Gabor for taking the time to explain to me in irc a bit more detail here. He hopes this can be done before the d.o D7 deployment so that l.d.o can test against the new code.

If anyone is working on this then they can set the version number to their desired version. They should be aware that there have not been commits nor deployments to 6.x code for drupalorg or project* for a few months now. I appreciate that might be frustrating.

My impression is that nobody is going to work on this before the D7 upgrade and that's why I set it to 7.x-2.x. If someone feels like working on this now that seems great.

jhodgdon’s picture

Priority: Normal » Major

Given that we need something for l.d.o ASAP, I am increasing the priority of this issue. We still need someone to decide they have time to work on it.

dww’s picture

Issue tags: +project, +drupal.org D7

If we *need* this before we launch D7 d.o, we should tag it as such. I'll let drumm or tvn untag it if it's not actually a launch blocker.

Maybe l.d.o functioning properly the day after launch is more important to everyone than the drush integration that #1710850: Deploy RestWS for D7 project issue JSON would be supporting (which is why it got bumped to Drupal.org 7.1 and postponed).

It's not up to me to decide unilaterally, but if this is considered a blocker, I'd happily try to get it done ASAP (although my schedule is insane for the next week, so I might not actually be able to complete it if it turns into a rathole). However, this should be a very easy thing to implement.

---

I still think we want a view like project/[type]/index/json (or perhaps even just a /json arg on the existing /project/[type]/index view) that provides almost no other information than what's there now (human-readable title, project shortname, URL to the project page), with the crucial addition of a last modified timestamp. That view already lets you filter on "API compatibility" (aka core version) and project status (sandbox vs. full), so the JSON listing could, too, via GET parameters. It'd be basically free to include the nid. Given that you could do separate queries for each project type you care about and full vs. sandbox (probably a good idea to keep the size of the JSON down), if we just include what's there now and the nid, that's most/all of what l.d.o actually *needs* based on the list in the summary.

Any client that needs more than that could then query /project/[shortname]/json and get the full project node as JSON. Sure, that'd be a *lot* of queries to initially populate a client, but as I said before, I'd much rather have 1000s of small queries that succeed than have to try to request a ridiculously huge JSON file that has all the data for all projects all at once, a request that will regularly (if not always) fail. If we *do* want to also provide the full project data via JSON, we should obviously do it via #1710850 so that the D7 version can very easily (if not for free) provide the identical JSON output. We don't have to get the issue node JSON working properly as a launch blocker, even if the project node JSON turns out to be.

But I think we just need a new display on an existing view (that's already been ported to D7 and doesn't need the Search API integration), and we're basically done here.

Cheers,
-Derek

Gábor Hojtsy’s picture

@dww: would the last changed timestamp change when a new release is made? I would guess not. We are synching projects as well as (new) releases for projects.

dww’s picture

No, the timestamp wouldn't change because of a new release, and the full project JSON I described at #66 wouldn't help you know about releases, either.

But we've already got the release history XML feeds that Update Manager in core relies on, so why don't you just use those? Sure, it's a bit sucky to parse JSON for one thing and XML for the other, but that doesn't seem like a launch blocker. If you care about that, it could basically be a JSON version of a view like https://drupal.org/node/3060/release right? If you want/need that, please open it as a separate issue, since IMHO it's out of scope in here.

Thanks,
-Derek

Gábor Hojtsy’s picture

"Download URLs for released versions (project browser)" did not say localize.drupal.org (adding it now), but what localize.drupal.org does now is it looks at all new projects and all new releases *from a given timestamp onwards*. If there is no way to get the recent releases across projects that means we need to query tens of thousand of projects (all of them) each time for all their releases, and then compare all that data to our stuff. To make those tens of thousands of HTTP requests every day we need to schedule chunks of them for different cronjobs. At least I don't assume doing tens of thousands of HTTP requests + parsing + saving would fit into any reasonable cronjob length. If there would be 5 projects, or at least we would have a list of projects that have new releases since given timestamp, then we would only need to query those. Without that information, sounds like we would need to blindly load everything all the time and build a whole system to schedule those HTTP queries.

Gábor Hojtsy’s picture

Issue summary: View changes

clarifying why s.d.o cares, updating which fields are actually needed for s.d.o

greggles’s picture

There is an rss feed of projects for each version of core already. Can l.d.o use that?

6.x: https://drupal.org/taxonomy/term/87
7.x: https://drupal.org/taxonomy/term/103

Gábor Hojtsy’s picture

We could scrape the HTML in items of that feed, if that is the best API d.o can provide. However, looks like this feed cannot be paged to further pages (tried adding page=2, etc., did not work), so if there are too many releases of projects between when l.d.o requests the page, it may run off the one page that is in the feed and those releases would get lots as far as l.d.o goes. (Even if we'd use scraping of that feed). So I guess then the other option is to scrape the main HTML pages which do have a pager. :) I'd hope we may be able to have a better API.

jhodgdon’s picture

The suggestion in #66 sounds good for things like Project Browser applications.

However, for purposes of "api.drupal.org contrib" and localize.drupal.org, the fundamental thing is not whether the project node has been updated, but whether a new numbered/tagged release has been made (localize) or whether a new commit has been made on a branch (api.drupal.org wants to eventually have all of contrib on there, and it needs to know which projects/branches have new code that it needs to read).

The query in #66 is not useful for either of those purposes, and as Gabor states in #69, surveying all projects to find out if each has a new release will be quite time consuming. Both api.d.o contrib and localize.d.o need an efficient way to query for which projects have updated code since the last query.

I don't think we want to open a new issue for these use cases -- they've been part of the picture here since nearly the beginning, but if you insist I guess we can separate each use case out into a separate issue.

greggles’s picture

As I look at the requirements of this task as described in the current issue summary, my thought is that the file would get created by a jenkins job and then stored until the next creation of the file is complete. I agree that creating the file on demand in response to an http request is unlikely to succeed consistently, if ever.

Doing this as a drush job that creates the file is not dependent on deploying RESTws or anything else.

jhodgdon’s picture

A generated file won't directly tell us which projects have updates since the last time we got the file, but I suppose we can then loop through the contents of the file, assuming it has "last commit time" (per branch) and "last release time" (per branch) included, and see if there have been updates.

greggles’s picture

We can see how long the job takes and schedule appropriately. If the file is generated every 6 or 12 hours is that "real time" enough? As it is, the anonymous call from api.d.o or l.d.o is going to get a varnish cached version of the file which is going to be stale by...something more than 15 minutes.

jhodgdon’s picture

I don't think the fact that the file is slightly stale will matter in the least to either api.d.o or localize.d.o, but I'll let someone else speak definitively about localize.

This sounds like a reasonable solution.

Gábor Hojtsy’s picture

6 hour staleness is not really a problem for l.d.o.

dww’s picture

Sorry, I'm not familiar with exactly what l.d.o and api.d.o need and why (and that's not clear from the current summary). I'm just trying to keep this simple and deployable. Thanks for the clarifications. If the addition of two more timestamps (last release, last commit) to the list I'm proposing at #66 would be sufficient, that seems great. Can someone confirm that's true? I think a giant list of all projects with timestamps of when to refresh any other data if you need it (from other sources) is a reasonable API for d.o to be providing. Thoughts?

I'm not sure what I think about generating these files via a drush command instead of a view. Note that #66 doesn't require we deploy REST, either. That's just if we end up doing the full project JSON (which I don't think either l.d.o nor api.d.o need). I suppose the drush command could take arguments to provide the same filters the view was going to have, and dump all the combos into separate static JSON files. However, d.o has a bunch of caching infrastructure already in place, so I'm not sure we care about needing to control how often this data is rebuilt. I could go either way, but it seems easier to just do this via another display in an existing view. However, if someone codes up a drush command implementation instead, that's fine, too. But I don't think a new drush command + jenkins job is necessarily any easier to deploy since I don't believe the new view display would introduce any new dependencies, either.

Thanks!
-Derek

jhodgdon’s picture

OK. For api.drupal.org, we are thinking of trying to host/parse/display "all of contrib".

In order to do that, what we would need is a list of all of the contrib projects with:
- shortname
- type (module, theme, etc.)
- an indication of sandboxes so we can exclude them
- human-readable name
- list of the branches it has (6.x-1.x, etc.)
- the date of the last commit on each branch

We would need to grab that information during cron runs, and it wouldn't matter to us whether it was up-to-the-second correct or had been created earlier and cached as a file.

Then, if a branch on a given project has a commit newer than the code we currently have, we would then need to grab the -dev tarball to get the latest code... although since the tarballs are not necessarily current (they are lagged by up to 12 hours), this could be a problem. Hopefully we can have some way to record which commit we have from the tarball we last grabbed, and compare that to the latest commit in the contrib project list, to see if we need to get the tarball again. Or the date of the commit. Or something... Any ideas?

greggles’s picture

In order to do that, what we would need is

I think the summary has a good presentation of the various bits of information and who needs it for roughly what purpose. Can you incorporate comment #79 into that?

jhodgdon’s picture

It is already in the issue summary -- when I made the summary I definitely included all the bits that the API module needs. dww said he couldn't parse out the bits specifically needed for API module from the issue summary, so I separated it out.

Do you want me to make a separate entry in the issue summary for just this?

drumm’s picture

Assigned: Unassigned » drumm
japerry’s picture

There was talk between myself and Drumm yesterday about using the apachesolr views module to provide a service. This could be a good way to provide a public API for searching for modules (project browser), while also giving *.drupal.org projects data they need as well.

However, because of how apachesolr project works on d.o, its not straightforward. I'm hoping nick_vh can chime in on this issue as well.

drumm’s picture

This is a bit of a tangent, but for completeness, this is what I over with japerry in person. Our Solr index is getting more normal in D7. I expect there are now useless legacy fields hanging out in our index. The basic metadata, such as project machine name or release version, are now indexed as any other field would be. There are still custom things indexed that don't correspond to fields, such as project usage. The old project node type should be ignored in favor of the project module, theme, etc node types.

This could overlap with our desire to replace project_solr module with apachesolr-powered Views. That is getting into scope creep for this issue.

drumm’s picture

This is a quick hack. For localize, I'm thinking we can just run something like:

SELECT from_unixtime(n.created) AS created, p.uri AS project_machine_name, pr.version, td.name AS api FROM node n INNER JOIN project_release_nodes pr ON pr.nid = n.nid AND pr.rebuild = 0 INNER JOIN project_projects p ON p.nid = pr.pid INNER JOIN term_data td ON td.tid = pr.version_api_tid WHERE n.status = 1 ORDER BY n.created DESC LIMIT 10;
+---------------------+--------------------------+----------------+-----+
| created             | project_machine_name     | version        | api |
+---------------------+--------------------------+----------------+-----+
| 2013-10-27 20:28:44 | forena                   | 7.x-4.0-alpha2 | 7.x |
| 2013-10-27 19:55:41 | newrelic_drush_plugin    | 7.x-1.0        | 7.x |
| 2013-10-27 17:16:19 | imagefield_eps           | 7.x-1.0        | 7.x |
| 2013-10-27 16:42:43 | foxycart                 | 7.x-2.10       | 7.x |
| 2013-10-27 16:22:35 | views_bootstrap          | 7.x-3.0        | 7.x |
| 2013-10-27 16:21:32 | views_bootstrap          | 7.x-2.0        | 7.x |
| 2013-10-27 11:53:51 | async_drupal             | 7.x-1.0-alpha1 | 7.x |
| 2013-10-27 11:53:01 | url_formatter            | 7.x-1.0-alpha2 | 7.x |
| 2013-10-27 11:47:02 | commerce_barcode_scanner | 7.x-1.1        | 7.x |
| 2013-10-27 11:31:54 | commerce_barcode_scanner | 7.x-1.0        | 7.x |
+---------------------+--------------------------+----------------+-----+

And on D7:

mysql> SELECT from_unixtime(n.created) AS created, pm.field_project_machine_name_value AS project_machine_name, rv.field_release_version_value AS version, td.name AS api FROM node n INNER JOIN field_data_field_release_project rp ON rp.entity_id = n.nid INNER JOIN field_data_field_release_build_type rbt ON rbt.entity_id = n.nid AND rbt.field_release_build_type_value = 'static' INNER JOIN field_data_field_release_version rv ON rv.entity_id = n.nid INNER JOIN field_data_field_project_machine_name pm ON pm.entity_id = rp.field_release_project_target_id INNER JOIN field_data_taxonomy_vocabulary_6 ra ON ra.entity_id = n.nid INNER JOIN taxonomy_term_data td ON td.tid = ra.taxonomy_vocabulary_6_tid WHERE n.status = 1 ORDER BY n.created DESC LIMIT 10;
+---------------------+----------------------+---------------+-----+
| created             | project_machine_name | version       | api |
+---------------------+----------------------+---------------+-----+
| 2013-10-19 20:55:37 | mixpanel             | 7.x-1.1       | 7.x |
| 2013-10-19 20:01:29 | menu_html            | 7.x-1.0       | 7.x |
| 2013-10-19 15:29:47 | weather              | 7.x-2.3       | 7.x |
| 2013-10-19 14:43:01 | social_stats         | 7.x-1.0       | 7.x |
| 2013-10-19 12:13:42 | ajax_facets          | 7.x-2.1       | 7.x |
| 2013-10-19 09:46:08 | flag_notify          | 7.x-1.0-rc1   | 7.x |
| 2013-10-19 08:43:35 | art_dialog           | 7.x-1.0-beta2 | 7.x |
| 2013-10-19 07:14:10 | live_css             | 7.x-2.12      | 7.x |
| 2013-10-19 07:08:31 | affiliate            | 6.x-1.1       | 6.x |
| 2013-10-19 07:07:32 | affiliate            | 5.x-2.0       | 5.x |
+---------------------+----------------------+---------------+-----+

This is all releases, not projects that are tagged, not including -dev releases, ordered by release date.

And output that straight to a web-accessible tsv file. Whatever else is needed should be grabbed from existing XML files like http://updates.drupal.org/release-history/dynamic_fieldable_content/7.x.

jhodgdon’s picture

For api.drupal.org, unlike localize.drupal.org, we actually want the last commit on the branch, not the releases.

SebCorbin’s picture

@drumm, so far so good, it's missing usage data but I can work on that later. It's currently using file timestamp and uri but it shouldn't be a problem.

drumm’s picture

Issue tags: -drupal.org D7 +Drupal.org 7.1
drumm’s picture

Assigned: drumm » Unassigned
drumm’s picture

Priority: Major » Normal
Gábor Hojtsy’s picture

@drumm: thanks a lot! Looking at that file, it seems to include all releases made on drupal.org ever in backwards time order. It is 2831678 bytes (~2.7MB). For localize.drupal.org, we'd usually look at the first 100 or at most 200 lines (depending on release frequency). Basically all lines up to the release we already recorded. I assume that the assumption is that downloading 2.7MB "inside the d.o network" is not that big, just wanted to point out that realistically, we would only look at the first couple hundred lines in this 60.000+ line file. I hope to have time for this today or tomorrow, so localize.drupal.org would experience little downtime in terms of syncing.

Gábor Hojtsy’s picture

Opened #2124931: Update l.d.o syncing for d.o update of project data to track the localize.drupal.org side. Will just disable the current queries for the update for now.

Gábor Hojtsy’s picture

Issue summary: View changes

Update to mark download links / release info needed for localize.drupal.org

barraponto’s picture

L.D.O #2124931: Update l.d.o syncing for d.o update of project data would benefit a lot from having titles in this tsv file. Can I has it?

If it's still using the script from #88 I've just changed a line in this gist.

drumm’s picture

Committed: http://drupalcode.org/project/infrastructure.git/commit/311211d

Since that is just a project repo on Drupal.org, patches to it are better than a copy in a gist, for any future changes.

drumm’s picture

I still would like to see this issue solved once and for all with a View having JSON output. One of the next steps is making sure we can have usage data in that. Project usage data is stored in custom tables, so we will need custom Views integration for the project_usage module.

Gábor Hojtsy’s picture

Even small edits on that tsv file may have big consequences. The full title made the file go from 2.9MB to 4.2MB. (Not that localize.drupal.org does not need the full title - it will work until download and/or parsing this tsv reaches the memory limit - but this may not be the most size effective way :)

SebCorbin’s picture

Issue summary: View changes
FileSize
999 bytes

Please, when you add data (columns) to the file, add them at the end :)

Anyway it's getting the release node title right now, here's an updated query (to test it go on staging and run drush sqlc)

barraponto’s picture

+++ b/live/release-list.sh
@@ -10,12 +10,13 @@ set -uex
+    np.title AS project_name,

Didn't you just ask for new columns to be added at the end?

barraponto’s picture

drumm’s picture

Committed #99.

skyredwang’s picture

Status: Active » Fixed
barraponto’s picture

Should we open a followup issue for a views-based list of releases?

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

helmo’s picture

Issue summary: View changes

Summary update to have a quick link to the solution.