Download and parse tgz files as a data source [#1611896]

Problem/Motivation

If we are going to maintain many more (or all) contrib modules and themes on api.drupal.org, we need a simpler way to manage their files. Currently, we have scripts that maintain our projects/branches as Git clones, but this is time consuming to set up. Instead, it would be easier if the API module (or a sub-module) could get the files from TGZ downloads. This would work in conjunction with
#686312: Grab project list and packages from Drupal database or XML
to make a system that could automatically maintain the project/branch lists from meta-data, and download the necessary files for the API module to parse.

Proposed resolution

A few things would need to change:

a) File update times are not reliable from TGZ files. So, instead of deciding a given file needs to be reparsed based on the file update time, we would need to switch to using a hash or checksum of the file instead, at least for projects/branches being managed via TGZ files.

b) Ideally, if the TGZ had not been updated at all, we could skip checking the hash/checksum of individual files in the branch when doing a branch update, because we would know that nothing needed to be updated.

c) Probably the API module could have a couple of new hooks, that would ask "Does this project/branch need to be fully checked" and "Get me the files for this project/branch". We can have a new submodule that will manage these hooks by using TGZ files, for certain branches, or maybe it would just be some code in the main module that would notice it's a TGZ managed branch and use this method, vs. a regular files branch, and do things the old way.

d) Unzipped files would have a time-to-live and could be cleaned up once the API module is done looking at them.

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

Original issue report....

Directly using tgz files downloaded from Drupal.org, or anywhere else, will greatly reduce the setup for each project. This is needed for #686312: Grab project list and packages from Drupal database or XML.

This could either be a new branch type, next to files, or changes to the files branch type. Some code will likely be shared.

Two strategies I can think of are:

Extract the tgz to the files directory and treat it the same as a files branch.
Extract the tgz in memory with http://pear.php.net/package/Archive_Tar/docs/latest/Archive_Tar/Archive_... or equivalent. Store in api_documentation's code column for queued full parsing.

I like in memory because it avoids the filesystem, and permissions problems that come with it, entirely. In memory might take a lot or memory, but we already use a lot of memory on parsing.

Localize.drupal.org uses Archive_Tar is used to extract to the filesystem: http://drupalcode.org/project/l10n_server.git/blob/refs/heads/7.x-1.x:/c....

Comments

Comment #1

drumm

he/him

NY, US

CreditAttribution: drumm commented 31 May 2012 at 19:41

Looking a bit closer at Archive_Tar, I realize it seems to require using the filesystem. I still like the idea of not touching the filesystem at all, if it is practical.

Comment #2

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 31 May 2012 at 20:23

I'm not sure this is all that feasible. Let's say a given project has a new tgz file. Does that mean we would need to parse all of that project's contents again? Because I'm looking at the api project's tgz for 7.x-1.x-dev, and every single file in there has a date of May 18 at 14:25, whereas I'm certain that not every file in there was really updated for that release. So it doesn't look like we can tell which files actually need parsing. If we parse every file all the time, we'll be way overloaded on parsing...

Comment #3

drumm

he/him

NY, US

CreditAttribution: drumm commented 31 May 2012 at 20:42

I think the tgz files will have good modification dates for each file when extracted.

Comment #4

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 31 May 2012 at 21:07

I don't think so. I just downloaded api-7.x-dev.tar.gz and extracted it using the Linux command-line
tar xvzf api-7.x-dev.tar.gz
Every file says it was last updated on May 18 at 14:25, except LICENSE.txt (which I think is put in by the packager).

Comment #5

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2012 at 19:17

Issue tags:

+api.drupal.org contrib

Adding new issue tag for issues pertaining to getting all contrib modules on api.drupal.org.

Comment #6

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 June 2012 at 21:34

Status:

Active

» Postponed

I just tested this again with the latest 7.x tar.gz from the API module, and now all the files say they were last updated June 12 at 14:32, except again LICENSE.txt (dated Sept 17 2011). Definitely a good number of those files were not updated between may 18 and June 12. So this is not going to be a viable way to track updates unless the d.o packaging scripts start putting better dates on the files. I'm postponing this accordingly, since it doesn't seem like it will work out at all to try this method... feel free to open if you have a different method for getting proper updated dates.

Comment #7

matglas86 CreditAttribution: matglas86 commented 17 December 2012 at 08:37

If the problem is to see what files changed, can't we introduce sha1 hashes as a way to verify file content changes. We can use http://php.net/manual/en/function.sha1-file.php. This would probably need a hash field in the api_documentation table.

It might help also with deleting old documentation. You don't need the dids that way because the sha1 is unique and it lowers the batch size for node deletion.

Comment #8

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 17 December 2012 at 15:00

RE #7 - using a hash to see if a file changed does seem like a good idea.

I don't see how hashes would help with deleting the old documentation though. You *do* need the dids to delete documentation, because you have to delete information in quite a few tables, and also you have to delete the fake nodes that are used to implement comments (the did is the node's nid).

Comment #9

matglas86 CreditAttribution: matglas86 commented 17 December 2012 at 16:45

When a file is updated the hash changes with it. So the old files have the old hash. When the file is parsed the updated information get the new hash while the information that is legacy can be deleted based on the old hash. Because its unique you don't need have the full array of dids to delete them. This way you can quickly delete the appropriate records. At least thats what I think. If that is really a + on execution is something different.

Comment #10

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 18 December 2012 at 05:52

RE #9, nope! Thanks for the suggestion though because they may help with the file downloads issue.

Comment #11

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 January 2013 at 23:52

In thinking about this, doing a md5 or sh1 hash of the files will *really* slow down the process of checking to see if files need to be reparsed. Right now it is only checking file times, which is fast (quick system directory query). If we have to do a file hash, each file in each project will need to be read in and hashed, which will not be as fast. I really don't think it will be feasible... the test it's doing now, just comparing file modified times, in the Drupal 8.x repo is already fairly time consuming. Multiply that by thousands of projects and I don't think we can afford to add to the time the process will take.

Comment #12

matglas86 CreditAttribution: matglas86 commented 7 January 2013 at 09:03

I'm running a test on this and will provide some info back later. It looks like it slows it down just a little now in such a major way that it is problematic I think.

Comment #13

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 January 2013 at 16:01

Thanks, that will be helpful! If you can test using the Drupal 8.x code base, that would be the worst case.

Comment #14

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 January 2013 at 16:02

(I mean, use the 7.x version of the API module, but test it on scanning the Drupal 8.x code base.)

Comment #15

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 5 February 2013 at 19:17

I took another look at this today:
- The file modified date for every file in the tar.gz and zip files is the date the archive was created (with the exception of LICENSE.txt).
- If you do a git clone in a new directory, the file modified date for every file is the moment that you do the git clone.
- If you do a git checkout to switch branches, the file modified date for every file is the moment you run the command.
- If you do a git clone, and then sometime later do a git pull to get updates, updated/new files will be dated as the moment you run the command.

So my conclusion from this is that:
- Git does not set the file modified times to the last time the file had a commit made. It lets the OS set it to the current time when the file is changed on disk.
- The packaging scripts are most likely doing a git checkout into a clean space, adding the license.txt file, and then zipping/tar.gzipping it all up. The only alternative to doing it this way would be to maintain directories for each branch of each project, which isn't practical in the least.

So. If we actually want to run the API module against downloaded zip or tgz archives, we would definitely need to do some kind of hash to see if the file contents for files in the archive have changed, rather than relying on last update times as we do now.

The problem is... Let's say we download a new Drupal Core tgz file for 8.x, and we realize that 125 files have changed since the last download (which is certainly possible). If we were using the file system as normal, we would just mark those 125 files as "needs parsing", and then over the next few cron or cron-queue runs, we would get through parsing them. But if the only location of the files is in an archive that we've downloaded temporarily, ... to say the least, the current parse queue system would not work well for this situation. I guess maybe we could unzip just the changed files to some temporary storage and set up a parse job that reads the temporary file? It sounds like a pain, but I guess it might be doable. We'd also need to set up a system where if, before we got around to parsing the temporary file we saved, the file changed again -- we would in that case want to queue another job and if possible, realize when we got to the first job that it was obsolete. Still, should be doable.

It does look like we could use the ArchiveTar class (which is now in Drupal 8.x) to do the extraction. You can construct an ArchiveTar (giving a file path or a URL I think), and then call the listContent() method to make a list of all the files it contains (not sure of the format but I'm sure we can figure that out). Then for each file in the archive, we could use extractInstring() to get the file contents into a string, check the hash, and save that string as a temporary file and a parse job if it has changed.

Comment #16

drumm

he/him

NY, US

CreditAttribution: drumm commented 5 February 2013 at 21:36

A couple possibilities for saving the source code for queued parsing:

Go ahead and save each file's contents to {api_documentation}.code while scanning for changes. Queued jobs would always use that for further parsing. A possible disadvantage is that it would show up in the view source UI a few minutes before the rest of parsing is completed.
Include the file's contents as an argument to the queued job. Would have keep track of the hashes that have been queued instead of timestamps.

Comment #17

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 5 February 2013 at 21:56

Interesting ideas...

{api_documentation}.code is currently storing the parsed/formatted code, not the raw source code, so that might be a bit difficult (for instance, in between noticing file A needed parsing and actually doing it, we wouldn't want to have overwritten the already-formatted code that we had).

But we could find another spot in the database. Putting it in the queued job itself (as you suggested) would be the most logical. I think that the data for the job is already a blob anyway.

Comment #18

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 14 August 2017 at 20:16

Issue summary:	View changes
Status:	Postponed	» Active
Related issues:		+#686312: Grab project list and packages from Drupal database or XML

We need to un-postpone this.

New possible idea of an approach:

a) We would probably only want to do this on contrib module branches. For Drupal Core, there are additional things we do besides just a git clone (such as running composer to get external dependencies), and these are not in the branch TGZ files I think.

b) For projects that we're using TGZ on, when the queues come across a job that needs the files, we could trigger a process that would download the TGZ to a temporary space, and unzip it.

c) The temporary space could have a time to live on it, so that it could be cleaned up later when it's not in use.

d) For branch update jobs, we could skip actually checking the individual files to see what is updated, by looking at a hash or checksum of the TGZ itself. If the TGZ has not changed, nothing in the branch has changed.

e) We would definitely need to switch from using file update times to say "this file needs reparsing" to using some kind of hash or checksum.

The "trigger a process" could be a hook, and we could have a separate sub-module that manages the TGZ download, extraction, and temporary file cleanups. There could also be a hook that asks "do we even need to process this branch update", which would check the hash on the TGZ.

Adding an issue summary...