Updated: Comment #N

Problem/Motivation

As the structure is now changed on d.o this is a follow up of #2125121: Allow user to specify protocol (http instead of https) with --issue-site

List of changes to make:

  • attachments are now field_issue_files
  • attachments are now just file resources
    • to find contributors we need to fetch either all comments or files?

From our test environment this comes from #2125121-6: Allow user to specify protocol (http instead of https) with --issue-site json.txt file and passed through http://jsbeautifier.org/

Proposed resolution

  1. Write a node fetcher class NodeFetcher which fetch not only the node but the whole tree (comments, files, users)
  2. Make sure we cache stuff
  3. Merge the fetch items into one consistent object (class JsonNode)
  4. Integrate that object JsonNode back into drush_iq

How to test

~/.drush/drush_iq$ ./runtests.sh iqInternalTest.php

Remaining tasks

  • As we now have comments and files separate the same matching algorithm must reapplied in JsonNode.
  • Use drush cache? and invalidate appropriate
  • What should / could we cache?
    1. Node short time periond
    2. Comment: long as we are not interested in changes do in comment
    3. File: forever as these never change.
    4. Taxonomy: ?
    5. ?

User interface changes

API changes

Original report by @username

Comments

clemens.tolboom’s picture

Issue summary: View changes

Added comment example and tiny start of changes to made.

clemens.tolboom’s picture

FWIW I think we should redesign the parsing and processing of the json.

As the comments are not in the response of ie node/1.json any more we will fire a lot of comments/cid.json requests unnecessary?

Can we fix this by processing the HTML page instead?

greg.1.anderson’s picture

In the past, processing the HTML page was tricky, because some old pages with low nids produced different, messier HTML than newer pages. I never looked into why this was; perhaps the D7 drupal.org no longer does this. drush iq-submit already parses the HTML page just to interpret the form elements; it strips out everything between the <form> and </form> to do this. I haven't checked to see if this is still working; odds are good that it is at least slightly broken.

I agree that we can probably separate out comment processing so that it only happens when needed (e.g. when generating attribution); I haven't had a chance to look into the situation yet to see what it would take, but I don't expect it to be too difficult.

greg.1.anderson’s picture

From #2125121: Allow user to specify protocol (http instead of https) with --issue-site:

drush_iq is not yet updated to work with drupal.org, and it appears that drupal.org is not yet emitting json data yet anyway. To test drush_iq, though, you may use the test site, since it -has- been updated to publish json at node/1.json.

To test:

drush iqi --issue-site=http://username:password@project-drupal.redesign.devdrupal.org 2125121

The username and password are both "drupal" (I didn't fill it in because I didn't want to make a valid URL for crawlers to follow...)

clemens.tolboom’s picture

This needs a better issue summary:

- add link to #1710850: Deploy RestWS for D7 project issue JSON
- replace the copy / paste of comment and node into 2 attachments
- formulate the problems when using json.

To start with the latter: requesting this issue through drush_iq would result in 1+5+2 requests:
- node/2127965.json
- comments/cid.json (5 times)
- user/uid (2 times)

My guess is we must not ask for json but use screen scraping instead on just this node/2127965 or create a json view having all data in it.

clemens.tolboom’s picture

Status: Active » Needs review
FileSize
3.6 KB

I've written a small test against http://project-drupal.redesign.devdrupal.org/ test site.

It fetches all issue related jsons (node, comments, users) and pretty print (needs PHP 5.4) these in tests/data/downloads/*

Running it takes between 7 and 10 seconds when having an empty 'cache'. Next run is under 40ms.

~/.drush/drush_iq$ ./runtests.sh iqInternalTest.php
clemens.tolboom’s picture

Issue summary: View changes
FileSize
5.81 KB

Added the downloaded files so no need for this big issue summary.

clemens.tolboom’s picture

I forgot the files ... doh. Each file has an owner so we are good in the needed dataset.

clemens.tolboom’s picture

Issue summary: View changes
clemens.tolboom’s picture

Issue summary: View changes
FileSize
4.66 KB

New patch got a cleanup.

When running through ./runtests.sh iqInternalTest.php it will start downloading

$ find tests | wc -l
     423

items.

This takes ages. Note it does fetch the complete tree of 25 nodes.

Checking for the relation comment - file I have not found. $ less tests/data/download/comment/7858225.json

The test code fetches a node and stuffs all related data in $store.

Shall I start integrating this test code into iq?

clemens.tolboom’s picture

[edit]
I don't think it is hard to find the correlation between a file and the user as the comment and file are submitted around the same timestamp by the same user.

And we know what the comment index is based on it's position in the comments array.
[/edit]

$ less tests/data/download/file/4637597.json
{
    "fid": "4637597",
    "name": "datetime.html_.twig-to-time.html_.twig-2088365-3.patch",
    "mime": "text\/x-diff",
    "size": "247",
    "url": "http:\/\/project-drupal.redesign.devdrupal.org\/files\/datetime.html_.twig-to-time.html_.twig-2088365-3.patch",
    "timestamp": "1379074841",
    "owner": {
        "uri": "http:\/\/project-drupal.redesign.devdrupal.org\/user\/2665733",
        "id": "2665733",
        "resource": "user"
    }
}

Take each file upload correspond to a comment we can relate these by the timestamp?!?

$ grep -r 13790748 tests
tests/data/download/comment/7858225.json:    "created": "1379074862",
tests/data/download/file/4637597.json:    "timestamp": "1379074841",
$ less tests/data/download/comment/7858225.json
{
    "comment_body": {
        "value": "<p>Attached patch for this small change.<\/p>\n",
        "format": "1"
    },
    "cid": "7858225",
    "name": "naveenvalecha",
    "homepage": "",
    "subject": "",
    "url": "http:\/\/project-drupal.redesign.devdrupal.org\/comment\/7858225#comment-7858225",
    "edit_url": "http:\/\/project-drupal.redesign.devdrupal.org\/comment\/edit\/7858225",
    "created": "1379074862",
    "node": {
        "uri": "http:\/\/project-drupal.redesign.devdrupal.org\/node\/2088365",
        "id": "2088365",
        "resource": "node"
    },
    "author": {
        "uri": "http:\/\/project-drupal.redesign.devdrupal.org\/user\/2665733",
        "id": "2665733",
        "resource": "user"
    }
}
clemens.tolboom’s picture

Currently patch needs Drush 6.x

It fails on drush master with
Cannot open file "/Users/clemens/Sites/drupal-devs/drush/tests/drush_testcase.inc".

clemens.tolboom’s picture

Patch tries to fetch from the new endpoint on https://restws-drupal.redesign.devdrupal.org/api-d7 but fails.

I've added some documentation. Needs more love :(

clemens.tolboom’s picture

I needed to add code 403 Forbidden ... not sure how to use that in code :-(

clemens.tolboom’s picture

I now have moved the fetching code out of the test and into a class.

It still needs to start using drush cache and more documentation.

Running against https://restws-drupal.redesign.devdrupal.org fetching 28 got me 1337 files and took 22 minutes :(

markcarver’s picture

This is now deployed on d.o. See: https://www.drupal.org/about-drupalorg/api#restws

clemens.tolboom’s picture

Fixed some drush 7.x / master issues. Running

# ~/.drush/drush_iq$
./runtests.sh iqInternalTest.php

will try to fetch 5 nodes and related data from the real thing.

Integration functions are written but how to integrate that into the logic of drush_iq is hard.

/**
 * Produces a Node fetcher.
 */
function _drush_iq_get_fetcher() {
  static $fetcher;

  if (isset($fetcher)) {
    return $fetcher;
  }

  $fetcher = new NodeJsonFetcher();

  // "http://drupal:drupal@project-drupal.redesign.devdrupal.org";
  $fetcher->setSite("http://www.drupal.org/api-d7");

  // TODO: cache this dir;
  $fetcher->setCacheDir(__DIR__ . '/tests/data/download');
  return $fetcher;
}

/**
 * Fetch all content belonging to a node.
 */
function _drush_iq_fetch_full_node($nid) {
  $fetcher = _drush_iq_get_fetcher();
  return $fetcher->fetchFullNode($nid);
}

/**
 * Fetch the node base json for further usage.
 *
 * @see _drush_iq_fetch_full_node().
 */
function _drush_iq_fetch_node($nid) {
  $fetcher = _drush_iq_get_fetcher();
  return $fetcher->downloadNode($nid);
}

/**
 * Fetch the given resource type
 */
function _drush_iq_fetch_resource_by_type($id, $type = 'node') {
  $fetcher = _drush_iq_get_fetcher();
  return $fetcher->fetchJson("$type/$id");
}

I started with class JsonNode which provide nice wrapper around comments and files by adding the author or owner. That needs more work for other fields.

(i spend to much time on this today so please review the 'test')

clemens.tolboom’s picture

Issue summary: View changes

It seems the new deployment on d.o does not make anything faster. So please report back.

greg.1.anderson’s picture

Sorry that I haven't replied here, but I am actually on vacation right now, and have very limited internet access. First of all, thank you for working for so long on this issue. There are two aspects of this patch that I want to address.

First, it seems that the new API on drupal.org is not well-suited to pre-fetching all of the information about an issue upfront. Attempting to do so takes a really long time, so I think we should lazy-load instead of prefetch. A lot of the information is not needed for most iq commands.

Second, on caching: I think it's good in principal to try to cache as much as possible. The drupal.org api docs certainly request this. However, for iq commands, it is unlikely that you are going to get cache hits very often. If someone does request the same node twice, it's usually because someone added a new patch that they'd like to try, so it seems likely that most of the cache hits that you do get will do more harm than good, as they will prevent the user from seeing new data.

I think that before we can really get much benefit out of caching, we need an extension of the api to allow us to fetch just the last-modified date of an entity, without getting all of the data. Failing that, we could leave just the issue node out of the cache, fetching it every time, and caching everything else.

I am working on a 7.x-2.x branch for iq for changes to support the d7 api. I'll push what I have done as soon as I have it nominally working. The first version(s) will not cache; help with cache wrappers (maybe utilizing the Drush cache functions) would be appreciated.

drumm’s picture

Second, on caching: I think it's good in principal to try to cache as much as possible. The drupal.org api docs certainly request this.

I was sure to add the "whenever possible" to the common-sense guidelines for this sort of situation. Since requests are generally triggered by a person running a command, this won't hit our web servers much more than a person viewing a web page, which is good. We do have the usual Varnish and CDN caching on our side.

greg.1.anderson’s picture

We can cache 'user' and 'file' objects aggressively, as these do not change.

greg.1.anderson’s picture

I have pushed the 7.x-2.x branch; it can be checked out from git. iq-info works, most other stuff is probably still broken (untested).

It would be useful to add caching, or to refactor into a class per @clemens.tolboom's earlier work, but I have not done this.

clemens.tolboom’s picture

@greg.1.anderson you had enough bandwidth to give good feedback :)

My ideas changed regarding the fetcher as ie Gabor could use the code as well. I have to checkout your code too.

greg.1.anderson’s picture

My approach was to modify the existing code first, just to get it working, rather than start with a new class, which might require more rework of the existing code. The adjusted functions in iq.drush.inc could later become class methods; drush_iq_download_info and drush_iq_download_url would be the two analogous functions to your fetcher.

I sketched an idea for caching that involved storing the parsed json data; however, I think that in fact it would probably be better to cache the raw downloaded files instead. This would involve a slight refactoring of where the cache function comes into play, but the idea would be the same. Note that the code rebuilds the requested data structures every time; if the cache is used, it pulls up the raw unaltered source data, and then re-applies any transformations (alter hooks) that are necessary. This allows us to, for example, have a function that lazy-loads just the first patch info, and this will not get in the way of subsequent calls that might want to request info on all of the patches. This should object-ify pretty easily, if you want to do that. I will continue on trying to make the base functionality of drush iq work, as time permits.

greg.1.anderson’s picture

I pushed some more code to the 7.x-2.x branch. iq-info and create-commit-comment are tested and working (except that I cannot determine who an issue is assigned to, and I don't know which comment number a file is associated with). All of the other code has been reviewed and updated, but not tested yet, so there are likely still some problems.

Caching works well, but only supports "cache forever" and "cache never". I have not made any attempt to factor out this code into a separate class that could be re-used by others. Efforts in that direction would be welcome. I will continue to test and fix problems in the current code and push fixes as time permits.

dman’s picture

Whoo Greg!
Thanks so much for finding time to resurrect this, especially given your time that should have been spent enjoying the country!

I'm starting to see some life back in the thing now - was failing for me earlier today, but yes (!) with the latest git pull I'm getting a valid response from iq-info.
Hoping for 'iq-apply-patch' soon - so I can catch up on a load of module housekeeping that's been not happening this year.

Naturally this *needs* drush6 or it also fails. - thanks also for your drush notes from the Auckland presentation Greg - some of the team bounced it to me today.

greg.1.anderson’s picture

I made a couple more commits tonight; iq-apply-patch and iq-reset are now tested and working. The comment number is still wrong for iqa, of course, since this information is not available from drupal.org yet; however, this deficiency should not get in the way of the actual operation of applying patches.

greg.1.anderson’s picture

Most of the iq commands are working new (as of #27), but I am still having some trouble with POSTing to the d7 version of drupal.org. For information on POST policies, please see #2328093: Establish policies for external bots or tools that post or modify content on drupal.org.

greg.1.anderson’s picture

iq-submit is now nominally working. There is a minor problem with multiple file uploads, so run with --interdiff=0 to avoid producing a second attachment. Drush iq now uses Goutte, so run composer install after cloning/downloading, and composer update after pulling.

clemens.tolboom’s picture

As there is no need for this issues patches (and my wasted time) I hide them.

greg.1.anderson’s picture

If you're not tracking here any more, then I'll just mark this issue as fixed. I'm sorry you feel your time was wasted; it wouldn't have to be. There would still be value in factoring this code out into a separate class. I just have limited time, and wanted to put the emphasis on making the code functional first before refactoring.

greg.1.anderson’s picture

Status: Needs review » Fixed
clemens.tolboom’s picture

I'm reworking the code into a drupal module / php component. Not sure were it will land. Depending on #2328093: Establish policies for external bots or tools that post or modify content on drupal.org

greg.1.anderson’s picture

I think that the best place to put it would be in composer / packagist. Then it could be included where needed. Note that drush_iq is already using composer to pull in Goutte / Guzzle. See http://drupalwatchdog.com/volume-3/issue-2/composer-sharing-wider

It seems likely that the policies will be accepted as written. These policies only dictate how bots should act, though, not how they are provided.

dman’s picture

I just want to chime in with a

happy dance

Because, YES, drush-iq has been working for me today!!!!

Thanks so much Greg!

So far success with iq-apply-patch, iq-merge, iq-reset

I had death with iq-submit, somewhere deep in guzzle. Not sure whether it's my install, php versions or something else. Maybe https certs for all I know, I'll look into it I guess.

Note: To clear up all issues, I had to upgrade everything - php itself, composer, drush core again, as well as drush-iq, installed via composer and then "composer update" there too. Something able the composer libs (guzzle again) demanding the newest PHP etc...

As a result, I've finally dedicated a day to closing 20 pending issues of one of my queues. Now for the other 40 projects ....

greg.1.anderson’s picture

Glad to hear that it's working for you! Yes, drush_iq now requires composer and at least Drush 6, although I'm running on master. If it's broken on 6, I could probably fix that. In working on this project, I started thinking of ways that Drush could help out here. It would be pretty easy for Drush to run composer update after downloading a Drush extension with a composer.json file. Drupal modules are still in flux vis-a-vis composer handling, though (see link in #34).

iq-submit was nominally working for me, but this command hasn't gotten a lot of play yet. If you didn't use --interdiff=0, it may have died trying to upload the second patch, which does not work. iq-release does not work.

I have not tried older versions of php; seems all of my systems are running recent versions right now. Do you know which versions of php failed, and which worked? I could at least document the requirements. The php requirements were probably inherited from guzzle / goutte.

dman’s picture

To not mess up my system, I ended up going with
DRUSH_PHP=/Applications/acquia-drupal/php5_4/bin/php
which was PHP 5.4.8 (cli) (built: Nov 5 2012 02:26:57)

I think the PHP that was not working for me may have been 5.3.18 ,

PHP Parse error:  syntax error, unexpected '[' in ~/.drush/drush_iq/vendor/guzzlehttp/guzzle/src/functions.php on line 24

so I recognised that as more modern PHP than I'm used to.
I'd even suspect I confused composer by using a different version of PHP-cli to download the libs vs the one I actually ran the code with, because I can't seem to trigger it now.

FWIW, stability seems here with
* PHP 5.4.8 (Using Acquia dev desktop distro binary and DRUSH_PHP)
* Drush Version : 6.3.0 (via composer into /.composer/vendor/bin/drush)

Next time I try iq-submit, I'll capture the trace, and yeah I didn't do anything special to trigger --interdiff=0 and it was something about Guzzle failing to get a webresourcehandle or ???
But If it persists, I'll open a new issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.