D7 migrate included abstract MigrateList and MigrateItem classes to support migration scenarios such as REST APIs where rather than iterating over a monolithic source, you obtain a list of items from one place (endpoint) and retrieve each individual item from another. I'll be working on this support in conjunction with #2429137: Implement basics of wine example in D8 (which demonstrated some examples of this pattern) and #2500955: Port MigrateSourceXML to Drupal 8 migrate source plugin.

Comments

mikeryan created an issue. See original summary.

mikeryan’s picture

Component: Source plugins » Code

See also #2621490: [meta] Maximize consistency with JSON source plugin and #2621492: [meta] Maximize consistency with XML source plugin, which will be the major consumers. migrate_source_json does include some support for the list/item pattern, so I will be looking at if any of it can be generalized and moved into migrate_plus to be shared amongst JSON, XML, and potentially other source plugins.

mikeryan’s picture

Status: Active » Postponed
mikeryan’s picture

#2640508: Refactor fetcher/parser classes to make fetcher the iterator should make it fairly simple to support the list/item pattern as a fetcher - (re)postponing on that issue.

mikeryan’s picture

Component: Code » Plugins
mikeryan’s picture

Version: 8.x-1.x-dev » 8.x-2.x-dev
Status: Postponed » Active

The "make fetcher the iterator" issue probably isn't going to proceed - this should be pursued on its own.

Grayside’s picture

Assigned: Unassigned » Grayside

Going to start digging on this tomorrow.

Grayside’s picture

After thinking about this for awhile, creating a generic solution for this is pretty tricky, and here are a couple issues:

  • Parsing and iterating are currently tangled in the data parser plugins.
  • Parsing and decoding are currently tangled, and feels like a dependency on the Serialization API wrapped with some generator magic would be better

I don't want to spend time digging into those right now without a clearer sense of project goals in the space, so I am working on a change to the JSON parser.

Grayside’s picture

Status: Active » Needs work
FileSize
6.35 KB

Attached is my work in progress, going to resume tomorrow with some live exercises.

Grayside’s picture

Status: Needs work » Needs review
FileSize
6.67 KB

This seems to be working for me.

Approach

I could not see a way to break apart the existing components without wildly expanding scope, so I focused on my JSON use case. Within the JSON parser's fetchNextRow() I saw an opportunity to check on config and use the getSourceData method to do a follow-up request for the item. This seems like a pattern that might work cleanly for other parser plugins, but not in a generic manner.

Configuration Changes

As noted in the docblock for getRemoteItemData(), this introduces a new list configuration parameter for the JSON data parser, which is stuck directly on the top-level of the source plugin config. list can be set to TRUE for a default execution, or it can contain any of the following nested configuration:

  • selector is equivalent in function to the selector property for individual fields and is used to identify which data within the "list" data row to use.
  • remote_item_selector is an opportunity to override item_selector config for where in the retrieved "item" document to get the data.
  • uri_template a very naive URI templating solution that will replace http://example.com/node/{nid} based on data in the list row. If the list row is a scalar and there are no tokens to replace it performs simple concatenation.

New/Refactored Utility Methods

As part of this work, I ended up with two static methods that seem like they don't belong in the Json parser anymore. select() is a consolidation of the xpath-like processing, and inspired me to file #2800295: Use external property access library in lieu of "xpath-like" implementation. getTemplatedUri() is the "naive" templating solution. Seems like something core should provide, but I couldn't find an API for it.

Problems

I noticed the HTTP data fetcher has a tendency to throw all errors as MigrationException. This makes sense for larger source URLs, but when we are making requests for individual item data it's a pain. I investigated the idea of setting a "row" context in which empty responses or ClientException (4xx errors) would throw a MigrateSkipRowException instead to make the migration more robust, but unfortunately these exceptions fall all the way up to Drupal\migrate\MigrateExecution::import() where it stops the migration.

      try {
        $source->next();
      }
      catch (\Exception $e) {
        $this->message->display(
          $this->t('Migration failed with source plugin exception: @e',
            array('@e' => $e->getMessage())), 'error');
        $this->migration->setStatus(MigrationInterface::STATUS_IDLE);
        return MigrationInterface::RESULT_FAILED;
      }

Not sure where we want to try catching a MigrateSkipRowException, or if it should be in this issue.

Potential Follow-ups

  • Skipping Rows instead of Terminating Migration as mentioned in Problems.
  • Building options for retries & exponential backoff into how the HTTP requests are issued. This approach is pounding the source URLs as fast as entities can be processed which is irresponsible and likely to start creating server failures for lower-scale infrastructure.
Grayside’s picture

Assigned: Grayside » Unassigned

Done for now.

mikeryan’s picture

Thanks for this patch - I'm going to need to think on this bit, just want to let you know I'm not ignoring it as I catch up on reviews...

Grayside’s picture

Grayside’s picture

I suspect the current implementation of this or #2640516: Support paging through multiple requests is disrupting rollbacks. Got this second-hand so I do not have a lot of details, but likely related to item counts.

Grayside’s picture

Missed one spot in getSourceData() where $this->itemSelector => $item_selector.

dmouse’s picture

I just update this patch with the 8.x-3.x branch

Status: Needs review » Needs work

The last submitted patch, 16: list_item_pattern-2608610-16.patch, failed testing.

dmouse’s picture

Updated patch

heddn’s picture

Issue tags: +Needs tests

Any chance we could add a unit test or two? I think that would be super helpful to iron out any outstanding bugs to the interface.

Grayside’s picture

We're pretty busy working to finish the project that necessitated this patch, so I doubt I'll be able to add tests in the next few weeks. Is the request a +1 that the basic approach is good?

heddn’s picture

There's enough complexity in what is being developed here I was hoping to see tests. From the deprecated code in http://cgit.drupalcode.org/migrate_source_json/tree/src/Plugin/migrate/s..., there was the concept of a single source and a multi source. And an implementation of JSONapi's recommendations. I feel like the list solution is going to vary widely for different sources. It might make sense to use a reader pattern that lets one easily swap out the list reader for another in an easy manner.

rv0’s picture

@Grayside
Nice work.
I've been trying to use it for a few hours but I'm afraid I'm not fully getting the code.

On the bottom of getSourceData, you select the data using $this->itemSelector, then pass it on to the select() function, which uses the same selector to narrow it down again (which does not work of course, it's already narrowed down to the same selector).
I must be doing something wrong? I'm not sure if I get the whole $this->itemSelector vs $item_selector difference.

Grayside’s picture

getSourceData() is used two different ways. First, it is used to pull the list of resources. Then as the source iterates across that list, it pulls a URL referenced by each of those rows.

Here's some example configuration.

source:
  plugin: deploy_node
  data_fetcher_plugin: http
  data_parser_plugin: json
  // The list key means the list/item pattern is in use.
  list:
    // The URI template uses keys from the list data for the row to form a URL for "secondary request" of the item.
    uri_template: '{path}?_format=json'
    // Acts as the item_selector, but for the item data from the secondary request.
    remote_item_selector: ''
  urls: '/deploy/node/list'
  // Where in the initial list data do we retrieve the rows to be processed?
  item_selector: 0

Keep in mind the configuration key item_selector and method parameter item_selector are not hte same thing. The source config key item_selector becomes $this->itemSelector in the parser.

rv0’s picture

@Grayside
Thank you for your quick reply. I tried all sorts of config options, either I'm missing something (I'm new to D8 and the D8 migrate API) or the code just won't work for my case. Perhaps it will help if I describe the data structure a bit (simplified):

There's a "list" url http://api/properties
returns an array with key "properties" and individual items "id, date"

There's an item url http://api/properties/id where id is a variable id
returns a property detail with key "property" and individual items "id, title, .."

This would leave me with structure like this I guess:

source:
  plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: json
  list:
    selector: id
    uri_template: '{id}'
    remote_item_selector: 'property'
  urls: 'http://api/properties'
  item_selector: 'properties'

I've tried all sorts of variations, but usually end up with undefined index type of errors
Notice: Undefined index: properties in Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json::select()