Implement interfaces and base classes for URL-based sources [#2623012]

Working on #2621490: [meta] Maximize consistency with JSON source plugin and #2621492: [meta] Maximize consistency with XML source plugin, I believe many of the common needs of XML and JSON sources (and down the line, other sources like file directories) can be addressed with common base classes and interfaces:

Specifying custom headers on a request
Migrating from multiple specific files/URLs
Paging through multiple requests ("next" links)
Common specification of fields and IDs

migrate_plus seems like the natural place for the common framework, from which specific source plugins in their own modules can inherit.

Comments

Comment #1

25 November 2015 at 15:30

mikeryan created an issue. See original summary.

Comment #2

mikeryan

he/him

English

Pittsfield, MA, USA

commented 25 November 2015 at 17:19

The list/item stuff may very well be subsumed by this.

Comment #3

mbaynton

he/him

Minneapolis, US

commented 26 November 2015 at 03:16

The needs you mention seem oriented towards data acquisition, with a particular bias towards acquisition over http. Basically making migrate more than ETL by adding nontrivial logic before 'extract' even happens. From a feature standpoint its obviously great, but I wonder if this additional step should be really treated as fully separate, rather than being added to source plugins. (https://www.drupal.org/node/2621492#comment-10605560 refers to it this way.) What about making the source plugins' relationship to this functionality a 'has-a' rather than 'is-a'. So at a high level, the source plugin has a DataAcquisitionInterface or whatever which could be fully-fledged web-spidering multi-source implementation, or just read one file off the local filesystem.

Comment #4

mikeryan

he/him

English

Pittsfield, MA, USA

commented 30 November 2015 at 16:03

I'm not sure I see acquisition as requiring a separate stage from extraction - the two seem intertwined to me. But, yes, I'm thinking that this shared functionality would be represented in an interface, although we would want to share implementation as well, which suggests a base class with default implementations (or, maybe a trait).

Comment #5

mikeryan

he/him

English

Pittsfield, MA, USA

commented 30 November 2015 at 20:22

Naming things is hard... Trying to think what this interface and corresponding base class would be named... It's intended for any file-oriented source (as opposed to row-oriented sources like SQL and CSV) and should work equally well for a local filesystem or for resources accessed via a stream wrapper (particularly http:// of course). Regarding "a particular bias towards acquisition over http", the only aspect here specific to HTTP is the ability to specify custom headers (like authentication tokens) - essential for the use case I'm working towards, and obviously @KarenS had use for it since she built it into the JSON plugin.

The "common specification of fields and IDs" (standard means to define them within source plugin configuration) actually may be more broadly useful than just the file-oriented sources I'm thinking of here, so that may be something to be put into a lower-level base class. Then on top of that we add the multiple file and linking support via an interface (yes, maybe DataAcquisitionInterface) and another layer of base class providing default implementations that both XML and JSON can leverage. I think the header support should go into here as well - when headers are provided in the configuration, they would be applied for any HTTP requests - any source of this type should be able to retrieve from HTTP as well as local files.

Still thinking through the details...

Comment #6

mikeryan

he/him

English

Pittsfield, MA, USA

commented 2 December 2015 at 17:37

Just as an FYI, as a first pass I'm doing this in an 'Url' abstract base class alongside an UrlIterator base class here in migrate_plus, putting all the common functionality for XML and JSON sources there. I have this working with XML (whose Xml and XmlIterator classes are now very thin layers, nearly all XML-specific processing is in the XMLReader class), still applying the pattern to the JSON plugin. On the 'Url' name - I don't think that's too awful, since local filespecs can be regarded as implicit file:// URLs. Open to other suggestions, though.

The upshot is that configuring XML and JSON plugins will be nearly identical, allowing you to very easily switch between the two.

migrate.migration.wine_role_xml.yml (item_selector and selector values are xpaths):

source:
  plugin: xml
  urls: /migrate_example_advanced_position?_format=xml
  item_selector: /response/position
  fields:
    machine_name:
      label: 'Unique position identifier'
      selector: sourceid
    friendly_name:
      label: 'Position name'
      selector: name
  ids:
    machine_name:
      type: string

migrate.migration.wine_role_json.yml (item_selector is a depth level, selector is a key at that depth):

source:
  plugin: json
  urls: /migrate_example_advanced_position?_format=json
  item_selector: 1 
  fields:
    machine_name:
      label: 'Unique position identifier'
      selector: sourceid
    friendly_name:
      label: 'Position name'
      selector: name
  ids:
    machine_name:
      type: string

Next steps:

Get the JSON plugin working with the current pattern.
Figure out the most appropriate interfaces and potential breakup of the Url/UrlIterator classes.
Commit the changes to all three modules (migrate_plus, migrate_source_json, migrate_source_xml) as a stable platform for the remaining work
Implement custom headers for HTTP URLs base on the current JSON code
Implement the list/item pattern
Implement link following based on the current JSON code.

Comment #7

mbaynton

he/him

Minneapolis, US

commented 3 December 2015 at 02:36

@mikeryan thanks for all your updates and discussion - I see them coming in during my workday and have to leave them alone until evenings when I have time for Drupal.

Sorry is this work (Url, UrlIterator, other things from your first pass) somewhere I can look at for more concrete understanding of the discussion? I don't see recent commits in migrate_plus.

Comment #8

mikeryan

he/him

English

Pittsfield, MA, USA

commented 3 December 2015 at 14:38

My plan was to get the JSON side working then upload the relevant patches here, in migrate_source_xml, and migrate_source_json. But my other project is heating up, not sure I'll finish the JSON work today, so maybe I should just put up what I've got so far. Or, perhaps more conveniently, push up the respective branches to the projects...

Comment #9

mikeryan

he/him

English

Pittsfield, MA, USA

commented 3 December 2015 at 19:45

OK, I've pushed the migrate_plus work to branch refactor_source_plugins, and the migrate_source_xml work to branch refactor.

Comment #10

mikeryan

he/him

English

Pittsfield, MA, USA

commented 3 December 2015 at 23:10

OK, got the JSON source plugin basically working and pushed to the 'refactor' branch in migrate_source_json.

So, now that much of the common functionality is in migrate_plus, let's think about how best to package it. Right now there are three key components:

The source plugin itself - Url is the base implementation, with Xml and Json source plugin implementations doing little beyond specifying default iterator and reader classes. The Xml plugin is handling namespaces (which really should go into the reader plugin), and the Json plugin has link and HTTP header support that will be moved to the base class in a following round. We're not very far from having Url being the source plugin itself, and we simply pass it the appropriate reader class to handle Xml vs. Json.
An iterator class (UrlIterator, extended as XmlIterator and JsonIterator). This intermediate iterator is primarily for iterating over multiple source urls (and would be the place for the link support), and it'd be nice to forgo the extra level, either by doing this in the source plugin directly or moving the handling into a reader base class. The extensions right now just create the reader, and there's only one argument difference requiring them to be different in that, so (if the iterator class is even necessary at all) UrlIterator should be usable without extension.
A reader class, which is the meat of the distinct plugins where the actual parsing and delivery of data is done. Right now these are independent classes for Xml and Json, but we certainly should have at least an interface in migrate_plus, and they can probably share some implementation as well. Maybe readers should be plugins themselves?

So, that's the horizontal organization. I'm also thinking in terms of the vertical organization - as I said before, I think handling fields and ids through source configuration is more general than this plugin, so that may belong at a lower-level base class. Actually, what I'm now thinking is that support ought to go into SourcePluginBase - fields() and getIds() would have default implementations which use the configuration, and source plugins would not need to implement them in all cases (and where a plugin had its own implementation it would be a default overridden by the configuration). But, that's not something we'll get into core in the immediate future, so my inclination is to put it here in an intermediate class and prove the concept in contrib before submitting as a core patch.

So, thoughts or suggestions?

Comment #11

mikeryan

he/him

English

Pittsfield, MA, USA

commented 4 December 2015 at 00:44

I went ahead and factored the fields/ids support out into an intermediate class SourcePluginExtension - dirt simple!

Comment #12

mikeryan

he/him

English

Pittsfield, MA, USA

commented 4 December 2015 at 00:48

Oh, and forgot to mention one other thing - right now the source plugin instance itself gets passed through the iterator into the reader - one task will be to pass only the specific information the reader needs through so we can decouple the classes.

Comment #13

benjy commented 5 December 2015 at 04:22

I don't understand the motivation to pass source fields in via configuration. The source fields should be calculated in the source via the source data. If the fields are not known because the source doesn't have a database connection or API credentials or whatever, then that's the problem we should be solving. E.g. passing in the required configuration for the source to work.

With this approach, will every migration that uses this source with different fields be responsible for creating the schema?

Comment #14

mikeryan

he/him

English

Pittsfield, MA, USA

commented 5 December 2015 at 16:27

The source fields should be calculated in the source via the source data.

Unfortunately, that's not how even the SQL sources work now - they have hardcoded lists of source fields, so the difference this functionality would make if it were in core is only in moving the field definitions from PHP to configuration, which I regard as a significant step forward - it at least makes it possible for front-end tools to manipulate the field configuration and documentation. Ideally, what I'd like to see is SQL sources automatically determining the list of source fields from the query, with any source configuration for fields overriding that to add appropriate labels.

But, aside from that, the primary motivation for putting the field definitions in configuration is to support sources like XML where you have to explicitly define your fields, and better to do that in configuration than in PHP, again because it is an enabler for tools.

Comment #15

benjy commented 6 December 2015 at 04:00

Well the fields() method is never used in core currently and it doesn't calculate the values because I don't think we wanted to solve the issue of getting the database connection at that point. However, if we solved that, core could replace all implementations with something like DESCRIBE table_name;

But, aside from that, the primary motivation for putting the field definitions in configuration is to support sources like XML where you have to explicitly define your fields

If XML needs it, then I think it should be specific to that source, not generic.

Also what I said regarding schema still stands which is probably the biggest downside here.

Finally, the commit you added on your sandbox loops over a local $fields variable rather than $this->fields which I think is wrong.

Comment #16

mikeryan

he/him

English

Pittsfield, MA, USA

commented 7 December 2015 at 14:53

If XML needs it, then I think it should be specific to that source, not generic.

The thing is, pretty much every non-database source will need it, so let's make sure they do it consistently by sharing it.

I'm not following your point about schema? I don't see why you'd need to declare a schema for the migration-specific field names here any more than you need to in the process configuration... That does remind me, though, that we do need schema for the configuration being introduced here.

Comment #17

benjy commented 7 December 2015 at 23:39

Well the process configuration is a special one because it has type: ignore, source configuration is not, if you look through core you'll see we always provide schema for source configuration.

If you write a test that saves a migration with these fields in the source configuration, it will fail the config schema checker.

Comment #18

mikeryan

he/him

English

Pittsfield, MA, USA

commented 8 December 2015 at 16:41

I had expected to make the fields type: ignore as in the process configuration.

Source plugin fields() is now actually getting used! #2630720: Implement drush migrate-fields-source

Comment #19

mikeryan

he/him

English

Pittsfield, MA, USA

commented 8 December 2015 at 16:44

Oh, and I was thinking of a more schema-friendly approach, instead of

  fields:
    machine_name:
      label: 'Unique position identifier'
      selector: sourceid
    friendly_name:
      label: 'Position name'
      selector: name

it could be

  fields:
    -
      field: machine_name
      label: 'Unique position identifier'
      selector: sourceid
    -
      field: friendly_name
      label: 'Position name'
      selector: name

More verbose for the YAML hand-crafter, but in the long run as we build out the toolset that becomes less of a concern (and this makes it a bit easier to build those tools I think).

Comment #20

benjy commented 9 December 2015 at 10:08

Can you explain why it's easier for tools to have a sequence rather than just straight mappings? From a PHP structure, both are fairly simple to work with?

Comment #21

zxaos commented 11 December 2015 at 16:23

Tangent:

A while back on the IRC I had started writing an implementation of a generic REST import source. In the interim, this separate effort's been started. Is there any useful way to add the plugin variant I have so far? It actually runs, but since I'm new to the D8 version Migrate I'm not sure how close to your targeted approach it would be.

Comment #22

karens commented 12 December 2015 at 17:11

I like the direction this is going. It seems to account for all the oddities I ran into while trying to develop the JSON source plugin. I was working with several totally different JSON sources so I started to see some of the variations that needed to be allowed for.

With respect to the JSON plugin at least, I'd like to fairly quickly get this new branch committed, but because of the need for the matching branch in Migrate Plus I suppose they'll all have to go in at the same time.

Ironically I also have a possible project with an XML source so I may have a chance to try that out as well.

It seems like the CSV plugin might also use the same base? I can't tell for sure if that's the intention.

I haven't tested the changes yet, hoping to do so soon.

Comment #23

karens commented 12 December 2015 at 17:15

Also I'm open to using the same xpath selector pattern in the JSON plugin as is used in the XML plugin for more consistency. The underlying code could explode and count the values to derive the item depth.

Comment #24

mikeryan

he/him

English

Pittsfield, MA, USA

commented 15 December 2015 at 15:54

I'm back! Was working on another project, and then my laptop suffered filesystem corruption that prevented it from completely booting - it took a few days to get a new laptop and get back into a productive state, so I'll be moving this forward in the coming days. I am so happy I was able to retrieve my unpushed changes here... One thing I should note is that I borked up a rebase (or maybe the early signs of corruption did - yeah, let's go with that;), so the current migrate_plus work is now on branch refactor_source_plugins2, not refactor_source_plugins.

@benjy: Can you explain why it's easier for tools to have a sequence rather than just straight mappings? From a PHP structure, both are fairly simple to work with?

My thinking was that this makes it easier because it would be fully described by schema, as opposed to using type: ignore. Haven't looked at your UI lately, but it felt like you were verging on something close to a general configuration editor, which would benefit from full schema. But, I'll defer to you on whether that's useful or not...

@zxaos: In the interim, this separate effort's been started. Is there any useful way to add the plugin variant I have so far? It actually runs, but since I'm new to the D8 version Migrate I'm not sure how close to your targeted approach it would be.

Have you looked at the patches involved here to see what your approach might add that we're not covering (yet)? I'm certainly welcome to contributions to this patch.

@KarensS: It seems like the CSV plugin might also use the same base? I can't tell for sure if that's the intention.

Yes, I'm thinking it could, haven't had time to look into it specifically though. We'll want to get heddn on board with that.

Also I'm open to using the same xpath selector pattern in the JSON plugin as is used in the XML plugin for more consistency. The underlying code could explode and count the values to derive the item depth.

Yes, I was thinking of that as well. The implementation would be entirely in the JSON plugin, so I see that as a follow-up issue in that queue.

So, reminding myself where I left off - I was working towards eliminating the intermediate iterator class pattern that came from the XML plugin, and making the reader classes plugins. At that point, the XML and JSON modules would just contain reader plugins, and despite my original feeling that source plugins should have their own modules I'm starting to think that we're approaching the point where the layer they're implementing is thin enough that maybe they should just be in migrate_plus after all. Thoughts?

Comment #25

mikeryan

he/him

English

Pittsfield, MA, USA

commented 16 December 2015 at 02:33

I've pushed my latest code up to the refactor_source_plugins2 branch, plus pushed up the corresponding migrate_source_xml changes to the refactor branch over there. Don't look too closely, still very much a work-in-progress - my immediate next steps are:

Make the JSON plugin work with this architecture
Convert the reader classes to plugins (some framework for that is in place)

Comment #26

benjy commented 16 December 2015 at 11:09

My thinking was that this makes it easier because it would be fully described by schema, as opposed to using type: ignore. Haven't looked at your UI lately, but it felt like you were verging on something close to a general configuration editor, which would benefit from full schema. But, I'll defer to you on whether that's useful or not...

Yes the form is entirely built from schema now and it supports all widget types, including ajax add another for sequences although sequences of maps - still needs some work, mainly in how we document them in schema.

Comment #27

mikeryan

he/him

English

Pittsfield, MA, USA

commented 17 December 2015 at 04:07

All right! I now have the reader plugins for XML and JSON working (at least against migrate_example_advanced), I've pushed it up to branch refactor_source_plugins2 in migrate_plus, as well as the "refactor" branches in migrate_source_xml and migrate_source_json. Moving forward from here, next steps:

Clean up the code, properly document the new stuff, etc. Please don't look too closely at the code yet...:P
Move the client/header stuff from the JSON plugin to the general URL plugin, so XML can send HTTP headers, and JSON can work from local files.

A couple of questions for interested parties (in addition to feedback on the general architecture), need to be resolved before committing the work:

Thoughts on moving the JSON and XML reader plugins into migrate_plus (particularly from the migrate_source_xml and migrate_source_json maintainers, who I would of course welcome as co-maintainers here)? As I said, what would be left in those modules is starting to look a little thin to be in separate modules. There would also be the short-term benefit of being less disruptive to anyone currently using those modules. But I'm not strongly wedded to the idea.
Thoughts on the "reader" name for this plugin? It sounds so generic, but I haven't come up with a better idea yet...

Comment #28

mikeryan

he/him

English

Pittsfield, MA, USA

commented 21 December 2015 at 21:06

I've pushed up some cleanup changes - fixing docs, a little renaming here, a little public->protected there...

So, I got a look at the feed I'm ultimately supporting with this work today. I had been particularly interested in the HTTP header support since I understood we would need to pass an authentication token. However, as it turns out, the project has a client class that handles the HTTP request (including authentication) directly. I had already been thinking that, rather than detect HTTP connections and use the Client class only in that case, it might be cleaner to have a local file client, so a client class is always involved (and not specifically HTTP-oriented). In this case, we'll want the migration's client class to wrap the project's client class to deliver the XML or JSON payload to the reader. So, my next step is to make the client more general - and, perhaps, make it a plugin. Looking back at @mbaynton's DataAcquisitionInterface thought, perhaps this would be a data_acquisition plugin, and the current "reader" plugin would be a data_parser plugin?

"Everything is a node" used to be Drupal's mantra, then in D7 "everything is an entity" - now in D8 "everything is a plugin";).

Comment #29

mikeryan

he/him

English

Pittsfield, MA, USA

commented 22 December 2015 at 21:58

Making some progress here... At least for the moment, my answer to the naming question is to change the "Client" class to a "data_retriever" plugin type, and the former reader plugins will become data_extractor plugins.

A bigger issue is that the current model for the data_retriever won't work for the XML source plugin as now implemented. It uses the XMLReader PHP class to incrementally parse the XML source, so large files won't absorb all available memory. So, the model where the source plugin invokes the extractor, which calls the retriever to get all the data and then creates an iterator over that data, won't fly. I'm also thinking that the iteration over multiple files logically belongs to the retriever rather than the extractor. All of which suggests the retriever rather than the extractor should be the source plugin's iterator.

Still thinking through the implications of this...

Comment #30

benjy commented 23 December 2015 at 01:30

How close are these classes going to be to what core has in the aggregator module? Even if we can't reuse that code maybe we should follow their naming convention of fetcher, parser and processor?

Comment #31

karens commented 23 December 2015 at 13:04

Mike, I'm totally open to moving the json source plugin back into migrate plus. I only set it up separately because you originally wanted it that way. It feels like work on it needs to be coordinated with work on the base class and that would be easier if it's all in the same project. You could go ahead and add it to migrate plus in your branch, then once it's working and pushed into the main branch I can post a message on the other project that it's been moved back here. For that matter I can post a message now that it's available in your feature branch and will ultimately be in Migrate Plus.

Glad to see you're working on adding headers to the xml plugin, I've already run into a situation where those will be needed.

Comment #32

karens commented 23 December 2015 at 13:12

Also if you move it back you'll run into the problem with tests in submodules again. I added files to the json source plugin that make it possible to run tests from its subdirectory. You could copy that to other submodules so at least you can run tests locally on them by navigating to their directory.

You need phpunit.xml.dist in the primary directory and bootstrap.php in the tests directory.

Comment #33

mikeryan

he/him

English

Pittsfield, MA, USA

commented 23 December 2015 at 20:46

Issue tags:

+Needs tests

@benjy: Good suggestion, I'll rename data_retriever to data_fetcher and data_extractor to data_parser to better track Aggregator and Feeds (the "processor" stage, of course, corresponds to the core process and destination plugins).

My long-term dream is to make stuff like fetchers/parsers (not to mention Migrate's plugins) components that migrate/aggregator/feeds/etc. could all build from. Also peace on earth.

@KarenS: Thanks, I'll move the JSON class into migrate_plus. And deal with the test business when it's stable enough to update the tests...

Comment #34

mikeryan

he/him

English

Pittsfield, MA, USA

commented 24 December 2015 at 22:17

Component:

Code

» Source plugins

OK, so I've used the current branch for this work with the custom API in our project that needs it, and I got it working with no further changes. I'm feeling good about the general architecture, so I think it's time to stop doing everything in a mega issue - I'll commit what we have here and open distinct follow-up issues to finish it up. That will make it easier for people to test it, provide feedback, contribute to specific follow-ups, etc. Caveat emptor - there *will* be further refactoring, so once you start working with this don't blindly update the module. You do have a little time to play with it though, I'm going off for the holidays so work on this will resume in 2016.

Slightly off-topic - I don't recall who it was that recently was asking me about migrating address fields, which I haven't previously done in D8, but I did get address field migration working in our case, where the incoming JSON has an Address array. Sample:

  # Destination subfields correspond to columns in the field table, source
  # subfields to subfields of Address in the source JSON.
  'field_event/address_line1': 'Address/Address1'
  'field_event/postal_code': 'Address/PostalCode'
  'field_event/administrative_area': 'Address/State'
  'field_event/locality': 'Address/City'
  # @todo: Translate country name to 2-character country code in prepare_row
#  'field_event/country_code': 'Address/Country'
  # Country code is required - if no value is provided, the location will
  # silently remain empty (a frustrating sticking point in working this out).
  'field_event/country_code':
    plugin: default_value
    default_value: US

Comment #35

24 December 2015 at 22:19

mikeryan committed 56f37d5 on 8.x-1.x

Issue #2623012 by mikeryan, KarenS, mbaynton: Implement interfaces and...

Comment #36

16 March 2016 at 22:28

mikeryan committed 56f37d5 on 8.x-2.x

Issue #2623012 by mikeryan, KarenS, mbaynton: Implement interfaces and...

Comment #37

mikeryan

he/him

English

Pittsfield, MA, USA

commented 19 May 2016 at 19:48

Component:

Source plugins

» Plugins

Comment #38

mikeryan

he/him

English

Pittsfield, MA, USA

commented 11 September 2016 at 16:16

Status:

Active

» Closed (fixed)

Not sure why I failed to set Fixed in the first place, but given we've got follow-ups in place, finally closing this.

Comment #39

docans commented 11 September 2019 at 17:54

Has this Issue been fixed. I am experiencing the same issue

Implement interfaces and base classes for URL-based sources

Comments

Child issues

Related issues