Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Working on #2621490: [meta] Maximize consistency with JSON source plugin and #2621492: [meta] Maximize consistency with XML source plugin, I believe many of the common needs of XML and JSON sources (and down the line, other sources like file directories) can be addressed with common base classes and interfaces:
- Specifying custom headers on a request
- Migrating from multiple specific files/URLs
- Paging through multiple requests ("next" links)
- Common specification of fields and IDs
migrate_plus seems like the natural place for the common framework, from which specific source plugins in their own modules can inherit.
Comments
Comment #2
mikeryanThe list/item stuff may very well be subsumed by this.
Comment #3
mbayntonThe needs you mention seem oriented towards data acquisition, with a particular bias towards acquisition over http. Basically making migrate more than ETL by adding nontrivial logic before 'extract' even happens. From a feature standpoint its obviously great, but I wonder if this additional step should be really treated as fully separate, rather than being added to source plugins. (https://www.drupal.org/node/2621492#comment-10605560 refers to it this way.) What about making the source plugins' relationship to this functionality a 'has-a' rather than 'is-a'. So at a high level, the source plugin has a
DataAcquisitionInterface
or whatever which could be fully-fledged web-spidering multi-source implementation, or just read one file off the local filesystem.Comment #4
mikeryanI'm not sure I see acquisition as requiring a separate stage from extraction - the two seem intertwined to me. But, yes, I'm thinking that this shared functionality would be represented in an interface, although we would want to share implementation as well, which suggests a base class with default implementations (or, maybe a trait).
Comment #5
mikeryanNaming things is hard... Trying to think what this interface and corresponding base class would be named... It's intended for any file-oriented source (as opposed to row-oriented sources like SQL and CSV) and should work equally well for a local filesystem or for resources accessed via a stream wrapper (particularly http:// of course). Regarding "a particular bias towards acquisition over http", the only aspect here specific to HTTP is the ability to specify custom headers (like authentication tokens) - essential for the use case I'm working towards, and obviously @KarenS had use for it since she built it into the JSON plugin.
The "common specification of fields and IDs" (standard means to define them within source plugin configuration) actually may be more broadly useful than just the file-oriented sources I'm thinking of here, so that may be something to be put into a lower-level base class. Then on top of that we add the multiple file and linking support via an interface (yes, maybe DataAcquisitionInterface) and another layer of base class providing default implementations that both XML and JSON can leverage. I think the header support should go into here as well - when headers are provided in the configuration, they would be applied for any HTTP requests - any source of this type should be able to retrieve from HTTP as well as local files.
Still thinking through the details...
Comment #6
mikeryanJust as an FYI, as a first pass I'm doing this in an 'Url' abstract base class alongside an UrlIterator base class here in migrate_plus, putting all the common functionality for XML and JSON sources there. I have this working with XML (whose Xml and XmlIterator classes are now very thin layers, nearly all XML-specific processing is in the XMLReader class), still applying the pattern to the JSON plugin. On the 'Url' name - I don't think that's too awful, since local filespecs can be regarded as implicit file:// URLs. Open to other suggestions, though.
The upshot is that configuring XML and JSON plugins will be nearly identical, allowing you to very easily switch between the two.
migrate.migration.wine_role_xml.yml (item_selector and selector values are xpaths):
migrate.migration.wine_role_json.yml (item_selector is a depth level, selector is a key at that depth):
Next steps:
Comment #7
mbaynton@mikeryan thanks for all your updates and discussion - I see them coming in during my workday and have to leave them alone until evenings when I have time for Drupal.
Sorry is this work (
Url
,UrlIterator
, other things from your first pass) somewhere I can look at for more concrete understanding of the discussion? I don't see recent commits in migrate_plus.Comment #8
mikeryanMy plan was to get the JSON side working then upload the relevant patches here, in migrate_source_xml, and migrate_source_json. But my other project is heating up, not sure I'll finish the JSON work today, so maybe I should just put up what I've got so far. Or, perhaps more conveniently, push up the respective branches to the projects...
Comment #9
mikeryanOK, I've pushed the migrate_plus work to branch refactor_source_plugins, and the migrate_source_xml work to branch refactor.
Comment #10
mikeryanOK, got the JSON source plugin basically working and pushed to the 'refactor' branch in migrate_source_json.
So, now that much of the common functionality is in migrate_plus, let's think about how best to package it. Right now there are three key components:
So, that's the horizontal organization. I'm also thinking in terms of the vertical organization - as I said before, I think handling fields and ids through source configuration is more general than this plugin, so that may belong at a lower-level base class. Actually, what I'm now thinking is that support ought to go into SourcePluginBase - fields() and getIds() would have default implementations which use the configuration, and source plugins would not need to implement them in all cases (and where a plugin had its own implementation it would be a default overridden by the configuration). But, that's not something we'll get into core in the immediate future, so my inclination is to put it here in an intermediate class and prove the concept in contrib before submitting as a core patch.
So, thoughts or suggestions?
Comment #11
mikeryanI went ahead and factored the fields/ids support out into an intermediate class SourcePluginExtension - dirt simple!
Comment #12
mikeryanOh, and forgot to mention one other thing - right now the source plugin instance itself gets passed through the iterator into the reader - one task will be to pass only the specific information the reader needs through so we can decouple the classes.
Comment #13
benjy CreditAttribution: benjy at PreviousNext commentedI don't understand the motivation to pass source fields in via configuration. The source fields should be calculated in the source via the source data. If the fields are not known because the source doesn't have a database connection or API credentials or whatever, then that's the problem we should be solving. E.g. passing in the required configuration for the source to work.
With this approach, will every migration that uses this source with different fields be responsible for creating the schema?
Comment #14
mikeryanUnfortunately, that's not how even the SQL sources work now - they have hardcoded lists of source fields, so the difference this functionality would make if it were in core is only in moving the field definitions from PHP to configuration, which I regard as a significant step forward - it at least makes it possible for front-end tools to manipulate the field configuration and documentation. Ideally, what I'd like to see is SQL sources automatically determining the list of source fields from the query, with any source configuration for fields overriding that to add appropriate labels.
But, aside from that, the primary motivation for putting the field definitions in configuration is to support sources like XML where you have to explicitly define your fields, and better to do that in configuration than in PHP, again because it is an enabler for tools.
Comment #15
benjy CreditAttribution: benjy at PreviousNext commentedWell the fields() method is never used in core currently and it doesn't calculate the values because I don't think we wanted to solve the issue of getting the database connection at that point. However, if we solved that, core could replace all implementations with something like
DESCRIBE table_name;
If XML needs it, then I think it should be specific to that source, not generic.
Also what I said regarding schema still stands which is probably the biggest downside here.
Finally, the commit you added on your sandbox loops over a local $fields variable rather than $this->fields which I think is wrong.
Comment #16
mikeryanThe thing is, pretty much every non-database source will need it, so let's make sure they do it consistently by sharing it.
I'm not following your point about schema? I don't see why you'd need to declare a schema for the migration-specific field names here any more than you need to in the process configuration... That does remind me, though, that we do need schema for the configuration being introduced here.
Comment #17
benjy CreditAttribution: benjy at PreviousNext commentedWell the process configuration is a special one because it has type: ignore, source configuration is not, if you look through core you'll see we always provide schema for source configuration.
If you write a test that saves a migration with these fields in the source configuration, it will fail the config schema checker.
Comment #18
mikeryanI had expected to make the fields type: ignore as in the process configuration.
Source plugin fields() is now actually getting used! #2630720: Implement drush migrate-fields-source
Comment #19
mikeryanOh, and I was thinking of a more schema-friendly approach, instead of
it could be
More verbose for the YAML hand-crafter, but in the long run as we build out the toolset that becomes less of a concern (and this makes it a bit easier to build those tools I think).
Comment #20
benjy CreditAttribution: benjy at PreviousNext commentedCan you explain why it's easier for tools to have a sequence rather than just straight mappings? From a PHP structure, both are fairly simple to work with?
Comment #21
zxaos CreditAttribution: zxaos at GiantGoat Web Development Inc. commentedTangent:
A while back on the IRC I had started writing an implementation of a generic REST import source. In the interim, this separate effort's been started. Is there any useful way to add the plugin variant I have so far? It actually runs, but since I'm new to the D8 version Migrate I'm not sure how close to your targeted approach it would be.
Comment #22
KarenS CreditAttribution: KarenS at Lullabot commentedI like the direction this is going. It seems to account for all the oddities I ran into while trying to develop the JSON source plugin. I was working with several totally different JSON sources so I started to see some of the variations that needed to be allowed for.
With respect to the JSON plugin at least, I'd like to fairly quickly get this new branch committed, but because of the need for the matching branch in Migrate Plus I suppose they'll all have to go in at the same time.
Ironically I also have a possible project with an XML source so I may have a chance to try that out as well.
It seems like the CSV plugin might also use the same base? I can't tell for sure if that's the intention.
I haven't tested the changes yet, hoping to do so soon.
Comment #23
KarenS CreditAttribution: KarenS at Lullabot commentedAlso I'm open to using the same xpath selector pattern in the JSON plugin as is used in the XML plugin for more consistency. The underlying code could explode and count the values to derive the item depth.
Comment #24
mikeryanI'm back! Was working on another project, and then my laptop suffered filesystem corruption that prevented it from completely booting - it took a few days to get a new laptop and get back into a productive state, so I'll be moving this forward in the coming days. I am so happy I was able to retrieve my unpushed changes here... One thing I should note is that I borked up a rebase (or maybe the early signs of corruption did - yeah, let's go with that;), so the current migrate_plus work is now on branch refactor_source_plugins2, not refactor_source_plugins.
My thinking was that this makes it easier because it would be fully described by schema, as opposed to using type: ignore. Haven't looked at your UI lately, but it felt like you were verging on something close to a general configuration editor, which would benefit from full schema. But, I'll defer to you on whether that's useful or not...
Have you looked at the patches involved here to see what your approach might add that we're not covering (yet)? I'm certainly welcome to contributions to this patch.
Yes, I'm thinking it could, haven't had time to look into it specifically though. We'll want to get heddn on board with that.
Yes, I was thinking of that as well. The implementation would be entirely in the JSON plugin, so I see that as a follow-up issue in that queue.
So, reminding myself where I left off - I was working towards eliminating the intermediate iterator class pattern that came from the XML plugin, and making the reader classes plugins. At that point, the XML and JSON modules would just contain reader plugins, and despite my original feeling that source plugins should have their own modules I'm starting to think that we're approaching the point where the layer they're implementing is thin enough that maybe they should just be in migrate_plus after all. Thoughts?
Comment #25
mikeryanI've pushed my latest code up to the refactor_source_plugins2 branch, plus pushed up the corresponding migrate_source_xml changes to the refactor branch over there. Don't look too closely, still very much a work-in-progress - my immediate next steps are:
Comment #26
benjy CreditAttribution: benjy at PreviousNext commentedYes the form is entirely built from schema now and it supports all widget types, including ajax add another for sequences although sequences of maps - still needs some work, mainly in how we document them in schema.
Comment #27
mikeryanAll right! I now have the reader plugins for XML and JSON working (at least against migrate_example_advanced), I've pushed it up to branch refactor_source_plugins2 in migrate_plus, as well as the "refactor" branches in migrate_source_xml and migrate_source_json. Moving forward from here, next steps:
A couple of questions for interested parties (in addition to feedback on the general architecture), need to be resolved before committing the work:
Comment #28
mikeryanI've pushed up some cleanup changes - fixing docs, a little renaming here, a little public->protected there...
So, I got a look at the feed I'm ultimately supporting with this work today. I had been particularly interested in the HTTP header support since I understood we would need to pass an authentication token. However, as it turns out, the project has a client class that handles the HTTP request (including authentication) directly. I had already been thinking that, rather than detect HTTP connections and use the Client class only in that case, it might be cleaner to have a local file client, so a client class is always involved (and not specifically HTTP-oriented). In this case, we'll want the migration's client class to wrap the project's client class to deliver the XML or JSON payload to the reader. So, my next step is to make the client more general - and, perhaps, make it a plugin. Looking back at @mbaynton's DataAcquisitionInterface thought, perhaps this would be a data_acquisition plugin, and the current "reader" plugin would be a data_parser plugin?
"Everything is a node" used to be Drupal's mantra, then in D7 "everything is an entity" - now in D8 "everything is a plugin";).
Comment #29
mikeryanMaking some progress here... At least for the moment, my answer to the naming question is to change the "Client" class to a "data_retriever" plugin type, and the former reader plugins will become data_extractor plugins.
A bigger issue is that the current model for the data_retriever won't work for the XML source plugin as now implemented. It uses the XMLReader PHP class to incrementally parse the XML source, so large files won't absorb all available memory. So, the model where the source plugin invokes the extractor, which calls the retriever to get all the data and then creates an iterator over that data, won't fly. I'm also thinking that the iteration over multiple files logically belongs to the retriever rather than the extractor. All of which suggests the retriever rather than the extractor should be the source plugin's iterator.
Still thinking through the implications of this...
Comment #30
benjy CreditAttribution: benjy at PreviousNext commentedHow close are these classes going to be to what core has in the aggregator module? Even if we can't reuse that code maybe we should follow their naming convention of fetcher, parser and processor?
Comment #31
KarenS CreditAttribution: KarenS at Lullabot commentedMike, I'm totally open to moving the json source plugin back into migrate plus. I only set it up separately because you originally wanted it that way. It feels like work on it needs to be coordinated with work on the base class and that would be easier if it's all in the same project. You could go ahead and add it to migrate plus in your branch, then once it's working and pushed into the main branch I can post a message on the other project that it's been moved back here. For that matter I can post a message now that it's available in your feature branch and will ultimately be in Migrate Plus.
Glad to see you're working on adding headers to the xml plugin, I've already run into a situation where those will be needed.
Comment #32
KarenS CreditAttribution: KarenS at Lullabot commentedAlso if you move it back you'll run into the problem with tests in submodules again. I added files to the json source plugin that make it possible to run tests from its subdirectory. You could copy that to other submodules so at least you can run tests locally on them by navigating to their directory.
You need phpunit.xml.dist in the primary directory and bootstrap.php in the tests directory.
Comment #33
mikeryan@benjy: Good suggestion, I'll rename data_retriever to data_fetcher and data_extractor to data_parser to better track Aggregator and Feeds (the "processor" stage, of course, corresponds to the core process and destination plugins).
My long-term dream is to make stuff like fetchers/parsers (not to mention Migrate's plugins) components that migrate/aggregator/feeds/etc. could all build from. Also peace on earth.
@KarenS: Thanks, I'll move the JSON class into migrate_plus. And deal with the test business when it's stable enough to update the tests...
Comment #34
mikeryanOK, so I've used the current branch for this work with the custom API in our project that needs it, and I got it working with no further changes. I'm feeling good about the general architecture, so I think it's time to stop doing everything in a mega issue - I'll commit what we have here and open distinct follow-up issues to finish it up. That will make it easier for people to test it, provide feedback, contribute to specific follow-ups, etc. Caveat emptor - there *will* be further refactoring, so once you start working with this don't blindly update the module. You do have a little time to play with it though, I'm going off for the holidays so work on this will resume in 2016.
Slightly off-topic - I don't recall who it was that recently was asking me about migrating address fields, which I haven't previously done in D8, but I did get address field migration working in our case, where the incoming JSON has an Address array. Sample:
Comment #37
mikeryanComment #38
mikeryanNot sure why I failed to set Fixed in the first place, but given we've got follow-ups in place, finally closing this.
Comment #39
docans CreditAttribution: docans commentedHas this Issue been fixed. I am experiencing the same issue