Problem/Motivation
The core migrate system needs a way to import things from CSV files.
Immediate use case is for the Umami Demo profile now in core in 8.6.x. The imported content is in CSV files and we have code to read it and create the relevant entries. Having migrate_source_csv in core would allow us to remove that code and use migrate instead.
See #2809635-37: Create experimental installation profile for where @larowlan suggested bring migrate_source_csv into core.
Other good reasons to have CSV as a migration source in core:
- "I use this constantly. I've seen it forked into the Contenta distro, and a few of my own projects (I just want the source plugin, not another module.) It'll also be vital to Commerce integrations as CSV is a common way to import data from an ERP." @mglaman (#4)
- "CSV format is the first format used by non technical professionnals who need to import data and to generate/edit those datas thanks to software they knew (MS Excel without mention it). It's also the easiest way to export data from many ERP and big solutions with datas." @GoZ (#5)
- "CSV is simply a really common used format and it would make migrate more usable by default for many usecases. One could say: Oh it is easy to parse CSV, but it turns out it is not. On top of that the module also doesn't load the entire CSV file upfront, but rather reads line by line. I think this makes it a good candidate also for bigger migrations out there." @dawehner (#14)
- "This would help the Out of the Box Initiative. Let's do it! Consider this my maintainer +1 :)" @phenaproxima (#18)
Proposed resolution
The migrate_source_csv module has a stable release (2.0), has been downloaded almost 100k times, and comes complete with tests. The contenta CMS has been using this for their migrations, too.
Rename the files/namespaces, and move it in, as-is, to the core migrate module.
Since this is a straight port from contrib, discussions about fetchers, etc. are postponed to follow-ups. See #16, #36 and #37 for some of the highlights of the discussion.
Follow-up: #2962091: Adopt fetchers/parsers logic for use by source csv plugin
Remaining tasks
Patch that renames files/namespaces and moves the contrib code into coreDone.TestsDone.Word-smithing PHPDocsDone.Decide if we're okay with a straight port (done), or if we need to hash out all the fetcher/parser stuff first, figure out the design, class names, avoid conflicts, possible BC hell, etc.Done. See comments #80 through #83. Punted to #2962091: Adopt fetchers/parsers logic for use by source csv plugin.- Finish cleaning up the docs and tests.
None.
User interface changes
None.
API changes
Adds a new migrate source plugin called "csv" in migration yml definition files. The class providing the plugin (\Drupal\migrate\Plugin\migrate\source\CSV) is heavily commented with PHPDoc.
Adds a \Drupal\migrate\CSVFileObject class (which extends \SplFileObject).
Existing users of this plugin from contrib are unharmed. The name is the same, but we haven't changed anything about how it works. Contrib users simply disable migrate_source_csv module before upgrading to core 8.X.Y and none of their migration files have to be changed to continue working.
Data model changes
None.
Original report by @smaz
In #2809635: Create experimental installation profile, we are using CSV files to provide default content for a demo installation profile of Drupal.
In a review, @larowlan suggested bringing the CSV migration sources into core:
https://www.drupal.org/project/drupal/issues/2809635#comment-12381619
I agree that a CSV migrate source would be a very useful feature for core.
The migrate_source_csv module has a stable release (2.0), has been downloaded almost 100k times (not sure if that includes composer stats?), and comes complete with tests. The contenta CMS has been using this for their migrations, too.
So would it be ok to look at moving this into Core's migrate module? If so, I'm happy to try and turn the module into a patch.
| Comment | File | Size | Author |
|---|---|---|---|
| #150 | interdiff_147-150.txt | 19.41 KB | heddn |
| #150 | 2931739-150.patch | 33.71 KB | heddn |
Comments
Comment #3
heddn+1 on this idea.
Comment #4
mglamanI use this constantly. I've seen it forked into the Contenta distro, and a few of my own projects (I just want the source plugin, not another module.)
It'll also be vital to Commerce integrations as CSV is a common way to import data from an ERP.
Not just +1, but +1,000
Comment #5
goz commentedTotally agreed.
CSV format is the first format used by non technical professionnals who need to import data and to generate/edit those datas thanks to software they knew (MS Excel without mention it). It's also the easiest way to export data from many ERP and big solutions with datas.
+1,000 for me too
Comment #6
olafkarsten commentedAgreed. +1
Comment #7
eiriksm+999
Comment #8
heddnRather than just +1s, let's also explain what use case in Core we need it. Just because something is popular doesn't mean it should go into core. And for migrate, we've tended to keep things in contrib that Core itself doesn't need.
I know I didn't do that myself, but don't follow my bad example ;)
Comment #9
eli-tImmediate use case is for the Umami Demo profile now in core in 8.6.x. The imported content is in csvs and we have code to read it and create the relevant entries. Having the migrate_source_csv would allow us to bin the code and use migrate instead.
Comment #10
dercheffeIMHO this makes sense very much. Especially if you want to create different content types with different fields (maybe even new modules/successors of legacy modules, but with the same data/content).
I would also vote for these modules in core (they support both CSV also):
https://www.drupal.org/project/views_data_export
and
https://www.drupal.org/project/feeds
Comment #11
heddnTo get a more practical discussion going, can someone take it upon them self to roll a first patch of converting the existing code base from contrib into something added to migrate.module in core? We'd need the source plugin, the file object and the tests all moved/renamed. We also need to handle BC, so a possible rename of the plugin id could take place. A suggestion is
csv_file. Perhaps someone else could come up with a better suggestion.Comment #12
mglamanI think it should be `file_csv`. The plugin is for a file source which is in the CSV format. This opens up a naming convention which better handles other file formats. For instance, this Allows for file_json or file_xml.
Comment #13
heddnI know it is a bit bikeshedding, but JSON and XML tends to not be a "file" as much as a URL data source. While csv is almost always a file. It really should be named 'csv', but that was already used. I wonder if there is any way to hack things so we support BC and still use the same name. We could try it at least.
Comment #14
dawehnerI think this really make sense. CSV is simply a really common used format and it would make migrate more usable by default for many usecases.
One could say: Oh it is easy to parse CSV, but it turns out it is not. On top of that the module also doesn't load the entire CSV file upfront, but rather reads line by line. I think this makes it a good candidate also for bigger migrations out there.
Comment #15
mglamanMaybe so for JSON. But many times I have had to deal with systems which FTP files to a server and they are read and processed in either CSV or XML.
Comment #16
mikeryanOn the last point - the last big project I did involved JSON files (provided as attachments to Jira issues, if you must know).
A larger point, though, is that the retrieval of file/stream-oriented sources would best be separated from parsing of the content retrieved. migrate_plus provides a source plugin which encapsulates separate fetcher and parser plugins (the fetcher further making use of authentication plugins), and uses this to support XML and JSON. CSV sources would also fit this model naturally, and there is an issue to do this (#2867090: Merge migrate_source_csv into migrate_plus as a parser plugin). It would make sense to me, anticipating XML and JSON sources will eventually follow CSV into core, to do this from the beginning.
Comment #17
cosmicdreams commentedThat makes sense from a classic ETL (Extract Transform Load) perspective. Which is sometimes described as ETTL (Extract Transfer Transform Load).
Comment #18
phenaproximaThis would help the Out of the Box Initiative. Let's do it! Consider this my maintainer +1 :)
Comment #19
mglamanI'm going to roll a patch just to get things kicked off. If it gets rejected because we want fetchers and parsers per #16 then so be it! But a patch will help kick this along :)
Comment #20
mglamanHere we go!
I didn't do this. I kept as CSV.
So we could rename it. Or keep it the same and allow contrib to override the plugin definition. I have a feeling that will not fly over well, but it is a possibility.
Comment #22
mglamanTest quirks that didn't fail in contrib but blew up here.
Comment #23
mglamanFix on those two issues.
Comment #25
mglamanI missed this. Class name does not match the filename.
Comment #26
yogeshmpawarComment #27
yogeshmpawarAdded Class name which matches with the filename & also added an interdiff.
Comment #29
rakesh.gectcrChanged the Place of test to
Drupal\Tests\migrate\Kernel\source, lets see how the test goesComment #30
mglamanCSVKernel.
This will fail.
Comment #32
rakesh.gectcrLets hit again... :)
Comment #34
mglamanCSVKernelTest
Not Kernal
Comment #35
rakesh.gectcr:) Thanks Matt....
Comment #36
heddnI'm not sure how important to block things on #16.
I did some quick design work on what we'd need to do to support the fetch/parser method this morning and it is significantly different than what is being done in migrate_source_csv today.
So, do we need to be able to support fetching the CSV from a non-local endpoint, say an HTTP request? Do need to be able to support multiple CSV URIs? If the answer is no, then we don't need fetchers/parsers. But I feel like we should have those things.
With that in mind, here's some design:
A new URI source plugin.
The parser can define a default fetcher. For CSVs that should be something akin to file_copy since file objects require a local file.
The current CSV class code becomes the parser.
Other possible fetchers are http and file (file doesn't copy, it just uses php file streams).
The parser would mask the presence of the fetcher as much as possible. But would pass along config options to the fetcher.
The fetcher interface in migrate_plus is pretty messy right now and assumes an HTTP resource. I think it should be less restrictive and return a \SplFileObject of some sort. Then if the parser needs to read in a file file into a json or xml parser, use
$file->fread($file->getSize())Comment #37
heddnHere's a summary of a conversation in Slack w/ @mglaman and @phenaproxima:
phenaproxima: we should have fetchers but not add any extra overhead or complexity from fetchers.
mglaman: having local files is enough for commerce, we used something outside of drupal to fetch the files local
heddn: leave it to something outside of Drupal to retrieve the files locally. Or have contrib build that feature.
So, I'm going to suggest that having fetchers, while nice if easy... isn't critical. It isn't super easy. So let's implement without them and leave it to contrib or follow-ups in core if we find they are super critical. My $.02: fetchers won't become critical, because source csv in contrib became a top 100 module without it and it wasn't a common support request in the issue queue.
Comment #38
heddnSo, in the last few weeks a fair number of patches landed upstream that added test coverage and various fixes. Here's an update for all them.
Note, I removed the schema yml, because that was only there for migrate plus integration.
Next steps is to review the doxygen and see if things fail on the test bot. Doxygen, because I think its a little lacking.
Comment #40
heddnOK, doxygen added. PHPCS stuff fixed and hopefully test failures too?
Except for the 80 char column wrapping on the doxygen. I didn't do that yet while we are discussing the language. I find its really hard to wordsmith when I'm constantly fixing the 80 chars on comments. Let's land on good docs, then do the line wrap once.
Comment #41
heddnFixing a copy/paste.
Comment #42
dwwI love migrate_source_csv. I don't mind installing it from contrib, but if we want to move it into core for core's benefit, that's fine with me.
@heddn asked for comments/wordsmithing on the phpdoc comment for the CSV source class. Here goes:
... "Defaults to 0." ? Don't we usually say what happens when you don't define something?
Also, "until the header row" is wrong. Maybe this is clearer:
" - header_row_count: (optional) The number of rows of header information in the CSV file before the first data row."
Uhh, is that really optional? I remember migrations blowing up spectacularly if you don't define that correctly.
Ditto about "optional". If you don't define this, you have no way to access any of the source values, right? Seems this is required.
Sorry, I didn't realize defining a header_row gives you this automatically! Learn something new every day. We should probably say so right here, not just if you closely read the examples.Furthermore, the intended input for it is something like this:
So it's not just a "Numeric key 0-based index of the columns". It's an array with numeric keys starting at 0 where the values are column machine name => column human-readable name associations. Not sure the most concise/elegant way to write that, but I it seems crucial to somehow explain it's expecting a nested array. We see it in the examples a little further down, but we should be clear up here, too.
Furthermore, Do the human readable names *ever* get used anywhere? I haven't see it, but maybe I haven't triggered the appropriate error condition where those would be visible.
Anyway, how about this?
" - column_names: (optional) Index of the columns in the source CSV file. If the file defines a header row, the values in the header row are used as the machine name of each column. If defined explicitly in the migration configuration, each entry in this array must have a numeric key, starting at 0, and the value should be a mapping of machine name to human-readable name for that column.
Or something. :/ Sorry. This is a bit confusing to document clearly.
Defaults to a comma ','.
Side note: Duh! ;)
Okay, you get the idea. Ditto all the other optional values...
Took me a second to realize all the text at the bottom was related to this example, not the next one.
Also: "It defaults to a string/varchar database column." huh?
Oh wow, I've never seen this, either: "Here we define the database schema to be a huge text column." We should fix the docs for "keys" above to talk about this schema stuff, not buried in these examples. I didn't realize that plumbing existed.
Maybe something like this:
Here's a proposal for the next example block:
I removed the bogus header row from the example my_file.csv here, since that would break the example migration, and it's just confusing.
Also, it seems valuable to have the advanced example show something with multiple columns as keys, since that'd be a more realistic example.
Meta: I should have been editing all this in emacs and posting a new patch, not trying to write it here. :/
Okay, I'm going to stop here for now and check back in with @heddn if this is actually helpful...
Comment #43
dwwSorry, here's #42 as a patch and an interdiff. Much better.
Comment #44
jofitzSetting to Needs Review to run the testbot.
Comment #46
jofitzCorrected coding standards errors.
Corrected typo that was causing CSVSourceYieldTest to fail.
Unable to replicate the other 2 test failures - hoping the other changes might work some magic.
Comment #47
heddnImprovements on docs.
Comment #48
heddnAnd a few more improvements. We're close enough, I formatted them to 80 chars.
Comment #50
dwwEven more better PHPDocs. ;)
Note: I added a @todo about file_class:
Not sure if we should do that right now as part of this patch, or handle it in a follow-up. But it seems that if we let people declare their own class, we should enforce an interface their class must implement if it's going to work for this.
Comment #54
heddnAny values explicitly defined in column_names will override...
For now, it really should extend CSVFileObject.
Why this change?
Comment #55
dww1. Sure.
2. Sounds good (for now). Thoughts on if we want to provide (and enforce) an interface as part of this patch, or move that to a followup?
3. Because of 80 character wrapping.
Cheers,
-Derek
Comment #56
heddnHopefully this fixes up the last few tests. More docs tweaks.
Comment #58
cosmicdreams commentedOK, to do a code standards pass? Couple things I'm seeing.
Comment #59
heddnOops, let's add the interface too. This should be ready for any types of review.
Comment #60
dwwGreat, thanks for adding the interface!
Now we need:
a) CSVFileObject should say it implements it.
b) The docs for file_class in CSV.php should reference it.
c) [optional] Something (?) should enforce that whatever you put in file_class implements it. There was talk of the CSV constructor throwing an exception.
Comment #61
dwwI'm going to work on my list from #60, and maybe fix the test failures, too. Stay tuned.
Comment #62
dwwWhoops, somehow I missed that A and C were already done when I skimmed the interdiff. Sorry about that. ;) Edit: Oh, because those changes were not included in the interdiff.
So here's a patch for B.
I briefly looked at the tests, and it's not immediately obvious why they're failing. :/ My local setup isn't currently running tests, so I'll leave that for someone else to sort out.
@heddn: Can you check your editor wrapping settings? I'm working in an emacs window exactly 80 chars wide, and your stuff keeps extending off the screen. That's why the interdiff is larger than it needs to be, since I keep re-fixing the 80 char wrapping. :/ Please either fix your editor's wrapping, or turn it off so it preserves what I'm uploading.
Thanks,
-Derek
Comment #63
dwwp.s. @cosmicdreams: Sorry I missed your comment. I'm done with this for now. I'll assign to myself if I'm working on it again. Meanwhile, feel free to assign this to yourself if you want to work on something. Thanks.
Comment #65
dwwUgh, sorry for the noise. In Slack @heddn convinced me my setup was wonky. I double-checked, and lo, my fill-column was actually set to 78.
Here's the fixes from #62, wrapped @ 80, not 78.
Still not sure WTF is causing those tests to fail. I haven't looked. No sense having the bot re-test this.
Comment #66
heddnThank goodness for tests. This should fix the final failures. Any chance we an upgrade our doctrine annotation parsing any time soon to get rid of this ugly work around?
Comment #67
quietone commentedPretty sure these are sentences, and need capitalization and a full stop.
s/defines/includes/ ?
No need for Simple and Advanced in the next example. Just 'Example with minimal options' says it all.
The description of path is verbatim from above and not adding value, in my opinion. I'd prefer a paragraph format explaining what is happening here.
Advanced can be removed.
As before, an explanation versus description would be better, I think.
Comment #68
rakesh.gectcrComment #69
heddnComment #70
heddnComment #71
jofitzThe patch in #70 does not address #67.2 (this may or may not be intentional).
Comment #72
dww67.2 was a ?
commentreview concern, and I don't think it helps improve the meaning of the code comment. A file has to define values in its header row for it to matter and be useful. It's not like you simply include "#header row" in your file, you have to define what it is by naming all your columns. I dunno, we're seriously nitpicking and bike-shedding now. ;)Generally, I think the paragraph version makes these comments a bit less useful, since it forces people (with short attention spans) to read an entire block of text if all they want to know is why a specific key is defined. I agree we can skip comments where there's nothing added over the original doc comments, but I'd personally rather see these still in the style of #66, not #70.
That said, I don't want to post another patch if it's going to be a pushing match. ;) If everyone else thinks a paragraph is better, I'll let it go.
Cheers,
-Derek
[edited for clarity]
Comment #73
quietone commentedThere was quite a bit of effort in getting the source, process and destinations plugins in core commented with examples and explanations and to be in a consistent style. Since this is going into core, it too should use the same style. Setting to NW because of that.
Personally, I do prefer lists and I often think in lists. And yet, I do find that having the plain language explanation can be very helpful at times to clarify a point that for some reason I didn't get in the details of the plugin options.
Comment #74
dww@quietone: Please look at the patch and/or interdiff. It's already in the style you asked for.
The only reason this was "needs work" was a (IMHO unhelpful) change for s/defines/includes/. This needs review, not work.
Comment #75
quietone commented@dww, sorry my last comment does seem out of sequence. Anyway, my concerns from #67 are addressed, including 67.2. So that is done.
What is next here? I don't have time to do a full review but I suggest that the discussion of using a fetcher be put in the IS because someone is going to ask about it.
Comment #76
mglamanWhen I read Fetcher I think of a remote source. Which seems to blow up the scope of this plugin source. The current source plugins do not use a fetcher, correct? Or would the fetcher have a "Local" option which just lets sourcing the file from the local disk? A lot of the comments here have seemed to echo the thought: Keep it simple so we can source local CSVs via core alone as quickly as possible.
Comment #77
dwwI'm working on a major IS update. Stay tuned.
Comment #78
dwwFirst pass. Summarizing the fetcher/parser debate is still todo.
Anyone else want to take a turn with that?
Comment #79
heddnIS updated to discuss fetchers and follow-up (#2962091: Adopt fetchers/parsers logic for use by source csv plugin) opened to continue that discussion.
Back to NR.
Comment #80
dwwCool, thanks for opening the follow-up. I did a few other minor edits to the summary, and added this as a remaining task:
Generally, I'm fine with getting this in sooner and leaving #2962091: Adopt fetchers/parsers logic for use by source csv plugin postponed for now. However, if we foresee doing the whole fetcher/parser system, "csv" makes a lot more sense as a parser plugin name than it does as a source plugin. Whenever the dust settles on #2962091, are we going to be stuck with some BC hell while we try to rename classes? Or is the vision that the parser plugin namespace will be distinct, and it won't matter, and folks who have "source: plugin: csv" in their migration will continue to be supported somehow? Yes, I understand plans can change once you actually start working on something, but I'd love clarity on why committing this patch, right now, as-is, isn't going to make our lives harder in the future.
Echoing @mglaman in comment #12, perhaps the plugin_id for this should be called "file_csv" for now, instead of parking on "csv" itself?
Regardless, removing 'Needs issue summary update' tag.
Comment #81
heddnre #80: What @phenaproxima and I had discussed was something that wouldn't be BC hell. By sticking with the same source plugin name; and a straight port, folks would just need to disable contrib source csv module. Because the plugin in core has the same one-for-one features, even the same name.
Renaming and would cause more issues.
Comment #82
dwwSorry, you misunderstand my concern in #80. I already summarized #81 in the edit to the summary I did in #78:
What I'm talking about in #80 is that N months from now, once this is in core, people like me that are relying on CSV migrations will have happily seen our contrib-driven CSV migrations working with CSV in core without modification. However, at some point, #2962091: Adopt fetchers/parsers logic for use by source csv plugin is going to introduce a slew of new classes, plugin ids, etc. Someone is going to say "The 'csv' id is all wrong for a source plugin! That's really a parser. Elegance++" People like me are going to say: "screw you, my migrations are already working, please don't re-re-break them!" Fight ensues. ;)
That's the situation I'm trying to mitigate by treading a bit more carefully now...
Maybe we can publicly agree, right here, that we're not going to do that in #2962091, commit "ourselves" to working around the issue that 'csv' is already taken as a source plugin id, and whatever happens in there is going to jump through a few more hoops to keep it working. That's all I'm saying.
Are we cool with that? If so, I'm going to copy this text into #2962091 to make it clear whatever refactoring happens there has to keep this working.
If we're not cool with that, please say so now.
Thanks,
-Derek
Comment #83
heddnMigrate module is stable. Therefore, once this lands it would have to abide by BC policy according to https://www.drupal.org/core/d8-bc-policy. So, when the rubber meets the road and we want fetchers, can we do it in a BC manner and keep everyone happy? I'm pretty sure we can.
The fetcher would just need to know what parser to use. And this existing CSV parser would just start /also/ being a parser. The fetcher would retrieve the file and put it in the location that the csv plugin asks for it to be. The code that makes all this magic happen might be ugly, but csv and users of these new fetchers wouldn't need to see the ugliness.
Comment #84
dwwGreat. You're now on record saying so, both here and in #2962091: Adopt fetchers/parsers logic for use by source csv plugin. ;) Updated both summaries accordingly.
No more remaining tasks, other than probably a thorough review and RTBC from someone who hasn't yet put their hands all over this code...
Thanks!
-Derek
Comment #85
heddnSeeing this is on the roadmap for 8.6 for both migrate and OOB, let's bump to a major.
Comment #86
quietone commentedThanks for the updates to the IS, very thorough. I haven't done a full review, mostly because I can't RTBC this anyway since I worked on it in migrate_source_csv.
Comment #87
phenaproximaIt's getting there!
Can we expand this doc comment to explain what constitutes a "CSV file", as defined by this interface? (i.e., it has header row(s) and named columns) If not now, then we need a follow-up to document this better.
Should be "Returns the number of header rows".
"...zero if there are no header rows."
Let's rephrase these similarly.
The doc comments here could also use some fine-tuning and better type hints (string[] rather than array).
I don't think we need this use statement, since CSVFileObject and CSVFileInterface are in the same namespace. Also...should these not be in the Drupal\migrate\Plugin\migrate\source namespace? They are source-specific, after all.
This doxygen isn't very helpful. Can we expand what a "CSV file object" is? Maybe "a wrapper around CSV data in a file"?
numeric index
I wonder if 1 might not be a more sane default here. I imagine that most CSV files have a header row (correct me if I'm wrong), so it's a reasonably safe default assumption.
The type hint should be string[].
How will this prevent NULL arguments from going to SplFileObject?
$this->csvColumns is not a thing.
Man, I wish we could use scalar type hints.
Could we cite a link here that explains what a BOM is, and how one might be sure that it is set correctly?
Is this an absolute path, or a relative one?
The first sentence is kinda obtuse. Can we just say: "An array of column names in the CSV file. Can be an array of strings, or an array of arrays. If declared as an array of arrays..."
I think we need to rephrase this, because it's over-explaining the functionality in a very confusing way. If I'm understanding it correctly, column_names is a simple key-value map where the values are the column name in header row, and the keys are the corresponding source row property names. A quick example (no need for code) would also help.
Can we rephrase? Maybe something like: "Fully qualified (including the full namespace) name of a class that can read the CSV file."
And what does a bitmask value of 15 mean? We should probably link to external documentation for \SplFileObject to explain further.
USA #1? =P
"in local storage" is weird. How about just "a local file"? Also, let's move this to the explanation of the 'path' configuration key.
I think these sentences are both confusing and redundant -- we pretty much explained all of this already.
Why is this a nested array?
"Example of keys" is weird. How about: "In this example, we override the default database schema so that the 'id' column will be a large text column."
Are we using this anywhere?
Or this?
Let's just call getConfiguration() once, after setConfiguration(), and keep re-using its result.
Let's get the path once so we don't have to keep calling getConfiguration(). It'll be easier to read.
Same here. Calling getConfiguration() repeatedly makes the code harder to read.
This needs to be getIds(), not getIDs().
Comment #88
dww@phenaproxima: Thanks for the excellent and thorough review! I don't have time right now to re-roll this, but a few quick replies for whomever does:
Please no to this change:
a) I use/see CSV files all the time without a header row.
b) It defaults to 0 in contrib.
c) If we silently change it and get this wrong for existing migrations, we introduce data loss.
Yes. :( Not sure if this is the right spot, but I wanted to know and see people constantly asking about how to make this a relative path to the directory of the module providing a given CSV migration. From what I can gather, here are all the ways people can/do use this:
a) URI like public://import/csv/example.csv and put the CSV file in your public files dir.
b) Absolute path.
c) Relative path from the site root.
d) https://www.drupal.org/project/system_stream_wrapper or #1308152: Add stream wrappers to access .json files in extensions
e) Implement
hook_migration_plugins_alter()along the lines of:No, because that's not what this knob is doing. It's specifying which of your columns define the unique key for the row. Maybe a single column name, or an array of column names that together define the unique key.
Yeah, I don't love this comment or how this knob works. But no, you're not exactly understanding how it works (further proof it all needs work). The keys of this knob are always a numeric index to specify which column in the CSV you're talking about. The values are either a string (the property name to refer to column #X with in the rest of your migration) or a nested array, where the key is the property name and the value is the (never used anywhere?) "human-readable names" of each column.
This knob is required if you don't have a header row, since it's the only way to map column numbers to property names.
If you have a header row, you probably don't want to define this, but if you do, whatever you put in here will overwrite whatever was declared in the header row.
Unless the "human-readable" name thingy is A Thing(tm) somewhere I'm not aware of, I'd love to kill that aspect of the code and enforce that this knob always has the same structure. Flat array of integer keys and string property name values.
#20 Yes, let's definitely fix that. It's alphabetically (and morally) last, so it should go to the end of the file. ;)
#23: Because of the confusing and fubar nature of #17.
#25: Not that I know of.
#26: Definitely. It's how we know the migration row key.
Otherwise, yes to everything you said.
Thanks again!
-Derek
Comment #89
phenaproximaHi @dww, thanks for the equally clear and thorough response! I really like the way you explained some of the configuration options, so for whoever works on this patch next...use the comment in #88 to rework some of the documentation :)
Comment #90
dwwRe: #89: Thanks. :)
Re: #87.17 + #88.17
If we change the structure of
keysto always be a flat array, what happens to existing migrations that are (pointlessly) defining human-readable descriptions? I'd rather this source plugin didn't barf on them with exceptions, forcing them to rewrite any of their working migrations. So maybe we have to leave support for the complex nested thing, but somehow flag it as deprecated? If I knew it was going to be caught and nicely handled by default, maybe there's some kind of StopUsingDeprecatedMigrationSourceSettingsException we could throw? ;) Or something?Short of any of that, I'd be happy for the public-facing docs to pretend this thing must be a flat array, even if the code actually supports both formats.
Comment #91
mikeryanYes, it is - the descriptions are returned as the values of the source plugin's fields() method (machine names are the keys) and are displayed in migrate_tools by the drush migrate-fields-source command as well as displayed in the UI, allowing you to document fields with something a little more helpful than "id".
Comment #92
jofitzAddressed all of @phenaproxima's comment's in #87 except for:
9. Based upon @dww's response in #88.
11.
16. Based upon @dww's response in #88.
17. Based upon @dww's response in #88.
23. Based upon @dww's response in #88.
Comment #94
dwwRe: #87.25: As @mikeryan said in #91, that's A Thing(tm). Behold:
So yeah, #92's patch and interdiff is incorrectly removing that and we need to keep it.
Test failures look like disagreement over the location of the CSVFileObject in the namespace. I agree it could live in \Drupal\migrate\Plugin\migrate\source instead of directly in \Drupal\migrate, but we have to get all parts of the code and tests to agree on that. ;)
Cheers,
-Derek
Comment #95
jofitzFixed the test failures.
Comment #96
heddnAll feedback address now from #87. Between the awesome work from dww and Jo Fitzgerald. Along with a small nit I have with keeping consistent variable naming. This is ready again for another round of reviews.
For #87.11, this was done when originally we were setting the flags in the file constructor. That was removed and the constructor just left. I've cleaned that up in this latest patch.
Comment #97
heddnJust noticed this:
Nit: let's order these both by country alpha. Leaving at NR for other feedback.
Comment #98
smazDo we need a .htaccess here to prevent direct web access to any CSV files used for testing? We have one in the Umami profile for the default content:
https://cgit.drupalcode.org/drupal/tree/core/profiles/demo_umami/modules...
Comment #99
mglamanRE: #98. What if the server is nginx? It won't make a difference at that point. I don't see the benefit, unless there's some history in the thread which added it for Umami.
Comment #100
phenaproximaPreventing access to the CSV file(s) is not in scope for this issue. It's worth looking into and talking about, at least, but let's open a follow-up issue for that if it's needed.
Comment #101
eli-tLinking #2941488: Add a .htaccess file to core/profiles/demo_umami/modules/demo_umami_content/default_content/images to prevent direct downloading of images for context, where .htaccess was added for Umami images. Note this was a follow up issue to the issue that introduced the files being protected.
Comment #102
quietone commentedAdded a follow up #2978827: Add a .htaccess file to prevent downloading of csv test files to decide on adding a .htaccess file.
Comment #103
phenaproximaLooks good and pretty logical, but I found a number of relatively minor things. Mostly documentation stuff.
Shouldn't this be "...source property names for each data row in the file"?
Can we rephrase this: "If there are multiple header rows, the column names are taken from the final one."
There has been no previous mention of machine names. We need to explain what a machine name is in this context, and how it's used.
I wonder -- given the way this works, do we actually need CsvFileInterface? Why not just create a new class which extends \SplFileObject with the appropriate methods, and enforce that file_class is an instance/descendant of that?
'15' tells me nothing. Are there constants we could use here? Like
Let's add a comment explaining that this will create a single-column text key, using the value from the 'id' column of the CSV file.
Using % as an enclosure is really jarring, and I can't imagine it'd be very common. Let's just use a single quote instead.
Can we add a usage of this character to the example data, so as to illustrate what it's for?
We need a comment explaining what this flag will do.
We already explained that file_flags is a bitmask, so we can remove this sentence.
Why not use file_exists() here?
This should check if $plugin_configuration['keys'] is an empty array.
Nit: There's an extra blank line here.
Since we do this at least twice, how about we instantiate the file reader class in __construct(), then call a method on it to get the path instead?
Doc comment needs to be expanded. Alternatively, if we move the logic from initializeIterator() into the constructor, we can just move all of this method's logic into initializeIterator(), where it makes more sense (IMHO).
The doc comment says that this will default to text, not string. Let's make it consistent.
We can reduce this to
foreach ($this->initializeIterator()->getColumnNames() as $column).We don't need to reassign $fields, just
return $this->getConfiguration()['fields'] + $fields.Do we not inherit this from SourcePluginBase() already?
I'm not sure we need this method. Why can't calling code simply use this plugin as an iterator, like all other source plugins? If it's to maintain BC, we should deprecate this right now.
Correct me if I'm wrong, but can't we use ConfigurablePluginTrait for this?
We can use the ::class form here.
Making this final is a good idea!
Comment #104
heddnI'll see when I can respond. But a quick thought is that the second half of this review should result in better code comments. Not just blindly changing the code. I'm not able to go into more detail.
Comment #105
smaz@heddn ping on your above comment - people are working on migrate + Umami at the moment at devdays so we might be able to get some help on this.
Comment #106
heddnTry to answer what you can. But 103.14, 19, 21 (as examples) are there for a reason and we need to just add comments why. I wish my life weren't so busy right now or I'd go into more detail.
Comment #107
heddnI think I hit most of the above points of feedback. Exceptions are as follows.
103.8: There is already a comment about what this does.
103.18: no, it doesn't implement ConfigurablePluginInterface. That is issue #2937177: Migrate plugin base classes should implement ConfigurablePluginInterface.
103.19: because of #2954413: Add getter for file.
103.20: ConfigurablePluginTrait doesn't exist; maybe you are thinking of the above ^ issue?
Comment #110
phenaproxima"name(s)" should be "names". If there's one item in that array, so be it; but the (s) is jarring here.
We start talking about a database schema -- but that only applies to ID mapping with the Sql map plugin. Which is the default, of course; but if you don't know that, then this tickles the WTF reflex. So we should probably link to the Sql class documentation for reference.
Also, instead of "defaults to a schema of 'string'", can we just say "By default, each key is stored as a string"?
"names" is duplicated.
"...a numeric key, starting at 0". Also, why the hell would this be an array of arrays? It seems like column_names should be a simple hash (key => value); having it be a nested array is really confusing. So, if we can avoid that, so much the better.
I'm thinking we should remove this option entirely. I can't honestly see too many people needing to extend the file class, or change it. (But it's worth grepping contrib to prove me either right or wrong on that point.) If they do need to extend it, they can also extend this source and implement their own custom logic around that. Also, CSVFileInterface is gone from this patch (and rightfully so), so we should remove all mention of it.
Let's move this @see up to where the other one is.
Ditto.
The constructor isn't called during discovery, so I'm not sure why this is relevant.
Why do we need to cache the file object? Isn't it the whole point of initializeIterator() that we return a fully prepared iterator, and let the calling code deal with it?
This is not a problem with the plugin definition; it's a problem with configuration. I think this should be InvalidArgumentException.
I think that the creation of the file object should be done in a protected method, so that ninjas can override it if needed. This would also allow us to remove the arcane file_class option.
I'm not sure why this method exists; it seems like its contents should be part of initializeIterator().
I think this can all be one line.
It's really weird to me that we're setting $ids[$delta] or $ids['value'] depending on what $value is. Shouldn't $ids always be keyed by the name of the key?
We shouldn't be calling initializeIterator() just to get the column names; it forces us to have $this->file and make initializeIterator() mutate its state, which is awkward. Let's just add a new protected method to derive the column names without needing to touch the iterator.
Wat? We're relying on undocumented configuration properties here ('fields' isn't mentioned in the doc comment).
What is the purpose of this method? I don't see it being used for anything. We should remove it unless there's a good reason to keep it around.
There's more to say here. Isn't $column_names supposed to be a key => value hash?
This is neat, but why do we have it? I don't understand what we're testing here.
Ideally, this file would include data that would exercise the delimiter, enclosure, and escape characters.
Why not import the data as user accounts instead? Given the data we're testing with, that's a more realistic scenario.
Rather than make a bizarre exception here, why not just set the source module to "migrate"?
We can remove this method; I don't think it covers anything useful (it also does not cover __construct() at all). If the 'csv' plugin does not exist, we *will* run into an exception elsewhere in the test.
We should be asserting all of the imported data, not just one record.
Do we really need a base class for this test? It seems like it would be acceptable to merge the two.
According to the source plugin's doc comment, these are the defaults. We should just make them the defaults in CSVFileObject, and remove this line.
We can remove this; it's not really testing anything.
This should be assertSame().
We should use assertCount() here.
This is adorable, but can we choose better property names? These aren't very descriptive.
Comment #111
eiriksmWorking on this
Comment #112
eiriksmWOW!
So, this was a bit of a rabbit hole. I started by looking into the test failures. And for some reason the plugin was not picked up. So digging deeper in the rabbit hole, I found out the annotation parser did not pick it up. So why did the annotation parser not pick it up? Well because the function token_get_all, which the Doctrine TokenParser uses, tries to set the doc comment by using the last known doc comment before the token T_CLASS is found. So it initially gets found, but then, since we changed this:
there is another instance of the T_CLASS token, but without the doc comment (of course).
Not sure if we should reports it as a bug to doctrine/annotations as well, but for now: It's not possible to have a plugin discovered by annotation if you also use ::class in the same file :o
So for starters, here my try to fix the test failures from #107. I will look at the comments after I can verify this fixes the failures.
Comment #113
heddnThat is already fixed upstream. We just cannot use that style of syntax until we can upgrade to the newer version of doctrine.
Comment #115
eiriksmOK, cool.
Here is an updated patch that should fix the test failures for starters. Interdiff from #107, not my own.
Comment #116
eiriksmOh snap, was a bit fast at typing. Here is the patch in .patch format
Comment #117
dwwNice work, thanks!
However, there's still a long list of (mostly spot-on) changes from the review in #110 that need to be addressed.
Re: #110.1, this whole first sentence:
could be even simpler:
Re: #110.4: "Also, why the hell would this be an array of arrays?"
And #110.18: "Isn't $column_names supposed to be a key => value hash?"
We went over this at #87.17, #88, and sadly at #91 @mikeryan confirmed It's A Thing(tm).
Yes, $column_names *should* be a simple hash of numeric keys (index or delta) and a string values (the names). There's apparently a "need" for an optional 3rd layer of "description". Perhaps that should be punted to another config knob if it really needs to exist, so column_names doesn't have to be weirdly overloaded like this. column_names vs. column_descriptions or something. Both are simple integer key to string value hashes. Both are optional so long as your CSV provides a header row and column_names are defined inline. If there's no header row, column_names is required (but simple to understand, unlike now).
+1 to everything else from #110.
Thanks,
-Derek
p.s. Also unassigning, since I don't assume @eiriksm wants to re-roll for all of this. If anyone wants to actively work on this, please assign yourself. Thanks again!
Comment #118
eiriksmHi, sorry about that. My intention was to work through the list after fixing the failures but I got swamped by other things. I will assign myself if I get the time to pick it up in the coming days
Comment #119
mikelutzComment #120
longwaveTried to move this along a bit from the reviews in #110 and #117.
#110.14: This is the result of 'keys' being either a plain string or a field definition; I tried to add comments to help explain what is going on.
#110.16: Not sure what this undocumented property was useful for, so I removed it.
#110.19/20/21/24/25/26/30: Left these alone for now.
Everything else noted in the reviews should be fixed.
I realise I'm overloading the word "keys" here to mean two different things, but this structure is hard to explain succinctly!
Comment #121
longwaveWhoops, I broke some of the tests entirely, this should be a bit better.
Comment #123
dwwThat looks great, thanks!
The comments partly help. But the deeper issue of how weird the code is remains. Maybe it's too much to ask that we actually fix this stuff, given that it all already works in contrib. It does seem like the sort of thing that has to happen to modules as they go into core. If they have any bizarre code, they need to be simplified and cleaned up before being committed.
Those are all test cleanups. Can/should they be moved to a follow-up issue? If so, let's open it as a child issue. If not, we need to fix the tests before this is committed.
Either way, this still needs work. But it's getting very close!
Thanks again,
-Derek
Comment #124
longwaveI agree the tests still need to be fixed here, I just ran out of time.
Dwelling on this overnight I wonder if the complication around 'keys' is YAGNI for core. CSV is inherently a file made of strings so what is the use case for the key being something other than a string? Should we simplify this entirely for core and leave it to subclasses if they need anything more complex, much like the removal of the 'file_class' and 'fields' options?
Comment #125
phenaproximaSpeaking as a Migrate maintainer for core, I can confidently say that, as long as we use a different plugin ID, we are absolutely free to disregard any kind of backwards compatibility with contrib.
As far as core is concerned, this is a new feature. Therefore, our priority should be to build a well thought-out and easy to use CSV plugin, not maintain compatibility with contrib. The old plugin can continue to exist in contrib, with all its idiosyncrasies, if people need it.
Comment #126
mikelutzComment #127
dwwRe: #125:
Then we would have to change the plugin ID for this, since the patch is currently using the one from contrib, 'csv'. :/
Re: #124:
The string can hold different kinds of values. Some "strings" are actually integers that would be much more happily stored and indexed as ints in the schema, not strings. Some strings are fixed length and relatively short. Some strings are huge. Some are floats (the kind of number, not CSS). ;) Etc.
I think part of the confusion is that the config setting is called "keys" but really it's the means by which this plugin can implement
MigrateSourceInterface::getIds(). If you read the PHP doc for that, it starts to make more sense why we need to allow field structures to be defined. It's simply a config shortcut that you can use a flat array of strings, in which case we use those values as the machine name index for the ids structure, and we give you a default subarray to declare that id as a string.testGetIdsComplex()is showing it in action.We could potentially improve both test and doc coverage of cases where numeric keys are in use and could be specified as such. I'm glad we're testing "paragraph" sized big text columns, but ints (and floats) seem very common in various migration cases, and worth both documenting and testing as part of this. In fact, all of the test CSV sources that use an 'id' are actually ints, and we should probably define/test them that way. Or at least have 2 tests showing that int ids (e.g. nids, uids, etc) work when declared as either ints or they use the default string schema.
Comment #128
dwwCleaned up the API changes section of the summary, and updated the remaining tasks.
Also, this:
might be more clear as:
(sorry if that wrapping isn't right, I'm doing this comment via browser, not emacs).
Comment #129
mikelutzThis still needs work on tests, I need to go through the rest of #110. This is my attempt to add some sanity to the whole column names thing. I'm content to change variable and config key names around, but code wise, this makes more sense to me.
The human readable descriptive names are ONLY for the fields method in the source. There is absolutely no reason why the file object needs to know about them. Now we have two config items, column_names and column_descriptions, both are simple indexed arrays to map the 0 indexed column to a field key and description. If not present, header row is used for either/both, and lacking everything, you basically get no fields, so I want a little more error handling around that. I rather like requiring either a header row or column names, but if we decide to allow the whole thing to be accessed using just the column index, then we need a little more work around the fields method to support it.
I also reduced the ids array to a simple array that makes everything a string. If there's strong reason to allow other id types, we can put it back.
Comment #130
dwwDisclaimer: I haven't looked at the patch nor interdiff, only going on your comment. About to head into a meeting, so I don't have time for more right now.
Thanks for column_names vs. column_descriptions!
However, -1 for "I also reduced the ids array to a simple array that makes everything a string. If there's strong reason to allow other id types, we can put it back.". See #127 for why.
I'm a little worried that #129 is moving too many things in too many different directions before we have agreement on what we're trying to accomplish.
Thanks,
-Derek
Comment #131
mikelutzWhile that may be true, the purpose of this typing is specifically for the idmap. While the default is certainly a SQL IdMap and a mySQL backend, it's not guaranteed. At the end of the day, the migration executable writes the map using the source values from the row. Unless we are going to typecast based on those key types, or have a lot more error checking around them, at the moment that happens, the executable is sending in a string value because the row is an array of strings. So despite the fact that it may seem like a simple incrementing integer value in a spreadsheet, and it may seem reasonable that it should be stored and indexed in the database as an integer, you are trusting an unknown idmap object can handle getting a string when you've told it it's getting an integer. The truth is, it probably can, but I'm not a fan of sending a sting to do an integer's job. Just because php is loosely typed doesn't mean we need to write loosely typed code.
As far as supporting longer strings, I think there's a better case for that, but I'm also not sure it's reasonable. if you need a paragraph sized identifier, you are probably doing something wrong.
I was thinking about potentially supporting using the csv line number as a migration key if none is specified. I won't think it's an ideal situation, but i could see some use cases where someone might want it.
Comment #132
dwwAll of #131 is probably true, but that seems like a fight to have with MigrateSourceInterface::getIds(), not this patch. If that API expects/allows all this, that's the place to have the tests to see what happens with faulty input, not in this particular implementation. We're just making sure the source config allows us to uphold the API we've been given to work with. Contrib has used this particular way, the 'keys' config knob, a large number of existing migrations are written to work with that, and it'd be pretty annoying for us to both squat on the 'csv' namespace *and* break all those migrations by changing it at this point. It's wonky, but we almost fully understand it now and nearly have it documented properly. Most people will never care. If they care, and they go out of their way to define a column as an int, it's their problem if they have dirty source data.
Comment #133
mikelutzI realize up around #81 @heddn talked about a straight port and keeping the same namespace, but the more I dig into this, the more I agree with #125. These configuration knobs are wonky, and once they are committed, it's going to be very difficult to fix. If we change the plugin name we can use a more intuitive set of configurations, and anybody currently using contrib can just keep doing so until they are ready to update their migrations.
I still vote against allowing non string ids when we only provide string fields, but I'm far more bothered by the column_names knob. If you aren't using a header row, you should be able to supply a list of column names as a simple array, and you shouldn't have to provide descriptions beyond the field names if you don't want to. Just making that change would break existing migrations and would warrant a name change, but I think it makes the plugin more useable going forward, and it's far better to do it now than try to do it later.
Comment #134
mikelutzMore interestingly why is there a test in the migrate module which extends a test class from the migrate_drupal module and enables a test module from the migrate_drupal_ui module to test that all source plugins have an annotation that they aren't required to have?
The module dependencies for tests should go the other way. Source plugins aren't required to have a source module annotation. Migrations tagged with 'Drupal 6' or 'Drupal 7' are only allowed to use source plugins which have a source module annotation. We shouldn't have to add dummy annotations to source plugins that don't apply to Drupal migrations just to pass tests.
Migrate has been built primarily for d6 and d7 migrations, but it's purpose is greater, as this issue demonstrates. If there isn't a source module, there isn't a source module.
Comment #135
heddnI don't know if this helps with some of the confusion. Around the timeframe of #110, there was a conversation in slack that never made it into this issue. Or so it would appear. In this conversation, @phenaproxima and @heddn agreed to not do a straight port as put forth in #81. So we /will/ have a new name. And this lets us make the API "right" and "sane".
As far as the column names thing, we don't need to expose that to folks in the API. For keys, I want to vote for using the approach we see in migrate_plus and core and call the thing ids and use the same sql style declaration of int, string, etc.
Comment #136
phenaproximaMigrate can freak out and throw exceptions during plugin discovery if source_module isn't set. In retrospect, that behavior should have been entirely confined to Migrate Drupal, but at this point it's too late to go back. In Drupal 9, we can and should scale back the invasiveness of the source_module and destination_module stuff. But for now, we should just set the source_module to 'migrate' and call it a day.
Comment #137
mikelutzI just did another quick grep, The only place I see the exception raised is in \Drupal\migrate_drupal\MigrationPluginManager::processDefinition() which throws an exception if migrate_drupal is enabled and there is a migration tagged with Drupal 6 or Drupal 7 that uses a source plugin with no source module defined. The exception declares that the source plugin must define the source module property, but it's really an exception on the migration, which shouldn't be using a source with no module defined. Actually, that should allow the migration to define a source module in it's configuration if it wants to, to mimic SourcePluginBase::getSourceModule.
Regardless, It should be punted to another issue, I suppose.
Comment #138
dwwRe: #135:
I don't know what that means. Of course we have to let people define their column names, or they can't write the rest of their migration. Are you proposing we *require* a header row? I'd be very strongly opposed to that. So we have to have a way to declare the column names via the API of how you configure your source plugin.
Maybe you mean we don't have to expose column descriptions to folks? Great. But then shouldn't we have an issue somewhere to deprecate that part of the migrate source plugin API? See comment #91 from @mikeryan.
Regardless, if the party line is that core is going to change namespaces and break existing migrations for this, so be it. In that case, the summary is out of date, and perhaps some of the migrate maintainers who've been discussing this offline could update things to reflect the current "agreement". Namely, the API changes section can no longer claim BC and ease-of-use for existing users of the contrib module, and it should start to spell out the ways that we envision this working differently (even if it's primarily about changing the available config knobs).
Also agreed that #136 and #137 are out of scope for this issue.
Thanks,
-Derek
Comment #139
heddnYeah, we need to update the IS.
Here's what I expect the yml looks like:
or alternatively:
This as opposed to the existing requirement:
Then I expect the source module to array_combine or some such on id, first_name, etc. to make the required nested array that migrate needs.
So, no API exposed to site users about a double nested array thing, just a yaml array that the csv plugin converts into the right format.
Comment #140
dwwRight, so column names are required, either via header row or a (simple) array.
Column descriptions are optional, via a second (simple) array config knob. If not defined, we default to reusing the name as the description.
Or are you proposing we simply hide descriptions entirely and not provide any way to declare them? In that case, follow-up issue to deprecate them from the Migrate API entirely?
Comment #141
heddnFor a CSV source, I'm suggesting we hide their existence in the implementation details. It is confusing DX. That isn't to say it has no value. And for other sources, it might make sense. So, hide but don't deprecate is my suggestion. If someone needs the description, they should be able to extend the source plugin or keep using contrib.
Comment #142
dwwThe nested array stuff is certainly confusing DX.
The flat arrays with separate knobs per #129 isn't bad.
If descriptions vs. machine names in general makes for a better UI/UX, and that's what the API itself supports so that various tools can benefit, it seems we should make it easier to give human readable names than "extend the plugin". I don't see what that has to do with a CSV source. Why does it matter if "first_name" vs. "First name" appears in my tool output, regardless of if the name came from a CSV file, a JSONAPI call, a DB query, etc?
I thought we were going to kill off the contrib version of this once in core. Why maintain two slightly different copies of the same thing?
Comment #143
andypostWhere the upgrade path from contrib will live? if structure will change then probably contrib needs to release 3.0 with 8.7 requirement and provide upgrade to disable itself at least
Comment #144
heddnThe tests will fail miserably so not changing status. But this rebases things using league/csv instead of a custom csv reader. I'm haven't had a chance to manually test things, but it should be a drop-in replacement for our previous file reader solution. Meaning, it should work but the tests need to be adjusted for the new solution.
Comment #145
heddnLet's see if this fixes up any of the tests.
Comment #146
heddnSome self review here:
The string comment here is unneeded.
All this error checking here and below could use tests.
Missing docs on this.
This could use tests.
Let's get rid of this multi line scenario as we won't be supporting it in core.
Let's rename these standard and non_standard.
Comment #147
heddnComment #149
dwwHrm, I'm rather -1 to a new composer.json dependency for this. That means we have to:
a) Carefully audit the external dependency.
b) Have confidence it's well-maintained, will get security updates, that we can coordinate with those updates when needed, etc.
c) Have confidence it's going to survive as long as this code will want to use it.
d) Have confidence it's not going to have conflicting requirements from what we want (e.g. dropping support for PHP 7.0 before we do, whatever).
Looking at the interdiff, we're not saving many lines of code for all those potential costs. The entire CSVFileObject.php in contrib is 117 lines, with comments and whitespace. Most of that are trivial set/get methods. The only actual code of substance are these:
I know we all love our #ProudlyFoundElsewhere tags, but it doesn't seem like an obvious win in this case.
Am I missing something?
Thanks,
-Derek
Comment #150
heddnThis fixes tests locally.
Comment #151
heddnReasons to use it, it supports stream resources. The #1 requested feature for CSV files. The contrib module doesn't have that support. https://thephpleague.com/ is a well respected member of the PHP community.
The downside as you stated is lining up release dates. From https://csv.thephpleague.com/
Comment #152
dwwOkay, duly noted. Thanks.
However, it seems like adding a new feature request on the way into core is the wrong approach. ;) Certainly the use-case-in-core for this (the Umami stuff) doesn't care about streams -- we'll have local files. Stream support seems to open up a lot of potential what-ifs that should be more carefully considered. IMHO, we should either:
a) Keep the core patch simple, small, easy, based on already-proven code, without introducing new dependencies. Add stream support at a later date in a follow-up issue.
or:
b) Try out the new dependency and stream support in contrib for a while and forget about trying to get this into 8.7.0.
Either way, I don't see much hope in adding a new external dependency at this late date in the 8.7.x release cycle. Maybe the framework and release managers will see it differently, but I doubt it.
Comment #153
andypostI'm also see no real benefit of external dependency, meanwhile using this lib I recall issues with encodings and that cause to always override it to properly handle variable encodings
Comment #154
heddnThere is stream filter support for encoding conversion. I opted not to include it for the initial patch here. Since for core's use, we don't need to handle encoding conversion. It would be a later follow-up. Having an absolute path for a local file for CSV is painful.
The following is required to get the absolute path:
Were-as we could use a stream once #1308152: Add stream wrappers to access .json files in extensions lands. And no plugin alters and cache clearing are needed.
I'm not particularity fixed on getting this into 8.7 if 8.8 works better. But it would be nice to use a library instead of parsing the CSV ourselves.
Comment #155
heddnYes we can and have written a nifty CSV solution in contrib. But if we bring it into core, it has several latent limitations and any enhancements to address these features would leave us writing our own solution. And we'd take that ~100 lines and quickly run into many 100s of lines. Migrate maintainers have many other priorities. Continuing to roll our own solution for reading CSVs doesn't seem like a good use of that time.
Known limitations:
Until we're more welcome to use 3rd party PHP libraries, I'm going to mark this closed, won't fix. I don't really want to implement a halfway CSV solution in core that has so many limitations.
Comment #156
mlncn commentedToo bad; this would make the migration support already in Drupal core really come alive. I thought Drupal had long since moved to embracing well-maintained 3rd-party dependencies!
Comment #157
heddnI've opened #3042160: Utilize leage/csv for CSV reader to move this over into contrib.