Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
It would be great if files could be checked for the correct encoding (UTF-8) before importing them.
If the encoding is wrong it should spit out an error.
The current error in drush is: "Migration failed with source plugin exception: ID has no value"
Issue fork migrate_source_csv-2939248
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
bbuchert CreditAttribution: bbuchert as a volunteer commentedComment #3
bbuchert CreditAttribution: bbuchert as a volunteer commentedComment #4
bbuchert CreditAttribution: bbuchert as a volunteer commentedComment #5
heddnPatches are always welcome.
Comment #6
bbuchert CreditAttribution: bbuchert commentedMaybe it makes even more sense to convert all non UTF-8 files directly to UTF-8. I don’t have much experience with reincoding files and how reliable it is.
https://stackoverflow.com/questions/505562/detect-file-encoding-in-php
Comment #7
heddnI don't have much experience either. But if you want to write a test of a file that needs re-encoding, I'll take it and the fix you have written up.
Comment #8
josephdpurcell CreditAttribution: josephdpurcell at Digital Bridge Solutions commentedInvestigation implementation ideas
I've run into this problem. The file on disk is in latin 1 (ISO-8859-1) encoding, but PHP is UTF-8 and the database is UTF-8. So, when a character like a non-breaking space is read from this file and attempted to be inserted into the database, an exception occurs like:
Looking at how
Drupal\migrate_source_csv\CSVFileObject
is implemented, it's using\SplFileObject
to read the file off disk. As https://stackoverflow.com/a/23682549/990642 explains, this means there's no way to encode the data in a different format, e.g. using a stream filter.Possible paths to take
The PHP League has a CSV library that has charset conversion built in, see https://csv.thephpleague.com/9.0/converter/charset/. One option would be to refactor this module to leverage this library.
Another path would be to refactor this module to use fopen so that stream filters are supported, and expose a configuration to allow anyone to add a stream filter (like charset conversion). Benefit here might be to avoid having the module dependent on a third party library.
Commentary
This is a pickle. From what I can see, Migrate API (specifically this module) has no extension points to convert the file encoding before the file is processed.
Does anyone have thoughts on maybe a work around, something I didn't see, or maybe what path sounds best?
Comment #9
heddnSo, CSV migration is all focused on migrating. If something needs to happen prior to reading the file, do that in an out of band process. Maybe the same process that copies the file can also rewrite the file as a pre-step. Rewriting the file doesn't have to happen during the migration. It can happen earlier.
Comment #10
josephdpurcell CreditAttribution: josephdpurcell at Digital Bridge Solutions commentedThe responsibility of reading the file off disk is included in this module. I think its a fair goal to have the responsibility of modifying the data read as out of scope.
It looks like we will have the opportunity to do the encoding transformation elsewhere (before its read by this module).
Back to the original intent of the ticket, I'm not sure it's possible? Because
\SplFileObject
doesn't provide access to the read data (only the parsed data), I don't think the file encoding can be detected. However, I'm not an expert on how encodings work.Comment #11
heddnI've said a few times in public places that making sure the file is in the right format and in the right place seems like a task that could easily be done by something external to processing the CSV file. There's lots of packages on packagist for taking off BOM, reformatting files and moving them around on the file system. Rather than putting all of those things into a parser module, I think I'm going to suggest those things happen externally.
Comment #12
bbuchert CreditAttribution: bbuchert commented@heddn it totally makes sense to convert the file elsewhere. I think what would be helpful is not a conversion but a check if the file was uploaded in the right format. If a more useful error is given a lot of confusion can be avoided. But if that is not a possible I guess it has to stay like it is.
Comment #13
heddnSome database encodings support Latin. Being too prescriptive on the encoding seems a little heavy handed. I like the idea of using file streams. But that would take a bit of work. At the very least, I might have been too hasty in closing as won't fix. Re-opening while we come up with ideas.
Comment #14
heddnThis can now be fixed via league/csv pretty easily. Let's do it!
Comment #15
bbuchert CreditAttribution: bbuchert commentedhttps://csv.thephpleague.com/9.0/interoperability/encoding/
Comment #16
heddnTo be more clear, I was more focused on UTF encoding. Running filters at run time. We'll need to add a new config option to support an array of stream_filters. I didn't add this feature with the initial 3.x branch since I wanted to make sure we did due diligence around naming things right and testing this so can properly demonstrate that (for example) windows-1252 to utf-8 works.
Comment #19
a.dmitriiev CreditAttribution: a.dmitriiev as a volunteer and at 1xINTERNET commentedMaybe the MR could be a starting point?
Comment #20
heddn