It would be great if files could be checked for the correct encoding (UTF-8) before importing them.
If the encoding is wrong it should spit out an error.

The current error in drush is: "Migration failed with source plugin exception: ID has no value"

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

bbuchert created an issue. See original summary.

bbuchert’s picture

Issue summary: View changes
bbuchert’s picture

Issue summary: View changes
Status: Needs work » Active
bbuchert’s picture

Title: Check encoding of csv file and show error if it is wrong » Check encoding of csv file and show useful error if it is wrong
heddn’s picture

Patches are always welcome.

bbuchert’s picture

Maybe it makes even more sense to convert all non UTF-8 files directly to UTF-8. I don’t have much experience with reincoding files and how reliable it is.

if(!mb_check_encoding($output, 'UTF-8')
    OR !($output === mb_convert_encoding(mb_convert_encoding($output, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

    $output = mb_convert_encoding($content, 'UTF-8', 'pass'); 
}

// $output is now safely converted to UTF-8!

https://stackoverflow.com/questions/505562/detect-file-encoding-in-php

heddn’s picture

I don't have much experience either. But if you want to write a test of a file that needs re-encoding, I'll take it and the fix you have written up.

josephdpurcell’s picture

Investigation implementation ideas

I've run into this problem. The file on disk is in latin 1 (ISO-8859-1) encoding, but PHP is UTF-8 and the database is UTF-8. So, when a character like a non-breaking space is read from this file and attempted to be inserted into the database, an exception occurs like:

Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0' for column 'address_address_line1' at row 1: INSERT INTO {profile__address}
...

Looking at how Drupal\migrate_source_csv\CSVFileObject is implemented, it's using \SplFileObject to read the file off disk. As https://stackoverflow.com/a/23682549/990642 explains, this means there's no way to encode the data in a different format, e.g. using a stream filter.

Possible paths to take

The PHP League has a CSV library that has charset conversion built in, see https://csv.thephpleague.com/9.0/converter/charset/. One option would be to refactor this module to leverage this library.

Another path would be to refactor this module to use fopen so that stream filters are supported, and expose a configuration to allow anyone to add a stream filter (like charset conversion). Benefit here might be to avoid having the module dependent on a third party library.

Commentary

This is a pickle. From what I can see, Migrate API (specifically this module) has no extension points to convert the file encoding before the file is processed.

Does anyone have thoughts on maybe a work around, something I didn't see, or maybe what path sounds best?

heddn’s picture

So, CSV migration is all focused on migrating. If something needs to happen prior to reading the file, do that in an out of band process. Maybe the same process that copies the file can also rewrite the file as a pre-step. Rewriting the file doesn't have to happen during the migration. It can happen earlier.

josephdpurcell’s picture

The responsibility of reading the file off disk is included in this module. I think its a fair goal to have the responsibility of modifying the data read as out of scope.

It looks like we will have the opportunity to do the encoding transformation elsewhere (before its read by this module).

Back to the original intent of the ticket, I'm not sure it's possible? Because \SplFileObject doesn't provide access to the read data (only the parsed data), I don't think the file encoding can be detected. However, I'm not an expert on how encodings work.

heddn’s picture

Status: Active » Closed (won't fix)

I've said a few times in public places that making sure the file is in the right format and in the right place seems like a task that could easily be done by something external to processing the CSV file. There's lots of packages on packagist for taking off BOM, reformatting files and moving them around on the file system. Rather than putting all of those things into a parser module, I think I'm going to suggest those things happen externally.

bbuchert’s picture

@heddn it totally makes sense to convert the file elsewhere. I think what would be helpful is not a conversion but a check if the file was uploaded in the right format. If a more useful error is given a lot of confusion can be avoided. But if that is not a possible I guess it has to stay like it is.

heddn’s picture

Status: Closed (won't fix) » Active

Some database encodings support Latin. Being too prescriptive on the encoding seems a little heavy handed. I like the idea of using file streams. But that would take a bit of work. At the very least, I might have been too hasty in closing as won't fix. Re-opening while we come up with ideas.

heddn’s picture

Version: 8.x-2.x-dev » 8.x-3.x-dev

This can now be fixed via league/csv pretty easily. Let's do it!

bbuchert’s picture

heddn’s picture

To be more clear, I was more focused on UTF encoding. Running filters at run time. We'll need to add a new config option to support an array of stream_filters. I didn't add this feature with the initial 3.x branch since I wanted to make sure we did due diligence around naming things right and testing this so can properly demonstrate that (for example) windows-1252 to utf-8 works.

a.dmitriiev made their first commit to this issue’s fork.

a.dmitriiev’s picture

Status: Active » Needs review

Maybe the MR could be a starting point?

heddn’s picture

Status: Needs review » Needs work