Check encoding of csv file and show useful error if it is wrong [#2939248]

Comment #1

24 January 2018 at 18:28

bbuchert created an issue. See original summary.

Log in or register to post comments

Comment #2

bbuchert CreditAttribution: bbuchert as a volunteer commented 24 January 2018 at 18:28

Issue summary:

View changes

Log in or register to post comments

Comment #4

bbuchert CreditAttribution: bbuchert as a volunteer commented 25 January 2018 at 09:31

Title:

Check encoding of csv file and show error if it is wrong

» Check encoding of csv file and show useful error if it is wrong

Log in or register to post comments

Comment #5

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC, Acro Commerce commented 25 January 2018 at 15:09

Patches are always welcome.

Log in or register to post comments

Comment #6

bbuchert CreditAttribution: bbuchert commented 26 January 2018 at 08:11

Maybe it makes even more sense to convert all non UTF-8 files directly to UTF-8. I don’t have much experience with reincoding files and how reliable it is.

if(!mb_check_encoding($output, 'UTF-8')
    OR !($output === mb_convert_encoding(mb_convert_encoding($output, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

    $output = mb_convert_encoding($content, 'UTF-8', 'pass'); 
}

// $output is now safely converted to UTF-8!

https://stackoverflow.com/questions/505562/detect-file-encoding-in-php

Log in or register to post comments

Comment #7

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC, Acro Commerce commented 26 January 2018 at 14:23

I don't have much experience either. But if you want to write a test of a file that needs re-encoding, I'll take it and the fix you have written up.

Log in or register to post comments

Comment #8

josephdpurcell CreditAttribution: josephdpurcell at Digital Bridge Solutions commented 13 September 2018 at 21:22

Investigation implementation ideas

I've run into this problem. The file on disk is in latin 1 (ISO-8859-1) encoding, but PHP is UTF-8 and the database is UTF-8. So, when a character like a non-breaking space is read from this file and attempted to be inserted into the database, an exception occurs like:

Drupal\Core\Database\DatabaseExceptionWrapper: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0' for column 'address_address_line1' at row 1: INSERT INTO {profile__address}
...

Looking at how Drupal\migrate_source_csv\CSVFileObject is implemented, it's using \SplFileObject to read the file off disk. As https://stackoverflow.com/a/23682549/990642 explains, this means there's no way to encode the data in a different format, e.g. using a stream filter.

Possible paths to take

The PHP League has a CSV library that has charset conversion built in, see https://csv.thephpleague.com/9.0/converter/charset/. One option would be to refactor this module to leverage this library.

Another path would be to refactor this module to use fopen so that stream filters are supported, and expose a configuration to allow anyone to add a stream filter (like charset conversion). Benefit here might be to avoid having the module dependent on a third party library.

Commentary

This is a pickle. From what I can see, Migrate API (specifically this module) has no extension points to convert the file encoding before the file is processed.

Does anyone have thoughts on maybe a work around, something I didn't see, or maybe what path sounds best?

Log in or register to post comments

Comment #9

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 14 September 2018 at 00:23

So, CSV migration is all focused on migrating. If something needs to happen prior to reading the file, do that in an out of band process. Maybe the same process that copies the file can also rewrite the file as a pre-step. Rewriting the file doesn't have to happen during the migration. It can happen earlier.

Log in or register to post comments

Comment #10

josephdpurcell CreditAttribution: josephdpurcell at Digital Bridge Solutions commented 14 September 2018 at 12:49

The responsibility of reading the file off disk is included in this module. I think its a fair goal to have the responsibility of modifying the data read as out of scope.

It looks like we will have the opportunity to do the encoding transformation elsewhere (before its read by this module).

Back to the original intent of the ticket, I'm not sure it's possible? Because \SplFileObject doesn't provide access to the read data (only the parsed data), I don't think the file encoding can be detected. However, I'm not an expert on how encodings work.

Log in or register to post comments

Comment #11

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 22 October 2018 at 19:47

Status:

Active

» Closed (won't fix)

I've said a few times in public places that making sure the file is in the right format and in the right place seems like a task that could easily be done by something external to processing the CSV file. There's lots of packages on packagist for taking off BOM, reformatting files and moving them around on the file system. Rather than putting all of those things into a parser module, I think I'm going to suggest those things happen externally.

Log in or register to post comments

Comment #12

bbuchert CreditAttribution: bbuchert commented 23 October 2018 at 05:31

@heddn it totally makes sense to convert the file elsewhere. I think what would be helpful is not a conversion but a check if the file was uploaded in the right format. If a more useful error is given a lot of confusion can be avoided. But if that is not a possible I guess it has to stay like it is.

Log in or register to post comments

Comment #13

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 23 October 2018 at 13:10

Status:

Closed (won't fix)

» Active

Some database encodings support Latin. Being too prescriptive on the encoding seems a little heavy handed. I like the idea of using file streams. But that would take a bit of work. At the very least, I might have been too hasty in closing as won't fix. Re-opening while we come up with ideas.

Log in or register to post comments

Comment #14

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 4 April 2019 at 14:52

Version:

8.x-2.x-dev

» 8.x-3.x-dev

This can now be fixed via league/csv pretty easily. Let's do it!

Log in or register to post comments

Comment #15

bbuchert CreditAttribution: bbuchert commented 4 April 2019 at 16:03

https://csv.thephpleague.com/9.0/interoperability/encoding/

Log in or register to post comments

Comment #16

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 4 April 2019 at 16:11

To be more clear, I was more focused on UTF encoding. Running filters at run time. We'll need to add a new config option to support an array of stream_filters. I didn't add this feature with the initial 3.x branch since I wanted to make sure we did due diligence around naming things right and testing this so can properly demonstrate that (for example) windows-1252 to utf-8 works.