Follow-up to #872624: Charset issues

I have a ISO-8859-15 encoded csv file with special characters (German umlauts) which has problems with the current migrate csv importer (MigrateSourceCSV). I used iconv('ISO-8859-15', 'UTF-8', $colum); in prepareRow() on each column, which solved most problems. But there where some special characters next to the separator with the effect that the separator was not recognized and the migration failed.

Thee are lots of issues, with character encoding and php's fgetcsv(), but there are not much issues in the migrate issue queue – I guess that the csv source isn't used that much with NON-UTF-8 encodings.

What could be done in migrate to fix this kind of problem:

* use own csv parser, like the Feeds module – but that comes with it's own problems: #1369874: Don't roll own CSV parser; use PHP's native one
* use a third party csv parser – any suggestions?
* stick with fgetcsv and require that csv file is UTF-8 encoded (could this be done inside the migration, eg. copy the source to tmp and run through iconv)?

Finally I've chosen the last option and converted the csv-file to UTF-8 using a cron job that runs before the update migration – as it was the easiest way.

Anyway I created this issue to inform others about the problem, to start a discussion about how the encoding problem should be solved and to see if there is some interest in change the current csv parser implementation.

Comments

osopolar created an issue.

jfraz’s picture

Finally I've chosen the last option and converted the csv-file to UTF-8 using a cron job that runs before the update migration

Could you please provide any more detail as to how this was done? I have a client who uploads their own csv's and sometimes they have errors in their data and it breaks the import(s).

kdborg@gmail.com’s picture

My steps... so far.

1) I converted my CSVs to UTF-8 using iconv: iconv -f UTF-16LE -t UTF-8 input.csv > output.csv

2) Then I had to remove the BOM (byte order marker): awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' output.csv > output.nobom.csv

3) Then, when I migrate, most characters are migrated properly. I just need to get the last few going correctly.

kdborg@gmail.com’s picture

And the reason why I had a some characters not translating properly is because in my migration I had used callbacks('utf8_encode') on certain fields.