There is a problem when reading info from CSV files containing multibyte characters (testing with Chinese text). I had previously encountered this problem and fixed it by setting the local settings in PHP to UTF-8 using setlocale(). However for some reason the problem is back again. Need to investigate further.

Comments

begun’s picture

I just tested it with a CSV containing mixed Chinese and English created by linux version of Thunderbird 13. Everything appears to work correctly. Need to confirm if problem is windows or Outlook specific.

begun’s picture

Tested with a CSV file with mixed English and Chinese characters, exported from Thunderbird 9.0.1 Windows version. Produced the following error:

"...Warning: htmlspecialchars(): Invalid multibyte sequence in argument in htmlspecialchars() (line 1572 of /home/bengul/git_repositories/drupal_site/includes/bootstrap.inc). => ..."

With following backtrace:

10: htmlspecialchars() (Array, 1 element)
9: check_plain() (Array, 2 elements)
8: namecards_import_select_contacts_form() (Array, 2 elements)
7: call_user_func_array() (Array, 1 element)
6: drupal_retrieve_form() (Array, 2 elements)
5: drupal_build_form() (Array, 2 elements)
4: drupal_get_form() (Array, 2 elements)
3: namecards_import_select_contacts() (Array, 2 elements)
2: call_user_func_array() (Array, 1 element)
1: menu_execute_active_handler() (Array, 2 elements)
0: main() (Array, 2 elements)

If continue with import operation, one gets the following fatal error when processing the batch job:

An AJAX HTTP error occurred. HTTP Result Code: 500 Debugging information follows. Path: /batch?id=10&op=do StatusText: Service unavailable (with message) ResponseText: PDOException: in drupal_write_record() (line 7023 of /home/bengul/git_repositories/drupal_site/includes/common.inc).

I noticed that the commas in the Chinese parts of the CSV appear to be a different font to that of the English content. I wonder if this could be the cause? Need to investigate further.

begun’s picture

Turns out that CSV files exported from Thunderbird 14 in Windows 7 have the charset=iso-8859-1. For files created in Thunderbird CSV files created in Linux have the charset=utf-8. So it looks like the problem is with Thunderbird.

begun’s picture

Status: Active » Needs review

Modified code so that CSV files containing Chinese characters can be imported (See Commit 1d39532). Problem is that it still does allow importing of other asian languages. Need to look into this further.

begun’s picture

Status: Needs review » Needs work

Still not working properly. Runs into problems with some mixed language files. Also not working with Korean.

begun’s picture

Status: Needs work » Fixed

It appears that there is no real full proof way of reliably detecting different encoding types. For this reason I have decided to put the onus on the client-side/user to provide the correctly encoded file. To that end I have added file validation to confirm if the uploaded file is valid UTF-8 (see commit 0d06c0d).

If the program which produces the CSV file does not produce a valid UTF-8 encoded file (e.g. Mozilla Thunderbird - Windows Build), then the user must convert it. This can usually be done by opening the CSV file in a text editor and then re-saving the file as UTF-8. After doing this one should be able to import it.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.