Closed (fixed)
Project:
Namecards
Version:
7.x-1.x-dev
Component:
Code
Priority:
Major
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
24 Feb 2012 at 05:11 UTC
Updated:
6 Aug 2012 at 04:01 UTC
There is a problem when reading info from CSV files containing multibyte characters (testing with Chinese text). I had previously encountered this problem and fixed it by setting the local settings in PHP to UTF-8 using setlocale(). However for some reason the problem is back again. Need to investigate further.
Comments
Comment #1
begun commentedI just tested it with a CSV containing mixed Chinese and English created by linux version of Thunderbird 13. Everything appears to work correctly. Need to confirm if problem is windows or Outlook specific.
Comment #2
begun commentedTested with a CSV file with mixed English and Chinese characters, exported from Thunderbird 9.0.1 Windows version. Produced the following error:
"...Warning: htmlspecialchars(): Invalid multibyte sequence in argument in htmlspecialchars() (line 1572 of /home/bengul/git_repositories/drupal_site/includes/bootstrap.inc). => ..."
With following backtrace:
10: htmlspecialchars() (Array, 1 element)
9: check_plain() (Array, 2 elements)
8: namecards_import_select_contacts_form() (Array, 2 elements)
7: call_user_func_array() (Array, 1 element)
6: drupal_retrieve_form() (Array, 2 elements)
5: drupal_build_form() (Array, 2 elements)
4: drupal_get_form() (Array, 2 elements)
3: namecards_import_select_contacts() (Array, 2 elements)
2: call_user_func_array() (Array, 1 element)
1: menu_execute_active_handler() (Array, 2 elements)
0: main() (Array, 2 elements)
If continue with import operation, one gets the following fatal error when processing the batch job:
An AJAX HTTP error occurred. HTTP Result Code: 500 Debugging information follows. Path: /batch?id=10&op=do StatusText: Service unavailable (with message) ResponseText: PDOException: in drupal_write_record() (line 7023 of /home/bengul/git_repositories/drupal_site/includes/common.inc).
I noticed that the commas in the Chinese parts of the CSV appear to be a different font to that of the English content. I wonder if this could be the cause? Need to investigate further.
Comment #3
begun commentedTurns out that CSV files exported from Thunderbird 14 in Windows 7 have the charset=iso-8859-1. For files created in Thunderbird CSV files created in Linux have the charset=utf-8. So it looks like the problem is with Thunderbird.
Comment #4
begun commentedModified code so that CSV files containing Chinese characters can be imported (See Commit 1d39532). Problem is that it still does allow importing of other asian languages. Need to look into this further.
Comment #5
begun commentedStill not working properly. Runs into problems with some mixed language files. Also not working with Korean.
Comment #6
begun commentedIt appears that there is no real full proof way of reliably detecting different encoding types. For this reason I have decided to put the onus on the client-side/user to provide the correctly encoded file. To that end I have added file validation to confirm if the uploaded file is valid UTF-8 (see commit 0d06c0d).
If the program which produces the CSV file does not produce a valid UTF-8 encoded file (e.g. Mozilla Thunderbird - Windows Build), then the user must convert it. This can usually be done by opening the CSV file in a text editor and then re-saving the file as UTF-8. After doing this one should be able to import it.