Subj + minor fixes to FeedsSource.inc so that processing EMPTY parser results won't break the batch operation (it was before).

Functions:
* CSV parser configuration page and Importer start screen:
- source file encoding selection
- checkbox to check source encoding
* Parser
- converts encoding of source data
- checks source encoding and if it differs - throwing an exception

Screenshot:

Requires: mbstring PHP extension

Files: 
CommentFileSizeAuthor
#97 feeds-encoding_support_CSV-1428272-97.patch5.16 KBNiremizov
#82 feeds-encoding_support_CSV-1428272-82.patch5.21 KBacouch
PASSED: [[SimpleTest]]: [MySQL] 6,191 pass(es).
[ View ]
#52 feeds-encoding_support_CSV-1428272-52.patch5.07 KBeosrei
PASSED: [[SimpleTest]]: [MySQL] 4,355 pass(es).
[ View ]
#50 feeds-encoding_support_CSV-1428272-50.patch6.43 KBeosrei
PASSED: [[SimpleTest]]: [MySQL] 4,355 pass(es).
[ View ]
#45 feeds-patch.png18.07 KBmsti
#41 feeds-encoding_support_CSV-1428272-41.patch6.89 KBliquidcms
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-encoding_support_CSV-1428272-41.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#37 n4.txt14 bytesspuki
#35 feeds-encoding_support_CSV-1428272-35.patch6.89 KBJerenus
PASSED: [[SimpleTest]]: [MySQL] 3,843 pass(es).
[ View ]
#34 feeds-encoding_support_CSV-1428272-34.patch9.42 KBJerenus
PASSED: [[SimpleTest]]: [MySQL] 3,845 pass(es).
[ View ]
#30 price_win1251.csv_.txt143.25 KBStan Shevchuk
#27 feeds-add-encoding-support-1428272-0.patch6.94 KBOnkelTem
PASSED: [[SimpleTest]]: [MySQL] 3,849 pass(es).
[ View ]
#13 adding_encoding_conversion_support-1428272-12.patch10.19 KBderhasi
FAILED: [[SimpleTest]]: [MySQL] 3,737 pass(es), 51 fail(s), and 2 exception(s).
[ View ]
#6 adding_encoding_conversion_support_2.patch9.83 KBOnkelTem
FAILED: [[SimpleTest]]: [MySQL] 3,737 pass(es), 51 fail(s), and 2 exception(s).
[ View ]
#3 adding_encoding_conversion_support.patch5.4 KBOnkelTem
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch adding_encoding_conversion_support_0.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#1 Edit importer: Товары из бухгалтерии | ФОР-Дубна.png86.26 KBOnkelTem
adding_encoding_conversion_support.patch6.26 KBOnkelTem
PASSED: [[SimpleTest]]: [MySQL] 3,845 pass(es).
[ View ]

Comments

OnkelTem’s picture

Issue summary:View changes

asd

OnkelTem’s picture

Issue summary:View changes

asd

OnkelTem’s picture

Issue summary:View changes

asd

OnkelTem’s picture

Issue summary:View changes

asd

OnkelTem’s picture

OnkelTem’s picture

Issue summary:View changes

asd

dmstru’s picture

Hi!

Cool features. I need this - I must check it.

Rus:

Пацаны, ваще ребята!
Молодцы, четко, могёте, умеете!

Пошел тестировать.

с ув., Алексей

OnkelTem’s picture

Issue summary:View changes

asd

OnkelTem’s picture

Issue summary:View changes

Note

OnkelTem’s picture

StatusFileSize
new5.4 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch adding_encoding_conversion_support_0.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

Supplying patch to 7.x-2.x git version.
Since commerce_feeds dev branch haven't been updated to accomodate severe changes in the feeds 2.x git branch yet (getting error message: Class FeedsCommerceProductProcessor contains 2 abstract methods .... ), I had no chance to test the patch.

OnkelTem’s picture

Issue summary:View changes

asd

dmstru’s picture

Version:7.x-2.0-alpha4» 7.x-2.x-dev
Status:Active» Needs review

Hi your patch not work:

:41: trailing whitespace.

:43: trailing whitespace.

:121: trailing whitespace.
$defaults = $this->configDefaults();
:125: trailing whitespace.
'#collapsible' => TRUE,
error: patch failed: plugins/FeedsCSVParser.inc:17
error: plugins/FeedsCSVParser.inc: patch does not apply

I try to use it in last dev version.

ALEX

dmstru’s picture

Please contact with me via mail or skype awa_77.

OnkelTem’s picture

Version:7.x-2.x-dev» 7.x-2.0-alpha4
StatusFileSize
new9.83 KB
FAILED: [[SimpleTest]]: [MySQL] 3,737 pass(es), 51 fail(s), and 2 exception(s).
[ View ]

Fixed patch to 7.x-2.0-alpha4

* Fixed errors in previous (#0) patch: added missed encoding conversion when a cell allocates more then one line.

* REWORKED parser. This is serious change. I believe the original CSV parser did things in a wrong way, treating an arbitrary double quote as a delimiter. IMO, this violates CSV rules, which states, that a field should be [fully] double quoted, when something uncommon happens in it. For example:

;A string with a "double quote in it;

should definitely break parsing process. The correct variant would be:

;"A string with a ""double quote in it";

But current implementation will accept the former and will feed the fields' value with subsequent lines until next double quote.
In my patch it will throw an exception with corresponding message about it, stopping parsing.

p.s. Moving the issue back to alpha4. This is odd, I know, but I lost in versions.

polom’s picture

Hi,

I have import problems related to encodings (my CSV files are is-latin 1 encoded) and as re-encoding them is not my favorite choice I tried to find solutions.
This patch seems a nice deal but I can't apply it. I have Feeds 7.x-2.0-alpha4 but here is what I get :

$ patch < adding_encoding_conversion_support_2.patch
patching file FeedsSource.inc
Hunk #1 FAILED at 343.
1 out of 1 hunk FAILED -- saving rejects to file FeedsSource.inc.rej
patching file ParserCSV.inc
Hunk #1 FAILED at 74.
Hunk #2 FAILED at 92.
Hunk #3 FAILED at 194.
Hunk #4 FAILED at 232.
Hunk #5 FAILED at 258.
Hunk #6 FAILED at 320.
6 out of 6 hunks FAILED -- saving rejects to file ParserCSV.inc.rej
patching file FeedsCSVParser.inc
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 101.
Hunk #3 FAILED at 145.
Hunk #4 FAILED at 154.
Hunk #5 FAILED at 180.
5 out of 5 hunks FAILED -- saving rejects to file FeedsCSVParser.inc.rej

Am I missing a point here ?

OnkelTem’s picture

Does patch -p1 < patchfile work for you?

emackn’s picture

Status:Needs review» Needs work
polom’s picture

yes it does :

$ patch -p1 < adding_encoding_conversion_support_2.patch
patching file includes/FeedsSource.inc
patching file libraries/ParserCSV.inc
patching file plugins/FeedsCSVParser.inc

Unfortunately my iso-latin encoded CSV file could not be imported and I had to convert it.

OnkelTem’s picture

@polom

Would you send me an example file?
aneganov at gmail d0t com

derhasi’s picture

Status:Needs work» Needs review

The patch works for me.

Cleaned it up for coding styles and renamed it, so the issue is referenced. (@see).

@polom, did you set the "Source file encoding" in the "Settings for CSV Parser"?

derhasi’s picture

StatusFileSize
new10.19 KB
FAILED: [[SimpleTest]]: [MySQL] 3,737 pass(es), 51 fail(s), and 2 exception(s).
[ View ]

*arg*, sorry forgot to attach the patch

emackn’s picture

Version:7.x-2.0-alpha4» 7.x-2.x-dev
Status:Needs review» Needs work

do you have a test for this?

derhasi’s picture

OnkelTem,could you provide a test for that?

OnkelTem’s picture

@derhasi, I would gladly make one, if I were know how to do that :) Really, I never created unit tests before.

Yuri’s picture

there is a patch in the issue summary, and in #13.
Which one to use? I applied #13 and update.php, nothing changed so far, I still have error
SQLSTATE[HY000]: General error: 1366 Incorrect string value

OnkelTem’s picture

Please, provide a copy of example data file which produces the error.

Yuri’s picture

Category:feature» bug

In my case, I use feeds, mailhandler and mail comment modules (latest devs in d7)
The following error messages appear onhttp://www.exampledomain.com/import/mailhandler_comments

Mailbox mail@exampledomain.com was checked and contained 1 messages.
Warning message SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0
<...' for column 'comment_body_value' at row 1
Error message SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0
<...' for column 'message' at row 1

The drupal error log shows:

PDOException: in field_sql_storage_field_storage_write() (line 448 of /home/gezond/public_html/modules/field/modules/field_sql_storage/field_sql_storage.module).

and the feeds log shows:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0asdf ...' for column 'message' at row 1
Yuri’s picture

The data imported is a general Gmail message like 'hello' (of which I have set utf8 for outgoing messages, according to http://support.google.com/mail/bin/answer.py?hl=en&answer=22841

OnkelTem’s picture

Are you mails come in CSV format?

OnkelTem’s picture

Status:Needs work» Needs review
xaqrox’s picture

Seems to work really well for me. Don't know the test framework very well either or I'd be right on it. I found a bunch of other issues which I believe this solves, which I closed as duplicate. Hope that's not too presumptuous.
#1605628: Encoding problems while importing from CSV
#1471950: CSV Parser: Check and Convert the Encoding
#1220606: Add support for encoding conversions for any parser
#1319142: csv in ISO 8859-15/EURO

twistor’s picture

Category:bug» feature
OnkelTem’s picture

@twistor
We probably need to split this issue, since #6 is reporting about serious bug in parser.
What do you think?

twistor’s picture

Yes, these are very separate issues.

OnkelTem’s picture

StatusFileSize
new6.94 KB
PASSED: [[SimpleTest]]: [MySQL] 3,849 pass(es).
[ View ]

Separating encoding support from the rest.

OnkelTem’s picture

twistor’s picture

Status:Needs review» Needs work
+++ b/libraries/ParserCSV.incundefined
@@ -324,4 +342,37 @@ class ParserCSV {
+      if (function_exists('mb_convert_encoding')) {

This should use extension_loaded(). Move this check to the top of the function, don't check twice.

+++ b/plugins/FeedsCSVParser.incundefined
@@ -185,6 +191,40 @@ class FeedsCSVParser extends FeedsParser {
+    $form['encoding'] = array(

This whole fieldset should be conditional on the mb library.

+++ b/plugins/FeedsCSVParser.incundefined
@@ -185,6 +191,40 @@ class FeedsCSVParser extends FeedsParser {
+    if (function_exists('mb_list_encodings')) {

Should use extension_loaded().

Could you please provide an example CSV file that needs this functionality?

Stan Shevchuk’s picture

StatusFileSize
new143.25 KB

Hi twistor,

I'm attaching a sample CSV file that requires conversion. Any chance to see the patch commited to dev?

mato’s picture

Hi,
I need encoding conversions from Spanish characters: which one is the patch to use against the current alpha7 release?
Thanks

Dubs’s picture

Thanks so much for your time on this essential module!

Please can this be committed - this would be so useful as I feel pain at the moment every time I have to import Euro characters!

imclean’s picture

To the last 3 posters, see #29. The patch still needs work before it can be committed.

Jerenus’s picture

StatusFileSize
new9.42 KB
PASSED: [[SimpleTest]]: [MySQL] 3,845 pass(es).
[ View ]

Here is the patch based on all the previous. And I made some modifications to let it work.

Jerenus’s picture

StatusFileSize
new6.89 KB
PASSED: [[SimpleTest]]: [MySQL] 3,843 pass(es).
[ View ]

The final one.

pcambra’s picture

Status:Needs work» Needs review
spuki’s picture

StatusFileSize
new14 bytes

It's not working for me. File encoding is Windows-1251, but it was recognized as CP936. So I just commented lines with encoding checking.

<?php
     
//$encode_array = array('ASCII', 'UTF-8', 'GBK', 'GB2312', 'BIG5');
      //$this->encoding = mb_detect_encoding($data, $encode_array);
           
      // Convert encoding if needed
     
if ($this->from_encoding != $this->to_encoding) {
         
$data = mb_convert_encoding($data, $this->to_encoding, $this->from_encoding);
      }
?>
bjcone’s picture

I downloaded the patch and it seems to be working to check the encoding of individual lines. However, I ran into a case where the file I was attempting to import was encoded such that fgets() did not recognize the end of line character and returned the full contents of the file. At this point it is too late for this patch to help.

What I found to resolve this issue and enable feeds to import my ASCII-encoded file was ini_set("auto_detect_line_endings", true). See the bottom of the page at http://www.php.net/manual/en/filesystem.configuration.php#ini.auto-detec... for more information. It seems as though PHP, without this setting, does not separate lines which only end in \r (Mac Carriage Return character).

Hope this is useful to someone else as it took me a while to find.

liquidcms’s picture

i have issue similar to #37. i saved a file from Excel (2007) as .csv. the file has 5 lines, 3 of which have odd non-ASCII dashes. when hitting the detect encoding line in the patch; as spuki states, those 3 lines are detected as CP936.

NOTE. comment in #37 about commenting out those lines doesn't do anything; mb_detect_encoding doesn't modify the data it simply determines the encoding. with or without the array of encoding types listed it still determines these lines are CP936 and the part of the code which then does the encode still runs (that check is simply to make sure we don't re-encode data that is already UTF-8.

so the only issue (and maybe not fixable) is that to convert those dashes in CP936 to UTF-8 the dash is simply removed - this is not a great solution; but without the patch and the encoding conversion Feeds import fails on these characters.

ALSO - as noted in #1140194: SQLSTATE[HY000]: General error: 1366 Incorrect string value for a field with accents if i open the Excel saved CSV into Notepad and then re-save as UTF-8, the dashes import correctly (with or without this patch; although i agree would be nice if Feeds could handle this).

liquidcms’s picture

turns out losing characters when using PHP to convert from CP936 to UTF-8 is a bug in PHP 5.3 and has been fixed in PHP 5.4.0 (https://bugs.php.net/bug.php?id=60306)

liquidcms’s picture

StatusFileSize
new6.89 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-encoding_support_CSV-1428272-41.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

btw, patch in #35 does not adhere to drupal coding standards. i think the attached is somewhat closer

Status:Needs review» Needs work

The last submitted patch, feeds-encoding_support_CSV-1428272-41.patch, failed testing.

Jerenus’s picture

Status:Needs work» Needs review
msti’s picture

#35 works for me

Thanks!

msti’s picture

StatusFileSize
new18.07 KB

screenshot

Summit’s picture

Status:Needs review» Reviewed & tested by the community

Hi, for me too! Setting this to RTBC ok?
Greetings, Martijn

meSte’s picture

Just to be clear: is #35 the patch that will be committed?

twistor’s picture

Status:Reviewed & tested by the community» Needs work

As was already pointed out, the patch in #35 breaks Drupal coding standards.

Honestly the to, from, check encoding business seems too complicated. Feeds is complicated enough. Is there a valid reason why we can't just automatically detect the encoding, and use that without any user interaction?

Bobík’s picture

twistor: Because sometimes the automatic detection is not reliable. It's nice to have auto detection, but not only this.

Btw, if you think Feeds is too complicated too, see this project.

eosrei’s picture

Status:Needs work» Needs review
StatusFileSize
new6.43 KB
PASSED: [[SimpleTest]]: [MySQL] 4,355 pass(es).
[ View ]

Here is #35 plus the coding standard changes from #41. Should pass tests. Works great for me to import international characters. Thanks everyone.

eosrei’s picture

Issue summary:View changes

asd

Jerenus’s picture

Are we have the conclusion that the demand of user interaction?

eosrei’s picture

StatusFileSize
new5.07 KB
PASSED: [[SimpleTest]]: [MySQL] 4,355 pass(es).
[ View ]

I've refactored this a bit:

  • UTF-8 conversion is forced to stop the "SQLSTATE[HY000]: General error: 1366 Incorrect string value" error.
  • Removed the optional "Check encoding" checkbox.
  • Made the encoding convert function fail if the encoding detection fails
  • Removed the encoding check function (it seems extraneous?)
  • Removed the exception if fixEncoding() cannot locate mbstring. I left the notice on the import form, maybe should be in install requirements?
  • Made $detected_encoding a local variable.
  • Added additional comments.
  • Reduced the overall amount of code changes.

It needs simpletests and possibly more refactoring to fit better within the current code architecture.

johan2’s picture

Hi, I installed the latest patch 52. I have a csv with terms in a column with a special character (ë). When I import the csv (utf-8) the term is not recognized. The select term field stays empty. I am struggling for a week with this. The only thing that works is to make a fake term without special character and try to change it by deleting the term and try to merge it with the intended taxonomy term with the special character or changing the term name once it is imported.

eosrei’s picture

@johan2 Have you tried any other encoding options? Windows‑1252 worked for my imports.

johan2’s picture

I am working on a power mac with excel 2001. My best setting to save from excel is -windows csv-. With Smultron and Textmate I controlled the file, and it shows no problems. A text is imported fine to a Text-field except when it has to go to a Term-field with special characters then it goes wrong even when changing the "ë" to "ë".
I suspected also the file... is it really utf-8 ? I found more issues concerning this.

johan2’s picture

Component:Feeds Import» Code

It must have something to do with the taxonomy. When I export a Term to csv from the site it will export the "ë" to "ë" . I kow that my excel is not the newest but through text-edit or smultron or textmate the text can be exported to utf-8, even changing the "ë" manually. Only for the Term the issue stays unsolved other text-fields have no problems and import correctly.

johan2’s picture

What also happens is if you adapt the same Term, changing "ë" to "e" in both taxonomy and csv-file, the import can go wrong because something stays in memory. I don't know what it is but flush cache is not enough or updating aliases. Changing abrupt to another Term that was in the taxonomy before works... So once it goes wrong you are started for a lot of worries since you are not sure what to do.

johan2’s picture

I have some more information about what happens. It seems that a great deal of the problem can be caused by the uploading process. When creating new terms is allowed (because otherwise the term stays empty) on import then the csv file creates for the same collumn with the same term several times a new term and in that process it ends up with ? ? ? around each character. So many duplicates are created. I installed the merge module 7-1 and here I noticed this behaviour. Before I had merge 7-2dev and the replace module, I desinstalled them. But also good news is that the special characters are passing.

eosrei’s picture

But also good news is that the special characters are passing.

So, do you mean this feature patch is working correctly?

johan2’s picture

I did some more testing and ended up installing Openoffice. Some of the characters in excel didn't export correctly. This was not obvious to find out. In the Openoffice Calc (spreadsheet) I noticed that some characters had ? around them. So this was also the case in the import with Feeds. Once I deleted these and exported the sheet again to csv the Feeds-import worked. The problem was that these ? were invisible in a texteditor or textmate. To import excel into Openoffice goes through a wizard and the software is free, so for me this is solved ;-)

johan2’s picture

Thanks for the great module. Just imported over 10.000 records with all kinds of data inline: title, number (mixed 1a, 1.1, 2, etc) ,category, subcategory, dates, description, certificate number ... And in not more then a few minutes everything was in my views without problems. So thanks again!!

henkit’s picture

Hi,
Just installed #52, working like a charm now! :) Thanx so much!!!
Henk

Summit’s picture

Status:Needs review» Reviewed & tested by the community

Yep https://drupal.org/node/1428272#comment-7349918 worked!
Setting this to RTBC ok?
greetings, Martijn

franz’s picture

Issue tags:+Needs tests

Thanks for all the work!

I'd love to have a test for it though.

sphankin’s picture

Hi,

I've been getting the SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\x92t message in my logs when feeds is trying to process a mail message from mail handler.

I've worked out that in the text the word don't is stopping it from working properly because of the'.

I've tried implementing the patch as suggested above but that still doesn't seem to work.

Also I'm not sure where the GUI as printscreened above is to change the source file encoding?

Thanks in advance.

msti’s picture

@sphankin The screenshot above shows the import form located at import/name_of_importer
If you dont see the options in the screenshot, the patch is not applied properly. Here is how to apply patches: https://drupal.org/patch/apply

sphankin’s picture

Hi @msti, thanks for the reply. I've followed the patch instructions and I still can't see anything/it isn't fixing the issue. As far as I can tell (looking through the patch) the patch is being applied properly. Would it make a difference if I'm importing an email instead of a CSV file? Thanks.

eosrei’s picture

@spankin: this is only for the CSV parser.

I'll be able to look into writing a simpletest later this month.

sphankin’s picture

@eosrei - thank you, that would be great!

sphankin’s picture

@eosrei - Hi. Any development on the email parser?

Thanks,

Sam

eosrei’s picture

@sphankin: This patch is only for the CSV parser. I don't have time/need to work on the email parser. Please create a separate feature request issue for email functionality. I still hope to be able to implement the needed simpletests for this patch, but it is feature complete as is.

franz’s picture

I used this patch with success on a project. I'm seriously just waiting for tests to commit it.

twistor’s picture

Status:Reviewed & tested by the community» Needs work

What franz said, needs tests.

Rosamunda’s picture

#52 WORKED FOR ME! THANKS!!!

acouch’s picture

@franz or @twistor, could a test just involve adding international characters to one of the csv files in feeds or should it have its own test?

franz’s picture

Ideally, we should have an individual assertion that verifies if the code works well with the encoding. What matters most is to make sure the tests appropriately cover the possibilities and provide an easy output in case it fails. That's my take on it at least.

HeathN’s picture

#52 is the way to go. Thanks for this patch. What is the status on getting this into the next build?

acouch’s picture

If someone can provide a file or files that works only with the new conversion that they would be comfortable being added to the project I can write the assertion.

Aimee Degnan’s picture

#52 worked for me. Thank you for the patch!

Summit’s picture

Hi,
Anyone up for the tests? Than I think this can be committed, right?
Greetings, Martijn

franz’s picture

So all we need is a CSV file with non utf-8 encoding to test the feature. If someone can provide, then acouch writes the assertions and we commit.

acouch’s picture

Status:Needs work» Needs review
StatusFileSize
new5.21 KB
PASSED: [[SimpleTest]]: [MySQL] 6,191 pass(es).
[ View ]

I found that the patch in #52 does't allow the encoding settings to be changed on a per-node basis. I added

$form['encoding']['#default_value'] = isset($source_config['encoding']) ? $source_config['encoding'] : $form['encoding']['#default_value'];

to the sourceForm which fixed this. Didn't get a chance to test this, so setting to 'needs review'. Will reset if the test passes and will still wait for a file to write test encodings.

eosrei’s picture

Status:Needs review» Needs work

This should be "Needs Work" until tests are written. I've got three higher priority projects in front of the project needing this, so I cannot work on it.

eosrei’s picture

Issue summary:View changes

Fixing screenshot

BrightBold’s picture

Patch in #52 solved the problem for me as well. (Sorry I didn't test #82 — I was short on time and didn't need the additional functionality it offered so I went for the earlier one with proven success). Wish I could write tests to help get this committed — it's great!

liquidcms’s picture

somewhat related but maybe needs a new issue:

if the column headings have non standard characters (mine have french characters) then that column is skipped on import.

liquidcms’s picture

hmm..as i mentioned in #39 above:

i was getting the import to crash when it would encounter a non-english character. the path in #52 fixed it from crashing; but it is simply removing those characters. that is certainly not a solution.

liquidcms’s picture

ughh!! at a bit of a loss with this... as i mentioned in #86 (and #39) i have the patch from #52 and my import no longer crashes on FR characters; but they get dropped.

I have looked in to the code a bit more and this is making less sense the more i look.

taking the code from the patch i do this in a devel/php window:

<?php
$data
= 'Date entrée';
$data = mb_convert_encoding($data, 'UTF-8', 'UTF-8');
echo
$data;
?>

and the result is: Date entrée

but, when i use a debugger and step through the same code in the feeds function: fixEncoding()

after running mb_convert_string() the character are lost; exactly as occurs when doing the import.

my guess is this has something to do with the devel/php having some html encoding possibly in the mix that the import function doesn't have; but also confused why so many people seem to be having success with the patch in #52.

is it possibly due to using PHP 5.3 instead of PHP 5.4?

liquidcms’s picture

i admit i do not understand all the aspects of dealing with other character sets, but i ran these tests to show if the issue is PHP version dependent.

running a test file via php on command line (Windows)

i set this as my test file:

<?php
echo mb_convert_encoding('Bibliothèque', 'UTF-8', 'UTF-8');
?>

results are:

C:\Program Files (x86)\nusphere\phped\php54>php -f test.php
Biblioth?que
C:\Program Files (x86)\nusphere\phped\php54>cd ../php53
C:\Program Files (x86)\nusphere\phped\php53>php -f test.php
Bibliothquec
C:\Program Files (x86)\nusphere\phped\php53>

i am sure the ? is just a display issue with the command shell; but it is clear that PHP 5.3 and PHP 5.4 act differently

i'll set up PHP 5.4 for the web server and try the CSV import to be sure.

liquidcms’s picture

fixed. in the fixEncoding function i replaced this line:

$data = mb_convert_encoding($data, $this->to_encoding, $detected_encoding);

with this one:

$data = utf8_encode($data);

eosrei’s picture

@liquidcms utf8_encode() only converts a string encoded in ISO-8859-1 to UTF-8. It doesn't cover any other conversions

mb_convert_encoding() converts between any listed encoding and is multi-byte string capable. See: http://www.php.net/manual/en/function.mb-convert-encoding.php

While this code may need some help, utf8_encode is not the answer. Sorry.

liquidcms’s picture

@eosrei thanks for the info.

i do not doubt what you are saying; but utf8_encode does work. :)

also the code in the patch: mb_detect_encoding($data, $this->from_encoding) returns utf8 for my data (not iso-8859-1)

and mb_convert_encoding certainly does not work (although, as i have said, perhaps it is only busted for < php 5.4).

祥子’s picture

Status:Needs work» Needs review
eosrei’s picture

Issue summary:View changes
Status:Needs review» Needs work

Setting issue status back. No need to test until tests are written.

rcodina’s picture

The patch in comment #82 works for me with these php versions:

5.3.10-1ubuntu3.10
5.2.17

I had no need to use "utf8_encode" instead of "mb_convert_encoding". I think "mb_convert_encoding" works really well here. However, I had to specify the exact charset of my CSV file to get it to work (in CSV parser configuration form). The "auto" option doesn't work for me.

I really think this patch must be put on recommended release ASAP because this is a key feature for a lot of Drupal sites in spanish, french, german, etc

Thank you so much!!!

cgdrupalkwk’s picture

#82 worked great for me.

Niremizov’s picture

#82 , #52 - "Check encoding" checkbox is not optional. This checkbox is badly needed, because there is not guarantee that string encoding would be determined properly with mb_detect_encoding() function.

The problem is that mb_detect_encoding() returns the first character encoding sheme that it determines. What this leads to?
Example: File contains multi language strings and uses Windows-1251 (CP1251) or any other character encoding sheme, that is compatible with ASCII (same codes for English chars). If first char in the processing string is English, then mb_detect_encoding($data, 'Windows-1251') - returns FALSE, because it detects ASCII first.

But mb_check_encoding() - works better in this case.

Niremizov’s picture

Attached patch in addition to #96 comment.

Barryvdh’s picture

Patch #82 does seem to fix my import issues with the latest dev version. Any reason this isn't merged yet?

MegaChriz’s picture

@Barryvdh
Yes, a test should be written for this issue. After that it is probably ready for commit. See also comment #81.