Create plugins for handling UTF-8 problems [#2263119]

When importing data from Windows systems, particularly CSV, you often run into UTF-8 issues. Both of these can result in some very cryptic SQL errors, and search/replace on source documents.

Having standard plugins to handle

- ISO-8859-1 to UTF-8 conversion
- Stripping non UTF-8 characters

would be handy.

Comment	File	Size	Author
#4	feeds_tamper-utf8-2263119-04.patch	3.19 KB	mpdonadio
#4
#2	feeds_tamper-utf8-2263119-02.patch	4.14 KB	mpdonadio
#2
#1	feeds_tamper-utf8-2263119-01.patch	1.83 KB	mpdonadio
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

CreditAttribution: mpdonadio commented 9 May 2014 at 16:47

Status:

Active

» Needs review

File	Size
feeds_tamper-utf8-2263119-01.patch	1.83 KB

The attached patch creates two simple plugins.

utf8_encode simply calls utf8_encode() on the data to convert ISO-8859-1 to UTF-8.

strip_non_utf8 will do a preg_replace using a regex based on the one from http://stackoverflow.com/a/1401716 to remove non-UTF8 characters.

Tests will follow when I get a chance to write them.

Comment #2

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

CreditAttribution: mpdonadio commented 9 May 2014 at 18:01

File	Size
feeds_tamper-utf8-2263119-02.patch	4.14 KB

Added tests. Would appreciate a second set of eyes on the test sequences.

Comment #3

9 May 2014 at 18:02

Status:

Needs review

» Needs work

The last submitted patch, 2: feeds_tamper-utf8-2263119-02.patch, failed testing.

File	Size
feeds_tamper-utf8-2263119-04.patch	3.19 KB

Comment #5

twistor CreditAttribution: twistor commented 12 July 2014 at 01:26

Status:

Needs review

» Needs work

+++ b/plugins/utf8_encode.inc
@@ -0,0 +1,26 @@
+  $field = drupal_convert_to_utf8($field, 'ISO-8859-1');

We should give an option to select the source encoding.

+++ b/tests/feeds_tamper_plugins.test
@@ -1020,3 +1020,43 @@ class FeedsTamperUniqueTestCase extends FeedsTamperUnitTestCase {
+ * Tests for utf8_estrip_non_utf8ncode.inc

What's dat?

Tests! Sweet! We really should provide either a select list, or a textfield for the encoding method. This looks good.

Comment #6

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

CreditAttribution: mpdonadio commented 12 July 2014 at 12:28

Assigned:

Unassigned

» mpdonadio

1. What encoding should be there? As the world has mostly converted to UTF8, and the backend will vary on what is installed on the server, I'm not sure what a list should be? ISO-8859-1, Windows 1251 and 1252, and MacRoman should cover most old documents, and maybe the other three UTF variants? Coming up with test sequences may be tough.

2. Is the effects of presbyopia on a developer :)

I'll update this week and resubmit.

Comment #7

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

CreditAttribution: mpdonadio commented 12 July 2014 at 13:15

Followup #1. In my experience, the people who are using Feeds to do imports are typically not developers, and don't really know much about encoding problems. I don't think that a textfield would be beneficial here, as they wouldn't know what to put. A few choices in a dropdown would allow them try a few things when they get the dreaded SQL error during an import.

Comment #8

twistor CreditAttribution: twistor commented 12 July 2014 at 20:39

It is difficult, but I can see the next issue being, "Add support for arbitrary encoding."

I think we could do both options though. Provide a select list of common encodings, and provide a text field. Label the textfield as "Advanced" or something. We could even do some select-or-other type stuff with states if you wanted to get fancy, but it's not a priority.

Anyone who has the need for this will have already had things exploding on them, so we should be as gentle and helpful as possible.

Comment #9

justindodge CreditAttribution: justindodge commented 24 December 2014 at 19:40

This is mostly a duplicate of #1817516: Feeds Tamper plugin for converting charset, (which does offer a select list to choose the source encoding, but does not have an open text field). There's some unit tests here also which is cool and the other issue is missing - mpdonadio maybe you could help out by writing some tests for the issue referenced above (we'd need a few more since there is more than just ISO-8859-1 supported).

This issue does include a plugin for 'Stripping non UTF-8 characters', so maybe we could trim this issue down to just being about that?