When importing data from Windows systems, particularly CSV, you often run into UTF-8 issues. Both of these can result in some very cryptic SQL errors, and search/replace on source documents.

Having standard plugins to handle

- ISO-8859-1 to UTF-8 conversion
- Stripping non UTF-8 characters

would be handy.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mpdonadio’s picture

Status: Active » Needs review
FileSize
1.83 KB

The attached patch creates two simple plugins.

utf8_encode simply calls utf8_encode() on the data to convert ISO-8859-1 to UTF-8.

strip_non_utf8 will do a preg_replace using a regex based on the one from http://stackoverflow.com/a/1401716 to remove non-UTF8 characters.

Tests will follow when I get a chance to write them.

mpdonadio’s picture

Added tests. Would appreciate a second set of eyes on the test sequences.

Status: Needs review » Needs work

The last submitted patch, 2: feeds_tamper-utf8-2263119-02.patch, failed testing.

mpdonadio’s picture

Status: Needs work » Needs review
FileSize
3.19 KB

Well that was embarrassing.

twistor’s picture

Status: Needs review » Needs work
  1. +++ b/plugins/utf8_encode.inc
    @@ -0,0 +1,26 @@
    +  $field = drupal_convert_to_utf8($field, 'ISO-8859-1');
    

    We should give an option to select the source encoding.

  2. +++ b/tests/feeds_tamper_plugins.test
    @@ -1020,3 +1020,43 @@ class FeedsTamperUniqueTestCase extends FeedsTamperUnitTestCase {
    + * Tests for utf8_estrip_non_utf8ncode.inc
    

    What's dat?

Tests! Sweet! We really should provide either a select list, or a textfield for the encoding method. This looks good.

mpdonadio’s picture

Assigned: Unassigned » mpdonadio

1. What encoding should be there? As the world has mostly converted to UTF8, and the backend will vary on what is installed on the server, I'm not sure what a list should be? ISO-8859-1, Windows 1251 and 1252, and MacRoman should cover most old documents, and maybe the other three UTF variants? Coming up with test sequences may be tough.

2. Is the effects of presbyopia on a developer :)

I'll update this week and resubmit.

mpdonadio’s picture

Followup #1. In my experience, the people who are using Feeds to do imports are typically not developers, and don't really know much about encoding problems. I don't think that a textfield would be beneficial here, as they wouldn't know what to put. A few choices in a dropdown would allow them try a few things when they get the dreaded SQL error during an import.

twistor’s picture

It is difficult, but I can see the next issue being, "Add support for arbitrary encoding."

I think we could do both options though. Provide a select list of common encodings, and provide a text field. Label the textfield as "Advanced" or something. We could even do some select-or-other type stuff with states if you wanted to get fancy, but it's not a priority.

Anyone who has the need for this will have already had things exploding on them, so we should be as gentle and helpful as possible.

justindodge’s picture

This is mostly a duplicate of #1817516: Feeds Tamper plugin for converting charset, (which does offer a select list to choose the source encoding, but does not have an open text field). There's some unit tests here also which is cool and the other issue is missing - mpdonadio maybe you could help out by writing some tests for the issue referenced above (we'd need a few more since there is more than just ISO-8859-1 supported).

This issue does include a plugin for 'Stripping non UTF-8 characters', so maybe we could trim this issue down to just being about that?