Hi,
when I try to import tweets, I sometimes get:
DOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x92\x97' for column 'body_value' at row 1
After some goggling, it seems, that this is caused by a "invalid" UTF-8 character, any idea how to get rid of this?!
Best regards,
Tobi
| Comment | File | Size | Author |
|---|---|---|---|
| #22 | feeds_jsonpath_parser-four-byte-1824506-22.patch | 5.11 KB | twistor |
| #15 | feeds_jsonpath_parser-four-byte-1824506-15.patch | 4.37 KB | twistor |
| #5 | FeedsJSONPathParser-invalidcharacter-1824506-5.patch | 1.05 KB | neoxavier |
| #3 | incorrect_string_value-1824506.patch | 617 bytes | Jackinloadup |
Comments
Comment #1
Jackinloadup commentedI got this error yesterday and have spent at least 4-6 hours trying to figure it out.
I am trying to load tweets as well.
Here is the relevant portion of my error
Thinking there might be i18n issues with the normal title field i also tried the id a string as the title. Then the following error came up
Here are the things i have learned
I ran these mysql commands to try an see if there was any database collation that wasn't set correctly
I've tried to make sure mysql is setup correctly and everything looks ok but it still acting odd. See links below.
http://stackoverflow.com/a/3513812
http://drupal.org/node/1140194#comment-4858112
http://stackoverflow.com/questions/2615611/utf-8-database-problem
At this point im going to simply strip out these chars until this issue can be figured out.
Comment #2
Jackinloadup commentedMost likely relevant as well
#1140194: SQLSTATE[HY000]: General error: 1366 Incorrect string value for a field with accents
Comment #3
Jackinloadup commentedHere is a stopgap patch to strip out the issues in questions. This is not a solution and SHOULD NOT be committed.
Comment #4
DerTobi75 commentedHi,
tested the patch, kind of works for me! Anyway a real solution would be great, thanks for the work around :-)
Tobi
Comment #5
neoxavier commentedHello,
I had the same problem as well when trying to import tweets. The culprit is the smiley (emoji) in the tweets.
The reason of this error is
The MYSQL's character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Here we have a string with a character (emoji / japanese smileys) that needs 4 bytes: i.e. \xF0\x9F\x98\xA4\.. = "\xf0\x9f\x98\xa4" => 'FACE WITH LOOK OF TRIUMPH'
https://github.com/iamcal/php-emoji/blob/master/emoji.php
Solution proposed
1. Strip down all string that need more than 3 bytes character.
2. Use MySQL 5.5.3 or later and change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.
Here is tutorial on how to change from UTF-8 to UTF-8MB4
http://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4
at the moment, I just use a quick patch that remove everything except valid letters (from any language), since upgrading the database will take some time.
The patch attached is a modification of patch in comment #3 above.
a fancy way is to check each bytes, but I am not going to do this now.
Comment #6
andylarks commentedHi All
I had the same problem, and fixed like this:
in -> function twitter_status_save($status)
change:
'text' => $status->text,
to:
'text' => preg_replace('/[^(\x20-\x7F)]*/','', $status->text),
...seems to be working nicely
Comment #7
DerTobi75 commented@andylarks
How did you get this working with Twitters API 1.1?!
Regards,
Tobi
Comment #8
purabdk commentedIt issue happen when character set is not in UTF-8 format.
If you got the error while using the db_insert then you can use the following code
$val = mb_check_encoding($val, 'UTF-8') ? $val : utf8_encode($val);
if it is just matter of some field values.
Comment #9
dimitrov.adrian commented#5 works for me
Comment #10
johnstew commented#6 worked for me. Appears to be pulling in the correct information now. Will monitor this and make sure this is a good fix. Either way should probably have a better solution to this other than just removing those characters.
Comment #11
kiliweb commentedAwsome patch neoxavier !
Patch #5 works perfectly for me, with instagram feeds.
You saved me hours ! Thanks.
Comment #12
jaesperanza commentedBefore doing all suggested, do utf8 conversion using notepad++ did the trick, sheesh! this saves you all the trouble and reading! :)
Comment #13
kebz commented@jaesperanza
- your suggestion (#12) worked for me.
I've been trying to create an article on my site for the past few days and it kept giving me this error:
So when I found this page, I was ready to incorporate the patch ... but fortunately, I read your post and made me review my article. And I found this in it "#P3#" ... I had copied and paste from my instagram post and those # are actually square looking boxes and suppose to be hearts... lol --- I couldn't post the characters because Drupal (here) is giving me an error message too... lol
Anyhoo, all is good again...yay!
Thanks =)
Comment #14
achekulaev commented@neoxavier you rock. Patch #5 works perfectly.
Comment #15
twistor commentedStripping the characters isn't the only option. We can convert them into HTML entities.
This patch adds that option, and strips by default.
The regex in #5 will strip too many things, and requires that pcre is compiled with unicode support.
Comment #17
twistor commentedComment #19
joelpittetAdding related core issues for everybody's reference on whats planned:
#2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it or #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols)
Comment #22
twistor commentedComment #24
twistor commentedComment #26
amaisano commentedThe latest version fixes this by stripping out those characters. However, I tried using the checkbox for "Convert four byte characters," but when that is checked, I get the same failures when importing tweets. Is that convert feature fully functional? I expected it to import the feed item, but it wasn't able to and gave me the following error:
SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x99\x80-e...' for column 'alias' at row 1Which is the same as the original issue. Let me know if I should open this up as a new bug instead.
So it looks like it IS getting converted to an HTML entity (http://unicode-table.com/en/1F640/) but is still failing to create the node due to this error.
Here's a report of this item error from the log:
Comment #27
twistor commentedPlease create a new issue. I just tested your feed, and it worked fine.
Comment #28
khoomy commented@neoxavier #5 worked for me.
Cheers
Comment #29
jan kellermann commentedPatch did not work for me. We added a new regexp-Rule in tamper with this regexp:
/(([\xE0-\xEF][\x00-\xFF][\x00-\xFF])|([\xF0-\xF4][\x00-\xFF][\x00-\xFF][\x00-\xFF]))/
replace by nothing.
Comment #30
d34dman commentedIn my opinion the module could declare a dependency on strip_utf8mb4, and use
_strip_utf8mb4_for_text_fields("<YOUR STRING HERE>", variable_get('strip_utf8mb4_replace_string', '--'));.The discussion regarding what to strip and how it should be replaced should be done under strip_utf8mb4 module issue queue. This will be beneficial for both community and reduce duplicate effort.
Comment #31
twistor commented@jan kellermann, Can you provide a specific example that fails?
@D34dMan, that's not an option. For one, this module is in bug fix only mode. Two, strip_utf8mb4 does way too much as it just strips 4 byte characters from any field. That would break for people who have converted mysql fields to allow 4 byte characters.
Comment #32
sanjayk commentedI was face same issue. I was used below code in custom module.
function custom_module_name_node_presave($node) {
if ($node->type == 'node_type') {
$node->body[LANGUAGE_NONE][0]['value'] = utf8_encode($node->body[LANGUAGE_NONE][0]['value']);
}
}