Hi,

when I try to import tweets, I sometimes get:

DOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x92\x97' for column 'body_value' at row 1

After some goggling, it seems, that this is caused by a "invalid" UTF-8 character, any idea how to get rid of this?!

Best regards,

Tobi

Comments

Jackinloadup’s picture

I got this error yesterday and have spent at least 4-6 hours trying to figure it out.

I am trying to load tweets as well.

Here is the relevant portion of my error

PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x98\xAD' for column 'title' at row 1: INSERT INTO {node}

Thinking there might be i18n issues with the normal title field i also tried the id a string as the title. Then the following error came up

PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x98\xAD' for column 'body_value' at row 1: INSERT INTO {field_data_body} 

Here are the things i have learned

I ran these mysql commands to try an see if there was any database collation that wasn't set correctly

  • show full columns from --TABLENAME--;
  • show variables like 'char%';
  • show table status where collation like '%utf8%'
  • show table status where collation like '%latin%'

I've tried to make sure mysql is setup correctly and everything looks ok but it still acting odd. See links below.
http://stackoverflow.com/a/3513812
http://drupal.org/node/1140194#comment-4858112
http://stackoverflow.com/questions/2615611/utf-8-database-problem

At this point im going to simply strip out these chars until this issue can be figured out.

Jackinloadup’s picture

Jackinloadup’s picture

StatusFileSize
new617 bytes

Here is a stopgap patch to strip out the issues in questions. This is not a solution and SHOULD NOT be committed.

DerTobi75’s picture

Hi,

tested the patch, kind of works for me! Anyway a real solution would be great, thanks for the work around :-)

Tobi

neoxavier’s picture

Hello,

I had the same problem as well when trying to import tweets. The culprit is the smiley (emoji) in the tweets.

The reason of this error is

The MYSQL's character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports

http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

Here we have a string with a character (emoji / japanese smileys) that needs 4 bytes: i.e. \xF0\x9F\x98\xA4\.. = "\xf0\x9f\x98\xa4" => 'FACE WITH LOOK OF TRIUMPH'
https://github.com/iamcal/php-emoji/blob/master/emoji.php

Solution proposed

1. Strip down all string that need more than 3 bytes character.

2. Use MySQL 5.5.3 or later and change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.

Here is tutorial on how to change from UTF-8 to UTF-8MB4
http://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

at the moment, I just use a quick patch that remove everything except valid letters (from any language), since upgrading the database will take some time.

The patch attached is a modification of patch in comment #3 above.

a fancy way is to check each bytes, but I am not going to do this now.

andylarks’s picture

Hi All

I had the same problem, and fixed like this:

in -> function twitter_status_save($status)

change:
'text' => $status->text,

to:
'text' => preg_replace('/[^(\x20-\x7F)]*/','', $status->text),

...seems to be working nicely

DerTobi75’s picture

@andylarks

How did you get this working with Twitters API 1.1?!

Regards,

Tobi

purabdk’s picture

It issue happen when character set is not in UTF-8 format.
If you got the error while using the db_insert then you can use the following code

$val = mb_check_encoding($val, 'UTF-8') ? $val : utf8_encode($val);

if it is just matter of some field values.

dimitrov.adrian’s picture

Issue summary: View changes

#5 works for me

johnstew’s picture

#6 worked for me. Appears to be pulling in the correct information now. Will monitor this and make sure this is a good fix. Either way should probably have a better solution to this other than just removing those characters.

kiliweb’s picture

Status: Active » Reviewed & tested by the community

Awsome patch neoxavier !
Patch #5 works perfectly for me, with instagram feeds.

You saved me hours ! Thanks.

jaesperanza’s picture

Before doing all suggested, do utf8 conversion using notepad++ did the trick, sheesh! this saves you all the trouble and reading! :)

kebz’s picture

@jaesperanza
- your suggestion (#12) worked for me.

I've been trying to create an article on my site for the past few days and it kept giving me this error:

PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x92\x95P3...' for column 'body_value' at row 1: INSERT INTO {field_data_body} (entity_type, entity_id, revision_id, bundle, delta, language, body_value, body_summary, body_format) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6, :db_insert_placeholder_7, :db_insert_placeholder_8); Array ( [:db_insert_placeholder_0] => node [:db_insert_placeholder_1] => 38 [:db_insert_placeholder_2] => 38 [:db_insert_placeholder_3] => article [:db_insert_placeholder_4] => 0 [:db_insert_placeholder_5] => und [:db_insert_placeholder_6] => "blah blah blah text for article" ~Kebz #P3# [:db_insert_placeholder_7] => [:db_insert_placeholder_8] => full_html ) in field_sql_storage_field_storage_write() (line 494 of .../modules/field/modules/field_sql_storage/field_sql_storage.module).

Backtrace:
PDOStatement->execute(Array) database.inc:2171
DatabaseStatementBase->execute(Array, Array) database.inc:683
DatabaseConnection->query('INSERT INTO {field_data_body} (entity_type, entity_id, revision_id, bundle, delta, language, body_value, body_summary, body_format) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6, :db_insert_placeholder_7, :db_insert_placeholder_8)', Array, Array) query.inc:36
InsertQuery_mysql->execute() field_sql_storage.module:494
field_sql_storage_field_storage_write('node', Object, 'insert', Array)
call_user_func_array('field_sql_storage_field_storage_write', Array) module.inc:866
module_invoke('field_sql_storage', 'field_storage_write', 'node', Object, 'insert', Array) field.attach.inc:967
field_attach_insert('node', Object) node.module:1177
node_save(Object) node.pages.inc:457
node_form_submit(Array, Array) form.inc:1513
form_execute_handlers('submit', Array, Array) form.inc:903
drupal_process_form('article_node_form', Array, Array) form.inc:385
drupal_build_form('article_node_form', Array) form.inc:130
drupal_get_form('article_node_form', Object) node.pages.inc:78
node_add('article')
call_user_func_array('node_add', Array) menu.inc:517
menu_execute_active_handler() index.php:21

So when I found this page, I was ready to incorporate the patch ... but fortunately, I read your post and made me review my article. And I found this in it "#P3#" ... I had copied and paste from my instagram post and those # are actually square looking boxes and suppose to be hearts... lol --- I couldn't post the characters because Drupal (here) is giving me an error message too... lol

Anyhoo, all is good again...yay!

Thanks =)

achekulaev’s picture

@neoxavier you rock. Patch #5 works perfectly.

twistor’s picture

Status: Reviewed & tested by the community » Needs review
StatusFileSize
new4.37 KB

Stripping the characters isn't the only option. We can convert them into HTML entities.
This patch adds that option, and strips by default.

The regex in #5 will strip too many things, and requires that pcre is compiled with unicode support.

Status: Needs review » Needs work

The last submitted patch, 15: feeds_jsonpath_parser-four-byte-1824506-15.patch, failed testing.

twistor’s picture

Version: 7.x-1.0-beta2 » 7.x-1.x-dev

Status: Needs work » Needs review

The last submitted patch, 5: FeedsJSONPathParser-invalidcharacter-1824506-5.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 15: feeds_jsonpath_parser-four-byte-1824506-15.patch, failed testing.

twistor’s picture

Status: Needs work » Needs review
StatusFileSize
new5.11 KB

  • twistor committed 9c1a508 on 7.x-1.x
    Issue #1824506 by twistor, Jackinloadup, neoxavier: When importing...
twistor’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

amaisano’s picture

Status: Closed (fixed) » Needs work

The latest version fixes this by stripping out those characters. However, I tried using the checkbox for "Convert four byte characters," but when that is checked, I get the same failures when importing tweets. Is that convert feature fully functional? I expected it to import the feed item, but it wasn't able to and gave me the following error:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x99\x80-e...' for column 'alias' at row 1

Which is the same as the original issue. Let me know if I should open this up as a new bug instead.

So it looks like it IS getting converted to an HTML entity (http://unicode-table.com/en/1F640/) but is still failing to create the node due to this error.

Here's a report of this item error from the log:

Original item
array(
  'jsonpath_parser:0' => 616668369094361088,
  'jsonpath_parser:1' => '🙀 of emoji in your social strategy? Insights from #ChevyGoesEmoji, #EmojiScience & mo...',
  'jsonpath_parser:2' => '🙀 of emoji in your social strategy? Insights from #ChevyGoesEmoji, #EmojiScience & mo...',
  'jsonpath_parser:3' => 'Thu Jul 02 18:02:50 +0000 2015',
  'jsonpath_parser:5' => 616668369094361088,
)
Entity
(object) array(
  'type' => 'twitter_tweet',
  'changed' => 1436997156,
  'created' => 1436997156,
  'language' => 'und',
  'is_new' => TRUE,
  'status' => 1,
  'promote' => 0,
  'sticky' => 0,
  'uid' => 0,
  'revision' => FALSE,
  'log' => 'Created by FeedsNodeProcessor',
  'title' => '🙀 of emoji in your social strategy? Insights from #ChevyGoesEmoji, #EmojiScience & mo...',
  'timestamp' => 1436997156,
  'uuid' => '9f440983-5f57-431c-a24f-7f735bb18a6e',
  'vuuid' => '73b8e0ed-6e0f-45f1-8f0f-afcf6ef5069d',
  'nid' => '568',
  'comment' => 0,
  'tnid' => 0,
  'translate' => 0,
  'vid' => '638',
)
twistor’s picture

Status: Needs work » Closed (fixed)

Please create a new issue. I just tested your feed, and it worked fine.

khoomy’s picture

@neoxavier #5 worked for me.

Cheers

jan kellermann’s picture

Patch did not work for me. We added a new regexp-Rule in tamper with this regexp:
/(([\xE0-\xEF][\x00-\xFF][\x00-\xFF])|([\xF0-\xF4][\x00-\xFF][\x00-\xFF][\x00-\xFF]))/
replace by nothing.

d34dman’s picture

In my opinion the module could declare a dependency on strip_utf8mb4, and use

_strip_utf8mb4_for_text_fields("<YOUR STRING HERE>", variable_get('strip_utf8mb4_replace_string', '--'));.

The discussion regarding what to strip and how it should be replaced should be done under strip_utf8mb4 module issue queue. This will be beneficial for both community and reduce duplicate effort.

twistor’s picture

@jan kellermann, Can you provide a specific example that fails?

@D34dMan, that's not an option. For one, this module is in bug fix only mode. Two, strip_utf8mb4 does way too much as it just strips 4 byte characters from any field. That would break for people who have converted mysql fields to allow 4 byte characters.

sanjayk’s picture

I was face same issue. I was used below code in custom module.

function custom_module_name_node_presave($node) {
if ($node->type == 'node_type') {
$node->body[LANGUAGE_NONE][0]['value'] = utf8_encode($node->body[LANGUAGE_NONE][0]['value']);
}
}