At least, I think it's emoji?

It didn't cause cron to fail but I did see a dblog that failed I think in this one specific tweet import:

PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value:      [error]
'\xF0\x9F\x98\x9D"' for column 'text' at row 1: INSERT INTO {twitter} (twitter_id,
screen_name, created_time, text, source, in_reply_to_status_id, in_reply_to_user_id,
in_reply_to_screen_name, truncated) VALUES (:db_insert_placeholder_0,
:db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3,
:db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6,
:db_insert_placeholder_7, :db_insert_placeholder_8); Array
(
    [:db_insert_placeholder_0] => 298958759877816320
    [:db_insert_placeholder_1] => cutieyanyan
    [:db_insert_placeholder_2] => 1360112297
    [:db_insert_placeholder_3] => I fell for it too. I deleted it na. LOL RT "@iamshieny:
PAL promo is a hoax. We all fell for it. <EMOJIHERE>"
    [:db_insert_placeholder_4] => Twitter for iPhone
    [:db_insert_placeholder_5] => 298942328834494464
    [:db_insert_placeholder_6] => 260220250
    [:db_insert_placeholder_7] => iamshieny
    [:db_insert_placeholder_8] => 0
)
 in twitter_status_save() (line 100 of
/home/blushama/main_d7/sites/all/modules/twitter/twitter.inc).

The emoji character is in the EMOJIHERE placeholder above. I couldn't paste it here because the text couldn't be saved when I submitted the Issue form.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

radj’s picture

Issue summary: View changes

The emoji character is in the placeholder above. I couldn't paste it here because the text couldn't be saved when I submitted the Issue form.

Drave Robber’s picture

That's not just emoji, D7 has a problem with any astral plane (four-byte) Unicode character. (D6, too - that's why you couldn't post it here)

This might be fixed in core some day but I wouldn't hold my breath yet, see #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols)

In contrib, this is quite a problem for Bot (because it, well, kills the bot). There are some efforts underway to handle it, see #1886646: Fire (Non UTF8 characters) kills bot - maybe there will soon be a solution to copy.

(I was tempted to downgrade this to minor, but then thought - for some sites half of all tweets may contain emoji.)

radj’s picture

Status: Active » Closed (duplicate)

I actually read your reply when you made it. I feel so rude for not replying right away.

Thanks for the reference. I guess this isn't a Twitter specific issue. Closing as duplicate of #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols).

seanbfuller’s picture

I was able to apply a quick fix by replacing the characters as suggested in https://drupal.org/node/1886646#comment-6932134 (linked above). It involved adding this line at the top of twitter_status_save() in twitter.inc:

  preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $status->text);

Since these characters totally break cron, would it be worth applying something similar in the module?

xurizaemon’s picture

Version: 7.x-5.4 » 7.x-5.x-dev
Status: Closed (duplicate) » Needs work

Agreed, this is worth fixing in the module. People are tweeting emoji today, and #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) is for Drupal 8.x. This issue stops all tweets importing, and if I read correctly above, kills cron for modules running after Twitter.

It doesn't matter if half your tweets contain emoji, it only takes one to stop cron cold.

xurizaemon’s picture

Tested above preg_replace() with emoji, I found that it didn't handle the smiley in this status: https://twitter.com/grobot_4/status/369607329643126785 ... Will come back to this - leaving this here as a note for myself.

Suspect we should just use Drupal text functions rather than trying to roll our own unicode exclusions.

PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x98\x84 L...' for column 'text' at row 1: 
INSERT INTO {twitter} (twitter_id, screen_name, created_time, text, source, in_reply_to_status_id, in_reply_to_user_id, in_reply_to_screen_name, truncated) 
VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6, :db_insert_placeholder_7, :db_insert_placeholder_8); 
Array ( 
  [:db_insert_placeholder_0] => 369607329643126785 
  [:db_insert_placeholder_1] => grobot_4
  [:db_insert_placeholder_2] => 1376956229
  [:db_insert_placeholder_3] => [x] Lol emoji
  [:db_insert_placeholder_4] => <a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iOS</a>
  [:db_insert_placeholder_5] =>
  [:db_insert_placeholder_6] => 0
  [:db_insert_placeholder_7] =>
  [:db_insert_placeholder_8] => 0
) in twitter_status_save() (line 169 of /path/to/example.org/sites/all/modules/twitter/twitter.inc).

(emoji redacted from above paste because it makes d.o comments cut off)

lex0r’s picture

FileSize
532 bytes

Hi all,
this is a patch for 7.x-5.8 based on #3.

lex0r’s picture

Issue summary: View changes

EMOJIHERE placeholder

lhridley’s picture

Ran into this issue and tried several things to fix, finally found this blog post:

http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/

Patched the twitter.inc file as follows, in twitter_status_save():

function twitter_status_save($status) {
  
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with --
$stripped_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
 '|[\x00-\x7F][\x80-\xBF]+'.
 '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
 '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
 '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
 '--', $status->text );

//reject overly long 3 byte sequences and UTF-16 surrogates and replace with --
$stripped_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.'|\xED[\xA0-\xBF][\x80-\xBF]/S','--', $stripped_string );
  
//  dd(iconv("utf-8", "utf-8//ignore", $status->text));
  $row = array(
    'twitter_id' => $status->id,
    'screen_name' => $status->user->screen_name,
    'created_time' => strtotime($status->created_at),
    'text' => $stripped_string,
    'source' => $status->source,
    'in_reply_to_status_id' => ($status->in_reply_to_status_id > 0) ? (string) $status->in_reply_to_status_id : NULL,
    'in_reply_to_user_id' => (int) $status->in_reply_to_user_id,
    'in_reply_to_screen_name' => $status->in_reply_to_screen_name,
    'truncated' => (int) $status->truncated,
  );

  db_merge('twitter')
    ->key(array('twitter_id' => $row['twitter_id']))
    ->fields($row)
    ->execute();
  // Let other modules know that a status has been saved.
  module_invoke_all('twitter_status_save', $status);
}

This replaces any non-UTF8 characters with the string "--". So far it's working for us.

joelpittet’s picture

Version: 7.x-5.x-dev » 7.x-6.x-dev
Status: Needs work » Needs review
FileSize
820 bytes

Here's a quick fix for 6.x dev.

Same patch as #6

nicrodgers’s picture

FileSize
1.5 KB

The fix in #6 (and corresponding patch in #8) worked for me on some statuses, but there were still some that weren't being stripped and triggering the error.

The code in #7 seems to be working on every status I've tried so far.

Here's a patch for 7.x-5.8 based on the code in #7 - thanks @lhridley

nicrodgers’s picture

..and here's a patch for 7.x-6x, thanks @Ihridley

dddave’s picture

Tested the patch for the 5x branch and it seems to be working as advertised. No more screaming logs and the tweet gets imported with the characters replaced by - (or a box?).

Thank you. Would be great if someone else could test it so that we can RTBC it (hopefully for both branches).

[Took the liberty to fix a typo in the comment #10]

joelpittet’s picture

Status: Needs review » Needs work

It may be worth giving more documentation around that regular expression because it's quite a bit more complicated and easy to introduce bugs in all those "or" clauses.

Also by documenting all those cases and exclusions may find any further bugs contained or unexpected exclusions.

In general though it sounds like it covers more cases including the UTF-16 bits so it's looking like a nice improvement thank you for finding that:)

Here are some coding/commenting standards nitpicks that should be addressed as well:

  1. +++ b/twitter.inc
    @@ -104,11 +104,22 @@ function twitter_load_authenticated_accounts() {
    +  //see https://www.drupal.org/node/1910376
    

    should be @see with a space before the @.

  2. +++ b/twitter.inc
    @@ -104,11 +104,22 @@ function twitter_load_authenticated_accounts() {
    +  //reject overly long 2 byte sequences, as well as characters above U+10000 and replace with --
    

    Start with a space and a capital, keep the lines at 80 chcarters and end with a period as per coding standards. @see https://www.drupal.org/node/1354

  3. +++ b/twitter.inc
    @@ -104,11 +104,22 @@ function twitter_load_authenticated_accounts() {
    +  $stripped_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
    +   '|[\x00-\x7F][\x80-\xBF]+'.
    +   '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
    +   '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
    

    Spaces before the concatenating periods.

  4. +++ b/twitter.inc
    @@ -104,11 +104,22 @@ function twitter_load_authenticated_accounts() {
    +   '--', $status->text );
    

    Remove space after ->text)

  5. +++ b/twitter.inc
    @@ -104,11 +104,22 @@ function twitter_load_authenticated_accounts() {
    +  //reject overly long 3 byte sequences and UTF-16 surrogates and replace with --
    

    Same commenting rules apply.

  6. +++ b/twitter.inc
    @@ -104,11 +104,22 @@ function twitter_load_authenticated_accounts() {
    +  $stripped_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.'|\xED[\xA0-\xBF][\x80-\xBF]/S','--', $stripped_string );
    

    Spaces after comma and not before ending parentheses. Spaces before and after concatenating period.

nicrodgers’s picture

Status: Needs work » Needs review
FileSize
1.54 KB

Hi joelpittet, thanks for the feedback. I've updated the changes so they meet the Drupal coding standards, and have re-rolled the patch for 7.x-6.x, attached. I've added in another @ see link to an article where the regular expressions come from. It has lots of useful information in there. Seemingly a lot of open source projects are using the same regexp. I thought it'd be better to link to that article than try and reproduce all the content as comments, but feel free to add more comments in if you think it necessary.

regards
Nic

joelpittet’s picture

Status: Needs review » Reviewed & tested by the community

@nicrodgers thanks, that's good to me:)

Was discussing this last night at the VanDUG meetup and @gapple and I were discussing if it's possible to get MySQL table to accept 4 byte unicode characters. Still not sure but after this is committed a follow-up to explore that would be ideal. #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) Take from that issue but try to apply it to this specific case maybe. Make MySQL version require 5.5+ just for this module... just a thought.

DamienMcKenna’s picture

DamienMcKenna’s picture

Would it be possible to instead replace them with a HTML entity, rather than remove the string entirely? Maybe the php-emoji library could help?

joelpittet’s picture

That sounds like a pretty awesome idea if it works. Something to give a try, thanks for the suggestion @DamienMcKenna

DamienMcKenna’s picture

Status: Reviewed & tested by the community » Needs work

Putting this back to "needs work" as I don't think changing the incoming data in this way is necessarily a good one (maintainers, feel free to disagree).

joelpittet’s picture

@DamienMcKenna any chance we could just put that in a follow up feature request so that we can eliminate this nasty sql error?

DamienMcKenna’s picture

Status: Needs work » Reviewed & tested by the community

Done: #2402153: Use php-emoji library to work around UTF-8 limitations

Note: I don't recommend releasing a version of the module with this patch included, it should be a stepping stone towards using php-emoji to build a proper workaround rather than changing content.

DamienMcKenna’s picture

DamienMcKenna’s picture

Status: Reviewed & tested by the community » Needs review

Triggering the testbot.

helmo’s picture

The testbot somehow postponed the latest patch ... the patch seems to work ok though.

DamienMcKenna’s picture

Patches are not being tested because the existing tests don't pass and it was failing every patch. Please see #2402317: Fix the 7.x-5.x branch tests and #2402321: Fix the 7.x-6.x branch tests.

mikeytown2’s picture

Just a heads up that I had to deal with this issue to a make sure Drupal outputted valid XML. If you test ALL utf8 characters preg_replace will return NULL. I came up with a solution that tries preg_replace first and then if that fails it will use slower int values for each character. Test case is inside the code as well.
http://stackoverflow.com/questions/3466035/how-to-skip-invalid-character...

IT-Cru’s picture

FileSize
1.34 KB

Add re-factored patch for 7.x-5.x based on #9 which have wrong file base.

Status: Needs review » Needs work

The last submitted patch, 26: emoji_fix-1910376-26.patch, failed testing.

DamienMcKenna’s picture

Version: 7.x-6.x-dev » 7.x-5.x-dev
Status: Needs work » Needs review
FileSize
1.45 KB

Rerolled.

mradcliffe’s picture

This isn't the correct approach as it works fine for SQLite and PostgreSQL.

I think instead the patch should catch the PDOException so that the application does not crash for MySQL databases instead of replacing valid text. Or maybe set a variable / static so that the module is aware of the character set constraints of previous versions of MySQL and if the column is not set to utf8mb4.

basvredeling’s picture

I agree with @mradcliffe #29
Either we solve the utf8mb4 compatibility or we check for db compatibility.
A viable but costly workaround would be to base64 encode / decode the text before storing it in the db.
Stripping the incompatible characters from the string is a no-go IMHO.

basvredeling’s picture

Try this patch. It's got a db update and writes and loads twitter text as base64 encoded strings.
To accomodate for the base64 strings (which are longer than regular strings) I had to add the db update to increase the field "text" varchar size.

While it's not an ideal solution, it's a pretty robust workaround for now.

DamienMcKenna’s picture

Status: Needs review » Needs work

@basvredeling: That's definitely one approach to take. Your patch needs to also loop over the existing records and base64 encode them, and it needs an update to hook_schema to set the 'text' field to the same size as what's used in the update.

basvredeling’s picture

Status: Needs work » Needs review
FileSize
3.07 KB

Here you go...

DamienMcKenna’s picture

@basvredeling: Thanks.

Has anyone tested to confirm that the rest of the module works OK with emojis, e.g. are the tweets displayed correctly on the site after importing?

DamienMcKenna’s picture

basvredeling’s picture

Just an example: I applied the patch to this site http://www.wheelhalla.com Tweets on for instance this page: http://www.wheelhalla.com/stage/tour-de-pologne-2015-2-czestochowa-dabro... are working fine with emoticons.
See screen shot of an individual embedded tweet.

tanc’s picture

Using the base64 method also requires that the views field decodes the output on render. Attached is an interdiff and an updated patch based on #33.

basvredeling’s picture

Good catch... perhaps we should also add a TODO at the new code segments to eventually remove the base64 encoding.

DamienMcKenna’s picture

Status: Needs review » Needs work

The update script needs to use batch API so that it doesn't fail for large sites.

DamienMcKenna’s picture

Priority: Normal » Major
DamienMcKenna’s picture

I had another idea - how about using a 'blob' for storing the tweets, instead of a varchar field? Please see #2534206: Support Twitter's new longer DMs.

basvredeling’s picture

Storing into blob would entail an encoding process too: from string to binary using pack(). I don't reckon you are proposing to store the raw twitter strings as binary blob strings directly... right?

If so this is a speed issue. I don't see any further advantages. If binary encoding is quicker than base64 encoding. And/or if blobs are stored and retrieved more speedily from mysql than varchars, we should go for it.

DamienMcKenna’s picture

@basvredeling: Yes, I was thinking of storing the entire tweet in its raw, unprocessed form as a blob. The database has to change anyway to support the new 10,000 char DM limit, so we might as well use this to consider our options.

DamienMcKenna’s picture

basvredeling’s picture

Status: Needs work » Needs review
FileSize
1.3 KB

Okay, so I did some reading on using BLOB as text. It's not a pretty solution imho, and our capability to do sorts (etc) is out the window, but... it should work. And MediaWiki is also storing text articles as blobs, apparently. If no encoding is required the patch can be a lot simpler too. No need for batch api #39 and views #37.

I just made this patch to keep things rolling, I have had no time to test it properly. Please review.

Status: Needs review » Needs work

The last submitted patch, 45: twitter-sql_error_when-1910376-44.patch, failed testing.

basvredeling’s picture

Status: Needs work » Needs review

Nothing wrong with the patch, it just fails testing because of the PHPStorm comment.

DamienMcKenna’s picture

DamienMcKenna’s picture

FileSize
28.86 KB

The change to 'blob' appears to work without any problems.
Emoji!

DamienMcKenna’s picture

Version: 7.x-5.x-dev » 7.x-6.x-dev
FileSize
1.17 KB

Ported to 7.x-6.x.

Status: Needs review » Needs work

The last submitted patch, 50: twitter-n1910376-50.patch, failed testing.

DamienMcKenna’s picture

Status: Needs work » Needs review
FileSize
1.83 KB

Ported to 6.x-5.x.

DamienMcKenna’s picture

FileSize
17.87 KB

And it works ok in 6.x-5.x too:
Emoji in D6!

Status: Needs review » Needs work

The last submitted patch, 52: twitter-n1910376-51-6.x-5.x.patch, failed testing.

  • DamienMcKenna committed ba7c7bd on 6.x-5.x
    Issue #1910376 by lex0r, joelpittet, nicrodgers, IT-Cru, DamienMcKenna,...
DamienMcKenna’s picture

Status: Needs work » Fixed

Committed! Yay! :-) Thanks everyone for your help!

  • DamienMcKenna committed e0000dd on 7.x-5.x
    Issue #1910376 by lex0r, joelpittet, nicrodgers, IT-Cru, DamienMcKenna,...
basvredeling’s picture

Thanks for the commit. Good to see this in.

basvredeling’s picture

Some follow up for those who previously applied this patch and converted their twitter messages to base64 encoding. To revert the process run this snippet:

$tweets = db_select('twitter', 't')
  ->fields('t', array('twitter_id', 'text'))
  ->orderBy('twitter_id', 'ASC')
  ->execute();

foreach ($tweets as $tweet) {
  $tweet->text = base64_decode($tweet->text);
  db_update('twitter')
    ->fields(array('text' => $tweet->text))
    ->condition('twitter_id', $tweet->twitter_id)
    ->execute();
}
DamienMcKenna’s picture

Thanks @basvredeling!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

basvredeling’s picture

Status: Closed (fixed) » Active

Now that Drupal 7.50 is out and utf8mb4 support is official we should probably revert this patch.
There is an @todo comment in the code of this patch that should be resolved.
Since utf8mb4 support is configurable / optional, we can either add that to the install.txt and module documentation or make the "blob" field optional only if the database doesn't support multibyte utf8.
Since it is currently working we can also just remove the @todo and leave everything as is. But I don't know what impact storing blobs has on performance and size compared to multibyte text.

Any ideas on which route to follow?

mradcliffe’s picture

Even if the driver supports utf8mb4, it is up to the server configuration to determine if it will be used in a particular collation by default, or up to the application to set the collation. The approach should take that into consideration so as not to break older installations of MySQL or MariaDB that did not support utf8mb4 or may not be configured to support utf8mb4.

basvredeling’s picture

@mradcliffe Drupal core 7.50+ has a hook_requirements() implementation that checks utf8mb4 compatibility. We could leverage that to check which storage solution we should use and mark the old solution as deprecated.

Furthermore: we only provided the utf8mb4 workaround for the message text, not for the twitter user name (which can also include multibyte characters).

oadaeh’s picture

Status: Active » Fixed

It is my opinion that the changes that were committed should stay as is. It may be that someone might not have the ability or permission to make changes to their database configuration or schema, and by making utf8mb4 a requirement for this module, it makes it so that they cannot use the module.

Also, this issue should be left closed and a new one opened, if it is indeed determined that the schema changes should be reverted.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.