Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
At least, I think it's emoji?
It didn't cause cron to fail but I did see a dblog that failed I think in this one specific tweet import:
PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: [error]
'\xF0\x9F\x98\x9D"' for column 'text' at row 1: INSERT INTO {twitter} (twitter_id,
screen_name, created_time, text, source, in_reply_to_status_id, in_reply_to_user_id,
in_reply_to_screen_name, truncated) VALUES (:db_insert_placeholder_0,
:db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3,
:db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6,
:db_insert_placeholder_7, :db_insert_placeholder_8); Array
(
[:db_insert_placeholder_0] => 298958759877816320
[:db_insert_placeholder_1] => cutieyanyan
[:db_insert_placeholder_2] => 1360112297
[:db_insert_placeholder_3] => I fell for it too. I deleted it na. LOL RT "@iamshieny:
PAL promo is a hoax. We all fell for it. <EMOJIHERE>"
[:db_insert_placeholder_4] => Twitter for iPhone
[:db_insert_placeholder_5] => 298942328834494464
[:db_insert_placeholder_6] => 260220250
[:db_insert_placeholder_7] => iamshieny
[:db_insert_placeholder_8] => 0
)
in twitter_status_save() (line 100 of
/home/blushama/main_d7/sites/all/modules/twitter/twitter.inc).
The emoji character is in the EMOJIHERE placeholder above. I couldn't paste it here because the text couldn't be saved when I submitted the Issue form.
Comment | File | Size | Author |
---|---|---|---|
#52 | twitter-n1910376-51-6.x-5.x.patch | 1.83 KB | DamienMcKenna |
#50 | twitter-n1910376-50.patch | 1.17 KB | DamienMcKenna |
#48 | twitter-n1910376-48-7.x-5.x.patch | 1.17 KB | DamienMcKenna |
#31 | twitter-sql_error_when-1910376-31.patch | 1.54 KB | basvredeling |
#28 | twitter-n1910376-28.patch | 1.45 KB | DamienMcKenna |
Comments
Comment #0.0
radj CreditAttribution: radj commentedThe emoji character is in the placeholder above. I couldn't paste it here because the text couldn't be saved when I submitted the Issue form.
Comment #1
Drave Robber CreditAttribution: Drave Robber commentedThat's not just emoji, D7 has a problem with any astral plane (four-byte) Unicode character. (D6, too - that's why you couldn't post it here)
This might be fixed in core some day but I wouldn't hold my breath yet, see #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols)
In contrib, this is quite a problem for Bot (because it, well, kills the bot). There are some efforts underway to handle it, see #1886646: Fire (Non UTF8 characters) kills bot - maybe there will soon be a solution to copy.
(I was tempted to downgrade this to minor, but then thought - for some sites half of all tweets may contain emoji.)
Comment #2
radj CreditAttribution: radj commentedI actually read your reply when you made it. I feel so rude for not replying right away.
Thanks for the reference. I guess this isn't a Twitter specific issue. Closing as duplicate of #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols).
Comment #3
seanbfuller CreditAttribution: seanbfuller commentedI was able to apply a quick fix by replacing the characters as suggested in https://drupal.org/node/1886646#comment-6932134 (linked above). It involved adding this line at the top of twitter_status_save() in twitter.inc:
Since these characters totally break cron, would it be worth applying something similar in the module?
Comment #4
xurizaemonAgreed, this is worth fixing in the module. People are tweeting emoji today, and #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) is for Drupal 8.x. This issue stops all tweets importing, and if I read correctly above, kills cron for modules running after Twitter.
It doesn't matter if half your tweets contain emoji, it only takes one to stop cron cold.
Comment #5
xurizaemonTested above preg_replace() with emoji, I found that it didn't handle the smiley in this status: https://twitter.com/grobot_4/status/369607329643126785 ... Will come back to this - leaving this here as a note for myself.
Suspect we should just use Drupal text functions rather than trying to roll our own unicode exclusions.
(emoji redacted from above paste because it makes d.o comments cut off)
Comment #6
lex0r CreditAttribution: lex0r commentedHi all,
this is a patch for 7.x-5.8 based on #3.
Comment #6.0
lex0r CreditAttribution: lex0r commentedEMOJIHERE placeholder
Comment #7
lhridley CreditAttribution: lhridley commentedRan into this issue and tried several things to fix, finally found this blog post:
http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/
Patched the twitter.inc file as follows, in twitter_status_save():
This replaces any non-UTF8 characters with the string "--". So far it's working for us.
Comment #8
joelpittetHere's a quick fix for 6.x dev.
Same patch as #6
Comment #9
nicrodgersThe fix in #6 (and corresponding patch in #8) worked for me on some statuses, but there were still some that weren't being stripped and triggering the error.
The code in #7 seems to be working on every status I've tried so far.
Here's a patch for 7.x-5.8 based on the code in #7 - thanks @lhridley
Comment #10
nicrodgers..and here's a patch for 7.x-6x, thanks @Ihridley
Comment #11
dddave CreditAttribution: dddave commentedTested the patch for the 5x branch and it seems to be working as advertised. No more screaming logs and the tweet gets imported with the characters replaced by - (or a box?).
Thank you. Would be great if someone else could test it so that we can RTBC it (hopefully for both branches).
[Took the liberty to fix a typo in the comment #10]
Comment #12
joelpittetIt may be worth giving more documentation around that regular expression because it's quite a bit more complicated and easy to introduce bugs in all those "or" clauses.
Also by documenting all those cases and exclusions may find any further bugs contained or unexpected exclusions.
In general though it sounds like it covers more cases including the UTF-16 bits so it's looking like a nice improvement thank you for finding that:)
Here are some coding/commenting standards nitpicks that should be addressed as well:
should be @see with a space before the @.
Start with a space and a capital, keep the lines at 80 chcarters and end with a period as per coding standards. @see https://www.drupal.org/node/1354
Spaces before the concatenating periods.
Remove space after ->text)
Same commenting rules apply.
Spaces after comma and not before ending parentheses. Spaces before and after concatenating period.
Comment #13
nicrodgersHi joelpittet, thanks for the feedback. I've updated the changes so they meet the Drupal coding standards, and have re-rolled the patch for 7.x-6.x, attached. I've added in another @ see link to an article where the regular expressions come from. It has lots of useful information in there. Seemingly a lot of open source projects are using the same regexp. I thought it'd be better to link to that article than try and reproduce all the content as comments, but feel free to add more comments in if you think it necessary.
regards
Nic
Comment #14
joelpittet@nicrodgers thanks, that's good to me:)
Was discussing this last night at the VanDUG meetup and @gapple and I were discussing if it's possible to get MySQL table to accept 4 byte unicode characters. Still not sure but after this is committed a follow-up to explore that would be ideal. #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) Take from that issue but try to apply it to this specific case maybe. Make MySQL version require 5.5+ just for this module... just a thought.
Comment #15
DamienMcKennaComment #16
DamienMcKennaWould it be possible to instead replace them with a HTML entity, rather than remove the string entirely? Maybe the php-emoji library could help?
Comment #17
joelpittetThat sounds like a pretty awesome idea if it works. Something to give a try, thanks for the suggestion @DamienMcKenna
Comment #18
DamienMcKennaPutting this back to "needs work" as I don't think changing the incoming data in this way is necessarily a good one (maintainers, feel free to disagree).
Comment #19
joelpittet@DamienMcKenna any chance we could just put that in a follow up feature request so that we can eliminate this nasty sql error?
Comment #20
DamienMcKennaDone: #2402153: Use php-emoji library to work around UTF-8 limitations
Note: I don't recommend releasing a version of the module with this patch included, it should be a stepping stone towards using php-emoji to build a proper workaround rather than changing content.
Comment #21
DamienMcKennaComment #22
DamienMcKennaTriggering the testbot.
Comment #23
helmo CreditAttribution: helmo commentedThe testbot somehow postponed the latest patch ... the patch seems to work ok though.
Comment #24
DamienMcKennaPatches are not being tested because the existing tests don't pass and it was failing every patch. Please see #2402317: Fix the 7.x-5.x branch tests and #2402321: Fix the 7.x-6.x branch tests.
Comment #25
mikeytown2 CreditAttribution: mikeytown2 commentedJust a heads up that I had to deal with this issue to a make sure Drupal outputted valid XML. If you test ALL utf8 characters preg_replace will return NULL. I came up with a solution that tries preg_replace first and then if that fails it will use slower int values for each character. Test case is inside the code as well.
http://stackoverflow.com/questions/3466035/how-to-skip-invalid-character...
Comment #26
IT-CruAdd re-factored patch for 7.x-5.x based on #9 which have wrong file base.
Comment #28
DamienMcKennaRerolled.
Comment #29
mradcliffeThis isn't the correct approach as it works fine for SQLite and PostgreSQL.
I think instead the patch should catch the PDOException so that the application does not crash for MySQL databases instead of replacing valid text. Or maybe set a variable / static so that the module is aware of the character set constraints of previous versions of MySQL and if the column is not set to utf8mb4.
Comment #30
basvredelingI agree with @mradcliffe #29
Either we solve the utf8mb4 compatibility or we check for db compatibility.
A viable but costly workaround would be to base64 encode / decode the text before storing it in the db.
Stripping the incompatible characters from the string is a no-go IMHO.
Comment #31
basvredelingTry this patch. It's got a db update and writes and loads twitter text as base64 encoded strings.
To accomodate for the base64 strings (which are longer than regular strings) I had to add the db update to increase the field "text" varchar size.
While it's not an ideal solution, it's a pretty robust workaround for now.
Comment #32
DamienMcKenna@basvredeling: That's definitely one approach to take. Your patch needs to also loop over the existing records and base64 encode them, and it needs an update to hook_schema to set the 'text' field to the same size as what's used in the update.
Comment #33
basvredelingHere you go...
Comment #34
DamienMcKenna@basvredeling: Thanks.
Has anyone tested to confirm that the rest of the module works OK with emojis, e.g. are the tweets displayed correctly on the site after importing?
Comment #35
DamienMcKennaComment #36
basvredelingJust an example: I applied the patch to this site http://www.wheelhalla.com Tweets on for instance this page: http://www.wheelhalla.com/stage/tour-de-pologne-2015-2-czestochowa-dabro... are working fine with emoticons.
See screen shot of an individual embedded tweet.
Comment #37
tancUsing the base64 method also requires that the views field decodes the output on render. Attached is an interdiff and an updated patch based on #33.
Comment #38
basvredelingGood catch... perhaps we should also add a TODO at the new code segments to eventually remove the base64 encoding.
Comment #39
DamienMcKennaThe update script needs to use batch API so that it doesn't fail for large sites.
Comment #40
DamienMcKennaComment #41
DamienMcKennaI had another idea - how about using a 'blob' for storing the tweets, instead of a varchar field? Please see #2534206: Support Twitter's new longer DMs.
Comment #42
basvredelingStoring into blob would entail an encoding process too: from string to binary using
pack()
. I don't reckon you are proposing to store the raw twitter strings as binary blob strings directly... right?If so this is a speed issue. I don't see any further advantages. If binary encoding is quicker than base64 encoding. And/or if blobs are stored and retrieved more speedily from mysql than varchars, we should go for it.
Comment #43
DamienMcKenna@basvredeling: Yes, I was thinking of storing the entire tweet in its raw, unprocessed form as a blob. The database has to change anyway to support the new 10,000 char DM limit, so we might as well use this to consider our options.
Comment #44
DamienMcKennaClosed a duplicate: #2015955: Special characters on twitt make a bug
Comment #45
basvredelingOkay, so I did some reading on using BLOB as text. It's not a pretty solution imho, and our capability to do sorts (etc) is out the window, but... it should work. And MediaWiki is also storing text articles as blobs, apparently. If no encoding is required the patch can be a lot simpler too. No need for batch api #39 and views #37.
I just made this patch to keep things rolling, I have had no time to test it properly. Please review.
Comment #47
basvredelingNothing wrong with the patch, it just fails testing because of the PHPStorm comment.
Comment #48
DamienMcKennaRerolled.
Comment #49
DamienMcKennaThe change to 'blob' appears to work without any problems.
Comment #50
DamienMcKennaPorted to 7.x-6.x.
Comment #52
DamienMcKennaPorted to 6.x-5.x.
Comment #53
DamienMcKennaAnd it works ok in 6.x-5.x too:
Comment #56
DamienMcKennaCommitted! Yay! :-) Thanks everyone for your help!
Comment #58
basvredelingThanks for the commit. Good to see this in.
Comment #59
basvredelingSome follow up for those who previously applied this patch and converted their twitter messages to base64 encoding. To revert the process run this snippet:
Comment #60
DamienMcKennaThanks @basvredeling!
Comment #62
basvredelingNow that Drupal 7.50 is out and utf8mb4 support is official we should probably revert this patch.
There is an @todo comment in the code of this patch that should be resolved.
Since utf8mb4 support is configurable / optional, we can either add that to the install.txt and module documentation or make the "blob" field optional only if the database doesn't support multibyte utf8.
Since it is currently working we can also just remove the @todo and leave everything as is. But I don't know what impact storing blobs has on performance and size compared to multibyte text.
Any ideas on which route to follow?
Comment #63
mradcliffeEven if the driver supports utf8mb4, it is up to the server configuration to determine if it will be used in a particular collation by default, or up to the application to set the collation. The approach should take that into consideration so as not to break older installations of MySQL or MariaDB that did not support utf8mb4 or may not be configured to support utf8mb4.
Comment #64
basvredeling@mradcliffe Drupal core 7.50+ has a hook_requirements() implementation that checks utf8mb4 compatibility. We could leverage that to check which storage solution we should use and mark the old solution as deprecated.
Furthermore: we only provided the utf8mb4 workaround for the message text, not for the twitter user name (which can also include multibyte characters).
Comment #65
oadaeh CreditAttribution: oadaeh at Hook 42 commentedIt is my opinion that the changes that were committed should stay as is. It may be that someone might not have the ability or permission to make changes to their database configuration or schema, and by making utf8mb4 a requirement for this module, it makes it so that they cannot use the module.
Also, this issue should be left closed and a new one opened, if it is indeed determined that the schema changes should be reverted.