The regex used in drupagram_emoji() is too permissive, and matches valid two- and three- byte UTF-8 characters. This results in anything other than latin text being mangled.
The first byte should be restricted to the range F0-F7 rather than C0-F7, and the subsequent match fixed at three bytes.
See http://en.wikipedia.org/wiki/UTF-8#Description for the leading byte values for multi-byte characters.
| Comment | File | Size | Author |
|---|---|---|---|
| #3 | drupagram-emoji_fix-2232243-3.patch | 1007 bytes | q11q11 |
| #1 | drupagram-emoji_fix-2232243-2.patch | 534 bytes | ben.kyriakou |
Comments
Comment #1
ben.kyriakou commentedPatch attached.
In practice this just removes the emoji, and I think a better fix would be to explicitly strip 4-byte characters until emoji support is added with a third-party library. In practice the emoji characters are not browser- or platform- cross-compatible since many systems lack the dictionaries corresponding to these code-points. Even the Instagram web-client lacks any cross-compatible support for these characters, and on my system (FF on Mac) they display as unknown characters.
See http://apps.timwhitlock.info/emoji/tables/unicode for a good summary of emoji unicode characters.
Comment #2
ben.kyriakou commentedComment #3
q11q11 commentedAddition to patch #1.
Location has 'name' element, and some users sometimes uses emoji for this.
Whatever, we just got this problem, so I`m just posting modified patch.
Comment #4
joelpittetAdding a related core issue for this.
Comment #5
damienmckennaThe Twitter module has the same bug: #1910376: SQL error when importing tweet with emoji
Rather than continuing to change the content, I suggested using php-emoji to provide a solid solution to the problem: #2402153: Use php-emoji library to work around UTF-8 limitations
Comment #6
damienmckennaThe solution is convert the 'text' fields to 'blob'.
Comment #7
damienmckennaComment #8
damienmckennaComment #9
damienmckennaAlso needs to be backported to the D6 branch.
Comment #10
daniel korte