The regex used in drupagram_emoji() is too permissive, and matches valid two- and three- byte UTF-8 characters. This results in anything other than latin text being mangled.

The first byte should be restricted to the range F0-F7 rather than C0-F7, and the subsequent match fixed at three bytes.

See http://en.wikipedia.org/wiki/UTF-8#Description for the leading byte values for multi-byte characters.

Comments

ben.kyriakou’s picture

Status: Needs work » Needs review
StatusFileSize
new534 bytes

Patch attached.

In practice this just removes the emoji, and I think a better fix would be to explicitly strip 4-byte characters until emoji support is added with a third-party library. In practice the emoji characters are not browser- or platform- cross-compatible since many systems lack the dictionaries corresponding to these code-points. Even the Instagram web-client lacks any cross-compatible support for these characters, and on my system (FF on Mac) they display as unknown characters.

See http://apps.timwhitlock.info/emoji/tables/unicode for a good summary of emoji unicode characters.

ben.kyriakou’s picture

Component: API Integration / Library » Code
q11q11’s picture

StatusFileSize
new1007 bytes

Addition to patch #1.
Location has 'name' element, and some users sometimes uses emoji for this.
Whatever, we just got this problem, so I`m just posting modified patch.

joelpittet’s picture

damienmckenna’s picture

The Twitter module has the same bug: #1910376: SQL error when importing tweet with emoji

Rather than continuing to change the content, I suggested using php-emoji to provide a solid solution to the problem: #2402153: Use php-emoji library to work around UTF-8 limitations

damienmckenna’s picture

The solution is convert the 'text' fields to 'blob'.

damienmckenna’s picture

Status: Needs review » Needs work
damienmckenna’s picture

damienmckenna’s picture

Also needs to be backported to the D6 branch.

daniel korte’s picture