The problem is with the UTF-8 version of the character é.
User submitted a post with the word:
cliché in it, then hit spellcheck.
Not only did it treat cliché as a spelling error (although it did recommend cliche) - the "unchanged" replacement turned
clichXX (where XX was the two-byte é)
into:
clichOOX (where part of the original two bytes is left, and more are added).
repeated re-spellchecks quickly formed a line of gibberish.

Selecting cliche to replace cliché only partly fixed.
clicheX (where X is half of the original é)
was the result.

I'm wondering if a filter to simply prestrip all words containing 8bit ascii (just those bytes where the first bit is a 1 - 0x80 to 0xFF) might provide a quick and dirty fix?
Presumably you are already stripping HTML, since I don't seem to run into that in the spellcheck output.

Ideally, I suppose, a filter to turn UTF-8 into ISO-8859-1 would be nice? Or at least for spellchecking with european dictionaries.

Comments

Steven’s picture

I think the easiest would be for dictionaries to be converted to UTF8 too.

Eric Scouten’s picture

Status: Active » Closed (won't fix)

Spellcheck module no longer maintained.