Can spellcheck be made unicode aware, or at least, blind? [#4506]

The problem is with the UTF-8 version of the character é.
User submitted a post with the word:
cliché in it, then hit spellcheck.
Not only did it treat cliché as a spelling error (although it did recommend cliche) - the "unchanged" replacement turned
clichXX (where XX was the two-byte é)
into:
clichOOX (where part of the original two bytes is left, and more are added).
repeated re-spellchecks quickly formed a line of gibberish.

Selecting cliche to replace cliché only partly fixed.
clicheX (where X is half of the original é)
was the result.

I'm wondering if a filter to simply prestrip all words containing 8bit ascii (just those bytes where the first bit is a 1 - 0x80 to 0xFF) might provide a quick and dirty fix?
Presumably you are already stripping HTML, since I don't seem to run into that in the spellcheck output.

Ideally, I suppose, a filter to turn UTF-8 into ISO-8859-1 would be nice? Or at least for spellchecking with european dictionaries.

Comments

Comment #1

Steven commented 12 February 2004 at 15:34

I think the easiest would be for dictionaries to be converted to UTF8 too.

Comment #2

Eric Scouten commented 27 July 2005 at 05:04

Status:

Active

» Closed (won't fix)

Spellcheck module no longer maintained.

Can spellcheck be made unicode aware, or at least, blind?

Comments

Comment #1

Comment #2

News items

Our community

Documentation

Drupal code base

Governance of community