Right now, the spam module appears to consider only space-separated 1-grams. Thus, something like:

spam_tokenize("为人民服务");

would return an array with one token containing the entire string. In Chinese at least, this behavior is obviously incorrect, as most chinese text contains no whitespace characters. Thus, every message would be categorized differently, and the database would quickly fill up with junk and the filter would be close to useless.

I'm not sure what the "base-level implementation" corresponding to English 1-grams is, but I'll try to find out. It's probably something along the lines of: "each character is a separate token" (为, 人, 民, 服, 务 in the above example) or "each pair of adjacent characters is a separate token" (为人, 人民, 民服, 服务 in the above example)

Comments

Wesley Tanaka’s picture

http://citeseer.ist.psu.edu/594545.html

"context information does nothelp in the Chinese data beyond 2-grams. The performance increase 3-4% from 1-gram to 2-gram, but does not increase any more."

"For many Asian languages such as Chinese and Japanese, where word segmentation is a hard, our character level CAN Bayes model is well suited for text classification because it avoids the need for word segmentation. For Western languages such as Greek and English, we can work at both the word and character levels. In our experiments, we actually found that the character level models worked slightly better than the word level
models in the English 20 Newsgroup data set (89% vs. 88%)."

It sounds like either method for tokenizing described above should be a reasonable choice, and either would be a dramatic improvement over the current tokenizer for many asian languages.

Jeremy’s picture

Status: Active » Postponed

I hope to improve international support in the upcoming 5.x-2.x version of the module. Postponing this issue until then.

Jeremy’s picture

Version: 5.x-1.x-dev » 5.x-3.x-dev
Status: Postponed » Active

Re-opening against the 5.x-3.x development branch. Help on ensuring that this new version of the modules offers better international support would be much appreciated.

Jeremy’s picture

Assigned: Unassigned » Jeremy

Assigning this to myself, as I'd like to improve the tokenizer to support other languages.

Jeremy’s picture

Assigned: Jeremy » Unassigned
Status: Active » Postponed

I want to see better support for other languages in the tokenizer, but this will not be happening until at least after we have a beta release. Unassigning and postponing this issue until at least then, or until someone comes along with a patch.

killes@www.drop.org’s picture

Category: feature » bug
Status: Postponed » Active

Inspiration should come from the core search module which used to have the same problem.

Not supporting more than a subset of languages is a bug.

AlexisWilke’s picture

Version: 5.x-3.x-dev » 6.x-1.x-dev

Bumping to 6.x since we don't support 5.x anymore.

Thank you.
Alexis

killes@www.drop.org’s picture

This is now in the dev version, I'd appreciate some testing.

killes@www.drop.org’s picture

Status: Active » Fixed

I've tested a bit myself, considering this fixed.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.