Simple Chinese support [#119991]

Right now, the spam module appears to consider only space-separated 1-grams. Thus, something like:

spam_tokenize("为人民服务");

would return an array with one token containing the entire string. In Chinese at least, this behavior is obviously incorrect, as most chinese text contains no whitespace characters. Thus, every message would be categorized differently, and the database would quickly fill up with junk and the filter would be close to useless.

I'm not sure what the "base-level implementation" corresponding to English 1-grams is, but I'll try to find out. It's probably something along the lines of: "each character is a separate token" (为, 人, 民, 服, 务 in the above example) or "each pair of adjacent characters is a separate token" (为人, 人民, 民服, 服务 in the above example)

Comments

Comment #1

Wesley Tanaka CreditAttribution: Wesley Tanaka commented 17 February 2007 at 11:31

http://citeseer.ist.psu.edu/594545.html

"context information does nothelp in the Chinese data beyond 2-grams. The performance increase 3-4% from 1-gram to 2-gram, but does not increase any more."

"For many Asian languages such as Chinese and Japanese, where word segmentation is a hard, our character level CAN Bayes model is well suited for text classiﬁcation because it avoids the need for word segmentation. For Western languages such as Greek and English, we can work at both the word and character levels. In our experiments, we actually found that the character level models worked slightly better than the word level
models in the English 20 Newsgroup data set (89% vs. 88%)."

It sounds like either method for tokenizing described above should be a reasonable choice, and either would be a dramatic improvement over the current tokenizer for many asian languages.

Comment #2

Jeremy CreditAttribution: Jeremy commented 15 October 2007 at 14:42

Status:

Active

» Postponed

I hope to improve international support in the upcoming 5.x-2.x version of the module. Postponing this issue until then.

Comment #3

Jeremy CreditAttribution: Jeremy commented 28 November 2007 at 14:31

Version:	5.x-1.x-dev	» 5.x-3.x-dev
Status:	Postponed	» Active

Re-opening against the 5.x-3.x development branch. Help on ensuring that this new version of the modules offers better international support would be much appreciated.

Comment #4

Jeremy CreditAttribution: Jeremy commented 24 April 2008 at 13:37

Assigned:

Unassigned

» Jeremy

Assigning this to myself, as I'd like to improve the tokenizer to support other languages.

Comment #5

Jeremy CreditAttribution: Jeremy commented 17 September 2008 at 15:32

Assigned:	Jeremy	» Unassigned
Status:	Active	» Postponed

I want to see better support for other languages in the tokenizer, but this will not be happening until at least after we have a beta release. Unassigning and postponing this issue until at least then, or until someone comes along with a patch.