chinese language search doesn't work, redux. [#39117]

I've noticed the documented fact that search in Chinese is don't work
out of the box.

Before submitting this, I searched around this site and found a few
references to this problem. The most up to date seemed to be:
http://drupal.org/node/2142
in which Steven suggests that you use a break-word preprocessor
hook_search_preprocess on chinese text to split the text into "words".

The way it's described in 2142, "Chinese doesn't use spaces". But I
think a better characterization is "Chinese doesn't have words". I've
read a research paper (I think on automated word segmentation by Dekai
Wu) which mentioned that chinese native speakers themselves have a
pretty large disagreement of where to put "word breaks" in Chinese
text, and that presumably comes from the fact that they don't
necessarily think of everything in terms of "words." I know that it's
never clear to me (as a non-native non-literate sort-of-speaker) which
things might be considered "words". A lot of particles could be words
by themselves, or maybe attached to the verb or noun that they modify.
A lot of nouns are compounds, whose component parts make sense as
words by themselves.

So what's the point? I suggest that:

Out of the box, index all chinese characters as if they were
separate words.

Here's the reasoning:

This is close to the ideal behavior anyway. A search for 我 should find ALL text that contains that character. A search for 我的 should ideally find all text that contains the string "我的", but an okay second place would be to find all text that contained both 我 and 的. Later, a feature could be added to rank search results based on the order of keywords in the original search query, so that "我的" results came up before results with 我 and 的 in different parts of the page.
It seems that to implement that behavior at the moment, I need to reduce the "index words less than this length" parameter of the search indexer to "1," and add a hook_search_preprocess hook to do something like add a space before and after every cjk character. That's unideal for the english content on the site.
What I'd like is to not have to reduce the minimum word length to index to "1." I guess I'd still have to reduce the minimum word length to search for to "1."
This seems like it could be done by building the default chinese word splitting into the search indexing engine, since you'd want to index all Chinese "words" even though they were all 1 character long.
Indexing all Chinese characters shouldn't be a big deal. Most sites probably won't have any Chinese characters in their content. For the sites that do, consider that the vast majority of chinese text draws from a bag of the most commonly used 6000-10,000 characters. That's a much better soft upper bound than the number of words you might encounter on english language pages.
It's far better than the current out of the box behavior, which doesn't work at all.

Thoughts?

Comments

Comment #1

Steven commented 28 November 2005 at 23:20

Optimizing for a specific language is really out of the scope of Drupal. The only reason I added the current CJK handling is that no-one actually seemed to use the preprocessor functionality, and CJK search in 4.6 was /really/ broken. Also, Unicode Han Unification means that there are no Chinese-only characters in general; indexing individual characters would be bad for Japanese for example. And adding special code to deal with both is definitely out of Drupal's scope.

As far as current HEAD search being broken, I do not know of any giant flaws, and I think I addressed your concerns in that other issue. I still hold that it was your own preprocessor messing things up by adding leading and trailing spaces.

I'll admit I don't know enough about Chinese searching patterns to say anything conclusive about it. For example, your concern about short/individual words really depends on whether such words are actually important. Your example is a personal pronoun and a grammatical particle, I think. Their english counterparts would be considered noise words in most cases. And if Chinese characters tend to be restricted to a smaller set than english words, then I would expect character pairing to result in better/more relevant results, rather than worse.

You also need to consider performance... the current CJK tokenizer is a single regexp pass across the entire item text, so it's relatively low-impact. However, if you want to force all Chinese characters to be included regardless of length, you need to add a hanzi regexp/check for every word. That means hundreds of regexp calls per item that are 100% useless on non-CJK sites. Either that, or add some useless characters to each word to extend its length (but then your index would become larger).

Comment #2

Wesley Tanaka commented 29 November 2005 at 07:58

indexing individual characters would be bad for Japanese for example.

Does this hold for Kanji, or just for Hiragana and Katakana? I would imagine that Unicode has ranges defining which CJK characters are "chinese originated". But if Kanji really does have words (where Chinese characters do not, really), I guess it wouldn't make sense to try to deal with it in an English-centric Drupal.

Your example is a personal pronoun and a grammatical particle, I think.

Yes, that was a very poor example of single character words and the different way that Chinese deals with compounding. The examples in the other issue are better.

expect character pairing to result in better/more relevant results, rather than worse.

From what I understand, this only holds if the search string is more than 1 character long and contains characters that are expected to appear next to each other in the text of the webpage. Would a search for "ABXY" find webpages that contained the string "ABCXY" or "ABCDEFGXY"? (i.e. pages in which "BX" did not occur)

You also need to consider performance... the current CJK tokenizer is a single regexp pass across the entire item text, so it's relatively low-impact. However, if you want to force all Chinese characters to be included regardless of length, you need to add a hanzi regexp/check for every word.

I'm confused about that. Wouldn't something like trim(preg_replace('/[CJKCLASS]/', ' \\1 ')) as a search preprocessor work? Splitting into single characters seem like they would be less intensive than splitting into overlapping pairs like the current code does.

Comment #3

Wesley Tanaka commented 17 December 2005 at 17:33

see http://drupal.org/node/39137#comment-57327 for a better example. I tried finding it with drupal's search engine, but couldn't because I was searching for the chinese characters in the post.. ;)

Comment #4

Wesley Tanaka commented 17 December 2005 at 17:47

okay. I have 4.7.0-beta2 now. I changed my Minimum word length to index to 1 and my Minimum word length to search for to 1, and I turned off "Simple CJK Handling".

The search works slightly better than before, but still has some problems.

For example, I have a node whose title is 1路 (bus #1)

If I search for 路, I expect to get that back as a hit.
If I search for 1路, I'd also expect to get it back as a hit, but neither of those searches returns that node.

Also, it would be nice to be able to search in Chinese without having to turn the minimum word length settings to 1 for english.

Comment #5

Steven commented 18 December 2005 at 12:37

I cannot reproduce this. Both with simple CJK handling turned on and off, a node with title 1路 gets found.

Comment #6

Wesley Tanaka commented 18 December 2005 at 12:57

Here's the node I'm looking for:

http://treehouse.ofb.net/go/en/node/1181

You can see the results of these two searches:

http://treehouse.ofb.net/go/en/search/node/1
(which finds it)
and
http://treehouse.ofb.net/go/en/search/node/%E8%B7%AF
which does not find it.

Comment #7

Steven commented 21 December 2005 at 13:32

Right, I found the issue ;). search_expand_cjk() returned '$word' rather than ' $word ' when there was no expanding to be done. So single characters would not be split off as a separate word, only multiple characters (for minimum length = 1).

Still, even without this fix, searching for '1路' found the node with title '1路' for me. But now, with word length 1, '路' should find it as well.

By the way:

I'm confused about that. Wouldn't something like trim(preg_replace('/[CJKCLASS]/', ' \\1 ')) as a search preprocessor work? Splitting into single characters seem like they would be less intensive than splitting into overlapping pairs like the current code does.

It is not about preprocessing, but about deciding which words to include in the index. This happens later and is based on word length.

Comment #8

Steven commented 21 December 2005 at 13:33

Status:

Active

» Fixed

Comment #9

(not verified) commented 4 January 2006 at 13:40

Status:

Fixed

» Closed (fixed)

chinese language search doesn't work, redux.

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

News items

Our community

Documentation

Drupal code base

Governance of community