I've noticed the documented fact that search in Chinese is don't work
out of the box.

Before submitting this, I searched around this site and found a few
references to this problem. The most up to date seemed to be:
http://drupal.org/node/2142
in which Steven suggests that you use a break-word preprocessor
hook_search_preprocess on chinese text to split the text into "words".

The way it's described in 2142, "Chinese doesn't use spaces". But I
think a better characterization is "Chinese doesn't have words". I've
read a research paper (I think on automated word segmentation by Dekai
Wu) which mentioned that chinese native speakers themselves have a
pretty large disagreement of where to put "word breaks" in Chinese
text, and that presumably comes from the fact that they don't
necessarily think of everything in terms of "words." I know that it's
never clear to me (as a non-native non-literate sort-of-speaker) which
things might be considered "words". A lot of particles could be words
by themselves, or maybe attached to the verb or noun that they modify.
A lot of nouns are compounds, whose component parts make sense as
words by themselves.

So what's the point? I suggest that:

Out of the box, index all chinese characters as if they were
separate words.

Here's the reasoning:

  1. This is close to the ideal behavior anyway. A search for 我 should find ALL text that contains that character. A search for 我的 should ideally find all text that contains the string "我的", but an okay second place would be to find all text that contained both 我 and 的. Later, a feature could be added to rank search results based on the order of keywords in the original search query, so that "我的" results came up before results with 我 and 的 in different parts of the page.
  2. It seems that to implement that behavior at the moment, I need to reduce the "index words less than this length" parameter of the search indexer to "1," and add a hook_search_preprocess hook to do something like add a space before and after every cjk character. That's unideal for the english content on the site.
  3. What I'd like is to not have to reduce the minimum word length to index to "1." I guess I'd still have to reduce the minimum word length to search for to "1."
  4. This seems like it could be done by building the default chinese word splitting into the search indexing engine, since you'd want to index all Chinese "words" even though they were all 1 character long.
  5. Indexing all Chinese characters shouldn't be a big deal. Most sites probably won't have any Chinese characters in their content. For the sites that do, consider that the vast majority of chinese text draws from a bag of the most commonly used 6000-10,000 characters. That's a much better soft upper bound than the number of words you might encounter on english language pages.
  6. It's far better than the current out of the box behavior, which doesn't work at all.

Thoughts?

Comments

Steven’s picture

Optimizing for a specific language is really out of the scope of Drupal. The only reason I added the current CJK handling is that no-one actually seemed to use the preprocessor functionality, and CJK search in 4.6 was /really/ broken. Also, Unicode Han Unification means that there are no Chinese-only characters in general; indexing individual characters would be bad for Japanese for example. And adding special code to deal with both is definitely out of Drupal's scope.

As far as current HEAD search being broken, I do not know of any giant flaws, and I think I addressed your concerns in that other issue. I still hold that it was your own preprocessor messing things up by adding leading and trailing spaces.

I'll admit I don't know enough about Chinese searching patterns to say anything conclusive about it. For example, your concern about short/individual words really depends on whether such words are actually important. Your example is a personal pronoun and a grammatical particle, I think. Their english counterparts would be considered noise words in most cases. And if Chinese characters tend to be restricted to a smaller set than english words, then I would expect character pairing to result in better/more relevant results, rather than worse.

You also need to consider performance... the current CJK tokenizer is a single regexp pass across the entire item text, so it's relatively low-impact. However, if you want to force all Chinese characters to be included regardless of length, you need to add a hanzi regexp/check for every word. That means hundreds of regexp calls per item that are 100% useless on non-CJK sites. Either that, or add some useless characters to each word to extend its length (but then your index would become larger).

Wesley Tanaka’s picture

indexing individual characters would be bad for Japanese for example.

Does this hold for Kanji, or just for Hiragana and Katakana? I would imagine that Unicode has ranges defining which CJK characters are "chinese originated". But if Kanji really does have words (where Chinese characters do not, really), I guess it wouldn't make sense to try to deal with it in an English-centric Drupal.

Your example is a personal pronoun and a grammatical particle, I think.

Yes, that was a very poor example of single character words and the different way that Chinese deals with compounding. The examples in the other issue are better.

expect character pairing to result in better/more relevant results, rather than worse.

From what I understand, this only holds if the search string is more than 1 character long and contains characters that are expected to appear next to each other in the text of the webpage. Would a search for "ABXY" find webpages that contained the string "ABCXY" or "ABCDEFGXY"? (i.e. pages in which "BX" did not occur)

You also need to consider performance... the current CJK tokenizer is a single regexp pass across the entire item text, so it's relatively low-impact. However, if you want to force all Chinese characters to be included regardless of length, you need to add a hanzi regexp/check for every word.

I'm confused about that. Wouldn't something like trim(preg_replace('/[CJKCLASS]/', ' \\1 ')) as a search preprocessor work? Splitting into single characters seem like they would be less intensive than splitting into overlapping pairs like the current code does.

Wesley Tanaka’s picture

see http://drupal.org/node/39137#comment-57327 for a better example. I tried finding it with drupal's search engine, but couldn't because I was searching for the chinese characters in the post.. ;)

Wesley Tanaka’s picture

okay. I have 4.7.0-beta2 now. I changed my Minimum word length to index to 1 and my Minimum word length to search for to 1, and I turned off "Simple CJK Handling".

The search works slightly better than before, but still has some problems.

For example, I have a node whose title is 1路 (bus #1)

If I search for 路, I expect to get that back as a hit.
If I search for 1路, I'd also expect to get it back as a hit, but neither of those searches returns that node.

Also, it would be nice to be able to search in Chinese without having to turn the minimum word length settings to 1 for english.

Steven’s picture

I cannot reproduce this. Both with simple CJK handling turned on and off, a node with title 1路 gets found.

Wesley Tanaka’s picture

Here's the node I'm looking for:

http://treehouse.ofb.net/go/en/node/1181

You can see the results of these two searches:

http://treehouse.ofb.net/go/en/search/node/1
(which finds it)
and
http://treehouse.ofb.net/go/en/search/node/%E8%B7%AF
which does not find it.

Steven’s picture

Right, I found the issue ;). search_expand_cjk() returned '$word' rather than ' $word ' when there was no expanding to be done. So single characters would not be split off as a separate word, only multiple characters (for minimum length = 1).

Still, even without this fix, searching for '1路' found the node with title '1路' for me. But now, with word length 1, '路' should find it as well.

By the way:

I'm confused about that. Wouldn't something like trim(preg_replace('/[CJKCLASS]/', ' \\1 ')) as a search preprocessor work? Splitting into single characters seem like they would be less intensive than splitting into overlapping pairs like the current code does.

It is not about preprocessing, but about deciding which words to include in the index. This happens later and is based on word length.

Steven’s picture

Status: Active » Fixed
Anonymous’s picture

Status: Fixed » Closed (fixed)