(Note: If your computer doesn't properly support Japanese, you'll have to use your imagination some in the following post. Credit for bringing attention to this problem goes to kurupira in the Japan group on g.d.o.)

Katakana is a Japanese phonetic script which is most often used to write foreign words such as パン pan, which means "bread" and is from the Portuguese word for "bread," and アルバイト arubaito, which means "part-time job" and is from the German word for "to work" (arbeit). It's also used to spell out names of species, particularly when speaking clinically, and for sound effects and onomatopoeia.

All but one Japanese morae (a distinct sound-part of a syllable) end with a vowel sound. That vowel sound can be extended for another mora to make a similar sounding (especially to foreign ears) but distinct syllable. In katakana, this is done with the character ー, called the chouonpu (which is distinct from the character 一, the kanji character ichi, meaning "one"). For example, the word "rekoudo," from the English "record" (as in vinyl album), is レコード re-ko-[long vowel]-do, whereas "deito," from the English "date" (a romantic outing), is デート de-[long vowel]-to. Thus, the character is significant when considering the pronunciation, and therefore the spelling, of a word.

However, Drupal's PREG_CLASS_SEARCH_EXCLUDE constant includes this character, causing Drupal to omit it when indexing and searching. Combined with the fact that, by default, "words" of less than three characters aren't accepted for search terms, and we've got a problem. A word like "メール" (meiru, from English "mail" and meaning an email or text message) is split into, and indexed as "メ" and "ル" (me and ru), which are meaningless by themselves. Searching for "メール" will return an error that your search term must be at least three characters long!

Many searches which are longer than three characters when chouonpu are not counted will work regardless; however, others will cause unexpected results. For example, if I create a node which contains "ハンバーガーメール" hanbaagaa meiru, index, and then search for "ハンバール" hanbaaru, that node will be returned!

PREG_CLASS_SEARCH_EXCLUDE is supposed to list punctuation, but ー (Unicode 30FC) is most decidedly not punctuation. In my opinion, this should be considered a bug, and fixed in D6 (and D5?) as well. I was instructed in #drupal to submit the patch for D7 first, but couldn't help attaching the D6 patch too. (Note that, save for a whitespace style difference, the patches are identical.) The search index will need to be rebuilt after applying this patch to ensure proper behavior with nodes submitted before the patch was applied.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Garrett Albright’s picture

Version: 6.12 » 7.x-dev

Well, this was a D6 bug report when I started writing it, but a D7 one when I finished. Fixing the "Version" menu.

Dries’s picture

Very entertaining and educational read -- great issue description! For D7, it would be great to have a couple of simple tests for those. We have search tests already so maybe those can be extended. Or, do you think we don't need tests for this?

Garrett Albright’s picture

I think a test or two would be great, but I'm not smart enough to write them yet.

I guess there's no time like the present to learn, though… I'll look into it when I'm off the clock.

Garrett Albright’s picture

FileSize
2.26 KB

Well, I hacked search.test to add a Japanese test to it, but when I go to run the tests, I can't find the Search module's test category on the test listing page thingie. I tried it on a fresh (unhacked) D7 installation too, but I couldn't find it there either. Yes, I enabled the Search module first… I'm not sure what that means, and the folks in #drupal couldn't help me tonight. Help, I'm a clueless testing n00b! What should I do?

I'm not sure how the modifications will work out anyway for people who don't have Japanese support on their computers and/or whose editors can't support UTF-8 files. Their editor might render it as a bunch of puke all over their screen and/or save it as a bunch of puke when they save the file. I tried to research some simple way of encoding Unicode characters as ASCII in PHP strings, but didn't find anything reasonable; I'm hoping the answer isn't using pack() or something equally Perlesque.

Here's my doubly-untested hacks to search.test.

EDIT: Upload attachment fail! This file is a patch and should have a .patch extension… Ouch, another n00b mistake.

tobiasb’s picture

Status: Needs review » Needs work

The last submitted patch failed testing.

Garrett Albright’s picture

I gave this another try this weekend. I reinstalled D7 and the search tests seemed to reappear. After some tweaking, I managed to create a semblance of some working tests. This patch combines both the tweak in the first patch (unmodified) plus some working tests.

If we're going to include Japanese tests, it would probably be good to include some for other languages with non-Roman scripts as well - Korean, Arabic, Cyrillic… But with each one there's the problem I mentioned in my previous post. When I view my own attached patch in the web browser, there's just garbage symbols where the Japanese should be… I don't know what to do about that.

Garrett Albright’s picture

Status: Needs work » Needs review

Marking as "needs review" and bumping because if D7 and D6.14 come out without this bug being fixed… well, one word: Ninja.

Status: Needs review » Needs work

The last submitted patch failed testing.

Dokuro’s picture

Hi Garrett,

Any update on this issue?

catch’s picture

Drupal.org sends all patch files as ISO to the browser, if you manually change it to UTF8 (View - > Encoding in firefox), then the kana shows up fine.

edit: there is a very similar bug for Thai vowels #335928: Thai vowels are excluded in search index.

Status: Needs work » Needs review

catch requested that failed test be re-tested.

Garrett Albright’s picture

Hi Garrett,

Any update on this issue?

I'm hoping both patches can still get in, though I'm not quite sure what was broken with the last test. I'm anxiously awaiting the results of the retest, but if something's still broken, I may not be able to work on it until next weekend.

Status: Needs review » Needs work

The last submitted patch, , failed testing.

Status: Needs work » Needs review

Re-test of from comment #1750476 was requested by @user.

Garrett Albright’s picture

Totally anti-climactic conclusion after waiting all that long. Re-testing.

Status: Needs review » Needs work

The last submitted patch, , failed testing.

Garrett Albright’s picture

All right, well, I tried installing D7 to do some further work on this, but D7's DB handling seems to be rather broken at this particular point in time and I can't even get to the admin page to enable modules or anything. But just for the record, I'm trying, dammit.

Still hoping this can at least get in for D6. It's a rather serious bug, after all.

Garrett Albright’s picture

FileSize
3.12 KB

Update. Fixed some tests, because it looks like the way those are done have changed since we started. Not entirely sure this fixes everything yet, though. Note that unless you install the patch at #672328: Let's fix some search module tests! first, you're going to get errors when you run the search tests, no matter what.

codycraven’s picture

Status: Needs work » Needs review

Set to needs review so patch will be tested

Status: Needs review » Needs work

The last submitted patch, chouonpu-100101-d7.patch, failed testing.

Status: Needs work » Needs review

Re-test of chouonpu-100101-d7.patch from comment #19 was requested by Garrett Albright.

Status: Needs review » Needs work

The last submitted patch, chouonpu-100101-d7.patch, failed testing.

Garrett Albright’s picture

Status: Needs work » Needs review

Reroll. Thanks, axyjo in #drupal-contribute.

Garrett Albright’s picture

FileSize
3.54 KB

A Varnish freakout ate my attachment. =[

axyjo’s picture

Looks good to me, let's see what the testbot has to say. Patch reviewers, don't make the same mistake I did and make sure that your browser encoding is set to UTF8.

Dries’s picture

Status: Needs review » Fixed

Committed this to CVS HEAD. Thanks all.

Given that the test bot passed, do we still need #672328: Let's fix some search module tests!?

axyjo’s picture

Version: 7.x-dev » 6.x-dev
Status: Fixed » Patch (to be ported)

Is this something we can backport to D6 since this is a bug?

Garrett Albright’s picture

Given that the test bot passed, do we still need #672328: Let's fix some search module tests!?

I don't know about the test bot, but locally, without that patch, running the Search tests results in failure, whereas everything's green with the patch. Take a quick peek at the patch and I think you'll see why.

I also agree that this should get into D6. The patch in the OP should still be good.

qchan’s picture

Thank you for chouonpu-patch. It's very useful.
I wish it would be introduced into Drupal6 core too.

setvik’s picture

Fantastic news and awesome job!!

+1 for getting this into Drupal 6 as well if possible. It'll be key for growing Drupal in Japan.

axyjo’s picture

Status: Patch (to be ported) » Needs review

Just realized that a patch already exists for D6 in the issue overview.

Garrett Albright’s picture

We need to keep an eye on #604002: Poor search support of some Unicode scripts, which also makes changes to Drupal's search tool with regards to its handling of various scripts.

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

It looks like the patch in the original report for D6 http://drupal.org/files/issues/search-chouonpu-D6.patch is the same as the D7 patch that was committed (minus the tests). Since the D7 patch was reviewed/committed, let's go ahead and set the D6 patch to RTBC.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Thanks, committed.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.