(Note: If your computer doesn't properly support Japanese, you'll have to use your imagination some in the following post. Credit for bringing attention to this problem goes to kurupira in the Japan group on g.d.o.)
Katakana is a Japanese phonetic script which is most often used to write foreign words such as パン pan, which means "bread" and is from the Portuguese word for "bread," and アルバイト arubaito, which means "part-time job" and is from the German word for "to work" (arbeit). It's also used to spell out names of species, particularly when speaking clinically, and for sound effects and onomatopoeia.
All but one Japanese morae (a distinct sound-part of a syllable) end with a vowel sound. That vowel sound can be extended for another mora to make a similar sounding (especially to foreign ears) but distinct syllable. In katakana, this is done with the character ー, called the chouonpu (which is distinct from the character 一, the kanji character ichi, meaning "one"). For example, the word "rekoudo," from the English "record" (as in vinyl album), is レコード re-ko-[long vowel]-do, whereas "deito," from the English "date" (a romantic outing), is デート de-[long vowel]-to. Thus, the character is significant when considering the pronunciation, and therefore the spelling, of a word.
However, Drupal's PREG_CLASS_SEARCH_EXCLUDE constant includes this character, causing Drupal to omit it when indexing and searching. Combined with the fact that, by default, "words" of less than three characters aren't accepted for search terms, and we've got a problem. A word like "メール" (meiru, from English "mail" and meaning an email or text message) is split into, and indexed as "メ" and "ル" (me and ru), which are meaningless by themselves. Searching for "メール" will return an error that your search term must be at least three characters long!
Many searches which are longer than three characters when chouonpu are not counted will work regardless; however, others will cause unexpected results. For example, if I create a node which contains "ハンバーガーメール" hanbaagaa meiru, index, and then search for "ハンバール" hanbaaru, that node will be returned!
PREG_CLASS_SEARCH_EXCLUDE is supposed to list punctuation, but ー (Unicode 30FC) is most decidedly not punctuation. In my opinion, this should be considered a bug, and fixed in D6 (and D5?) as well. I was instructed in #drupal to submit the patch for D7 first, but couldn't help attaching the D6 patch too. (Note that, save for a whitespace style difference, the patches are identical.) The search index will need to be rebuilt after applying this patch to ensure proper behavior with nodes submitted before the patch was applied.
Comment | File | Size | Author |
---|---|---|---|
#25 | chouonpu-100102-d7.patch | 3.54 KB | Garrett Albright |
#19 | chouonpu-100101-d7.patch | 3.12 KB | Garrett Albright |
#7 | search-chouonpu-tests-d7.patch | 3.06 KB | Garrett Albright |
#5 | japanese_search_493770.patch | 2.26 KB | tobiasb |
#4 | japanese.search.test | 2.26 KB | Garrett Albright |
Comments
Comment #1
Garrett Albright CreditAttribution: Garrett Albright commentedWell, this was a D6 bug report when I started writing it, but a D7 one when I finished. Fixing the "Version" menu.
Comment #2
Dries CreditAttribution: Dries commentedVery entertaining and educational read -- great issue description! For D7, it would be great to have a couple of simple tests for those. We have search tests already so maybe those can be extended. Or, do you think we don't need tests for this?
Comment #3
Garrett Albright CreditAttribution: Garrett Albright commentedI think a test or two would be great, but I'm not smart enough to write them yet.
I guess there's no time like the present to learn, though… I'll look into it when I'm off the clock.
Comment #4
Garrett Albright CreditAttribution: Garrett Albright commentedWell, I hacked search.test to add a Japanese test to it, but when I go to run the tests, I can't find the Search module's test category on the test listing page thingie. I tried it on a fresh (unhacked) D7 installation too, but I couldn't find it there either. Yes, I enabled the Search module first… I'm not sure what that means, and the folks in #drupal couldn't help me tonight. Help, I'm a clueless testing n00b! What should I do?
I'm not sure how the modifications will work out anyway for people who don't have Japanese support on their computers and/or whose editors can't support UTF-8 files. Their editor might render it as a bunch of puke all over their screen and/or save it as a bunch of puke when they save the file. I tried to research some simple way of encoding Unicode characters as ASCII in PHP strings, but didn't find anything reasonable; I'm hoping the answer isn't using pack() or something equally Perlesque.
Here's my doubly-untested hacks to search.test.
EDIT: Upload attachment fail! This file is a patch and should have a .patch extension… Ouch, another n00b mistake.
Comment #5
tobiasbComment #7
Garrett Albright CreditAttribution: Garrett Albright commentedI gave this another try this weekend. I reinstalled D7 and the search tests seemed to reappear. After some tweaking, I managed to create a semblance of some working tests. This patch combines both the tweak in the first patch (unmodified) plus some working tests.
If we're going to include Japanese tests, it would probably be good to include some for other languages with non-Roman scripts as well - Korean, Arabic, Cyrillic… But with each one there's the problem I mentioned in my previous post. When I view my own attached patch in the web browser, there's just garbage symbols where the Japanese should be… I don't know what to do about that.
Comment #8
Garrett Albright CreditAttribution: Garrett Albright commentedMarking as "needs review" and bumping because if D7 and D6.14 come out without this bug being fixed… well, one word: Ninja.
Comment #10
Dokuro CreditAttribution: Dokuro commentedHi Garrett,
Any update on this issue?
Comment #11
catchDrupal.org sends all patch files as ISO to the browser, if you manually change it to UTF8 (View - > Encoding in firefox), then the kana shows up fine.
edit: there is a very similar bug for Thai vowels #335928: Thai vowels are excluded in search index.
Comment #13
Garrett Albright CreditAttribution: Garrett Albright commentedI'm hoping both patches can still get in, though I'm not quite sure what was broken with the last test. I'm anxiously awaiting the results of the retest, but if something's still broken, I may not be able to work on it until next weekend.
Comment #16
Garrett Albright CreditAttribution: Garrett Albright commentedTotally anti-climactic conclusion after waiting all that long. Re-testing.
Comment #18
Garrett Albright CreditAttribution: Garrett Albright commentedAll right, well, I tried installing D7 to do some further work on this, but D7's DB handling seems to be rather broken at this particular point in time and I can't even get to the admin page to enable modules or anything. But just for the record, I'm trying, dammit.
Still hoping this can at least get in for D6. It's a rather serious bug, after all.
Comment #19
Garrett Albright CreditAttribution: Garrett Albright commentedUpdate. Fixed some tests, because it looks like the way those are done have changed since we started. Not entirely sure this fixes everything yet, though. Note that unless you install the patch at #672328: Let's fix some search module tests! first, you're going to get errors when you run the search tests, no matter what.
Comment #20
codycraven CreditAttribution: codycraven commentedSet to needs review so patch will be tested
Comment #24
Garrett Albright CreditAttribution: Garrett Albright commentedReroll. Thanks, axyjo in #drupal-contribute.
Comment #25
Garrett Albright CreditAttribution: Garrett Albright commentedA Varnish freakout ate my attachment. =[
Comment #26
axyjo CreditAttribution: axyjo commentedLooks good to me, let's see what the testbot has to say. Patch reviewers, don't make the same mistake I did and make sure that your browser encoding is set to UTF8.
Comment #27
Dries CreditAttribution: Dries commentedCommitted this to CVS HEAD. Thanks all.
Given that the test bot passed, do we still need #672328: Let's fix some search module tests!?
Comment #28
axyjo CreditAttribution: axyjo commentedIs this something we can backport to D6 since this is a bug?
Comment #29
Garrett Albright CreditAttribution: Garrett Albright commentedI don't know about the test bot, but locally, without that patch, running the Search tests results in failure, whereas everything's green with the patch. Take a quick peek at the patch and I think you'll see why.
I also agree that this should get into D6. The patch in the OP should still be good.
Comment #30
qchan CreditAttribution: qchan commentedThank you for chouonpu-patch. It's very useful.
I wish it would be introduced into Drupal6 core too.
Comment #31
setvik CreditAttribution: setvik commentedFantastic news and awesome job!!
+1 for getting this into Drupal 6 as well if possible. It'll be key for growing Drupal in Japan.
Comment #32
axyjo CreditAttribution: axyjo commentedJust realized that a patch already exists for D6 in the issue overview.
Comment #33
Garrett Albright CreditAttribution: Garrett Albright commentedWe need to keep an eye on #604002: Poor search support of some Unicode scripts, which also makes changes to Drupal's search tool with regards to its handling of various scripts.
Comment #34
jhodgdonIt looks like the patch in the original report for D6 http://drupal.org/files/issues/search-chouonpu-D6.patch is the same as the D7 patch that was committed (minus the tests). Since the D7 patch was reviewed/committed, let's go ahead and set the D6 patch to RTBC.
Comment #35
Gábor HojtsyThanks, committed.