Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
So, I came across a few posts of people who say that they can't search with unicode characters. I was pretty sure this was not true, and went searching for any issues in the queue which either detail the problem or show where this was fixed. I can't seem to find anything specific about this.
Here is the main forum thread that people seem to be pointing to when they are looking for evidence of it not working. http://drupal.org/node/178840 [#178840]
Perhaps someone more knowledgeable can point to the patch or this will put this issue on the radar at least instead of getting lost in the forums.
Comment | File | Size | Author |
---|---|---|---|
#77 | search-pedantic-grammar-D7.patch | 1.89 KB | Garrett Albright |
#76 | search-unicode-d6.patch | 10.74 KB | Garrett Albright |
#75 | search-unicode-d6.patch | 49.89 KB | Garrett Albright |
#73 | search_unicode_tests.patch | 56.99 KB | chx |
#70 | search_unicode_tests.patch | 55.34 KB | chx |
Comments
Comment #1
kaakuu CreditAttribution: kaakuu commentedThis is confirmed long ago - answers or solution seem to be lacking for such a major important thing for Devanagari/Indic?similar unicode sites.
Now apparently chx has solution for this, I have requested to post the solution.
You can see this part of this thread http://drupal.org/node/671566#comment-242514
and since no one is attending this issue you can post a request to chx for the facts on the solution. Thanks.
Edited: chx has looked into this and asked for some info on this bug (http://drupal.org/node/671566#comment-2426384)
@rcross - the following info are needed. So please help with this info.
"crucial information like a) being in the issue b) the Unicode library from the status report page c) PCRE version (from the phpinfo linked from the status report page) d) OS."
Chx also asked to "file a bug report" - so I made this issue http://drupal.org/node/672430
I will also try to make a fresh install again and post these info as soon as I can.
Comment #2
kaakuu CreditAttribution: kaakuu commentedThe info on my part
Drupal 6.15, Usual Lamp stack ( I tried this in three to four common popular webhosts)
Unicode library - PHP Mbstring Extension
PCRE Library Version 7.8 2008-09-05
I have tried just now again with a fresh install of Drupal with the above Unicode lib and PCRE specifications.
I pasted the following Unicode text in a node -
सुदृढ आणि सुजाण बाळाची चाहूल सुदृढ, सशक्त व हुशार मुले ही ज्याप्रमाणे आई वडिलांचा तसेच समाजाला आधार असतात, त्याचप्रमाणे देशाची खरी संपत्ती असतात अशी मुले ही घडवावी लागतात ornage
लठ्ठपणा घालविण्याचे सोपे उपाय डाएटिंग सुरु केल्यानंतर वजन कमी होण्याची गती अपेक्षाकृत जलद असते. नंतर मात्र ही गती मंदावते. त्यामुळे निराश होऊ नये. त्यानंतर मात्र वजन कमी होऊ लागते orange
I indexed my site after making sure my search settings are okay.
Search can find the word orange but cannot find बाळाची
Comment #3
Dave ReidComment #4
douggreen CreditAttribution: douggreen commentedDamien Tournoud suggests in #218403: Duplicate entry errors in search indexer comment#26 what needs to be done for 7.x:
@kaakuu, does changing search_index to utf8_bin_ci solve the problem?
Comment #5
kaakuu CreditAttribution: kaakuu commented@dougreen, the link says that it solved that issue but apparently this is a different one.
Changing search_index to utf8_bin as suggested by you and various such utf combinations does not solve the problem. More specifically Drupal throws error message when asked to search complex Unicode words in Indic, Devnagari or similar Unicode text.
If there is an working demo example that shows changing search_index to utf8_bin_ci solves this issue it will help us in a way that we can try more tweaking the various settings. However, apparently WP and others just do this out of the box without any maneuvers.
To repeat, keep or change search index to utf8__ as suggested or various other
Then, paste this sample Unicode text or any such text with complex words
सुदृढ आणि सुजाण बाळाची चाहूल सुदृढ, सशक्त व हुशार मुले ही ज्याप्रमाणे आई वडिलांचा तसेच समाजाला आधार असतात, त्याचप्रमाणे देशाची खरी संपत्ती असतात अशी मुले ही घडवावी लागतात orange लठ्ठपणा घालविण्याचे सोपे उपाय डाएटिंग सुरु केल्यानंतर वजन कमी होण्याची गती अपेक्षाकृत जलद असते. नंतर मात्र ही गती मंदावते. त्यामुळे निराश होऊ नये. त्यानंतर मात्र वजन कमी होऊ लागते orange
Now, index your site, run cron, do whatever is needed and search words like त्यामुळे or बाळाची or त्यानंतर
Comment #6
Dave ReidThe problem isn't the indexing or database table encoding. Everything works properly. What's going on is the three words at the end of #5 fail search.module's "You must include at least one positive keyword with 3 characters or more." check. I performed several successful searches with words like सशक्त.
Comment #7
kaakuu CreditAttribution: kaakuu commented@Dave When we search words like त्यामुळे or बाळाची or त्यानंतर the error message itself is critically erroneous as search word has already included "at least one positive keyword with 3 characters or more."
When you search words like सशक्त you are probably searching a simple word, which behaves like English words - however, Unicode Devnagari or Indic or similar are actually full of complex words and finding complex words on search is a critical necessity.
Can you find these words त्यामुळे or बाळाची or त्यानंतर or similar in search index in database? I am not sure I find those there but this may need more test than just a quick look I had now.
Let us, for example change the sample text to
त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर
(Those who are testing please do the test with this text now on as this is more representative of actual usage)
Now, can you please do a re-test and find whether search works for त्यानंतर त्यामुळे ?
This search term has included "at least one positive keyword with 3 characters or more."
Comment #8
chx CreditAttribution: chx commentedIs this a duplicate and or variant of #335928: Thai vowels are excluded in search index ? Can someone check which characters are problematic are here and whether we exclude them in error?
Comment #9
kaakuu CreditAttribution: kaakuu commented> Is this a duplicate and or variant
No, as far as I comprehend.
> which characters are problematic
It can be vowel or consonant, anything any charaacter combinations that happens in a complex word.
I just now set up a demo wordpress site http://unimode.wordpress.com/ and did the steps in #7 above (except that here indexing is automatic and one has to do nothing setupwise, search happens automatically) - the results are as expected, WP does find the text that contains त्यानंतर त्यामुळे at a single go. May be coders can see how WP does this.
Please let us know if any more info needed ?
Comment #10
chx CreditAttribution: chx commentedBy which characters I meant Unicode codepoints... range of codepoints actually. Then we can peek into search module and compare.
Comment #11
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis has nothing to do with "Unicode". It's just that we very poorly support some scripts, mostly because of the lack of review and contributions from people actually using them. Let's make that a meta-issue.
Comment #12
chx CreditAttribution: chx commentedhttp://php.net/manual/en/regexp.reference.unicode.php
Comment #13
chx CreditAttribution: chx commentedAdded test for #7. Hopefully I got it right.
Comment #15
chx CreditAttribution: chx commentedWell guys this is interesting and thanks #7 for the interesting search string! I ran
and it turns out
seven of them are M. Five are "Non-spacing mark (Mn)" or and two are "Combining spacing mark (Mc)".
Going further http://www.fileformat.info/info/unicode/category/Mc/list.htm we find the wovel marks from several scripts including Devanagari. Same for Mn.
Conclusion: Mc and Mn should not be excluded. Attached patch is the first actual change to search module, it changes \pM to \p{Me}.
Edit: I have repeated the steps in D7 with unpatched search and get the favourite "gimme three" error message because it excludes out most of the characters. I applied the patch and the search succeeded.
Comment #16
chx CreditAttribution: chx commentedComment #18
chx CreditAttribution: chx commentedLet me guess. The bot install fails the requirement check. Could someone else please manually check?
Comment #19
chx CreditAttribution: chx commentedFurther on, we can do this fix without Unicode properties... just it's a lot easier with properties.
Edit: we probably need to write a script which recompiles \pC|\p{Lm}|\p{Me}|\p{Nl}|\pP|\pS|\pZ into code points. Hopeless manually.
Comment #20
chx CreditAttribution: chx commentedGrab http://unicode.org/Public/UNIDATA/UnicodeData.txt and the attachment here here is a PHP script producing an exclude. I am currently including all letters (include Lm that's new), Nd, No, Mc, Mn.
Comment #21
chx CreditAttribution: chx commentedA much nicer script to generate.
Comment #22
naxoc CreditAttribution: naxoc commentedI tested the patch from #15 and it did fail the requirements check on install. I edited the implementation of hook_requirements some to make the install work. When running the test it fails on what looks like some japanese characters?
Query matching 'ドルーパル'
and
Query matching 'コーヒー'
Comment #24
chx CreditAttribution: chx commentedNow with comments.
Comment #25
chx CreditAttribution: chx commentedSymbol is now mostly moved to the index -- Sk however are excluded. The rest were clear for some time now: Letters, Numbers and Marks are included, Other, Punctuation and Separator are excluded. So this is hopefully the last one if the comments are OK.
Comment #27
chx CreditAttribution: chx commentedNow, come on. I left out a {} and you blow up? Bah :p
Comment #28
chx CreditAttribution: chx commentedComment #29
kaakuu CreditAttribution: kaakuu commentedAt Chx - thanks a lot for your very detailed insight and work. It will greatly help if you can kindly post a zip or text of the whole search.module in its new form. It can be tested in details.
Comment #30
kaakuu CreditAttribution: kaakuu commentedI actually found that (in the existing search.module, not the patch)
* Matches Unicode character classes to exclude from the search index.
seems to be causing problem.For example, making the code
actually improves the unicode search almost by more than 90% to 95%. It still fails on some words which I cannot consistently reproduce. Probably with Chx's patches this should be working 100%.
I need a new search.module text or zip - if thats not entirely impossible please post it.
I have no idea whether removing all those codes, quite a lot, has any security implications or not.
Comment #31
sunMissing PHPDoc.
s/unicode/Unicode/
"The PCRE library, PHP is linked with, does not support Unicode properties."
s/See:/@see/
s///and/
s/@TODO:/@todo/
Powered by Dreditor.
Comment #32
dmitrig01 CreditAttribution: dmitrig01 commented@sun - I believe that in third one (inserting commas), no commas should be inserted, only "that" replaced with "your"
Comment #33
Dave ReidYeah that suggestion is even odder.
"The PCRE library that PHP is linked with does not support Unicode properties."
Comment #34
chx CreditAttribution: chx commentedI have removed the test. We are not here to test PCRE tables for correctness. I have also moved all Symbol characters to the index. This is a matter of preference. For example, including Sc means that you can search separately on "100$" and "100¢" but 100 won't match them. Not including Sc would mean that searching on 100$ finds 100¢ too which smells wrong to me. What do we want?
Comment #35
chx CreditAttribution: chx commentedhttps://twitter.com/CatherineOmega/status/7560974517 makes sense. Moved S to exclude.
Comment #36
sunMissing ().
"Google" needs to be removed, suggested search string should be in quotes to delimit it...
'Please search for "pcre unicode properties [your operating system]" on the net or install PHP from php.net.'
(also note trailing period)
Powered by Dreditor.
Comment #37
chx CreditAttribution: chx commentedAdded sun's fixes, renamed Mark to more precise Combining mark and added some more explanation from php handbook.
Comment #38
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is obviously not perfect (implementing word-splitting properly would require implementing the whole TR#29... and even some fancier machine-learning algorithms), but it is without any doubt an improvement over the current implementation.
Comment #39
webchickWow. That's an insanely awesome code clean-up. I definitely want this for 7.x. I'm not sure how Gabor will feel about changing requirements 15 point releases in, but I guess we can find out. :)
However, I would really like a version of this patch that includes the test. While today this is implemented in PCRE, tomorrow it might be something else, and since it took us literally like 4 years or so to finally get a well-written bug report that successfully nailed this (thanks for that, kaakuu), I do not feel comfortable without a test that ensures it does not break again.
Comment #40
kaakuu CreditAttribution: kaakuu commentedYay! #39 Webchick - Thanks!
Yes, it does need more test. As apart from the above sample text there seems to be at least three or more representative sample texts to test apart from whether anything else is broke. I did have one or two zero results with a few words but that cannot be consistently reproduced. Wish I could do more tests but won't be having some time right now (till this month's third week, which is past the alpha release date).
Anyway, big thanks to Chx for the very analytical insight and the ultimate help in this - it will be a big step forward,
Comment #41
rcross CreditAttribution: rcross commentedglad to see the power of the issue queue again. amazing how long something can sit in the forums festering, when a simple post on the queue actually gets things accomplished. glad I could bring this to light, but kudos to everyone who did the heavy lifting.
Comment #42
chx CreditAttribution: chx commentedRe Drupal 6.x as said above we won't change requirements, we will use ugly regexp I already posted the script and instructions that compile the nice regexp to an ugly one. Test... HM. OK.
Comment #43
chx CreditAttribution: chx commented@rcross sorry but not a simple post got this rolling but an actual reproducible bug report!
Comment #44
chx CreditAttribution: chx commentedWith tests. Also included is the script used to generate the UnicodeCharacters.txt file. Uses the Unicode character database linked from above.
Edit: the unichr() function in the generate comes from Moodle which is GPL.
Comment #45
chx CreditAttribution: chx commentedHmmm the patch did not add UnicodeCharacters.txt. I removed chr(0) from the beginning, that placated diff. I am testing \0 separately.
Comment #47
chx CreditAttribution: chx commentedChanged the generating script. The previous patch was too small :p
Comment #49
chx CreditAttribution: chx commentedBlargh, bah, bah! I have removed all the ASCII control characters hoping that patch won't die on me. I actually tested running patch now, too.
Comment #51
chx CreditAttribution: chx commentedMaybe restricting to the BMP helps? (The original regexp only dealt with that, anyways) Note that all these patches just pass fine for me.
Comment #53
chx CreditAttribution: chx commentedFurther investigation shows that 2502 characters are wrongly classified by PCRE. Stay tuned. 2494 of them are Cn. Hm, I guess I need the Unicode 4.01 UnicodeData maybe http://unicode.org/Public/4.1.0/ucd/UnicodeData.txt from here.
Comment #54
chx CreditAttribution: chx commentedWell, guys, PCRE is buggy. Who would have thought? Even rolling back to 4.1.0 found a few characters which are unassigned per PCRE. Also I do not want to fudge around with not knowing which Unicode we are compatible with or not. PCRE 7.0 and 7.5 contained significant fixes / changes to which Unicode is supported. So we are back to a per-codepoint regexp, but way way more precise than the one found in D7 currently.
It must be noted that for the numbers-followed-by-punctuation we still use the PCRE properties and we do not have a test. However, during my testing I found extremely few N and P problems with PCRE so I let it rest.
I *really* hope this passes. The previous tests failed exactly because of the PCRE version mismatches and therefore different behaviour on my computer where I generate and the testbot.
Comment #56
chx CreditAttribution: chx commentedBah, we saw that before, didn't we? I have restarted and forgot to exclude the bottom of the list. Issue summary:
We have excluded too many characters in search.module. We tried writing a much shorter regexp using PCRE properties but then it turned out that various PHP versions ship with various PCRE versions supporting different Unicode versions and containing bugs in that support. So instead we ourselves generate our regexp. Then using another script, we generate a text file containing the concatenation of every Unicode character above U+001F to avoid patch freaking out in the Unicode 5.2.0 character database in UTF-8 encoding. Then in a test compare search_simplify() results with the previously stored version of the search_simplify()'d version of this file.
We currently exclude 5321 characters out of the 21829 in the character database.
Comment #57
chx CreditAttribution: chx commentedFor comparison, here are the beginnings of the current regexp:
compare this to
It's clearly visible that the new regexp is much more fine grained in what to exclude and what is included, however the beginning is quite the same.
Comment #59
chx CreditAttribution: chx commentedI dunno. I am out of ideas. I am posting one with only the BMP (ie only up to U+FFFF) but my hopes are quite low at this point. Up until now I understood the problems of the testbot.
Comment #61
chx CreditAttribution: chx commentedWell, now we will see where this fails. I generated a file separated by chr(10) characters the parts alternate between includes and excludes. I got 334 passes, 0 fails, and 0 exceptions and we will see what the testbot delivers. And, it still only takes 4 sec on my laptop.
Comment #63
chx CreditAttribution: chx commentedBot test.
Comment #64
chx CreditAttribution: chx commentedWith less typos in testBotTellMeWhyDoYouFail
Comment #66
chx CreditAttribution: chx commentedSigh.
Comment #68
Heine CreditAttribution: Heine commentedUse \x syntax to see whether encoding issues between testbot and d.o. cause this.
Comment #70
chx CreditAttribution: chx commentedpoor, poor issue.
mb_strtolower mixes up ohm with omega... and other snafus. There are a few Unicode characters where the lowercase character has UTF-8 bytes so the above tests using strlen instead of drupal_strlen were doomed for failure. The previous tests were wrong because my machine did not have mbstring compiled in so my machine generated uppercase characters and the testbot have lowercased them so the identical failed...Comment #71
aspilicious CreditAttribution: aspilicious commentedIt passes!
Comment #72
Heine CreditAttribution: Heine commentedahem.
Comment #73
chx CreditAttribution: chx commentedRevert the number-punctuation regexp from properties to code points. At this point, the patch straight applies to D6 too.
Comment #74
webchickExcellent work! Not only do we fix a bug in Drupal core for a few billion people, but we also can file bug reports upstream for PCRE. :D While I was expecting tests that just ran a couple more strings through the existing search tests, chx tells me that these tests are bullet-proof and ensure we get no further regressions in this tweaky, obtuse area of code, which sounds great to me.
Committed to HEAD. Since this was fixed in such a way that it does not require changes to requirements/APIs, also moving down to 6.x for consideration.
Comment #75
Garrett Albright CreditAttribution: Garrett Albright commentedEDIT: Ignore this stupid patch.
Comment #76
Garrett Albright CreditAttribution: Garrett Albright commented(Well, I'm not sure how I pulled that off, but anyway, here's a reroll with just the search-related stuff.)
D6 patch! Without tests, obviously, but I was able to successfully get results when using the Thai text in #5, and, if accepted, this also eliminates the need for a D6 port of #493770: Search incorrectly splits some katakana words (I was able to get expected results using some of the hiragana examples in that issue).
Comment #77
Garrett Albright CreditAttribution: Garrett Albright commentedPatch to fix some niggly grammatical issues introduced in the D7 patch in #73.
Comment #78
kaakuu CreditAttribution: kaakuu commentedIs it somehow and kindly possible to post search.module, patched and in entirety,for D6 and D7 as a text attachment, please?
Comment #79
tstoecklerSince the D7 patch was committed, you can just go to the Drupal project page (http://drupal.org/project/drupal) and download Drupal 7.
Comment #80
jhodgdonJust a note that some of this for D7 will be moved out of the search module and used in trucate_utf8(), if this issue gets fixed:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
And also to see #56 above to learn how the unicode character file was generated for the tests. Note that it ends up being alternate lines of word/boundary characters, which is how the test works (the latest version of the tests for D7 have a lot more comments in them on how they work).
Comment #82
jhodgdonI just took a look at the patch in #77, since this is a D7 patch. As a note, it's not actually suggesting grammar fixes per se -- it's line wrapping, extra spaces, and capitalization.... let's see.
The first two sections are inconsistent:
this fixes General_category -> General_Category in only one of the two spots it appears.... I think we should just leave it as-is. Also, this hunk has moved to unicode.inc and has been reworded, so the other changes suggested here have already been taken care of.
This is a suggestion to capitalize Unicode in search.module, but it's not complete and doesn't apply to the current code. There are two spots where this could be done. But they're in code comments (not docblocks) so I don't think this is very high priority. Let's leave it.
So I guess we can proceed to Drupal 6, and review the patch in #76. Setting status appropriately (will review that patch in a separate comment).
Comment #83
jhodgdonwhoops, wrong version.
Comment #84
jhodgdonI cannot get the patch in #76 to apply to the current Drupal 6. We need a new patch.
Comment #85
udvranto CreditAttribution: udvranto commentedsubscribing
Comment #86
udvranto CreditAttribution: udvranto commentedI applied the patch manually to 6.20. Still does not work for Bengali characters. Do I need to update the index database?
Comment #87
jhodgdonYes, after applying the patch, you would definitely need to reindex your site, because this would change how your site is indexed as well as searched.
Comment #88
jhodgdonSomeone just reported another example of this for Tamil at #1108194: Drupal unicode search does not work! (closed as duplicate)
Comment #89
jhodgdonTalked with Gabor (the Drupal 6 branch maintainer) and D6 issues are really not being committed unless they're really essential -- we really don't have a test system for Drupal 6 and it's too dangerous. So... putting this back to D7 / fixed.