So, I came across a few posts of people who say that they can't search with unicode characters. I was pretty sure this was not true, and went searching for any issues in the queue which either detail the problem or show where this was fixed. I can't seem to find anything specific about this.

Here is the main forum thread that people seem to be pointing to when they are looking for evidence of it not working. http://drupal.org/node/178840 [#178840]

Perhaps someone more knowledgeable can point to the patch or this will put this issue on the radar at least instead of getting lost in the forums.

Files: 
CommentFileSizeAuthor
#77 search-pedantic-grammar-D7.patch1.89 KBGarrett Albright
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch search-pedantic-grammar-D7.patch. View
#76 search-unicode-d6.patch10.74 KBGarrett Albright
#75 search-unicode-d6.patch49.89 KBGarrett Albright
#73 search_unicode_tests.patch56.99 KBchx
Passed on all environments. View
#70 search_unicode_tests.patch55.34 KBchx
Passed on all environments. View
#68 search_unicode_tests_5.patch56.28 KBHeine
Failed on MySQL 5.0 InnoDB, with: 17,187 pass(es), 8 fail(s), and 0 exception(es). View
#66 search_unicode_tests.patch56.27 KBchx
Failed on MySQL 5.0 InnoDB, with: 17,186 pass(es), 8 fail(s), and 0 exception(es). View
#64 search_unicode_tests.patch56.26 KBchx
Failed on MySQL 5.0 InnoDB, with: 17,265 pass(es), 9 fail(s), and 4 exception(es). View
#63 search_unicode_tests.patch56.26 KBchx
Failed on MySQL 5.0 InnoDB, with: 17,235 pass(es), 6 fail(s), and 7 exception(es). View
#61 search_unicode_tests.patch55.08 KBchx
Failed on MySQL 5.0 InnoDB, with: 17,238 pass(es), 3 fail(s), and 0 exception(es). View
#59 search_unicode_tests.patch90.87 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,893 pass(es), 1 fail(s), and 0 exception(es). View
#56 allchars.txt1.31 KBchx
#56 generator.txt1.03 KBchx
#56 search_unicode_tests.patch135.31 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,893 pass(es), 3 fail(s), and 0 exception(es). View
#54 generator.txt1.04 KBchx
#54 search_unicode_tests.patch106.64 KBchx
Invalid patch format in search_unicode_tests_0_3.patch. View
#51 search_unicode_tests.patch86.77 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,900 pass(es), 3 fail(s), and 0 exception(es). View
#49 search_unicode_tests.patch125.11 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,927 pass(es), 1 fail(s), and 0 exception(es). View
#47 allchars.txt1.31 KBchx
#47 search_unicode_tests.patch125.04 KBchx
Invalid patch format in search_unicode_tests_0_0.patch. View
#45 search_unicode_tests.patch91.02 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,903 pass(es), 1 fail(s), and 0 exception(es). View
#44 allchars.txt1.28 KBchx
#44 search_unicode_tests.patch56.95 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,925 pass(es), 1 fail(s), and 1 exception(es). View
#37 search_unicode_pwnd.patch7.16 KBchx
Passed on all environments. View
#35 search_unicode_pwnd.patch6.99 KBchx
Passed on all environments. View
#34 search_unicode_pwnd.patch6.98 KBchx
Passed on all environments. View
#27 search_unicode_pwnd.patch7.22 KBchx
Passed on all environments. View
#25 search_unicode_pwnd.patch7.21 KBchx
Failed on MySQL 5.0 InnoDB, with: 16,050 pass(es), 46 fail(s), and 12 exception(es). View
#24 search_unicode_pwnd.patch7.07 KBchx
Passed on all environments. View
#22 604002.diff6.68 KBnaxoc
Failed on MySQL 5.0 InnoDB, with: 16,069 pass(es), 3 fail(s), and 0 exception(es). View
#21 classifier.txt968 byteschx
#20 classifier.txt1.2 KBchx
#15 search_php51.patch6.11 KBchx
Failed to install Drupal on MySQL 5.0 InnoDB. View
#13 search_php51.patch6.1 KBchx
Failed to install Drupal on MySQL 5.0 InnoDB. View
#12 search_php51.patch5.22 KBchx
Failed to install Drupal on MySQL 5.0 InnoDB. View

Comments

kaakuu’s picture

Title: Search with Unicode characters does not work » Unicode Does Not Work?

This is confirmed long ago - answers or solution seem to be lacking for such a major important thing for Devanagari/Indic?similar unicode sites.

Now apparently chx has solution for this, I have requested to post the solution.
You can see this part of this thread http://drupal.org/node/671566#comment-242514
and since no one is attending this issue you can post a request to chx for the facts on the solution. Thanks.

Edited: chx has looked into this and asked for some info on this bug (http://drupal.org/node/671566#comment-2426384)
@rcross - the following info are needed. So please help with this info.
"crucial information like a) being in the issue b) the Unicode library from the status report page c) PCRE version (from the phpinfo linked from the status report page) d) OS."
Chx also asked to "file a bug report" - so I made this issue http://drupal.org/node/672430

I will also try to make a fresh install again and post these info as soon as I can.

kaakuu’s picture

The info on my part

Drupal 6.15, Usual Lamp stack ( I tried this in three to four common popular webhosts)
Unicode library - PHP Mbstring Extension
PCRE Library Version 7.8 2008-09-05
I have tried just now again with a fresh install of Drupal with the above Unicode lib and PCRE specifications.
I pasted the following Unicode text in a node -
सुदृढ आणि सुजाण बाळाची चाहूल सुदृढ, सशक्त व हुशार मुले ही ज्याप्रमाणे आई वडिलांचा तसेच समाजाला आधार असतात, त्याचप्रमाणे देशाची खरी संपत्ती असतात अशी मुले ही घडवावी लागतात ornage
लठ्ठपणा घालविण्याचे सोपे उपाय डाएटिंग सुरु केल्यानंतर वजन कमी होण्याची गती अपेक्षाकृत जलद असते. नंतर मात्र ही गती मंदावते. त्यामुळे निराश होऊ नये. त्यानंतर मात्र वजन कमी होऊ लागते orange

I indexed my site after making sure my search settings are okay.

Search can find the word orange but cannot find बाळाची

Dave Reid’s picture

Title: Unicode Does Not Work? » Search with Unicode characters does not work
douggreen’s picture

Title: Unicode Does Not Work? » Search with Unicode characters does not work

Damien Tournoud suggests in #218403: Duplicate entry errors in search indexer comment#26 what needs to be done for 7.x:

One way to solve that bug is to set the {search_index}.word column to utf8_bin_ci. I just validated on a test site that it solve the problem.

But, with this would mean that we would differentiate between different versions of a word (accented/not accented, etc.).

... In fact, collation is not an enemy, it should be our friend. The implementation of the collation is a difficult work, and moreover language-specific. The one-size-fits-all 'utf8_general_ci' is not optimal, but should works well for most latin based languages. Doing a language specific collation and steeming should be in our work plan for D7.

@kaakuu, does changing search_index to utf8_bin_ci solve the problem?

kaakuu’s picture

@dougreen, the link says that it solved that issue but apparently this is a different one.
Changing search_index to utf8_bin as suggested by you and various such utf combinations does not solve the problem. More specifically Drupal throws error message when asked to search complex Unicode words in Indic, Devnagari or similar Unicode text.

If there is an working demo example that shows changing search_index to utf8_bin_ci solves this issue it will help us in a way that we can try more tweaking the various settings. However, apparently WP and others just do this out of the box without any maneuvers.

To repeat, keep or change search index to utf8__ as suggested or various other
Then, paste this sample Unicode text or any such text with complex words
सुदृढ आणि सुजाण बाळाची चाहूल सुदृढ, सशक्त व हुशार मुले ही ज्याप्रमाणे आई वडिलांचा तसेच समाजाला आधार असतात, त्याचप्रमाणे देशाची खरी संपत्ती असतात अशी मुले ही घडवावी लागतात orange लठ्ठपणा घालविण्याचे सोपे उपाय डाएटिंग सुरु केल्यानंतर वजन कमी होण्याची गती अपेक्षाकृत जलद असते. नंतर मात्र ही गती मंदावते. त्यामुळे निराश होऊ नये. त्यानंतर मात्र वजन कमी होऊ लागते orange
Now, index your site, run cron, do whatever is needed and search words like त्यामुळे or बाळाची or त्यानंतर

Dave Reid’s picture

The problem isn't the indexing or database table encoding. Everything works properly. What's going on is the three words at the end of #5 fail search.module's "You must include at least one positive keyword with 3 characters or more." check. I performed several successful searches with words like सशक्त.

kaakuu’s picture

@Dave When we search words like त्यामुळे or बाळाची or त्यानंतर the error message itself is critically erroneous as search word has already included "at least one positive keyword with 3 characters or more."

When you search words like सशक्त you are probably searching a simple word, which behaves like English words - however, Unicode Devnagari or Indic or similar are actually full of complex words and finding complex words on search is a critical necessity.

Can you find these words त्यामुळे or बाळाची or त्यानंतर or similar in search index in database? I am not sure I find those there but this may need more test than just a quick look I had now.

Let us, for example change the sample text to
त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर त्यामुळे बाळाची त्यानंतर
(Those who are testing please do the test with this text now on as this is more representative of actual usage)

Now, can you please do a re-test and find whether search works for त्यानंतर त्यामुळे ?
This search term has included "at least one positive keyword with 3 characters or more."

chx’s picture

Is this a duplicate and or variant of #335928: Thai vowels are excluded in search index ? Can someone check which characters are problematic are here and whether we exclude them in error?

kaakuu’s picture

> Is this a duplicate and or variant

No, as far as I comprehend.

> which characters are problematic

It can be vowel or consonant, anything any charaacter combinations that happens in a complex word.

I just now set up a demo wordpress site http://unimode.wordpress.com/ and did the steps in #7 above (except that here indexing is automatic and one has to do nothing setupwise, search happens automatically) - the results are as expected, WP does find the text that contains त्यानंतर त्यामुळे at a single go. May be coders can see how WP does this.

Please let us know if any more info needed ?

chx’s picture

By which characters I meant Unicode codepoints... range of codepoints actually. Then we can peek into search module and compare.

Damien Tournoud’s picture

Title: Search with Unicode characters does not work » [Meta-Issue] Poor search support of some Unicode scripts

This has nothing to do with "Unicode". It's just that we very poorly support some scripts, mostly because of the lack of review and contributions from people actually using them. Let's make that a meta-issue.

chx’s picture

Status: Active » Needs review
FileSize
5.22 KB
Failed to install Drupal on MySQL 5.0 InnoDB. View

http://php.net/manual/en/regexp.reference.unicode.php

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected.

chx’s picture

FileSize
6.1 KB
Failed to install Drupal on MySQL 5.0 InnoDB. View

Added test for #7. Hopefully I got it right.

Status: Needs review » Needs work

The last submitted patch, search_php51.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
6.11 KB
Failed to install Drupal on MySQL 5.0 InnoDB. View

Well guys this is interesting and thanks #7 for the interesting search string! I ran

$a = 'तयानतर तयामळ';
preg_match_all('/\pM/u', $a, $matches);
foreach($matches[0] as $match) {
  for ($i=0; $i < strlen($match);$i++) echo ord($match[$i]) ." ";
  echo "\n";
}

and it turns out

224 165 141
224 164 190
224 164 130
224 165 141
224 164 190
224 165 129
224 165 135

seven of them are M. Five are "Non-spacing mark (Mn)" or and two are "Combining spacing mark (Mc)".

Going further http://www.fileformat.info/info/unicode/category/Mc/list.htm we find the wovel marks from several scripts including Devanagari. Same for Mn.

Conclusion: Mc and Mn should not be excluded. Attached patch is the first actual change to search module, it changes \pM to \p{Me}.

Edit: I have repeated the steps in D7 with unpatched search and get the favourite "gimme three" error message because it excludes out most of the characters. I applied the patch and the search succeeded.

chx’s picture

Title: [Meta-Issue] Poor search support of some Unicode scripts » Poor search support of some Unicode scripts

Status: Needs review » Needs work

The last submitted patch, search_php51.patch, failed testing.

chx’s picture

Status: Needs work » Needs review

Let me guess. The bot install fails the requirement check. Could someone else please manually check?

chx’s picture

Further on, we can do this fix without Unicode properties... just it's a lot easier with properties.

Edit: we probably need to write a script which recompiles \pC|\p{Lm}|\p{Me}|\p{Nl}|\pP|\pS|\pZ into code points. Hopeless manually.

chx’s picture

FileSize
1.2 KB

Grab http://unicode.org/Public/UNIDATA/UnicodeData.txt and the attachment here here is a PHP script producing an exclude. I am currently including all letters (include Lm that's new), Nd, No, Mc, Mn.

chx’s picture

FileSize
968 bytes

A much nicer script to generate.

naxoc’s picture

FileSize
6.68 KB
Failed on MySQL 5.0 InnoDB, with: 16,069 pass(es), 3 fail(s), and 0 exception(es). View

I tested the patch from #15 and it did fail the requirements check on install. I edited the implementation of hook_requirements some to make the install work. When running the test it fails on what looks like some japanese characters?

Query matching 'ドルーパル'
and
Query matching 'コーヒー'

Status: Needs review » Needs work

The last submitted patch, 604002.diff, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
7.07 KB
Passed on all environments. View

Now with comments.

chx’s picture

FileSize
7.21 KB
Failed on MySQL 5.0 InnoDB, with: 16,050 pass(es), 46 fail(s), and 12 exception(es). View

Symbol is now mostly moved to the index -- Sk however are excluded. The rest were clear for some time now: Letters, Numbers and Marks are included, Other, Punctuation and Separator are excluded. So this is hopefully the last one if the comments are OK.

Status: Needs review » Needs work

The last submitted patch, search_unicode_pwnd.patch, failed testing.

chx’s picture

FileSize
7.22 KB
Passed on all environments. View

Now, come on. I left out a {} and you blow up? Bah :p

chx’s picture

Status: Needs work » Needs review
kaakuu’s picture

At Chx - thanks a lot for your very detailed insight and work. It will greatly help if you can kindly post a zip or text of the whole search.module in its new form. It can be tested in details.

kaakuu’s picture

I actually found that (in the existing search.module, not the patch) * Matches Unicode character classes to exclude from the search index. seems to be causing problem.

For example, making the code

define('PREG_CLASS_SEARCH_EXCLUDE',

'\x{3289}');

/**
 * Matches all 'N' Unicode character classes (numbers)
 */
define('PREG_CLASS_NUMBERS',

'\x{3289}\x{32b1}-\x{32bf}\x{ff10}-\x{ff19}');

/**
 * Matches all 'P' Unicode character classes (punctuation)
 */
define('PREG_CLASS_PUNCTUATION',

'\x{3289}');

/**
 * Matches all CJK characters that are candidates for auto-splitting
 * (Chinese, Japanese, Korean).
 * Contains kana and BMP ideographs.
 */

define('PREG_CLASS_CJK', '\x{3289}');

actually improves the unicode search almost by more than 90% to 95%. It still fails on some words which I cannot consistently reproduce. Probably with Chx's patches this should be working 100%.
I need a new search.module text or zip - if thats not entirely impossible please post it.

I have no idea whether removing all those codes, quite a lot, has any security implications or not.

sun’s picture

+++ modules/search/search.install	2010-01-03 20:33:12 +0000
@@ -6,6 +6,18 @@
+function search_requirements($phase) {

Missing PHPDoc.

+++ modules/search/search.install	2010-01-03 20:33:12 +0000
@@ -6,6 +6,18 @@
+      'title' => $t('PHP PCRE unicode support'),

s/unicode/Unicode/

+++ modules/search/search.install	2010-01-03 20:33:12 +0000
@@ -6,6 +6,18 @@
+      'description' => t('The PCRE library your PHP is linked with does not support Unicode properties.'),

"The PCRE library, PHP is linked with, does not support Unicode properties."

+++ modules/search/search.module	2010-01-03 21:42:57 +0000
@@ -9,78 +9,36 @@
+ * See: http://unicode.org/glossary

s/See:/@see/

+++ modules/search/search.module	2010-01-03 21:42:57 +0000
@@ -9,78 +9,36 @@
+ * The index only contains the following character categories / properties.

s///and/

+++ modules/search/search.module	2010-01-03 21:42:57 +0000
@@ -9,78 +9,36 @@
+ * @TODO: Enhance based on http://unicode.org/reports/tr29/.

s/@TODO:/@todo/

Powered by Dreditor.

dmitrig01’s picture

@sun - I believe that in third one (inserting commas), no commas should be inserted, only "that" replaced with "your"

Dave Reid’s picture

Yeah that suggestion is even odder.
"The PCRE library that PHP is linked with does not support Unicode properties."

chx’s picture

FileSize
6.98 KB
Passed on all environments. View

I have removed the test. We are not here to test PCRE tables for correctness. I have also moved all Symbol characters to the index. This is a matter of preference. For example, including Sc means that you can search separately on "100$" and "100¢" but 100 won't match them. Not including Sc would mean that searching on 100$ finds 100¢ too which smells wrong to me. What do we want?

chx’s picture

FileSize
6.99 KB
Passed on all environments. View
sun’s picture

+++ modules/search/search.install	2010-01-09 13:46:16 +0000
@@ -7,6 +7,21 @@
+ * Implements hook_requirements.

Missing ().

+++ modules/search/search.install	2010-01-09 13:46:16 +0000
@@ -7,6 +7,21 @@
+      'description' => t('PCRE has not been compiled with Unicode property support. Please Google pcre unicode properties [your operating system] here for more or use PHP from php.net'),

"Google" needs to be removed, suggested search string should be in quotes to delimit it...

'Please search for "pcre unicode properties [your operating system]" on the net or install PHP from php.net.'

(also note trailing period)

Powered by Dreditor.

chx’s picture

FileSize
7.16 KB
Passed on all environments. View

Added sun's fixes, renamed Mark to more precise Combining mark and added some more explanation from php handbook.

Damien Tournoud’s picture

Status: Needs review » Reviewed & tested by the community

This is obviously not perfect (implementing word-splitting properly would require implementing the whole TR#29... and even some fancier machine-learning algorithms), but it is without any doubt an improvement over the current implementation.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Wow. That's an insanely awesome code clean-up. I definitely want this for 7.x. I'm not sure how Gabor will feel about changing requirements 15 point releases in, but I guess we can find out. :)

However, I would really like a version of this patch that includes the test. While today this is implemented in PCRE, tomorrow it might be something else, and since it took us literally like 4 years or so to finally get a well-written bug report that successfully nailed this (thanks for that, kaakuu), I do not feel comfortable without a test that ensures it does not break again.

kaakuu’s picture

Yay! #39 Webchick - Thanks!

Yes, it does need more test. As apart from the above sample text there seems to be at least three or more representative sample texts to test apart from whether anything else is broke. I did have one or two zero results with a few words but that cannot be consistently reproduced. Wish I could do more tests but won't be having some time right now (till this month's third week, which is past the alpha release date).
Anyway, big thanks to Chx for the very analytical insight and the ultimate help in this - it will be a big step forward,

rcross’s picture

glad to see the power of the issue queue again. amazing how long something can sit in the forums festering, when a simple post on the queue actually gets things accomplished. glad I could bring this to light, but kudos to everyone who did the heavy lifting.

chx’s picture

Re Drupal 6.x as said above we won't change requirements, we will use ugly regexp I already posted the script and instructions that compile the nice regexp to an ugly one. Test... HM. OK.

chx’s picture

@rcross sorry but not a simple post got this rolling but an actual reproducible bug report!

chx’s picture

Status: Needs work » Needs review
FileSize
56.95 KB
Failed on MySQL 5.0 InnoDB, with: 16,925 pass(es), 1 fail(s), and 1 exception(es). View
1.28 KB

With tests. Also included is the script used to generate the UnicodeCharacters.txt file. Uses the Unicode character database linked from above.

Edit: the unichr() function in the generate comes from Moodle which is GPL.

chx’s picture

FileSize
91.02 KB
Failed on MySQL 5.0 InnoDB, with: 16,903 pass(es), 1 fail(s), and 0 exception(es). View

Hmmm the patch did not add UnicodeCharacters.txt. I removed chr(0) from the beginning, that placated diff. I am testing \0 separately.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
125.04 KB
Invalid patch format in search_unicode_tests_0_0.patch. View
1.31 KB

Changed the generating script. The previous patch was too small :p

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
125.11 KB
Failed on MySQL 5.0 InnoDB, with: 16,927 pass(es), 1 fail(s), and 0 exception(es). View

Blargh, bah, bah! I have removed all the ASCII control characters hoping that patch won't die on me. I actually tested running patch now, too.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
86.77 KB
Failed on MySQL 5.0 InnoDB, with: 16,900 pass(es), 3 fail(s), and 0 exception(es). View

Maybe restricting to the BMP helps? (The original regexp only dealt with that, anyways) Note that all these patches just pass fine for me.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Further investigation shows that 2502 characters are wrongly classified by PCRE. Stay tuned. 2494 of them are Cn. Hm, I guess I need the Unicode 4.01 UnicodeData maybe http://unicode.org/Public/4.1.0/ucd/UnicodeData.txt from here.

chx’s picture

Status: Needs work » Needs review
FileSize
106.64 KB
Invalid patch format in search_unicode_tests_0_3.patch. View
1.04 KB

Well, guys, PCRE is buggy. Who would have thought? Even rolling back to 4.1.0 found a few characters which are unassigned per PCRE. Also I do not want to fudge around with not knowing which Unicode we are compatible with or not. PCRE 7.0 and 7.5 contained significant fixes / changes to which Unicode is supported. So we are back to a per-codepoint regexp, but way way more precise than the one found in D7 currently.

It must be noted that for the numbers-followed-by-punctuation we still use the PCRE properties and we do not have a test. However, during my testing I found extremely few N and P problems with PCRE so I let it rest.

I *really* hope this passes. The previous tests failed exactly because of the PCRE version mismatches and therefore different behaviour on my computer where I generate and the testbot.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
135.31 KB
Failed on MySQL 5.0 InnoDB, with: 16,893 pass(es), 3 fail(s), and 0 exception(es). View
1.03 KB
1.31 KB

Bah, we saw that before, didn't we? I have restarted and forgot to exclude the bottom of the list. Issue summary:

We have excluded too many characters in search.module. We tried writing a much shorter regexp using PCRE properties but then it turned out that various PHP versions ship with various PCRE versions supporting different Unicode versions and containing bugs in that support. So instead we ourselves generate our regexp. Then using another script, we generate a text file containing the concatenation of every Unicode character above U+001F to avoid patch freaking out in the Unicode 5.2.0 character database in UTF-8 encoding. Then in a test compare search_simplify() results with the previously stored version of the search_simplify()'d version of this file.

We currently exclude 5321 characters out of the 21829 in the character database.

chx’s picture

For comparison, here are the beginnings of the current regexp:

'\x{0}-\x{2f}\x{3a}-\x{40}\x{5b}-\x{60}\x{7b}-\x{bf}\x{d7}\x{f7}\x{2b0}-\x{385}'.

compare this to

  '\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' .
  '\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' .   

It's clearly visible that the new regexp is much more fine grained in what to exclude and what is included, however the beginning is quite the same.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
90.87 KB
Failed on MySQL 5.0 InnoDB, with: 16,893 pass(es), 1 fail(s), and 0 exception(es). View

I dunno. I am out of ideas. I am posting one with only the BMP (ie only up to U+FFFF) but my hopes are quite low at this point. Up until now I understood the problems of the testbot.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
55.08 KB
Failed on MySQL 5.0 InnoDB, with: 17,238 pass(es), 3 fail(s), and 0 exception(es). View

Well, now we will see where this fails. I generated a file separated by chr(10) characters the parts alternate between includes and excludes. I got 334 passes, 0 fails, and 0 exceptions and we will see what the testbot delivers. And, it still only takes 4 sec on my laptop.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
56.26 KB
Failed on MySQL 5.0 InnoDB, with: 17,235 pass(es), 6 fail(s), and 7 exception(es). View

Bot test.

chx’s picture

FileSize
56.26 KB
Failed on MySQL 5.0 InnoDB, with: 17,265 pass(es), 9 fail(s), and 4 exception(es). View

With less typos in testBotTellMeWhyDoYouFail

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
56.27 KB
Failed on MySQL 5.0 InnoDB, with: 17,186 pass(es), 8 fail(s), and 0 exception(es). View

Sigh.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests.patch, failed testing.

Heine’s picture

Status: Needs work » Needs review
FileSize
56.28 KB
Failed on MySQL 5.0 InnoDB, with: 17,187 pass(es), 8 fail(s), and 0 exception(es). View

Use \x syntax to see whether encoding issues between testbot and d.o. cause this.

Status: Needs review » Needs work

The last submitted patch, search_unicode_tests_5.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
55.34 KB
Passed on all environments. View

poor, poor issue. mb_strtolower mixes up ohm with omega... and other snafus. There are a few Unicode characters where the lowercase character has UTF-8 bytes so the above tests using strlen instead of drupal_strlen were doomed for failure. The previous tests were wrong because my machine did not have mbstring compiled in so my machine generated uppercase characters and the testbot have lowercased them so the identical failed...

aspilicious’s picture

It passes!

Heine’s picture

Status: Needs review » Reviewed & tested by the community

ahem.

chx’s picture

FileSize
56.99 KB
Passed on all environments. View

Revert the number-punctuation regexp from properties to code points. At this point, the patch straight applies to D6 too.

webchick’s picture

Version: 7.x-dev » 6.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

Excellent work! Not only do we fix a bug in Drupal core for a few billion people, but we also can file bug reports upstream for PCRE. :D While I was expecting tests that just ran a couple more strings through the existing search tests, chx tells me that these tests are bullet-proof and ensure we get no further regressions in this tweaky, obtuse area of code, which sounds great to me.

Committed to HEAD. Since this was fixed in such a way that it does not require changes to requirements/APIs, also moving down to 6.x for consideration.

Garrett Albright’s picture

Status: Patch (to be ported) » Needs review
FileSize
49.89 KB

EDIT: Ignore this stupid patch.

Garrett Albright’s picture

FileSize
10.74 KB

(Well, I'm not sure how I pulled that off, but anyway, here's a reroll with just the search-related stuff.)

D6 patch! Without tests, obviously, but I was able to successfully get results when using the Thai text in #5, and, if accepted, this also eliminates the need for a D6 port of #493770: Search incorrectly splits some katakana words (I was able to get expected results using some of the hiragana examples in that issue).

Garrett Albright’s picture

FileSize
1.89 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch search-pedantic-grammar-D7.patch. View

Patch to fix some niggly grammatical issues introduced in the D7 patch in #73.

kaakuu’s picture

Is it somehow and kindly possible to post search.module, patched and in entirety,for D6 and D7 as a text attachment, please?

tstoeckler’s picture

Since the D7 patch was committed, you can just go to the Drupal project page (http://drupal.org/project/drupal) and download Drupal 7.

jhodgdon’s picture

Just a note that some of this for D7 will be moved out of the search module and used in trucate_utf8(), if this issue gets fixed:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

And also to see #56 above to learn how the unicode character file was generated for the tests. Note that it ends up being alternate lines of word/boundary characters, which is how the test works (the latest version of the tests for D7 have a lot more comments in them on how they work).

Status: Needs review » Needs work

The last submitted patch, search-pedantic-grammar-D7.patch, failed testing.

jhodgdon’s picture

Version: 6.x-dev » 7.x-dev
Status: Needs work » Needs review

I just took a look at the patch in #77, since this is a D7 patch. As a note, it's not actually suggesting grammar fixes per se -- it's line wrapping, extra spaces, and capitalization.... let's see.

The first two sections are inconsistent:

  * Characters with the following General_category (gc) property values are
  * excluded from the search index. Also, they are used as word boundaries.
- * While this does not fully conform to the  Word Boundaries algorithm
- * described in http://unicode.org/reports/tr29, as PCRE does not contain the
- * Word_Break property table, this simpler algorithm has to do.
+ * While this does not fully conform to the Word Boundaries algorithm described
+ * in http://unicode.org/reports/tr29, as PCRE does not contain the Word_Break
+ * property table, this simpler algorithm has to do.
  * - Cc, Cf, Cn, Co, Cs: Other.
  * - Pc, Pd, Pe, Pf, Pi, Po, Ps: Punctuation.
  * - Sc, Sk, Sm, So: Symbols.
  * - Zl, Zp, Zs: Separators.
  *
  * Consequently, the index only contains characters with the following
- * General_category (gc) property values:
+ * General_Category (gc) property values:

this fixes General_category -> General_Category in only one of the two spots it appears.... I think we should just leave it as-is. Also, this hunk has moved to unicode.inc and has been reworded, so the other changes suggested here have already been taken care of.

-  // search behavior with acronyms and URLs. No need to use the unicode modifer
+  // search behavior with acronyms and URLs. No need to use the Unicode modifier

This is a suggestion to capitalize Unicode in search.module, but it's not complete and doesn't apply to the current code. There are two spots where this could be done. But they're in code comments (not docblocks) so I don't think this is very high priority. Let's leave it.

So I guess we can proceed to Drupal 6, and review the patch in #76. Setting status appropriately (will review that patch in a separate comment).

jhodgdon’s picture

Version: 7.x-dev » 6.x-dev

whoops, wrong version.

jhodgdon’s picture

Status: Needs review » Needs work

I cannot get the patch in #76 to apply to the current Drupal 6. We need a new patch.

udvranto’s picture

subscribing

udvranto’s picture

I applied the patch manually to 6.20. Still does not work for Bengali characters. Do I need to update the index database?

jhodgdon’s picture

Yes, after applying the patch, you would definitely need to reindex your site, because this would change how your site is indexed as well as searched.

jhodgdon’s picture

Someone just reported another example of this for Tamil at #1108194: Drupal unicode search does not work! (closed as duplicate)

jhodgdon’s picture

Version: 6.x-dev » 7.x-dev
Issue summary: View changes
Status: Needs work » Fixed

Talked with Gabor (the Drupal 6 branch maintainer) and D6 issues are really not being committed unless they're really essential -- we really don't have a test system for Drupal 6 and it's too dangerous. So... putting this back to D7 / fixed.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.