Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Just upgraded drupal.hu to Drupal 6 to eat some dog food, and it looks like the search module is miserably misbehaving. I get boatloads of these *errors* in dblog:
Duplicate entry 'nevet-33-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('nevet', 33, 'node', 0.401937290034) - [***]/root/modules/search/search.module - 574. sor.
Comment | File | Size | Author |
---|---|---|---|
#26 | d6-search-collation.patch | 1.07 KB | Damien Tournoud |
Comments
Comment #1
cburschkasearch_wipe() is called before doing these inserts.
And search_wipe should delete all that stuff:
Why isn't it doing that?
Comment #2
cburschkaAddendum: Also, the array is keyed by words, so those should be unique for a single re-indexing.
Comment #3
Gábor HojtsyHm, now that you provided this info, I think it could easily be a collation problem. "word" is defined to be a simple varchar column, so whatever collides in the used collation will collide there as well.
Comment #4
cburschkaI get it: PHP's array keys collate differently, allowing multiple strings that would collide in MySQL's collation.
That's nasty. Is there any way to avoid having to search for each word before adding it? Or would it be simpler to force a binary collation on this column?
Comment #5
Gábor HojtsyTo elaborate on that, depending on the UTF collation used, several chars might be the same for the index, like é and e for Hungarian. It is a bad MySQL bug which we encountered before (http://bugs.mysql.com/bug.php?id=22337). Are case-insensitive searches supported relying on MySQL's collations? Depending on the answer we can convert this field to binary or not.
Comment #6
Gábor HojtsyHm, tested to set the word column to
varbinary(10)
and then tovarchar(10) character set utf8 collate utf8_bin NOT NULL default ''
and none of that worked to fix this. It might not be a collation issue after all. Hm.Comment #7
Gábor HojtsyVarbinary() would not be good anyway, since it does not work on the character level. So the binary collation was the only contender. With that in place, I got *even more* duplicate key errors for the indexing. Not promising. (Now I did revoke permission for anonymous and authenticated users to use the search, since I was wiping the search index on the SQL level by hand to be able to do these changes so it would have been a less then satisfactory experience).
Comment #8
cburschkaUnfortunately I can't reproduce this with a Lorem Ipsum on utf8_general_ci. The search indexer just runs right through without errors. Can you post the collation and specific node text that this is happening with?
Perhaps vardumping the $results array before inserting it would be helpful information, too.
Comment #9
moshe weitzman CreditAttribution: moshe weitzman commentedWe have a similar issue in D5 which was marked 'fixed'. Not sure if this is a dupe of not. See http://drupal.org/node/143160
Comment #10
cburschkaWell, the older issue was fixed by adding a unique index on the table. So instead of having duplicate values in the table, we now get errors about trying to insert duplicate values.
Comment #11
Gábor HojtsyHm, I looked into actual data causing this. For the error mentioned in the topic starter, it is http://drupal.hu/node/33 (as shown in the error message). The node and its comments contain "nevet" and "nevét", and this obviously collides with the collation used. My search_index table is:
So to try to reproduce, for example, just include "nevet" and "nevét" in a node and see what happens on its indexing. These two words are definitely not colliding in PHP arrays, but they are in MySQL's above setup, as Drupal uses it. This way, search module is unusable as-is with sites using accented chars.
Comment #12
keith.smith CreditAttribution: keith.smith commentedYep. A Drupal 6 install (with search enabled) containing "nevet" and "nevét" in a node (and nothing else) will trigger a:
on running cron.
Comment #13
cburschkaSo why won't the binary collation work? Is the unique key on word-sid-type is stripping high-bit characters regardless of collation?
Comment #14
catchIs there any reason someone would want to index accented and unaccented characters separately with search module? Very late in the cycle for this but I'm wondering if something similar to accents module might be an option? Does varchar to binary have performance/space implications here?
Comment #15
cburschkaWhat happens to search if it is used in a non-Latin character set like Japanese or Hebrew?
But for that problem I would indeed suggest stripping out or replacing high-bit characters, and doing the same when processing a search query.
Comment #16
douggreen CreditAttribution: douggreen commentedI can also reproduce this error.
responding to #1, search_wipe is working fine. It wipes {search_index} for the sid/type combination, then starts inserting the word/sid/type values found, and if this list included the example text "nevet" and "nevét", the first word gets inserted, and the word creates the duplicate error.
So to summarize the above, php recognizes a difference between these two words, but mysql does not.
Comment #17
cburschkaHow can PHP ensure uniqueness without messing up non-Latin alphabets?
Comment #18
cburschkaWhile the root cause is a tricky one to solve properly, I know a trivial solution to the error problem:
This would choose only the word with the maximum score for each set of duplicates. If this sounds like a good solution, I'll make a patch.
Comment #19
Gábor HojtsyArancaytar: if no better solution comes around, this would slow down the reindexing but solve the problem for foreign languages. If this way is chosen, code comments would be great to be added to describe why do we do this (since people might think that the wipe removes the word occurrences already).
Note that we are on track for an (again hopefully final) RC4 tonight (in around 13 hours), so it would be great to solve this as well.
Comment #20
cburschkaI know. :/ That's why I hate to be satisfied with such a primitive solution. But this is one of six criticals still in the queue, and if the next release candidate can downgrade it to a performance issue, that would be an improvement.
Comment #21
catchfwiw I just did a very rudimentary test with some Japanese text and minimum length one character and didn't run into any issues - looks like it's a bug specific to particular accents and nothing more than that.
Comment #22
Gábor HojtsyMost accents used in our text (we use accents common to Central Europe) cause bugs. One indexing round with 100 nodes resulted in around 5 pages of errors in watchdog, that's why I turned the module off for now on the live site, and only test it on my private test site.
Comment #23
catchSorry particular accents should've been "accents on roman characters", cafe and café causes the same error with utf8_general_ci for example.
More on this bug/behaviour at the mysql forums, doesn't look all that helpful.
What about a strtr to strip accents before insertion? Non-trivial but probably more performant than #18. However I guess this could really mess things up in some languages.
As mentioned, accents module does this but I've not installed it, and first look at the code was five minutes ago.
Comment #24
Gábor HojtsyWell, "nevet" and "nevét" definitely means two different things. "nevet" == "laughs" or "that name" depending on the context :), while "nevét" == "her name" or "his name". I know from practice that most of the time, Hungarians are fine writing without accents, but if you remove accents, the results are sometimes funny/unexpected. Eg. "szár" means "stalk", while "szar" means "shit", so when you search for the first, you definitely don't want results on the second to appear.
It could be a fine decision that Drupal's built in search module will not be perfectly fitting for these rules. In fact, if you try http://www.google.com/search?q=szár you will see that both accented and non-accented results appear. So it would not be an unusual or (from user perspective) inherently unexpected path to take. For this to work, we would need to have a list of accented chars to replace with the non-accented versions.
Comment #25
catchAlso with many sites, even ones not using locale/content translate, we'll see "nevet" and "nevét" alongside "café" and "cafe", right here for example - so stripping diacritics makes sense to me for core.
Looked at utf8_decode() but of course that completely destroys non-latin characters, so useless.
Ideally this should happen with both indexing and searching - it currently does explicit searches for e and é with different result sets. If we strip when indexing, we have to do the same with searching. Google wasn't my friend for a nice clean function to do this unfortunately.
Comment #26
Damien Tournoud CreditAttribution: Damien Tournoud commented@Gábor#7: this is clearly a problem related to collations. I ran into it several weeks ago will testing the migration of http://drupalfr.org to Drupal 6, but failed to report it due to severe time constraints.
One way to solve that bug is to set the
{search_index}.word
column toutf8_bin_ci
. I just validated on a test site that it solve the problem.But, with this would mean that we would differentiate between different versions of a word (accented/not accented, etc.).
@catch#25: In fact, collation is not an enemy, it should be our friend. The implementation of the collation is a difficult work, and moreover language-specific. The one-size-fits-all
'utf8_general_ci'
is not optimal, but should works well for most latin based languages. Doing a language specific collation and steeming should be in our work plan for D7.In the meanwhile, because scores are additive, I suggest we do that:
Enclosed patch does exactly that.
Comment #27
cburschkaApplies, fixes the log warnings. Re-indexing a node containing "café cafe" will log errors without the patch, but not afterward.
A second review would be good.
Comment #28
catchPatch removes the error. If I have two nodes like this:
1. café cafe
2. cafe
Searching for either café or cafe gives me the same result. However, the highlighting is still dependent on the specific string used.
This seems ok to me really, might even be a feature since it'd show results with lazy typing, but still differentiate in the nevet nevét case on display. Very nice!
Comment #29
catchLooks like mine was the second review.
Comment #30
Gábor HojtsyI am testing this patch live on Drupal.hu now, thanks! Will commit later today if all looks fine.
Comment #31
Gábor HojtsyDid not solve it on my site. Looks like there is a problem on word length in the unique key as well:
See, that the key entry has a shorter string then the inserted word, so longer words can result in duplicates in the key. Anyone else can reproduce this problem? (Note that these words do not contain any non-English chars).
Comment #32
Damien Tournoud CreditAttribution: Damien Tournoud commented@Gábor: It looks like a different issue, perhaps related to the upgrade path. Could you dump the schema of the
{search_index}
table in your current database?The default one from my empty D6 installation is:
Is yours any different?
Comment #33
cburschkaAlso cannot reproduce on fresh installation. I agree it's probably an update path problem where the unique key has to be reset to the full length of the field.
Comment #34
cburschkaHere's the key-adding code in system.install.
I was unable to find any relevant update on search_index that changed the length of the word field. So the cause is not yet identified.
Edit: Just for reference, my table structure is identical:
Comment #35
catchI can't reproduce this key length issue either on a clean install.
My live D5 database which has been around since 4.5 also doesn't have any explicit key lengths - just checked in case there might be some sticking around:
Comment #36
keith.smith CreditAttribution: keith.smith commentedIf Gabor's post in #11 is current, the only difference I see is the line:
`type` varchar(16) default '',
Comment #37
cburschkaDidn't notice that. Well, with VARCHAR(50) and no explicit key length, I have no idea why the key gets truncated to 10 characters. Weird.
Comment #38
Gábor HojtsyHold on, I broke my own database while trying out fixes. I set it to varchar(10). Let me see a better test now with a correct table. Sorry for taking your time, I should have been more alert to what I change. Excuse me.
Comment #39
Gábor HojtsyWorks perfectly on drupal.hu as well, superb. Thanks for the patch. Committed to 6.x and RTBC for 7.x.
Comment #40
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is still missing for 7.x.
Comment #41
Dries CreditAttribution: Dries commentedI've committed this to CVS HEAD. Thanks guys.
Comment #42
Anonymous (not verified) CreditAttribution: Anonymous commentedAutomatically closed -- issue fixed for two weeks with no activity.
Comment #43
daimhin CreditAttribution: daimhin commentedI am seeing this problems with my 6.2 install. Has the fix that is discussed in this posting been applied to the 6.2 core?
Comment #44
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis patch went in before the release of Drupal 6.0. Please open another issue with a detailed description of your problems.
Comment #45
rajmataj CreditAttribution: rajmataj commented...Need to open this ticket again because I don't believe this has been fixed. I have a new Drupal 6.2 install, and all was running fine for the past week. However, since a few days ago I'm getting numerous errors in my log, dozens of them, all logged within the same minute:
Duplicate entry 'personal-13-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('personal', 13, 'node', 0.949597315436) in /home/bcims/public_html/modules/search/search.module on line 579.
I could definitely use some help in solving this. I was able to construct the template for Drupal 6.2 and am a novice at php and mysql. I can do some rudimentary work on the error but some assistance from more experienced programmers would be very appreciated.
There is a fellow who wrote this article on the problem:
http://cmsreport.com/node/1189
Comment #46
catchRajP. Please open a new issue rather than re-opening this old one. It's likely your duplicate is caused by a slightly different version of the bug so will need to be fixed in a different issue. A quick fix for this is to truncate your search tables and reindex (you can do this via the ui by disabling and uninstalling search module then re-enabling it).
Comment #47
kenorb CreditAttribution: kenorb commentedIs this one:
Is related to this issue?
Comment #48
kenorb CreditAttribution: kenorb commentedDuplicates:
#143160: search_index has duplicate (sid, word, type, fromsid=0) entries
#143160: search_index has duplicate (sid, word, type, fromsid=0) entries
http://drupal.org/node/180032
Comment #49
kenorb CreditAttribution: kenorb commented#218403: Duplicate entry errors in search indexer