Just upgraded drupal.hu to Drupal 6 to eat some dog food, and it looks like the search module is miserably misbehaving. I get boatloads of these *errors* in dblog:

Duplicate entry 'nevet-33-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('nevet', 33, 'node', 0.401937290034) - [***]/root/modules/search/search.module - 574. sor.

Files: 
CommentFileSizeAuthor
#26 d6-search-collation.patch1.07 KBDamien Tournoud

Comments

cburschka’s picture

search_wipe() is called before doing these inserts.

<?php
 
# line 567
 
search_wipe($sid, $type, TRUE);

 
// Insert cleaned up data into dataset
 
db_query("INSERT INTO {search_dataset} (sid, type, data, reindex) VALUES (%d, '%s', '%s', %d)", $sid, $type, $accum, 0);

 
// Insert results into search index
 
foreach ($results[0] as $word => $score) {
   
db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
   
search_dirty($word);
  }                    
?>

And search_wipe should delete all that stuff:

<?php
/**
 * Wipes a part of or the entire search index.
 *
 * @param $sid
 *  (optional) The SID of the item to wipe. If specified, $type must be passed
 *  too.
 * @param $type
 *  (optional) The type of item to wipe.
 */
function search_wipe($sid = NULL, $type = NULL, $reindex = FALSE) {
  if (
$type == NULL && $sid == NULL) {
   
module_invoke_all('search', 'reset');
  }
  else {
   
db_query("DELETE FROM {search_dataset} WHERE sid = %d AND type = '%s'", $sid, $type);
   
db_query("DELETE FROM {search_index} WHERE sid = %d AND type = '%s'", $sid, $type);
   
// Don't remove links if re-indexing.
   
if (!$reindex) {
     
db_query("DELETE FROM {search_node_links} WHERE sid = %d AND type = '%s'", $sid, $type);
    }
  }
}
?>

Why isn't it doing that?

cburschka’s picture

Addendum: Also, the array is keyed by words, so those should be unique for a single re-indexing.

Gábor Hojtsy’s picture

Hm, now that you provided this info, I think it could easily be a collation problem. "word" is defined to be a simple varchar column, so whatever collides in the used collation will collide there as well.

cburschka’s picture

I get it: PHP's array keys collate differently, allowing multiple strings that would collide in MySQL's collation.

That's nasty. Is there any way to avoid having to search for each word before adding it? Or would it be simpler to force a binary collation on this column?

Gábor Hojtsy’s picture

To elaborate on that, depending on the UTF collation used, several chars might be the same for the index, like é and e for Hungarian. It is a bad MySQL bug which we encountered before (http://bugs.mysql.com/bug.php?id=22337). Are case-insensitive searches supported relying on MySQL's collations? Depending on the answer we can convert this field to binary or not.

Gábor Hojtsy’s picture

Hm, tested to set the word column to varbinary(10) and then to varchar(10) character set utf8 collate utf8_bin NOT NULL default '' and none of that worked to fix this. It might not be a collation issue after all. Hm.

Gábor Hojtsy’s picture

Varbinary() would not be good anyway, since it does not work on the character level. So the binary collation was the only contender. With that in place, I got *even more* duplicate key errors for the indexing. Not promising. (Now I did revoke permission for anonymous and authenticated users to use the search, since I was wiping the search index on the SQL level by hand to be able to do these changes so it would have been a less then satisfactory experience).

cburschka’s picture

Unfortunately I can't reproduce this with a Lorem Ipsum on utf8_general_ci. The search indexer just runs right through without errors. Can you post the collation and specific node text that this is happening with?

Perhaps vardumping the $results array before inserting it would be helpful information, too.

moshe weitzman’s picture

We have a similar issue in D5 which was marked 'fixed'. Not sure if this is a dupe of not. See http://drupal.org/node/143160

cburschka’s picture

Well, the older issue was fixed by adding a unique index on the table. So instead of having duplicate values in the table, we now get errors about trying to insert duplicate values.

Gábor Hojtsy’s picture

Hm, I looked into actual data causing this. For the error mentioned in the topic starter, it is http://drupal.hu/node/33 (as shown in the error message). The node and its comments contain "nevet" and "nevét", and this obviously collides with the collation used. My search_index table is:

mysql> show create table search_index;
...
| search_index | CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default '',
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
...

So to try to reproduce, for example, just include "nevet" and "nevét" in a node and see what happens on its indexing. These two words are definitely not colliding in PHP arrays, but they are in MySQL's above setup, as Drupal uses it. This way, search module is unusable as-is with sites using accented chars.

keith.smith’s picture

Yep. A Drupal 6 install (with search enabled) containing "nevet" and "nevét" in a node (and nothing else) will trigger a:

user warning: Duplicate entry 'nevét-3-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('nevét', 3, 'node', 1) in /web/test/modules/search/search.module on line 574.

on running cron.

cburschka’s picture

So why won't the binary collation work? Is the unique key on word-sid-type is stripping high-bit characters regardless of collation?

catch’s picture

Is there any reason someone would want to index accented and unaccented characters separately with search module? Very late in the cycle for this but I'm wondering if something similar to accents module might be an option? Does varchar to binary have performance/space implications here?

cburschka’s picture

What happens to search if it is used in a non-Latin character set like Japanese or Hebrew?

But for that problem I would indeed suggest stripping out or replacing high-bit characters, and doing the same when processing a search query.

douggreen’s picture

I can also reproduce this error.

responding to #1, search_wipe is working fine. It wipes {search_index} for the sid/type combination, then starts inserting the word/sid/type values found, and if this list included the example text "nevet" and "nevét", the first word gets inserted, and the word creates the duplicate error.

So to summarize the above, php recognizes a difference between these two words, but mysql does not.

cburschka’s picture

How can PHP ensure uniqueness without messing up non-Latin alphabets?

cburschka’s picture

While the root cause is a tricky one to solve properly, I know a trivial solution to the error problem:

<?php
 
foreach ($results[0] as $word => $score) {
    if ((float)
db_result(db_query("SELECT score FROM {search_index} WHERE word = '%s' AND sid = %d AND type = '%s'", $word, $sid, $type)) < $score) {
     
db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
     
search_dirty($word);
    }
  }
?>

This would choose only the word with the maximum score for each set of duplicates. If this sounds like a good solution, I'll make a patch.

Gábor Hojtsy’s picture

Arancaytar: if no better solution comes around, this would slow down the reindexing but solve the problem for foreign languages. If this way is chosen, code comments would be great to be added to describe why do we do this (since people might think that the wipe removes the word occurrences already).

Note that we are on track for an (again hopefully final) RC4 tonight (in around 13 hours), so it would be great to solve this as well.

cburschka’s picture

this would slow down the reindexing

I know. :/ That's why I hate to be satisfied with such a primitive solution. But this is one of six criticals still in the queue, and if the next release candidate can downgrade it to a performance issue, that would be an improvement.

catch’s picture

fwiw I just did a very rudimentary test with some Japanese text and minimum length one character and didn't run into any issues - looks like it's a bug specific to particular accents and nothing more than that.

Gábor Hojtsy’s picture

Most accents used in our text (we use accents common to Central Europe) cause bugs. One indexing round with 100 nodes resulted in around 5 pages of errors in watchdog, that's why I turned the module off for now on the live site, and only test it on my private test site.

catch’s picture

Sorry particular accents should've been "accents on roman characters", cafe and café causes the same error with utf8_general_ci for example.

More on this bug/behaviour at the mysql forums, doesn't look all that helpful.

What about a strtr to strip accents before insertion? Non-trivial but probably more performant than #18. However I guess this could really mess things up in some languages.

As mentioned, accents module does this but I've not installed it, and first look at the code was five minutes ago.

Gábor Hojtsy’s picture

Well, "nevet" and "nevét" definitely means two different things. "nevet" == "laughs" or "that name" depending on the context :), while "nevét" == "her name" or "his name". I know from practice that most of the time, Hungarians are fine writing without accents, but if you remove accents, the results are sometimes funny/unexpected. Eg. "szár" means "stalk", while "szar" means "shit", so when you search for the first, you definitely don't want results on the second to appear.

It could be a fine decision that Drupal's built in search module will not be perfectly fitting for these rules. In fact, if you try http://www.google.com/search?q=szár you will see that both accented and non-accented results appear. So it would not be an unusual or (from user perspective) inherently unexpected path to take. For this to work, we would need to have a list of accented chars to replace with the non-accented versions.

catch’s picture

Also with many sites, even ones not using locale/content translate, we'll see "nevet" and "nevét" alongside "café" and "cafe", right here for example - so stripping diacritics makes sense to me for core.

Looked at utf8_decode() but of course that completely destroys non-latin characters, so useless.

Ideally this should happen with both indexing and searching - it currently does explicit searches for e and é with different result sets. If we strip when indexing, we have to do the same with searching. Google wasn't my friend for a nice clean function to do this unfortunately.

Damien Tournoud’s picture

Status:Active» Needs review
StatusFileSize
new1.07 KB

@Gábor#7: this is clearly a problem related to collations. I ran into it several weeks ago will testing the migration of http://drupalfr.org to Drupal 6, but failed to report it due to severe time constraints.

One way to solve that bug is to set the {search_index}.word column to utf8_bin_ci. I just validated on a test site that it solve the problem.

But, with this would mean that we would differentiate between different versions of a word (accented/not accented, etc.).

@catch#25: In fact, collation is not an enemy, it should be our friend. The implementation of the collation is a difficult work, and moreover language-specific. The one-size-fits-all 'utf8_general_ci' is not optimal, but should works well for most latin based languages. Doing a language specific collation and steeming should be in our work plan for D7.

In the meanwhile, because scores are additive, I suggest we do that:

<?php
   
// The database will collate similar words (accented and non-accented forms, etc.),
    // and the score is additive, so first add and then insert.
   
db_query("UPDATE {search_index} SET score = score + %d WHERE word = '%s' AND sid = '%d' AND type = '%s'", $score, $word, $sid, $type);
    if (!
db_affected_rows()) {
     
db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
    }
?>

Enclosed patch does exactly that.

cburschka’s picture

Applies, fixes the log warnings. Re-indexing a node containing "café cafe" will log errors without the patch, but not afterward.

A second review would be good.

catch’s picture

Patch removes the error. If I have two nodes like this:

1. café cafe
2. cafe

Searching for either café or cafe gives me the same result. However, the highlighting is still dependent on the specific string used.

This seems ok to me really, might even be a feature since it'd show results with lazy typing, but still differentiate in the nevet nevét case on display. Very nice!

catch’s picture

Status:Needs review» Reviewed & tested by the community

Looks like mine was the second review.

Gábor Hojtsy’s picture

I am testing this patch live on Drupal.hu now, thanks! Will commit later today if all looks fine.

Gábor Hojtsy’s picture

Status:Reviewed & tested by the community» Needs review

Did not solve it on my site. Looks like there is a problem on word length in the unique key as well:

Duplicate entry 'modositott-91-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('modositottam', 91, 'node', 0.56910543131) - .../modules/search/search.module - 578.

Duplicate entry 'informatik-108-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('informatika', 108, 'node', 4) - .../modules/search/search.module - 578.

...

See, that the key entry has a shorter string then the inserted word, so longer words can result in duplicates in the key. Anyone else can reproduce this problem? (Note that these words do not contain any non-English chars).

Damien Tournoud’s picture

@Gábor: It looks like a different issue, perhaps related to the upgrade path. Could you dump the schema of the {search_index} table in your current database?

The default one from my empty D6 installation is:

CREATE TABLE IF NOT EXISTS `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Is yours any different?

cburschka’s picture

Also cannot reproduce on fresh installation. I agree it's probably an update path problem where the unique key has to be reset to the full length of the field.

cburschka’s picture

Here's the key-adding code in system.install.

<?php
function system_update_6036() {
   ...
// if mysql:
     
$ret[] = update_sql("ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY word_sid_type (word, sid, type)");
   ...
// otherwise
     
db_add_unique_key($ret, 'search_index', 'word_sid_type', array('word', 'sid', 'type'));
?>

I was unable to find any relevant update on search_index that changed the length of the word field. So the cause is not yet identified.

Edit: Just for reference, my table structure is identical:

CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
catch’s picture

I can't reproduce this key length issue either on a clean install.

CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |

My live D5 database which has been around since 4.5 also doesn't have any explicit key lengths - just checked in case there might be some sticking around:

CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `fromsid` int(10) unsigned NOT NULL default '0',
  `fromtype` varchar(16) default NULL,
  `score` float default NULL,
  KEY `sid_type` (`sid`,`type`),
  KEY `from_sid_type` (`fromsid`,`fromtype`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
keith.smith’s picture

If Gabor's post in #11 is current, the only difference I see is the line:

`type` varchar(16) default '',

cburschka’s picture

Didn't notice that. Well, with VARCHAR(50) and no explicit key length, I have no idea why the key gets truncated to 10 characters. Weird.

Gábor Hojtsy’s picture

Status:Needs review» Reviewed & tested by the community

Hold on, I broke my own database while trying out fixes. I set it to varchar(10). Let me see a better test now with a correct table. Sorry for taking your time, I should have been more alert to what I change. Excuse me.

Gábor Hojtsy’s picture

Version:6.x-dev» 7.x-dev

Works perfectly on drupal.hu as well, superb. Thanks for the patch. Committed to 6.x and RTBC for 7.x.

Damien Tournoud’s picture

This is still missing for 7.x.

Dries’s picture

Status:Reviewed & tested by the community» Fixed

I've committed this to CVS HEAD. Thanks guys.

Anonymous’s picture

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

daimhin’s picture

Version:7.x-dev» 6.2

I am seeing this problems with my 6.2 install. Has the fix that is discussed in this posting been applied to the 6.2 core?

Damien Tournoud’s picture

Version:6.2» 7.x-dev

This patch went in before the release of Drupal 6.0. Please open another issue with a detailed description of your problems.

rajmataj’s picture

Version:7.x-dev» 6.2
Status:Closed (fixed)» Active

...Need to open this ticket again because I don't believe this has been fixed. I have a new Drupal 6.2 install, and all was running fine for the past week. However, since a few days ago I'm getting numerous errors in my log, dozens of them, all logged within the same minute:

Duplicate entry &#039;personal-13-node&#039; for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES (&#039;personal&#039;, 13, &#039;node&#039;, 0.949597315436) in /home/bcims/public_html/modules/search/search.module on line 579.

I could definitely use some help in solving this. I was able to construct the template for Drupal 6.2 and am a novice at php and mysql. I can do some rudimentary work on the error but some assistance from more experienced programmers would be very appreciated.

There is a fellow who wrote this article on the problem: http://cmsreport.com/node/1189

catch’s picture

Status:Active» Closed (fixed)

RajP. Please open a new issue rather than re-opening this old one. It's likely your duplicate is caused by a slightly different version of the bug so will need to be fixed in a different issue. A quick fix for this is to truncate your search tables and reindex (you can do this via the ui by disabling and uninstalling search module then re-enabling it).

kenorb’s picture

Version:6.2» 6.9

Is this one:

Duplicate entry &#039;1107-node&#039; for key 1 query: INSERT INTO search_dataset (sid, type, data, reindex) VALUES (1107, &#039;node&#039;, &#039; monter poszycia kadłubów samolotów zagranica referencja job1107 ocena oceń kiepsko ok dobra super extra twoja ocena brak &#039;, 0) w pliku /home/sites/co.uk/public_html/modules/search/search.module, linia 571.

Is related to this issue?

  // Insert cleaned up data into dataset
  db_query("INSERT INTO {search_dataset} (sid, type, data, reindex) VALUES (%d, '%s', '%s', %d)", $sid, $type, $accum, 0);
kenorb’s picture