Duplicate entry errors in search indexer [#218403]

Comment	File	Size	Author
#26	d6-search-collation.patch	1.07 KB	Damien Tournoud

Comment #1

they

CreditAttribution: cburschka commented 6 February 2008 at 10:51

search_wipe() is called before doing these inserts.

  # line 567
  search_wipe($sid, $type, TRUE);

  // Insert cleaned up data into dataset
  db_query("INSERT INTO {search_dataset} (sid, type, data, reindex) VALUES (%d, '%s', '%s', %d)", $sid, $type, $accum, 0);

  // Insert results into search index
  foreach ($results[0] as $word => $score) {
    db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
    search_dirty($word);
  }

And search_wipe should delete all that stuff:


/**
 * Wipes a part of or the entire search index.
 *
 * @param $sid
 *  (optional) The SID of the item to wipe. If specified, $type must be passed
 *  too.
 * @param $type
 *  (optional) The type of item to wipe.
 */
function search_wipe($sid = NULL, $type = NULL, $reindex = FALSE) {
  if ($type == NULL && $sid == NULL) {
    module_invoke_all('search', 'reset');
  }
  else {
    db_query("DELETE FROM {search_dataset} WHERE sid = %d AND type = '%s'", $sid, $type);
    db_query("DELETE FROM {search_index} WHERE sid = %d AND type = '%s'", $sid, $type);
    // Don't remove links if re-indexing.
    if (!$reindex) {
      db_query("DELETE FROM {search_node_links} WHERE sid = %d AND type = '%s'", $sid, $type);
    }
  }
}

Why isn't it doing that?

Log in or register to post comments

Comment #2

cburschka

they

CreditAttribution: cburschka commented 6 February 2008 at 10:54

Addendum: Also, the array is keyed by words, so those should be unique for a single re-indexing.

Log in or register to post comments

Comment #3

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 6 February 2008 at 11:21

Hm, now that you provided this info, I think it could easily be a collation problem. "word" is defined to be a simple varchar column, so whatever collides in the used collation will collide there as well.

Log in or register to post comments

Comment #4

cburschka

they

CreditAttribution: cburschka commented 6 February 2008 at 11:33

I get it: PHP's array keys collate differently, allowing multiple strings that would collide in MySQL's collation.

That's nasty. Is there any way to avoid having to search for each word before adding it? Or would it be simpler to force a binary collation on this column?

Log in or register to post comments

Comment #5

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 6 February 2008 at 11:34

To elaborate on that, depending on the UTF collation used, several chars might be the same for the index, like é and e for Hungarian. It is a bad MySQL bug which we encountered before (http://bugs.mysql.com/bug.php?id=22337). Are case-insensitive searches supported relying on MySQL's collations? Depending on the answer we can convert this field to binary or not.

Log in or register to post comments

Comment #6

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 6 February 2008 at 11:44

Hm, tested to set the word column to varbinary(10) and then to varchar(10) character set utf8 collate utf8_bin NOT NULL default '' and none of that worked to fix this. It might not be a collation issue after all. Hm.

Log in or register to post comments

Comment #7

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 6 February 2008 at 12:04

Varbinary() would not be good anyway, since it does not work on the character level. So the binary collation was the only contender. With that in place, I got *even more* duplicate key errors for the indexing. Not promising. (Now I did revoke permission for anonymous and authenticated users to use the search, since I was wiping the search index on the SQL level by hand to be able to do these changes so it would have been a less then satisfactory experience).

Log in or register to post comments

Comment #8

cburschka

they

CreditAttribution: cburschka commented 6 February 2008 at 12:06

Unfortunately I can't reproduce this with a Lorem Ipsum on utf8_general_ci. The search indexer just runs right through without errors. Can you post the collation and specific node text that this is happening with?

Perhaps vardumping the $results array before inserting it would be helpful information, too.

Log in or register to post comments

Comment #9

moshe weitzman CreditAttribution: moshe weitzman commented 6 February 2008 at 14:15

We have a similar issue in D5 which was marked 'fixed'. Not sure if this is a dupe of not. See http://drupal.org/node/143160

Log in or register to post comments

Comment #10

cburschka

they

CreditAttribution: cburschka commented 6 February 2008 at 16:26

Well, the older issue was fixed by adding a unique index on the table. So instead of having duplicate values in the table, we now get errors about trying to insert duplicate values.

Log in or register to post comments

Comment #11

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 6 February 2008 at 17:03

Hm, I looked into actual data causing this. For the error mentioned in the topic starter, it is http://drupal.hu/node/33 (as shown in the error message). The node and its comments contain "nevet" and "nevét", and this obviously collides with the collation used. My search_index table is:

mysql> show create table search_index;
...
| search_index | CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default '',
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
...

So to try to reproduce, for example, just include "nevet" and "nevét" in a node and see what happens on its indexing. These two words are definitely not colliding in PHP arrays, but they are in MySQL's above setup, as Drupal uses it. This way, search module is unusable as-is with sites using accented chars.

Log in or register to post comments

Comment #12

keith.smith CreditAttribution: keith.smith commented 6 February 2008 at 17:13

Yep. A Drupal 6 install (with search enabled) containing "nevet" and "nevét" in a node (and nothing else) will trigger a:

user warning: Duplicate entry 'nevét-3-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('nevét', 3, 'node', 1) in /web/test/modules/search/search.module on line 574.

on running cron.

Log in or register to post comments

Comment #13

cburschka

they

CreditAttribution: cburschka commented 6 February 2008 at 22:14

So why won't the binary collation work? Is the unique key on word-sid-type is stripping high-bit characters regardless of collation?

Log in or register to post comments

Comment #14

catch

he/him

English

CreditAttribution: catch commented 6 February 2008 at 22:26

Is there any reason someone would want to index accented and unaccented characters separately with search module? Very late in the cycle for this but I'm wondering if something similar to accents module might be an option? Does varchar to binary have performance/space implications here?

Log in or register to post comments

Comment #15

cburschka

they

CreditAttribution: cburschka commented 6 February 2008 at 22:34

What happens to search if it is used in a non-Latin character set like Japanese or Hebrew?

But for that problem I would indeed suggest stripping out or replacing high-bit characters, and doing the same when processing a search query.

Log in or register to post comments

Comment #16

douggreen CreditAttribution: douggreen commented 7 February 2008 at 01:35

I can also reproduce this error.

responding to #1, search_wipe is working fine. It wipes {search_index} for the sid/type combination, then starts inserting the word/sid/type values found, and if this list included the example text "nevet" and "nevét", the first word gets inserted, and the word creates the duplicate error.

So to summarize the above, php recognizes a difference between these two words, but mysql does not.

Log in or register to post comments

Comment #17

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 07:15

How can PHP ensure uniqueness without messing up non-Latin alphabets?

Log in or register to post comments

Comment #18

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 08:04

While the root cause is a tricky one to solve properly, I know a trivial solution to the error problem:


  foreach ($results[0] as $word => $score) {
    if ((float)db_result(db_query("SELECT score FROM {search_index} WHERE word = '%s' AND sid = %d AND type = '%s'", $word, $sid, $type)) < $score) {
      db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
      search_dirty($word);
    }
  }

This would choose only the word with the maximum score for each set of duplicates. If this sounds like a good solution, I'll make a patch.

Log in or register to post comments

Comment #19

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 09:37

Arancaytar: if no better solution comes around, this would slow down the reindexing but solve the problem for foreign languages. If this way is chosen, code comments would be great to be added to describe why do we do this (since people might think that the wipe removes the word occurrences already).

Note that we are on track for an (again hopefully final) RC4 tonight (in around 13 hours), so it would be great to solve this as well.

Log in or register to post comments

Comment #20

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 09:46

this would slow down the reindexing

I know. :/ That's why I hate to be satisfied with such a primitive solution. But this is one of six criticals still in the queue, and if the next release candidate can downgrade it to a performance issue, that would be an improvement.

Log in or register to post comments

Comment #21

catch

he/him

English

CreditAttribution: catch commented 7 February 2008 at 09:51

fwiw I just did a very rudimentary test with some Japanese text and minimum length one character and didn't run into any issues - looks like it's a bug specific to particular accents and nothing more than that.

Log in or register to post comments

Comment #22

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 10:19

Most accents used in our text (we use accents common to Central Europe) cause bugs. One indexing round with 100 nodes resulted in around 5 pages of errors in watchdog, that's why I turned the module off for now on the live site, and only test it on my private test site.

Log in or register to post comments

Comment #23

catch

he/him

English

CreditAttribution: catch commented 7 February 2008 at 10:51

Sorry particular accents should've been "accents on roman characters", cafe and café causes the same error with utf8_general_ci for example.

More on this bug/behaviour at the mysql forums, doesn't look all that helpful.

What about a strtr to strip accents before insertion? Non-trivial but probably more performant than #18. However I guess this could really mess things up in some languages.

As mentioned, accents module does this but I've not installed it, and first look at the code was five minutes ago.

Log in or register to post comments

Comment #24

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 10:58

Well, "nevet" and "nevét" definitely means two different things. "nevet" == "laughs" or "that name" depending on the context :), while "nevét" == "her name" or "his name". I know from practice that most of the time, Hungarians are fine writing without accents, but if you remove accents, the results are sometimes funny/unexpected. Eg. "szár" means "stalk", while "szar" means "shit", so when you search for the first, you definitely don't want results on the second to appear.

It could be a fine decision that Drupal's built in search module will not be perfectly fitting for these rules. In fact, if you try http://www.google.com/search?q=szár you will see that both accented and non-accented results appear. So it would not be an unusual or (from user perspective) inherently unexpected path to take. For this to work, we would need to have a list of accented chars to replace with the non-accented versions.

Log in or register to post comments

Comment #25

catch

he/him

English

CreditAttribution: catch commented 7 February 2008 at 11:32

Also with many sites, even ones not using locale/content translate, we'll see "nevet" and "nevét" alongside "café" and "cafe", right here for example - so stripping diacritics makes sense to me for core.

Looked at utf8_decode() but of course that completely destroys non-latin characters, so useless.

Ideally this should happen with both indexing and searching - it currently does explicit searches for e and é with different result sets. If we strip when indexing, we have to do the same with searching. Google wasn't my friend for a nice clean function to do this unfortunately.

Log in or register to post comments

Comment #26

Damien Tournoud CreditAttribution: Damien Tournoud commented 7 February 2008 at 12:22

Status:

Active

» Needs review

File	Size
d6-search-collation.patch	1.07 KB

@Gábor#7: this is clearly a problem related to collations. I ran into it several weeks ago will testing the migration of http://drupalfr.org to Drupal 6, but failed to report it due to severe time constraints.

One way to solve that bug is to set the {search_index}.word column to utf8_bin_ci. I just validated on a test site that it solve the problem.

But, with this would mean that we would differentiate between different versions of a word (accented/not accented, etc.).

@catch#25: In fact, collation is not an enemy, it should be our friend. The implementation of the collation is a difficult work, and moreover language-specific. The one-size-fits-all 'utf8_general_ci' is not optimal, but should works well for most latin based languages. Doing a language specific collation and steeming should be in our work plan for D7.

In the meanwhile, because scores are additive, I suggest we do that:

    // The database will collate similar words (accented and non-accented forms, etc.),
    // and the score is additive, so first add and then insert.
    db_query("UPDATE {search_index} SET score = score + %d WHERE word = '%s' AND sid = '%d' AND type = '%s'", $score, $word, $sid, $type);
    if (!db_affected_rows()) {
      db_query("INSERT INTO {search_index} (word, sid, type, score) VALUES ('%s', %d, '%s', %f)", $word, $sid, $type, $score);
    }

Enclosed patch does exactly that.

Log in or register to post comments

Comment #27

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 12:31

Applies, fixes the log warnings. Re-indexing a node containing "café cafe" will log errors without the patch, but not afterward.

A second review would be good.

Log in or register to post comments

Comment #28

catch

he/him

English

CreditAttribution: catch commented 7 February 2008 at 12:32

Patch removes the error. If I have two nodes like this:

1. café cafe
2. cafe

Searching for either café or cafe gives me the same result. However, the highlighting is still dependent on the specific string used.

This seems ok to me really, might even be a feature since it'd show results with lazy typing, but still differentiate in the nevet nevét case on display. Very nice!

Log in or register to post comments

Comment #29

catch

he/him

English

CreditAttribution: catch commented 7 February 2008 at 12:33

Status:

Needs review

» Reviewed & tested by the community

Looks like mine was the second review.

Log in or register to post comments

Comment #30

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 13:08

I am testing this patch live on Drupal.hu now, thanks! Will commit later today if all looks fine.

Log in or register to post comments

Comment #31

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 13:23

Status:

Reviewed & tested by the community

» Needs review

Did not solve it on my site. Looks like there is a problem on word length in the unique key as well:

Duplicate entry 'modositott-91-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('modositottam', 91, 'node', 0.56910543131) - .../modules/search/search.module - 578.

Duplicate entry 'informatik-108-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('informatika', 108, 'node', 4) - .../modules/search/search.module - 578.

...

See, that the key entry has a shorter string then the inserted word, so longer words can result in duplicates in the key. Anyone else can reproduce this problem? (Note that these words do not contain any non-English chars).

Log in or register to post comments

Comment #32

Damien Tournoud CreditAttribution: Damien Tournoud commented 7 February 2008 at 13:31

@Gábor: It looks like a different issue, perhaps related to the upgrade path. Could you dump the schema of the {search_index} table in your current database?

The default one from my empty D6 installation is:

CREATE TABLE IF NOT EXISTS `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Is yours any different?

Log in or register to post comments

Comment #33

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 13:34

Also cannot reproduce on fresh installation. I agree it's probably an update path problem where the unique key has to be reset to the full length of the field.

Log in or register to post comments

Comment #34

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 14:06

Here's the key-adding code in system.install.

function system_update_6036() {
   ... // if mysql:
      $ret[] = update_sql("ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY word_sid_type (word, sid, type)");
   ... // otherwise
      db_add_unique_key($ret, 'search_index', 'word_sid_type', array('word', 'sid', 'type'));

I was unable to find any relevant update on search_index that changed the length of the word field. So the cause is not yet identified.

Edit: Just for reference, my table structure is identical:

CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

Log in or register to post comments

Comment #35

catch

he/him

English

CreditAttribution: catch commented 7 February 2008 at 14:32

I can't reproduce this key length issue either on a clean install.

CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `score` float default NULL,
  UNIQUE KEY `word_sid_type` (`word`,`sid`,`type`),
  KEY `sid_type` (`sid`,`type`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |

My live D5 database which has been around since 4.5 also doesn't have any explicit key lengths - just checked in case there might be some sticking around:

CREATE TABLE `search_index` (
  `word` varchar(50) NOT NULL default '',
  `sid` int(10) unsigned NOT NULL default '0',
  `type` varchar(16) default NULL,
  `fromsid` int(10) unsigned NOT NULL default '0',
  `fromtype` varchar(16) default NULL,
  `score` float default NULL,
  KEY `sid_type` (`sid`,`type`),
  KEY `from_sid_type` (`fromsid`,`fromtype`),
  KEY `word` (`word`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 |

Log in or register to post comments

Comment #36

keith.smith CreditAttribution: keith.smith commented 7 February 2008 at 14:35

If Gabor's post in #11 is current, the only difference I see is the line:

`type` varchar(16) default '',

Log in or register to post comments

Comment #37

cburschka

they

CreditAttribution: cburschka commented 7 February 2008 at 15:13

Didn't notice that. Well, with VARCHAR(50) and no explicit key length, I have no idea why the key gets truncated to 10 characters. Weird.

Log in or register to post comments

Comment #38

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 16:16

Status:

Needs review

» Reviewed & tested by the community

Hold on, I broke my own database while trying out fixes. I set it to varchar(10). Let me see a better test now with a correct table. Sorry for taking your time, I should have been more alert to what I change. Excuse me.

Log in or register to post comments

Comment #39

Gábor Hojtsy

he/him

Hungarian

Hungary

CreditAttribution: Gábor Hojtsy commented 7 February 2008 at 16:43

Version:

6.x-dev

» 7.x-dev

Works perfectly on drupal.hu as well, superb. Thanks for the patch. Committed to 6.x and RTBC for 7.x.

Log in or register to post comments

Comment #40

Damien Tournoud CreditAttribution: Damien Tournoud commented 2 April 2008 at 06:58

This is still missing for 7.x.

Log in or register to post comments

Comment #41

Dries CreditAttribution: Dries commented 2 April 2008 at 20:13

Status:

Reviewed & tested by the community

» Fixed

I've committed this to CVS HEAD. Thanks guys.

Log in or register to post comments

Comment #42

Anonymous (not verified) CreditAttribution: Anonymous commented 16 April 2008 at 20:22

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

Log in or register to post comments

Comment #43

daimhin CreditAttribution: daimhin commented 17 April 2008 at 19:09

Version:

7.x-dev

» 6.2

I am seeing this problems with my 6.2 install. Has the fix that is discussed in this posting been applied to the 6.2 core?

Log in or register to post comments

Comment #44

Damien Tournoud CreditAttribution: Damien Tournoud commented 17 April 2008 at 20:45

Version:

6.2

» 7.x-dev

This patch went in before the release of Drupal 6.0. Please open another issue with a detailed description of your problems.

Log in or register to post comments

Comment #45

rajmataj CreditAttribution: rajmataj commented 14 May 2008 at 23:32

Version:	7.x-dev	» 6.2
Status:	Closed (fixed)	» Active

...Need to open this ticket again because I don't believe this has been fixed. I have a new Drupal 6.2 install, and all was running fine for the past week. However, since a few days ago I'm getting numerous errors in my log, dozens of them, all logged within the same minute:

Duplicate entry 'personal-13-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('personal', 13, 'node', 0.949597315436) in /home/bcims/public_html/modules/search/search.module on line 579.

I could definitely use some help in solving this. I was able to construct the template for Drupal 6.2 and am a novice at php and mysql. I can do some rudimentary work on the error but some assistance from more experienced programmers would be very appreciated.

There is a fellow who wrote this article on the problem: http://cmsreport.com/node/1189

Log in or register to post comments

Comment #46

catch

he/him

English

CreditAttribution: catch commented 14 May 2008 at 23:58

Status:

Active

» Closed (fixed)

RajP. Please open a new issue rather than re-opening this old one. It's likely your duplicate is caused by a slightly different version of the bug so will need to be fixed in a different issue. A quick fix for this is to truncate your search tables and reindex (you can do this via the ui by disabling and uninstalling search module then re-enabling it).

Log in or register to post comments

Comment #47

kenorb CreditAttribution: kenorb commented 16 January 2009 at 16:11

Version:

6.2

» 6.9

Is this one:

Duplicate entry &#039;1107-node&#039; for key 1 query: INSERT INTO search_dataset (sid, type, data, reindex) VALUES (1107, &#039;node&#039;, &#039; monter poszycia kadłubów samolotów zagranica referencja job1107 ocena oceń kiepsko ok dobra super extra twoja ocena brak &#039;, 0) w pliku /home/sites/co.uk/public_html/modules/search/search.module, linia 571.

Is related to this issue?

  // Insert cleaned up data into dataset
  db_query("INSERT INTO {search_dataset} (sid, type, data, reindex) VALUES (%d, '%s', '%s', %d)", $sid, $type, $accum, 0);