(Updated in #83)

Problem/Motivation

* First we had exceptions like SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '113-à' for key 'PRIMARY' when e.g. 113-à collided with 113-a in the index as these are equal in utf8-general-ci collation
* Then all search api tables had been switched to utf8-bin collation by committing #46 in #60
* This fixes the duplicate index exceptions, but has as repercussion that sorting now has binary collation: A..Z, a..z

Workarounds

* in #59 a preprocessor sandbox module was linked: Search API Transliterate

Proposed resolution

Find another way to fix index collisions which does not kill case-insensitive sorting.
Which might be (brainstormed):
* (best) change the index (and maybe the php inser code) in a way we can have any collation without index collisions
* use the db_merge approach from #58 so we can have any collation without index collisions (and find similar solutions for other db systems)
* stick to binary collation but add option to set collation in sort statement (how to in mysql, in postgres, sqlite, oracle, sql-server)
* use general-ci collation but emulate it in php so we have no collisions
* use binary collation and do all normalization in php
* consider having "original word" and "collated word" columns

Remaining tasks

To thoroughly understand the original problem of index collissions we need some backtraces.
Important: We should understand if the problem arises in regular or full text indexes or both.

This way we can understand if and how the collisions can be fixed without losing collation.
Note: The relevant methods should be \SearchApiDbService::indexItem

User interface changes

API changes

Data model changes

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drunken monkey’s picture

Priority: Normal » Major

Oh god, that's this ***** MySQL collation setting, most probably, which might count "a" and "à" as the same letter.
OK, then I guess we should specify the collation when creating tables for the database search. Asking how this works in #1144644: Enable specifying the collation when creating a database table. As a local fix for now, you can just change the collation for the relevant table (or only its "value" column) to "utf8_bin". The relevant table is "search_api_db_INDEX_FIELD", so in your case (and for the default node index): "search_api_db_default_nodex_index_body_value".

I can't really tell why your workaround with the tokenizer shouldn't work, though.

miprof’s picture

ok
you are very speed.
for tokenizer, i try [à] or "à" It's the good syntax ?

drunken monkey’s picture

Maybe try "[àÀ]". Apart from that, both variants should work.

miprof’s picture

the same - but this error is very strange,
i have an another content with the same node, and in the same field i put "à" = no error ???

I test more to find the context

drunken monkey’s picture

The error will only occur when you have both "à" and "a" (or "á", "â", "ã", …) as individual tokens in the same field.

And, as said, please try changing the MySQL collation of the table or field.

miprof’s picture

Yes,
it's ok with #1

thanks a lot

yareckon’s picture

sub

drunken monkey’s picture

Status: Active » Needs review
FileSize
982 bytes

As per the suggestion in #1206126-16: SearchApiException: Unknown server profile_server, let's mention this issue in the README.

As for fixing this bug, I can't really do anything unless Drupal lets me influence the collation setting it uses, which pretty much makes fixing this in D7 impossible. If you want to support this at least for D8, try to create some publicity for #1144644: Enable specifying the collation when creating a database table.

drunken monkey’s picture

Title: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '113-à' for key 'PRIMARY' » Problems with umlauts, accented characters, etc.
Project: Search API » Search API Database Search
Version: 7.x-1.x-dev » 7.x-1.0-beta1
Component: Database search » Code
Status: Needs review » Active

Moving this to the new module.

I also now committed the abvove patch to the README.txt. As said, fixing this seems pretty impossible to me in D7, but maybe someone will finally come up with a solution, or at least a workaround.
In any case, leaving the issue open so people running into this problem will (hopefully) find it.

nattie’s picture

Hi i had the same problem to resolve it i've added 2 verifications before insert requests on the service.inc file. Hope it will help you.
Line 360:
$query = db_insert($table, $this->query_options)
->fields(array('item_id', 'word', 'score'));

foreach ($words as $word) {

$sql_result = db_select($table,'fdfd')
->fields('fdfd', array('score'))
->condition('item_id',$id)
->condition('word',$word['value'])
->execute();
foreach ($sql_result as $data){
$item_exist1=0;
break;
}

if(!isset($item_exist1)){
$query->values(array(
'item_id' => $id,
'word' => $word['value'],
'score' => $word['score'] * $boost,
));
$query->execute();
}

}
Line 402:
$sql = "select item_id from ".$table." where item_id = ".$id." and value ='".$value."'";

$result = db_query($sql);
foreach( $result as $item) {
$item_exist=0;
break;
}
if(!isset($item_exist)){

db_insert($table, $this->query_options)
->fields(array(
'item_id' => $id,
'value' => $value,
))
->execute();
}

drunken monkey’s picture

Just changing the table collation should help, too, won't screw with your search results and also won't need any patches to the code.

tahiticlic’s picture

Well, I've changed all the search api index table collation, and that didn't help. Error is still there.

drunken monkey’s picture

Well, I've changed all the search api index table collation, and that didn't help. Error is still there.

Even after clearing the index? Then you should check manually which keys are seen as identical.
(Or maybe the columns still use the other collation, and you have to change that specifically for each "word" column … ?)

etiennechataignier’s picture

Same problem here.
I cleared the index.
I set collation to utf8_bin for tables search_api_db_*.

Then you should check manually which keys are seen as identical.

I'd like to.
The error log prints

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '706-therond' for key 'PRIMARY'

From this information, how may I find the 2 keys ?

drunken monkey’s picture

From this information, how may I find the 2 keys ?

SELECT word FROM $table WHERE item_id = 706 AND word = 'therond'
This should give you the first word, that got indexed. The second one is "therond".

PedroMiguel’s picture

if you already change the table collation to utf8_bin and still dont work, change also the collaction on table row.

Works for me.

yareckon’s picture

drunken monkey, could you add a test for the collation issue and have it complain with a drupal message? Would probably catch more folks than an entry in the README.txt

drunken monkey’s picture

I cannot really think of a good easy test there. The way to do this would probably be something like in #10, running a SELECT before each INSERT and thus considerably slowing down the (already rather slow) indexing operation. However, this way we could not only detect but also fix the error, I guess. So maybe we should do that after all, if this is such a frequent problem …

Also, carefully reading the README.txt before using a module is a very sensible practice that I want to encourage, rather than help people work around.

tahiticlic’s picture

Changing collation for fields inside tables made the trick, thanks!

yareckon’s picture

drunken monkey, I'm not suggesting doing it at run time before each insert, just some sort of hook_requirements like thing that would do a one off test for this problem when folks tried to activate/first configure an index with a mysql backend. If it comes back busted, just throw up a status message like any other module.
Basically if it fails: "Warning: to index non ascii characters you must change your mysql collation settings. See the search_api README.txt for more information"

drunken monkey’s picture

drunken monkey, I'm not suggesting doing it at run time before each insert, just some sort of hook_requirements like thing that would do a one off test for this problem when folks tried to activate/first configure an index with a mysql backend. If it comes back busted, just throw up a status message like any other module.
Basically if it fails: "Warning: to index non ascii characters you must change your mysql collation settings. See the search_api README.txt for more information"

Ah, you're right, that would also be an option.

Or, come to think of it, we could inspect the exception and if it's a duplicate key issue we could log a more helpful error message.

etiennechataignier’s picture

#16
Thanks! It works.
For my next projects, I'll set utf8_bin from the very beginning of the Drupal install.

drunken monkey’s picture

For my next projects, I'll set utf8_bin from the very beginning of the Drupal install.

You mean for all tables? In that case, please be aware that this has some probably undesirable side effects, like not ignoring the case for sorting (example sort: A, B, C, a, b, c). Depending on your language this might not be a problem after all, though.

jefkin’s picture

Because I had to do a little sql research to find the right sql syntax for this, I'm posting it here so others might not need to suffer as much.

This is no Panacea, but it is a workable way to break through this limitation. Using either your favorite mysql client:

mysql> ALTER TABLE search_api_db_node_index_body_value CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin';

Obviously you need to replace 'search_api_db_node_index_body_value' your table name(s). I was running my index on a big old complicated data type with about 9000 nodes, so I wasn't sure which tables needed it, so I ended up running this statement for each of the search api different index tables. getting the list with:

mysql> show tables like '%search%';

I didn't try it myself, but if you can't use mysql directly, you might get it done with php:

db_query("ALTER TABLE {search_api_db_node_index_body_value} CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin'")
->execute();

perhaps via the development module's execute php page.

Also note that (for me at least) the entire index had to be cleared after the collation change and the data re-indexed. Not ideal, but workable.

nils.destoop’s picture

Or if you have many search api tables:

  $search_api_tables = db_query("show tables LIKE '%search_api%';")->fetchCol();
  foreach ($search_api_tables as $table) {
    db_query("ALTER TABLE {$table} CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin'")
    ->execute();
  }
Ken Hawkins’s picture

Although I've set the search indexes to utf8_bin I'm still getting complaints from the system that "Couldn't index items. Check the logs for details." with the offending error relating to "1062 Duplicate entry".

Stranger still is that before switching to utf8_bin the indexing simply failed, now I get the errors (and the item remains not indexed in the counter), however the item appear appropriately in the search results.

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '45-cafes' for key 'PRIMARY'
drunken monkey’s picture

@ #24, #25: You only need to use this with the 'search_api_db_%' tables. The ones from Search API itself are fine as they are. On the other hand, it shouldn't do any harm, so it's probably fine either way.

@ #26: Very strange. Did you clear the index after changing the collation?

freakalis’s picture

This work for great for me but is ut possible to add COLLATE 'utf8_bin' to all tables created by search_api_db so that we don't need to run this "fix"?

freakalis’s picture

Seems like db_create_table does not allow to set collation on a table.

Here is a patch that tries to resolve this problem by adding collation utf8_bin after table is created.

A update function is also included that updates all search_api_db tables to utf8_bin collation.

freakalis’s picture

Status: Active » Needs review

Forgot to change status to need review...

moniuch’s picture

I set utf8_bin on all search_api tables, cleared my node index and re-created it. The errors are still there. When I compared the error messages with contents of the nodes, I started to wonder how search_api copes with situations where a word occur more than once in a node.
In my case the index failed with message "866-możliwość" (this is a Polish word). When I looked at node/866 I saw this word appears three times. Isn't this a limitation? Is there a setting in UI that I neglected to check that might cure this?

tahiticlic’s picture

You have to set utf8_bin on fields also, not only tables (text fields at least).

moniuch’s picture

True, that helped, thx

drunken monkey’s picture

If you also set the collation on tables of string fields (not only on fulltext ones), that will make sorting on those fields case-sensitive. So, for fixing this problem, you should probably only change the collation of the "word" column in the tables of fulltext fields.

@ #29: Thanks for the patch! If this is also possible for single columns, it might really be a good idea to do this.
However, does this work for all DBMSs the same, or only for MySQL?

aendra’s picture

I've tried changing all search_api-related tables to utf8-bin, and have cleared the index; alas, I still get the errors.

As per #32, am I to take that as, if I have a field named "field_blabla", I should change the collation on table "field_data_field_blabla" to utf-bin also?

Thanks!

tahiticlic’s picture

You have to set utf8_bin on fields also, not only tables (text fields at least).

modestmoes’s picture

#29 worked for me. Thanks!

sylus’s picture

Rolled the patch against the latest dev release (July 2nd) - commit: a6203a3

checker’s picture

Perhaps you can add something like variable_get to enable or disable if this feature is not needed by everyone.

sylus’s picture

Can confirm that the patch breaks PostGres installs:

SQLSTATE[42601]: Syntax error: 7 ERROR: syntax error at or near "CONVERT" LINE 1: ALTER TABLE search_api_db_database_node_index_nid CONVERT TO... ^
rob_johnston’s picture

It will also break an MS SQL Server install:

SQLSTATE[42000]: [Microsoft][SQL Server Native Client 10.0][SQL Server]Incorrect syntax near the keyword 'CONVERT'.

I believe that COLLATE can only be applied to the database as a whole or to a column of a table, but not to a table (as in the patch). It gets worse...

No version of SQL Server supports UTF-8 and from what I've read, depending on what one wishes to do, it should be UTF-16 or an nvarchar field (which uses UCS-2) or varbinary... all have pluses and minuses.

There is another way of doing it, which is to specify the collation when doing the select, such as:

SELECT Place FROM Locations ORDER BY Place
COLLATE Traditional_Spanish_ci_as ASC;

Would something similar to this method work with other database systems (ci = case insensitive, as = accent sensitive)?

Anyone who knows more about SQL Server can feel free to chime in.

rob_johnston’s picture

An update to the patch by @sylus to execute different SQL depending on the database being used.

pvhee’s picture

Status: Needs review » Reviewed & tested by the community

Can confirm that #42 is working fine with the latest DEV release. RTBC?

checker’s picture

I'm also using this patch #42 without problems (under mysql). Other DBMS could be easily added later if there are issues.

geek-merlin’s picture

this patch solved my "integer constraint violations" after deleting and recreating the db backend.
re-rolled this with bogus whitespace removed but no other changes.

drunken monkey’s picture

Version: 7.x-1.0-beta1 » 7.x-1.x-dev
Status: Reviewed & tested by the community » Needs review
FileSize
2.2 KB

Well, I guess a fix for MySQL is better than nothing … If someone is interested in making this work on other DBMSs, they can provide another patch.

I've just cleaned the patch up a bit, removing the dummy code for other DBMSs, using the right database and also fixing the update hook to use the correct data.
Please review and confirm that this still fixes your problems.

Also, as mentioned before, this will also make sorting case-sensitive, which usually isn't desired. We could easily just fix the tables for fulltext fields in this way (or also for multi-valued fields, as they cannot be searched, either) which would remove this problem – however, there could theoretically be clashes in normal fields too, in some rare cases.
So, what would you prefer: being able to sort case-insensitively, or getting rid of this bug (on MySQL servers) also in edge cases?

Please also see #1794374: Searching for "...é..." yields other results than searching for "...e..." for a half-related issue. This would suggest that we should maybe not counteract MySQLs default collation, but rather emulate it. A better solution would probably be to provide a preprocessor for that, though.

rob_johnston’s picture

The first way of determining the database system in #42, using $database = $databases['default']['default']; was completely invented by me and I've been keeping my eyes open for an official way of doing it. I see that the latest patch uses:
$db_driver = $databases[$key][$target]['driver'];

I recently saw that the views module, in their /includes/handlers.inc, file determines the database with:
$db_type = Database::getConnection()->databaseType();

Just thought I'd add that, as I don't know which is better. Both return a value like 'mysql', 'pgsql', 'sqlsrv', etc.

You could probably change the switch-case statement in the service.inc file to be a simple if ($db_driver === 'mysql') as well since none of the other cases are being used.

drunken monkey’s picture

The first way of determining the database system in #42, using $database = $databases['default']['default']; was completely invented by me and I've been keeping my eyes open for an official way of doing it. I see that the latest patch uses:$db_driver = $databases[$key][$target]['driver'];

Yes, we have to do it that way as we allow other databases than the default one to be used for the search server.

You could probably change the switch-case statement in the service.inc file to be a simple if ($db_driver === 'mysql') as well since none of the other cases are being used.

Yes, but in contrast to the update hook, this might actually later be filled, so I thought I'd leave it. It's not a performance problem in any way, but provides a visual reminder of what's missing.

So, does the patch work for you as expected?

rob_johnston’s picture

I guess that was just a long-winded question. No hurry, but can you tell me or point me to something that talks about the difference between $db_driver = $databases[$key][$target]['driver']; and $db_type = Database::getConnection()->databaseType();? They both return the identical string. I'm going to be making changes to handle sqlsrv in a couple more modules.

Sylus is handling the install profile I'm using and I'm doing some of the sqlsrv work. If the patch doesn't work after he includes it then I'll let somebody know.

dimarick’s picture

This patch works fine for me.
Installation
1. Apply patch for search_api_db/service.inc
2. Clear index (but not reindex)
3. Index content
4. See no errors!

But i dont know where I get full collation table like mysql utf8_general_ci

davvid’s picture

#50 works, but it's probably useless if you want to search with words that contain replaced characters. Am I right?

milesw’s picture

Patch #46 applied cleanly against 7.x-1.0-beta4 and fixes my MySQL indexing errors related to accents. Thanks for the patch!

Vacilando’s picture

I concur with @milesw in #52: "Patch #46 applied cleanly against 7.x-1.0-beta4 and fixes my MySQL indexing errors related to accents. Thanks for the patch!"

checker’s picture

#46 fixes my db errors with umlauts.

mansspams’s picture

Status: Needs review » Reviewed & tested by the community

#46 applies and works. Better solution still needed.

sardbaba’s picture

#46 worked for me.

Steps:

  1. Download and apply patch at #46
  2. Run update.php (the patch invokes hook_update_N)
  3. Clear the index
  4. Index the content
Maico de Jong’s picture

#46 worked form me

almul0’s picture

I was dealing with the issue this evening, I have tested a lot of utf8 collations along search_api_db_* tables. Finally, cause my problem was when I was trying to index the content, finally I swapped the db_insert statements with db_merge statements.

These statements prevents the index failure when indexing a set of words like '--a---' and '--á---' cause for the database with default collation is the same .

Before, I was applying the utf8_bin collation, with this patch, I reverted them to their original collation DEFAULT.

For me, I am using it with this patch for partial matching Add option for partial matching, works quite well and now my search using views using fulltext form field exposed work well, returning the content, being case and accent insensitive.

IMPORTANT: I make this patch for a case and accent insensitive search.

almul0’s picture

Finally, I have found thanks to grisendo a patch for index problematic words. Is a sandbox project which transliterate the word, is a processor of search_api and it works fine.

I can revert that patch for my purpose, but may be it will be useful for others.

Search API Transliterate

drunken monkey’s picture

Title: Problems with umlauts, accented characters, etc. » Fix problems with umlauts, accented characters, etc.
Status: Reviewed & tested by the community » Active

Finally, I have found thanks to grisendo a patch for index problematic words. Is a sandbox project which transliterate the word, is a processor of search_api and it works fine.

Ah, cool! Great to know this exists, could surely be helpful for some. A pity that it's only a sandbox project for now …

Anyways, regarding the issue at hand: thank you all a lot for testing, and sorry it took me a while to get back to this. I committed #46 now, hopefully this will help most people. Setting the issue back to "active", though, because code for the other DBMSs is still needed.

ndf’s picture

Tested patch #46 and sandbox-module #59 on an index with 1000+ nodes with searchapi facets enabled.
Both the patch and the module do what they say the do.

Prefer patch #46 over the module #59 because the module removes all special characters from the index. Therefore searchapi facets and views-fields:node-index won't show these special signs anymore.

In case you want to show special characters but ignore them while searching you need the patch. Good it's committed in .dev!

brahimmouhamou’s picture

Patch #46 works like a charm! Thanks!

cmonnow’s picture

I didn't see it written anywhere but perhaps it can be suggested in the README (or at least now here) that for PostgreSQL you would set (ALTER) the string character columns of "search_api_db_" tables to "C" (COLLATE "C"). I haven't actually tested anything though and don't even know the default behaviour. Just putting it out there for discussion.

drunken monkey’s picture

Issue summary: View changes

Thanks for the note, good to know!
Do you have a Postgres database for testing? If so, and you can tell me the correct syntax for this ALTER statement, then I could add this fix to the module. Would be great to be able to do this out of the box for Postgres users, too!

cmonnow’s picture

Hi drunken monkey,

Unfortunately I'm new to PostgreSQL myself and like I guess most people here converted for PostGIS features.

I don't know how/if it's possible to change the "collate" type without defining the column type as well. e.g.

ALTER TABLE search_api_db_database_node_index_body_summary
   ALTER COLUMN word SET DATA TYPE character varying(50) COLLATE pg_catalog."C"

is the basic syntax that works for me (at least on empty columns), and the type in this case was redefined as it was already (character varying(50)). I haven't actually tested the search indexes yet at all.

I haven't reached the stage of my project yet where I index foreign languages yet but it seems the dilemma of mapping similar letters as described in https://drupal.org/node/1794374 would be essential for an international site that involves "cross-cultural" communications.

drunken monkey’s picture

Status: Active » Needs review

Ah, excellent, that's all I needed to know!
Can you then please test the attached patch and see if it succeeds in producing tables with the correct collation? Thanks!

(Before committing we should maybe just wrap the whole switch in a try/catch – for PSQL, there are not enough testers to be sure whether this also works in 8.x.)

drunken monkey’s picture

The last submitted patch, 8: 1144620--known-umlaut-bug-8.patch, failed testing.

The last submitted patch, 29: search_api_db_issue_1144620.patch, failed testing.

The last submitted patch, 38: search_api_db_utf8-1144620-38.patch, failed testing.

The last submitted patch, 42: search_api_db_utf8_1144620-42.patch, failed testing.

The last submitted patch, 46: 1144620--umlauts-46.patch, failed testing.

The last submitted patch, 50: service.inc_.05122012.patch, failed testing.

cmonnow’s picture

Hi drunken monkey,

I can confirm that the patch worked as expected in a clean, standard Drupal 7.26 install using PostgreSQL 9.2 (with 2 nodes indexed).

The "word" column was successfully assigned a collation of "C". I tested the imaginary word "bànànà" in an article and it was found using the string "bànànà" (and not "banana", as expected).

(Aside: I wish there were more PostgreSQL users in general since there's a fraction of the user-friendly info out there when compared with mySQL).

Cheers

cmonnow’s picture

And yep, the try-catch sounds like a great idea for developmental queries. I suppose we can use the Database API's built in error handling/reporting as well e.g.

 try {
        switch ($this->connection->databaseType()) {
//All the collation fixes  
        }
}
catch (Exception $e) {
    watchdog('Search API DB Search', 'Search API DB Search is unable to alter the table/column collation type (to support universal character indexing). Message =@message, query= @query', array('@message' => $e->getMessage(), '@query' => $e->query_string), WATCHDOG_ERROR);

//Optionally also present message on screen since it's a back-end error
    drupal_set_message(t('Table/column collation type could not be altered to support universal character sets. Message = %message, query= %query',
array('%message' => $e->getMessage(), '%query' => $e->query_string)), 'error');
}
drunken monkey’s picture

Status: Needs review » Fixed

Good to hear! I've committed this fix now.
I think if anyone wants a less well-known DBMS supported, they should create a new issue – this one has been used enough.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

miro_dietiker’s picture

The project front page contains still hint about there "currently be problems" with reference to this issue.
I guess you should also update it.

Mykolas’s picture

Category: Bug report » Feature request

What if I want search to be accent insensitive? COLLATE in search query seems to work, but is very slow, because 'word' column has index with utf8_bin collation and MySQL does not support indices with custom collations (http://dev.mysql.com/worklog/task/?id=826)

jamsilver’s picture

FYI: I've raised a related issue about ensuring sorting on a string field always occurs case insensitively, since the change here moved us over to utf8_bin collation which forces a case-sensitive sort. See: #2793009: Sorting on a string field should be case insensitive in mysql, but is case sensitive.

geek-merlin’s picture

Title: Fix problems with umlauts, accented characters, etc. » Fix character collation problems
Issue summary: View changes

Now we have #2793009: Sorting on a string field should be case insensitive in mysql, but is case sensitive.
This is a pita.

I read and groked the whole thread and i'm quite sure that we need a deeper understanding and a more thorough approach.
Find ideas in the description.

geek-merlin’s picture

After some thinking my gut feeling is like
* have one column with binary collation and use as index
* have one column with general collation and use for sorting
(both columns contain the same word)

geek-merlin’s picture

Issue summary: View changes

Added some summary links for "collation in select /sort"... may be even better.