(Updated in #83)
Problem/Motivation
* First we had exceptions like SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '113-à' for key 'PRIMARY' when e.g. 113-à collided with 113-a in the index as these are equal in utf8-general-ci collation
* Then all search api tables had been switched to utf8-bin collation by committing #46 in #60
* This fixes the duplicate index exceptions, but has as repercussion that sorting now has binary collation: A..Z, a..z
Workarounds
* in #59 a preprocessor sandbox module was linked: Search API Transliterate
Proposed resolution
Find another way to fix index collisions which does not kill case-insensitive sorting.
Which might be (brainstormed):
* (best) change the index (and maybe the php inser code) in a way we can have any collation without index collisions
* use the db_merge approach from #58 so we can have any collation without index collisions (and find similar solutions for other db systems)
* stick to binary collation but add option to set collation in sort statement (how to in mysql, in postgres, sqlite, oracle, sql-server)
* use general-ci collation but emulate it in php so we have no collisions
* use binary collation and do all normalization in php
* consider having "original word" and "collated word" columns
Remaining tasks
To thoroughly understand the original problem of index collissions we need some backtraces.
Important: We should understand if the problem arises in regular or full text indexes or both.
This way we can understand if and how the collisions can be fixed without losing collation.
Note: The relevant methods should be \SearchApiDbService::indexItem
Comments
Comment #1
drunken monkeyOh god, that's this ***** MySQL collation setting, most probably, which might count "a" and "à" as the same letter.
OK, then I guess we should specify the collation when creating tables for the database search. Asking how this works in #1144644: Enable specifying the collation when creating a database table. As a local fix for now, you can just change the collation for the relevant table (or only its "value" column) to "utf8_bin". The relevant table is "search_api_db_INDEX_FIELD", so in your case (and for the default node index): "search_api_db_default_nodex_index_body_value".
I can't really tell why your workaround with the tokenizer shouldn't work, though.
Comment #2
miprof CreditAttribution: miprof commentedok
you are very speed.
for tokenizer, i try [à] or "à" It's the good syntax ?
Comment #3
drunken monkeyMaybe try "[àÀ]". Apart from that, both variants should work.
Comment #4
miprof CreditAttribution: miprof commentedthe same - but this error is very strange,
i have an another content with the same node, and in the same field i put "à" = no error ???
I test more to find the context
Comment #5
drunken monkeyThe error will only occur when you have both "à" and "a" (or "á", "â", "ã", …) as individual tokens in the same field.
And, as said, please try changing the MySQL collation of the table or field.
Comment #6
miprof CreditAttribution: miprof commentedYes,
it's ok with #1
thanks a lot
Comment #7
yareckon CreditAttribution: yareckon commentedsub
Comment #8
drunken monkeyAs per the suggestion in #1206126-16: SearchApiException: Unknown server profile_server, let's mention this issue in the README.
As for fixing this bug, I can't really do anything unless Drupal lets me influence the collation setting it uses, which pretty much makes fixing this in D7 impossible. If you want to support this at least for D8, try to create some publicity for #1144644: Enable specifying the collation when creating a database table.
Comment #9
drunken monkeyMoving this to the new module.
I also now committed the abvove patch to the README.txt. As said, fixing this seems pretty impossible to me in D7, but maybe someone will finally come up with a solution, or at least a workaround.
In any case, leaving the issue open so people running into this problem will (hopefully) find it.
Comment #10
nattie CreditAttribution: nattie commentedHi i had the same problem to resolve it i've added 2 verifications before insert requests on the service.inc file. Hope it will help you.
Line 360:
$query = db_insert($table, $this->query_options)
->fields(array('item_id', 'word', 'score'));
foreach ($words as $word) {
$sql_result = db_select($table,'fdfd')
->fields('fdfd', array('score'))
->condition('item_id',$id)
->condition('word',$word['value'])
->execute();
foreach ($sql_result as $data){
$item_exist1=0;
break;
}
if(!isset($item_exist1)){
$query->values(array(
'item_id' => $id,
'word' => $word['value'],
'score' => $word['score'] * $boost,
));
$query->execute();
}
}
Line 402:
$sql = "select item_id from ".$table." where item_id = ".$id." and value ='".$value."'";
$result = db_query($sql);
foreach( $result as $item) {
$item_exist=0;
break;
}
if(!isset($item_exist)){
db_insert($table, $this->query_options)
->fields(array(
'item_id' => $id,
'value' => $value,
))
->execute();
}
Comment #11
drunken monkeyJust changing the table collation should help, too, won't screw with your search results and also won't need any patches to the code.
Comment #12
tahiticlic CreditAttribution: tahiticlic commentedWell, I've changed all the search api index table collation, and that didn't help. Error is still there.
Comment #13
drunken monkeyEven after clearing the index? Then you should check manually which keys are seen as identical.
(Or maybe the columns still use the other collation, and you have to change that specifically for each "word" column … ?)
Comment #14
etiennechataignier CreditAttribution: etiennechataignier commentedSame problem here.
I cleared the index.
I set collation to utf8_bin for tables search_api_db_*.
I'd like to.
The error log prints
From this information, how may I find the 2 keys ?
Comment #15
drunken monkeySELECT word FROM $table WHERE item_id = 706 AND word = 'therond'
This should give you the first word, that got indexed. The second one is "therond".
Comment #16
PedroMiguel CreditAttribution: PedroMiguel commentedif you already change the table collation to utf8_bin and still dont work, change also the collaction on table row.
Works for me.
Comment #17
yareckon CreditAttribution: yareckon commenteddrunken monkey, could you add a test for the collation issue and have it complain with a drupal message? Would probably catch more folks than an entry in the README.txt
Comment #18
drunken monkeyI cannot really think of a good easy test there. The way to do this would probably be something like in #10, running a SELECT before each INSERT and thus considerably slowing down the (already rather slow) indexing operation. However, this way we could not only detect but also fix the error, I guess. So maybe we should do that after all, if this is such a frequent problem …
Also, carefully reading the README.txt before using a module is a very sensible practice that I want to encourage, rather than help people work around.
Comment #19
tahiticlic CreditAttribution: tahiticlic commentedChanging collation for fields inside tables made the trick, thanks!
Comment #20
yareckon CreditAttribution: yareckon commenteddrunken monkey, I'm not suggesting doing it at run time before each insert, just some sort of hook_requirements like thing that would do a one off test for this problem when folks tried to activate/first configure an index with a mysql backend. If it comes back busted, just throw up a status message like any other module.
Basically if it fails: "Warning: to index non ascii characters you must change your mysql collation settings. See the search_api README.txt for more information"
Comment #21
drunken monkeyAh, you're right, that would also be an option.
Or, come to think of it, we could inspect the exception and if it's a duplicate key issue we could log a more helpful error message.
Comment #22
etiennechataignier CreditAttribution: etiennechataignier commented#16
Thanks! It works.
For my next projects, I'll set utf8_bin from the very beginning of the Drupal install.
Comment #23
drunken monkeyYou mean for all tables? In that case, please be aware that this has some probably undesirable side effects, like not ignoring the case for sorting (example sort: A, B, C, a, b, c). Depending on your language this might not be a problem after all, though.
Comment #24
jefkin CreditAttribution: jefkin commentedBecause I had to do a little sql research to find the right sql syntax for this, I'm posting it here so others might not need to suffer as much.
This is no Panacea, but it is a workable way to break through this limitation. Using either your favorite mysql client:
mysql> ALTER TABLE search_api_db_node_index_body_value CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin';
Obviously you need to replace 'search_api_db_node_index_body_value' your table name(s). I was running my index on a big old complicated data type with about 9000 nodes, so I wasn't sure which tables needed it, so I ended up running this statement for each of the search api different index tables. getting the list with:
mysql> show tables like '%search%';
I didn't try it myself, but if you can't use mysql directly, you might get it done with php:
perhaps via the development module's execute php page.
Also note that (for me at least) the entire index had to be cleared after the collation change and the data re-indexed. Not ideal, but workable.
Comment #25
nils.destoop CreditAttribution: nils.destoop commentedOr if you have many search api tables:
Comment #26
Ken Hawkins CreditAttribution: Ken Hawkins commentedAlthough I've set the search indexes to utf8_bin I'm still getting complaints from the system that "Couldn't index items. Check the logs for details." with the offending error relating to "1062 Duplicate entry".
Stranger still is that before switching to utf8_bin the indexing simply failed, now I get the errors (and the item remains not indexed in the counter), however the item appear appropriately in the search results.
Comment #27
drunken monkey@ #24, #25: You only need to use this with the
'search_api_db_%'
tables. The ones from Search API itself are fine as they are. On the other hand, it shouldn't do any harm, so it's probably fine either way.@ #26: Very strange. Did you clear the index after changing the collation?
Comment #28
freakalis CreditAttribution: freakalis commentedThis work for great for me but is ut possible to add COLLATE 'utf8_bin' to all tables created by search_api_db so that we don't need to run this "fix"?
Comment #29
freakalis CreditAttribution: freakalis commentedSeems like db_create_table does not allow to set collation on a table.
Here is a patch that tries to resolve this problem by adding collation utf8_bin after table is created.
A update function is also included that updates all search_api_db tables to utf8_bin collation.
Comment #30
freakalis CreditAttribution: freakalis commentedForgot to change status to need review...
Comment #31
moniuch CreditAttribution: moniuch commentedI set utf8_bin on all search_api tables, cleared my node index and re-created it. The errors are still there. When I compared the error messages with contents of the nodes, I started to wonder how search_api copes with situations where a word occur more than once in a node.
In my case the index failed with message "866-możliwość" (this is a Polish word). When I looked at node/866 I saw this word appears three times. Isn't this a limitation? Is there a setting in UI that I neglected to check that might cure this?
Comment #32
tahiticlic CreditAttribution: tahiticlic commentedYou have to set utf8_bin on fields also, not only tables (text fields at least).
Comment #33
moniuch CreditAttribution: moniuch commentedTrue, that helped, thx
Comment #34
drunken monkeyIf you also set the collation on tables of string fields (not only on fulltext ones), that will make sorting on those fields case-sensitive. So, for fixing this problem, you should probably only change the collation of the "word" column in the tables of fulltext fields.
@ #29: Thanks for the patch! If this is also possible for single columns, it might really be a good idea to do this.
However, does this work for all DBMSs the same, or only for MySQL?
Comment #35
aendra CreditAttribution: aendra commentedI've tried changing all search_api-related tables to utf8-bin, and have cleared the index; alas, I still get the errors.
As per #32, am I to take that as, if I have a field named "field_blabla", I should change the collation on table "field_data_field_blabla" to utf-bin also?
Thanks!
Comment #36
tahiticlic CreditAttribution: tahiticlic commentedYou have to set utf8_bin on fields also, not only tables (text fields at least).
Comment #37
modestmoes CreditAttribution: modestmoes commented#29 worked for me. Thanks!
Comment #38
sylus CreditAttribution: sylus commentedRolled the patch against the latest dev release (July 2nd) - commit: a6203a3
Comment #39
checker CreditAttribution: checker commentedPerhaps you can add something like variable_get to enable or disable if this feature is not needed by everyone.
Comment #40
sylus CreditAttribution: sylus commentedCan confirm that the patch breaks PostGres installs:
Comment #41
rob_johnston CreditAttribution: rob_johnston commentedIt will also break an MS SQL Server install:
SQLSTATE[42000]: [Microsoft][SQL Server Native Client 10.0][SQL Server]Incorrect syntax near the keyword 'CONVERT'.
I believe that COLLATE can only be applied to the database as a whole or to a column of a table, but not to a table (as in the patch). It gets worse...
No version of SQL Server supports UTF-8 and from what I've read, depending on what one wishes to do, it should be UTF-16 or an nvarchar field (which uses UCS-2) or varbinary... all have pluses and minuses.
There is another way of doing it, which is to specify the collation when doing the select, such as:
Would something similar to this method work with other database systems (ci = case insensitive, as = accent sensitive)?
Anyone who knows more about SQL Server can feel free to chime in.
Comment #42
rob_johnston CreditAttribution: rob_johnston commentedAn update to the patch by @sylus to execute different SQL depending on the database being used.
Comment #43
pvhee CreditAttribution: pvhee commentedCan confirm that #42 is working fine with the latest DEV release. RTBC?
Comment #44
checker CreditAttribution: checker commentedI'm also using this patch #42 without problems (under mysql). Other DBMS could be easily added later if there are issues.
Comment #45
geek-merlinthis patch solved my "integer constraint violations" after deleting and recreating the db backend.
re-rolled this with bogus whitespace removed but no other changes.
Comment #46
drunken monkeyWell, I guess a fix for MySQL is better than nothing … If someone is interested in making this work on other DBMSs, they can provide another patch.
I've just cleaned the patch up a bit, removing the dummy code for other DBMSs, using the right database and also fixing the update hook to use the correct data.
Please review and confirm that this still fixes your problems.
Also, as mentioned before, this will also make sorting case-sensitive, which usually isn't desired. We could easily just fix the tables for fulltext fields in this way (or also for multi-valued fields, as they cannot be searched, either) which would remove this problem – however, there could theoretically be clashes in normal fields too, in some rare cases.
So, what would you prefer: being able to sort case-insensitively, or getting rid of this bug (on MySQL servers) also in edge cases?
Please also see #1794374: Searching for "...é..." yields other results than searching for "...e..." for a half-related issue. This would suggest that we should maybe not counteract MySQLs default collation, but rather emulate it. A better solution would probably be to provide a preprocessor for that, though.
Comment #47
rob_johnston CreditAttribution: rob_johnston commentedThe first way of determining the database system in #42, using
$database = $databases['default']['default'];
was completely invented by me and I've been keeping my eyes open for an official way of doing it. I see that the latest patch uses:$db_driver = $databases[$key][$target]['driver'];
I recently saw that the views module, in their /includes/handlers.inc, file determines the database with:
$db_type = Database::getConnection()->databaseType();
Just thought I'd add that, as I don't know which is better. Both return a value like 'mysql', 'pgsql', 'sqlsrv', etc.
You could probably change the switch-case statement in the service.inc file to be a simple
if ($db_driver === 'mysql')
as well since none of the other cases are being used.Comment #48
drunken monkeyYes, we have to do it that way as we allow other databases than the default one to be used for the search server.
Yes, but in contrast to the update hook, this might actually later be filled, so I thought I'd leave it. It's not a performance problem in any way, but provides a visual reminder of what's missing.
So, does the patch work for you as expected?
Comment #49
rob_johnston CreditAttribution: rob_johnston commentedI guess that was just a long-winded question. No hurry, but can you tell me or point me to something that talks about the difference between
$db_driver = $databases[$key][$target]['driver'];
and$db_type = Database::getConnection()->databaseType();
? They both return the identical string. I'm going to be making changes to handle sqlsrv in a couple more modules.Sylus is handling the install profile I'm using and I'm doing some of the sqlsrv work. If the patch doesn't work after he includes it then I'll let somebody know.
Comment #50
dimarick CreditAttribution: dimarick commentedThis patch works fine for me.
Installation
1. Apply patch for search_api_db/service.inc
2. Clear index (but not reindex)
3. Index content
4. See no errors!
But i dont know where I get full collation table like mysql utf8_general_ci
Comment #51
davvid CreditAttribution: davvid commented#50 works, but it's probably useless if you want to search with words that contain replaced characters. Am I right?
Comment #52
milesw CreditAttribution: milesw commentedPatch #46 applied cleanly against 7.x-1.0-beta4 and fixes my MySQL indexing errors related to accents. Thanks for the patch!
Comment #53
Vacilando CreditAttribution: Vacilando commentedI concur with @milesw in #52: "Patch #46 applied cleanly against 7.x-1.0-beta4 and fixes my MySQL indexing errors related to accents. Thanks for the patch!"
Comment #54
checker CreditAttribution: checker commented#46 fixes my db errors with umlauts.
Comment #55
mansspams CreditAttribution: mansspams commented#46 applies and works. Better solution still needed.
Comment #56
sardbaba CreditAttribution: sardbaba commented#46 worked for me.
Steps:
Comment #57
Maico de Jong#46 worked form me
Comment #58
almul0 CreditAttribution: almul0 commentedI was dealing with the issue this evening, I have tested a lot of utf8 collations along search_api_db_* tables. Finally, cause my problem was when I was trying to index the content, finally I swapped the db_insert statements with db_merge statements.
These statements prevents the index failure when indexing a set of words like '--a---' and '--á---' cause for the database with default collation is the same .
Before, I was applying the utf8_bin collation, with this patch, I reverted them to their original collation DEFAULT.
For me, I am using it with this patch for partial matching Add option for partial matching, works quite well and now my search using views using fulltext form field exposed work well, returning the content, being case and accent insensitive.
IMPORTANT: I make this patch for a case and accent insensitive search.
Comment #59
almul0 CreditAttribution: almul0 commentedFinally, I have found thanks to grisendo a patch for index problematic words. Is a sandbox project which transliterate the word, is a processor of search_api and it works fine.
I can revert that patch for my purpose, but may be it will be useful for others.
Search API Transliterate
Comment #60
drunken monkeyAh, cool! Great to know this exists, could surely be helpful for some. A pity that it's only a sandbox project for now …
Anyways, regarding the issue at hand: thank you all a lot for testing, and sorry it took me a while to get back to this. I committed #46 now, hopefully this will help most people. Setting the issue back to "active", though, because code for the other DBMSs is still needed.
Comment #61
ndf CreditAttribution: ndf commentedTested patch #46 and sandbox-module #59 on an index with 1000+ nodes with searchapi facets enabled.
Both the patch and the module do what they say the do.
Prefer patch #46 over the module #59 because the module removes all special characters from the index. Therefore searchapi facets and views-fields:node-index won't show these special signs anymore.
In case you want to show special characters but ignore them while searching you need the patch. Good it's committed in .dev!
Comment #62
brahimmouhamou CreditAttribution: brahimmouhamou commentedPatch #46 works like a charm! Thanks!
Comment #63
cmonnow CreditAttribution: cmonnow commentedI didn't see it written anywhere but perhaps it can be suggested in the README (or at least now here) that for PostgreSQL you would set (ALTER) the string character columns of "search_api_db_" tables to "C" (COLLATE "C"). I haven't actually tested anything though and don't even know the default behaviour. Just putting it out there for discussion.
Comment #64
drunken monkeyThanks for the note, good to know!
Do you have a Postgres database for testing? If so, and you can tell me the correct syntax for this ALTER statement, then I could add this fix to the module. Would be great to be able to do this out of the box for Postgres users, too!
Comment #65
cmonnow CreditAttribution: cmonnow commentedHi drunken monkey,
Unfortunately I'm new to PostgreSQL myself and like I guess most people here converted for PostGIS features.
I don't know how/if it's possible to change the "collate" type without defining the column type as well. e.g.
is the basic syntax that works for me (at least on empty columns), and the type in this case was redefined as it was already (character varying(50)). I haven't actually tested the search indexes yet at all.
I haven't reached the stage of my project yet where I index foreign languages yet but it seems the dilemma of mapping similar letters as described in https://drupal.org/node/1794374 would be essential for an international site that involves "cross-cultural" communications.
Comment #66
drunken monkeyAh, excellent, that's all I needed to know!
Can you then please test the attached patch and see if it succeeds in producing tables with the correct collation? Thanks!
(Before committing we should maybe just wrap the whole
switch
in atry
/catch
– for PSQL, there are not enough testers to be sure whether this also works in 8.x.)Comment #67
drunken monkeyComment #76
cmonnow CreditAttribution: cmonnow commentedHi drunken monkey,
I can confirm that the patch worked as expected in a clean, standard Drupal 7.26 install using PostgreSQL 9.2 (with 2 nodes indexed).
The "word" column was successfully assigned a collation of "C". I tested the imaginary word "bànànà" in an article and it was found using the string "bànànà" (and not "banana", as expected).
(Aside: I wish there were more PostgreSQL users in general since there's a fraction of the user-friendly info out there when compared with mySQL).
Cheers
Comment #77
cmonnow CreditAttribution: cmonnow commentedAnd yep, the try-catch sounds like a great idea for developmental queries. I suppose we can use the Database API's built in error handling/reporting as well e.g.
Comment #78
drunken monkeyGood to hear! I've committed this fix now.
I think if anyone wants a less well-known DBMS supported, they should create a new issue – this one has been used enough.
Comment #80
miro_dietikerThe project front page contains still hint about there "currently be problems" with reference to this issue.
I guess you should also update it.
Comment #81
Mykolas CreditAttribution: Mykolas commentedWhat if I want search to be accent insensitive? COLLATE in search query seems to work, but is very slow, because 'word' column has index with utf8_bin collation and MySQL does not support indices with custom collations (http://dev.mysql.com/worklog/task/?id=826)
Comment #82
jamsilver CreditAttribution: jamsilver commentedFYI: I've raised a related issue about ensuring sorting on a string field always occurs case insensitively, since the change here moved us over to utf8_bin collation which forces a case-sensitive sort. See: #2793009: Sorting on a string field should be case insensitive in mysql, but is case sensitive.
Comment #83
geek-merlinNow we have #2793009: Sorting on a string field should be case insensitive in mysql, but is case sensitive.
This is a pita.
I read and groked the whole thread and i'm quite sure that we need a deeper understanding and a more thorough approach.
Find ideas in the description.
Comment #84
geek-merlinAfter some thinking my gut feeling is like
* have one column with binary collation and use as index
* have one column with general collation and use for sorting
(both columns contain the same word)
Comment #85
geek-merlinAdded some summary links for "collation in select /sort"... may be even better.