Fix character collation problems [#1144620]

Comment	File	Size	Author
#67	1144620-66--psql_collation_support.patch	684 bytes	drunken monkey

#58	search_api_db-use_db_merge_when_indexing_items-1144620-58.patch	1.9 KB	almul0

#50	service.inc_.05122012.patch	1.66 KB	dimarick

#50	service.inc_.05122012.patch	1.66 KB	dimarick

#50	service.inc_.05122012.patch	1.66 KB	dimarick

#46	1144620--umlauts-46.patch	2.2 KB	drunken monkey

#45	0001-Issue-1144620-by-drunken-monkey-freakalis-sylus-rob_.patch	3.63 KB	geek-merlin

#42	search_api_db_utf8_1144620-42.patch	3.16 KB	rob_johnston

#38	search_api_db_utf8-1144620-38.patch	1.07 KB	sylus

#29	search_api_db_issue_1144620.patch	1.32 KB	freakalis

#8	1144620--known-umlaut-bug-8.patch	982 bytes	drunken monkey

Comment #1

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 2 May 2011 at 14:11

Priority:

Normal

» Major

Oh god, that's this ***** MySQL collation setting, most probably, which might count "a" and "à" as the same letter.
OK, then I guess we should specify the collation when creating tables for the database search. Asking how this works in #1144644: Enable specifying the collation when creating a database table. As a local fix for now, you can just change the collation for the relevant table (or only its "value" column) to "utf8_bin". The relevant table is "search_api_db_INDEX_FIELD", so in your case (and for the default node index): "search_api_db_default_nodex_index_body_value".

I can't really tell why your workaround with the tokenizer shouldn't work, though.

Log in or register to post comments

Comment #2

miprof CreditAttribution: miprof commented 2 May 2011 at 18:18

ok
you are very speed.
for tokenizer, i try [à] or "à" It's the good syntax ?

Log in or register to post comments

Comment #3

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 2 May 2011 at 19:08

Maybe try "[àÀ]". Apart from that, both variants should work.

Log in or register to post comments

Comment #4

miprof CreditAttribution: miprof commented 3 May 2011 at 11:34

the same - but this error is very strange,
i have an another content with the same node, and in the same field i put "à" = no error ???

I test more to find the context

Log in or register to post comments

Comment #5

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 May 2011 at 16:06

The error will only occur when you have both "à" and "a" (or "á", "â", "ã", …) as individual tokens in the same field.

And, as said, please try changing the MySQL collation of the table or field.

Log in or register to post comments

Comment #6

miprof CreditAttribution: miprof commented 4 May 2011 at 11:58

Yes,
it's ok with #1

thanks a lot

Log in or register to post comments

Comment #7

yareckon CreditAttribution: yareckon commented 7 August 2011 at 13:56

sub

Log in or register to post comments

Comment #8

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 15 August 2011 at 16:24

Status:

Active

» Needs review

File	Size
1144620--known-umlaut-bug-8.patch	982 bytes

As per the suggestion in #1206126-16: SearchApiException: Unknown server profile_server, let's mention this issue in the README.

As for fixing this bug, I can't really do anything unless Drupal lets me influence the collation setting it uses, which pretty much makes fixing this in D7 impossible. If you want to support this at least for D8, try to create some publicity for #1144644: Enable specifying the collation when creating a database table.

Log in or register to post comments

Comment #9

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 6 September 2011 at 20:09

Title:	SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '113-à' for key 'PRIMARY'	» Problems with umlauts, accented characters, etc.
Project:	Search API	» Search API Database Search
Version:	7.x-1.x-dev	» 7.x-1.0-beta1
Component:	Database search	» Code
Status:	Needs review	» Active

Moving this to the new module.

I also now committed the abvove patch to the README.txt. As said, fixing this seems pretty impossible to me in D7, but maybe someone will finally come up with a solution, or at least a workaround.
In any case, leaving the issue open so people running into this problem will (hopefully) find it.

Log in or register to post comments

Comment #10

nattie CreditAttribution: nattie commented 7 September 2011 at 09:01

Hi i had the same problem to resolve it i've added 2 verifications before insert requests on the service.inc file. Hope it will help you.
Line 360:
$query = db_insert($table, $this->query_options)
->fields(array('item_id', 'word', 'score'));

foreach ($words as $word) {

$sql_result = db_select($table,'fdfd')
->fields('fdfd', array('score'))
->condition('item_id',$id)
->condition('word',$word['value'])
->execute();
foreach ($sql_result as $data){
$item_exist1=0;
break;
}

if(!isset($item_exist1)){
$query->values(array(
'item_id' => $id,
'word' => $word['value'],
'score' => $word['score'] * $boost,
));
$query->execute();
}

}
Line 402:
$sql = "select item_id from ".$table." where item_id = ".$id." and value ='".$value."'";

$result = db_query($sql);
foreach( $result as $item) {
$item_exist=0;
break;
}
if(!isset($item_exist)){

db_insert($table, $this->query_options)
->fields(array(
'item_id' => $id,
'value' => $value,
))
->execute();
}

Log in or register to post comments

Comment #11

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 7 September 2011 at 11:22

Just changing the table collation should help, too, won't screw with your search results and also won't need any patches to the code.

Log in or register to post comments

Comment #12

tahiticlic CreditAttribution: tahiticlic commented 10 October 2011 at 08:57

Well, I've changed all the search api index table collation, and that didn't help. Error is still there.

Log in or register to post comments

Comment #13

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 10 October 2011 at 09:27

Well, I've changed all the search api index table collation, and that didn't help. Error is still there.

Even after clearing the index? Then you should check manually which keys are seen as identical.
(Or maybe the columns still use the other collation, and you have to change that specifically for each "word" column … ?)

Log in or register to post comments

Comment #14

etiennechataignier CreditAttribution: etiennechataignier commented 21 October 2011 at 13:26

Same problem here.
I cleared the index.
I set collation to utf8_bin for tables search_api_db_*.

Then you should check manually which keys are seen as identical.

I'd like to.
The error log prints

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '706-therond' for key 'PRIMARY'

From this information, how may I find the 2 keys ?

Log in or register to post comments

Comment #15

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 21 October 2011 at 19:05

From this information, how may I find the 2 keys ?

SELECT word FROM $table WHERE item_id = 706 AND word = 'therond'
This should give you the first word, that got indexed. The second one is "therond".

Log in or register to post comments

Comment #16

PedroMiguel CreditAttribution: PedroMiguel commented 21 October 2011 at 21:39

if you already change the table collation to utf8_bin and still dont work, change also the collaction on table row.

Works for me.

Log in or register to post comments

Comment #17

yareckon CreditAttribution: yareckon commented 24 October 2011 at 09:11

drunken monkey, could you add a test for the collation issue and have it complain with a drupal message? Would probably catch more folks than an entry in the README.txt

Log in or register to post comments

Comment #18

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 24 October 2011 at 10:12

I cannot really think of a good easy test there. The way to do this would probably be something like in #10, running a SELECT before each INSERT and thus considerably slowing down the (already rather slow) indexing operation. However, this way we could not only detect but also fix the error, I guess. So maybe we should do that after all, if this is such a frequent problem …

Also, carefully reading the README.txt before using a module is a very sensible practice that I want to encourage, rather than help people work around.

Log in or register to post comments

Comment #19

tahiticlic CreditAttribution: tahiticlic commented 24 October 2011 at 17:02

Changing collation for fields inside tables made the trick, thanks!

Log in or register to post comments

Comment #20

yareckon CreditAttribution: yareckon commented 25 October 2011 at 16:14

drunken monkey, I'm not suggesting doing it at run time before each insert, just some sort of hook_requirements like thing that would do a one off test for this problem when folks tried to activate/first configure an index with a mysql backend. If it comes back busted, just throw up a status message like any other module.
Basically if it fails: "Warning: to index non ascii characters you must change your mysql collation settings. See the search_api README.txt for more information"

Log in or register to post comments

Comment #21

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 26 October 2011 at 08:48

drunken monkey, I'm not suggesting doing it at run time before each insert, just some sort of hook_requirements like thing that would do a one off test for this problem when folks tried to activate/first configure an index with a mysql backend. If it comes back busted, just throw up a status message like any other module.
Basically if it fails: "Warning: to index non ascii characters you must change your mysql collation settings. See the search_api README.txt for more information"

Ah, you're right, that would also be an option.

Or, come to think of it, we could inspect the exception and if it's a duplicate key issue we could log a more helpful error message.

Log in or register to post comments

Comment #22

etiennechataignier CreditAttribution: etiennechataignier commented 17 November 2011 at 15:23

#16
Thanks! It works.
For my next projects, I'll set utf8_bin from the very beginning of the Drupal install.

Log in or register to post comments

Comment #23

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 17 November 2011 at 23:07

For my next projects, I'll set utf8_bin from the very beginning of the Drupal install.

You mean for all tables? In that case, please be aware that this has some probably undesirable side effects, like not ignoring the case for sorting (example sort: A, B, C, a, b, c). Depending on your language this might not be a problem after all, though.

Log in or register to post comments

Comment #24

jefkin CreditAttribution: jefkin commented 19 December 2011 at 21:52

Because I had to do a little sql research to find the right sql syntax for this, I'm posting it here so others might not need to suffer as much.

This is no Panacea, but it is a workable way to break through this limitation. Using either your favorite mysql client:

mysql> ALTER TABLE search_api_db_node_index_body_value CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin';

Obviously you need to replace 'search_api_db_node_index_body_value' your table name(s). I was running my index on a big old complicated data type with about 9000 nodes, so I wasn't sure which tables needed it, so I ended up running this statement for each of the search api different index tables. getting the list with:

mysql> show tables like '%search%';

I didn't try it myself, but if you can't use mysql directly, you might get it done with php:

db_query("ALTER TABLE {search_api_db_node_index_body_value} CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin'")
->execute();

perhaps via the development module's execute php page.

Also note that (for me at least) the entire index had to be cleared after the collation change and the data re-indexed. Not ideal, but workable.

Log in or register to post comments

Comment #25

nils.destoop CreditAttribution: nils.destoop commented 3 January 2012 at 16:01

Or if you have many search api tables:

  $search_api_tables = db_query("show tables LIKE '%search_api%';")->fetchCol();
  foreach ($search_api_tables as $table) {
    db_query("ALTER TABLE {$table} CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin'")
    ->execute();
  }

Log in or register to post comments

Comment #26

Ken Hawkins CreditAttribution: Ken Hawkins commented 7 May 2012 at 11:49

Although I've set the search indexes to utf8_bin I'm still getting complaints from the system that "Couldn't index items. Check the logs for details." with the offending error relating to "1062 Duplicate entry".

Stranger still is that before switching to utf8_bin the indexing simply failed, now I get the errors (and the item remains not indexed in the counter), however the item appear appropriately in the search results.

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '45-cafes' for key 'PRIMARY'

Log in or register to post comments

Comment #27

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 28 May 2012 at 19:14

@ #24, #25: You only need to use this with the 'search_api_db_%' tables. The ones from Search API itself are fine as they are. On the other hand, it shouldn't do any harm, so it's probably fine either way.

@ #26: Very strange. Did you clear the index after changing the collation?

Log in or register to post comments

Comment #28

freakalis CreditAttribution: freakalis commented 1 June 2012 at 11:50

This work for great for me but is ut possible to add COLLATE 'utf8_bin' to all tables created by search_api_db so that we don't need to run this "fix"?

Log in or register to post comments

Comment #29

freakalis CreditAttribution: freakalis commented 1 June 2012 at 12:22

File	Size
search_api_db_issue_1144620.patch	1.32 KB

Seems like db_create_table does not allow to set collation on a table.

Here is a patch that tries to resolve this problem by adding collation utf8_bin after table is created.

A update function is also included that updates all search_api_db tables to utf8_bin collation.

Log in or register to post comments

Comment #30

freakalis CreditAttribution: freakalis commented 1 June 2012 at 12:26

Status:

Active

» Needs review

Forgot to change status to need review...

Log in or register to post comments

Comment #31

moniuch CreditAttribution: moniuch commented 15 June 2012 at 22:21

I set utf8_bin on all search_api tables, cleared my node index and re-created it. The errors are still there. When I compared the error messages with contents of the nodes, I started to wonder how search_api copes with situations where a word occur more than once in a node.
In my case the index failed with message "866-możliwość" (this is a Polish word). When I looked at node/866 I saw this word appears three times. Isn't this a limitation? Is there a setting in UI that I neglected to check that might cure this?

Log in or register to post comments

Comment #32

tahiticlic CreditAttribution: tahiticlic commented 15 June 2012 at 23:35

You have to set utf8_bin on fields also, not only tables (text fields at least).

Log in or register to post comments

Comment #33

moniuch CreditAttribution: moniuch commented 15 June 2012 at 23:51

True, that helped, thx

Log in or register to post comments

Comment #34

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 25 June 2012 at 12:56

If you also set the collation on tables of string fields (not only on fulltext ones), that will make sorting on those fields case-sensitive. So, for fixing this problem, you should probably only change the collation of the "word" column in the tables of fulltext fields.

@ #29: Thanks for the patch! If this is also possible for single columns, it might really be a good idea to do this.
However, does this work for all DBMSs the same, or only for MySQL?

Log in or register to post comments

Comment #35

aendra CreditAttribution: aendra commented 9 July 2012 at 00:21

I've tried changing all search_api-related tables to utf8-bin, and have cleared the index; alas, I still get the errors.

As per #32, am I to take that as, if I have a field named "field_blabla", I should change the collation on table "field_data_field_blabla" to utf-bin also?

Thanks!

Log in or register to post comments

Comment #36

tahiticlic CreditAttribution: tahiticlic commented 15 July 2012 at 05:51

You have to set utf8_bin on fields also, not only tables (text fields at least).

Log in or register to post comments

Comment #37

modestmoes CreditAttribution: modestmoes commented 1 August 2012 at 17:48

#29 worked for me. Thanks!

Log in or register to post comments

Comment #38

sylus CreditAttribution: sylus commented 31 August 2012 at 20:15

File	Size
search_api_db_utf8-1144620-38.patch	1.07 KB

Rolled the patch against the latest dev release (July 2nd) - commit: a6203a3

Log in or register to post comments

Comment #39

checker CreditAttribution: checker commented 4 September 2012 at 16:11

Perhaps you can add something like variable_get to enable or disable if this feature is not needed by everyone.

Log in or register to post comments

Comment #40

sylus CreditAttribution: sylus commented 10 September 2012 at 16:38

Can confirm that the patch breaks PostGres installs:

SQLSTATE[42601]: Syntax error: 7 ERROR: syntax error at or near "CONVERT" LINE 1: ALTER TABLE search_api_db_database_node_index_nid CONVERT TO... ^

Log in or register to post comments

Comment #41

rob_johnston CreditAttribution: rob_johnston commented 20 September 2012 at 00:39

It will also break an MS SQL Server install:

SQLSTATE[42000]: [Microsoft][SQL Server Native Client 10.0][SQL Server]Incorrect syntax near the keyword 'CONVERT'.

I believe that COLLATE can only be applied to the database as a whole or to a column of a table, but not to a table (as in the patch). It gets worse...

No version of SQL Server supports UTF-8 and from what I've read, depending on what one wishes to do, it should be UTF-16 or an nvarchar field (which uses UCS-2) or varbinary... all have pluses and minuses.

There is another way of doing it, which is to specify the collation when doing the select, such as:

SELECT Place FROM Locations ORDER BY Place
COLLATE Traditional_Spanish_ci_as ASC;

Would something similar to this method work with other database systems (ci = case insensitive, as = accent sensitive)?

Anyone who knows more about SQL Server can feel free to chime in.

Log in or register to post comments

Comment #42

rob_johnston CreditAttribution: rob_johnston commented 4 October 2012 at 17:04

File	Size
search_api_db_utf8_1144620-42.patch	3.16 KB

An update to the patch by @sylus to execute different SQL depending on the database being used.

Log in or register to post comments

Comment #43

pvhee CreditAttribution: pvhee commented 17 October 2012 at 14:29

Status:

Needs review

» Reviewed & tested by the community

Can confirm that #42 is working fine with the latest DEV release. RTBC?

Log in or register to post comments

Comment #44

checker CreditAttribution: checker commented 18 October 2012 at 06:56

I'm also using this patch #42 without problems (under mysql). Other DBMS could be easily added later if there are issues.

Log in or register to post comments

Comment #45

geek-merlin

they/omg

German

Freiburg, Germany

CreditAttribution: geek-merlin commented 22 October 2012 at 01:16

File	Size
0001-Issue-1144620-by-drunken-monkey-freakalis-sylus-rob_.patch	3.63 KB

this patch solved my "integer constraint violations" after deleting and recreating the db backend.
re-rolled this with bogus whitespace removed but no other changes.

Log in or register to post comments

Comment #46

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 29 October 2012 at 01:11

Version:	7.x-1.0-beta1	» 7.x-1.x-dev
Status:	Reviewed & tested by the community	» Needs review

File	Size
1144620--umlauts-46.patch	2.2 KB

Well, I guess a fix for MySQL is better than nothing … If someone is interested in making this work on other DBMSs, they can provide another patch.

I've just cleaned the patch up a bit, removing the dummy code for other DBMSs, using the right database and also fixing the update hook to use the correct data.
Please review and confirm that this still fixes your problems.

Also, as mentioned before, this will also make sorting case-sensitive, which usually isn't desired. We could easily just fix the tables for fulltext fields in this way (or also for multi-valued fields, as they cannot be searched, either) which would remove this problem – however, there could theoretically be clashes in normal fields too, in some rare cases.
So, what would you prefer: being able to sort case-insensitively, or getting rid of this bug (on MySQL servers) also in edge cases?

Please also see #1794374: Searching for "...é..." yields other results than searching for "...e..." for a half-related issue. This would suggest that we should maybe not counteract MySQLs default collation, but rather emulate it. A better solution would probably be to provide a preprocessor for that, though.

Log in or register to post comments

Comment #47

rob_johnston CreditAttribution: rob_johnston commented 29 October 2012 at 16:56

The first way of determining the database system in #42, using $database = $databases['default']['default']; was completely invented by me and I've been keeping my eyes open for an official way of doing it. I see that the latest patch uses:
$db_driver = $databases[$key][$target]['driver'];

I recently saw that the views module, in their /includes/handlers.inc, file determines the database with:
$db_type = Database::getConnection()->databaseType();

Just thought I'd add that, as I don't know which is better. Both return a value like 'mysql', 'pgsql', 'sqlsrv', etc.

You could probably change the switch-case statement in the service.inc file to be a simple if ($db_driver === 'mysql') as well since none of the other cases are being used.

Log in or register to post comments

Comment #48

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 30 October 2012 at 19:26

The first way of determining the database system in #42, using $database = $databases['default']['default']; was completely invented by me and I've been keeping my eyes open for an official way of doing it. I see that the latest patch uses:$db_driver = $databases[$key][$target]['driver'];

Yes, we have to do it that way as we allow other databases than the default one to be used for the search server.

You could probably change the switch-case statement in the service.inc file to be a simple if ($db_driver === 'mysql') as well since none of the other cases are being used.

Yes, but in contrast to the update hook, this might actually later be filled, so I thought I'd leave it. It's not a performance problem in any way, but provides a visual reminder of what's missing.

So, does the patch work for you as expected?

Log in or register to post comments

Comment #49

rob_johnston CreditAttribution: rob_johnston commented 31 October 2012 at 15:36

I guess that was just a long-winded question. No hurry, but can you tell me or point me to something that talks about the difference between $db_driver = $databases[$key][$target]['driver']; and $db_type = Database::getConnection()->databaseType();? They both return the identical string. I'm going to be making changes to handle sqlsrv in a couple more modules.

Sylus is handling the install profile I'm using and I'm doing some of the sqlsrv work. If the patch doesn't work after he includes it then I'll let somebody know.

Log in or register to post comments

Comment #50

dimarick CreditAttribution: dimarick commented 4 December 2012 at 22:15

File	Size
service.inc_.05122012.patch	1.66 KB

service.inc_.05122012.patch	1.66 KB

service.inc_.05122012.patch	1.66 KB

This patch works fine for me.
Installation
1. Apply patch for search_api_db/service.inc
2. Clear index (but not reindex)
3. Index content
4. See no errors!

But i dont know where I get full collation table like mysql utf8_general_ci

Log in or register to post comments

Comment #51

davvid CreditAttribution: davvid commented 20 December 2012 at 19:15

#50 works, but it's probably useless if you want to search with words that contain replaced characters. Am I right?

Log in or register to post comments

Comment #52

milesw CreditAttribution: milesw commented 25 December 2012 at 12:02

Patch #46 applied cleanly against 7.x-1.0-beta4 and fixes my MySQL indexing errors related to accents. Thanks for the patch!

Log in or register to post comments

Comment #53

Vacilando CreditAttribution: Vacilando commented 3 January 2013 at 08:37

I concur with @milesw in #52: "Patch #46 applied cleanly against 7.x-1.0-beta4 and fixes my MySQL indexing errors related to accents. Thanks for the patch!"

Log in or register to post comments

Comment #54

checker CreditAttribution: checker commented 9 January 2013 at 16:57

#46 fixes my db errors with umlauts.

Log in or register to post comments

Comment #55

mansspams CreditAttribution: mansspams commented 10 January 2013 at 15:27

Status:

Needs review

» Reviewed & tested by the community

#46 applies and works. Better solution still needed.

Log in or register to post comments

Comment #56

sardbaba CreditAttribution: sardbaba commented 9 February 2013 at 15:11

#46 worked for me.

Steps:

Download and apply patch at #46
Run update.php (the patch invokes hook_update_N)
Clear the index
Index the content

Log in or register to post comments

Comment #57

Maico de Jong

Dutch

CreditAttribution: Maico de Jong commented 12 February 2013 at 12:35

#46 worked form me

Log in or register to post comments

Comment #58

almul0 CreditAttribution: almul0 commented 13 February 2013 at 18:56

File	Size
search_api_db-use_db_merge_when_indexing_items-1144620-58.patch	1.9 KB

I was dealing with the issue this evening, I have tested a lot of utf8 collations along search_api_db_* tables. Finally, cause my problem was when I was trying to index the content, finally I swapped the db_insert statements with db_merge statements.

These statements prevents the index failure when indexing a set of words like '--a---' and '--á---' cause for the database with default collation is the same .

Before, I was applying the utf8_bin collation, with this patch, I reverted them to their original collation DEFAULT.

For me, I am using it with this patch for partial matching Add option for partial matching, works quite well and now my search using views using fulltext form field exposed work well, returning the content, being case and accent insensitive.

IMPORTANT: I make this patch for a case and accent insensitive search.

Log in or register to post comments

Comment #59

almul0 CreditAttribution: almul0 commented 14 February 2013 at 13:32

Finally, I have found thanks to grisendo a patch for index problematic words. Is a sandbox project which transliterate the word, is a processor of search_api and it works fine.

I can revert that patch for my purpose, but may be it will be useful for others.

Search API Transliterate

Log in or register to post comments

Comment #60

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 11 March 2013 at 16:58

Title:	Problems with umlauts, accented characters, etc.	» Fix problems with umlauts, accented characters, etc.
Status:	Reviewed & tested by the community	» Active

Finally, I have found thanks to grisendo a patch for index problematic words. Is a sandbox project which transliterate the word, is a processor of search_api and it works fine.

Ah, cool! Great to know this exists, could surely be helpful for some. A pity that it's only a sandbox project for now …

Anyways, regarding the issue at hand: thank you all a lot for testing, and sorry it took me a while to get back to this. I committed #46 now, hopefully this will help most people. Setting the issue back to "active", though, because code for the other DBMSs is still needed.

Log in or register to post comments

Comment #61

ndf CreditAttribution: ndf commented 12 March 2013 at 14:54

Tested patch #46 and sandbox-module #59 on an index with 1000+ nodes with searchapi facets enabled.
Both the patch and the module do what they say the do.

Prefer patch #46 over the module #59 because the module removes all special characters from the index. Therefore searchapi facets and views-fields:node-index won't show these special signs anymore.

In case you want to show special characters but ignore them while searching you need the patch. Good it's committed in .dev!

Log in or register to post comments

Comment #62

brahimmouhamou CreditAttribution: brahimmouhamou commented 4 May 2013 at 12:12

Patch #46 works like a charm! Thanks!

Log in or register to post comments

Comment #63

cmonnow CreditAttribution: cmonnow commented 1 February 2014 at 04:05

I didn't see it written anywhere but perhaps it can be suggested in the README (or at least now here) that for PostgreSQL you would set (ALTER) the string character columns of "search_api_db_" tables to "C" (COLLATE "C"). I haven't actually tested anything though and don't even know the default behaviour. Just putting it out there for discussion.

Log in or register to post comments

Comment #64

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 4 March 2014 at 09:58

Issue summary:

View changes

Thanks for the note, good to know!
Do you have a Postgres database for testing? If so, and you can tell me the correct syntax for this ALTER statement, then I could add this fix to the module. Would be great to be able to do this out of the box for Postgres users, too!

Log in or register to post comments

Comment #65

cmonnow CreditAttribution: cmonnow commented 4 March 2014 at 13:58

Hi drunken monkey,

Unfortunately I'm new to PostgreSQL myself and like I guess most people here converted for PostGIS features.

I don't know how/if it's possible to change the "collate" type without defining the column type as well. e.g.

ALTER TABLE search_api_db_database_node_index_body_summary
   ALTER COLUMN word SET DATA TYPE character varying(50) COLLATE pg_catalog."C"

is the basic syntax that works for me (at least on empty columns), and the type in this case was redefined as it was already (character varying(50)). I haven't actually tested the search indexes yet at all.

I haven't reached the stage of my project yet where I index foreign languages yet but it seems the dilemma of mapping similar letters as described in https://drupal.org/node/1794374 would be essential for an international site that involves "cross-cultural" communications.

Log in or register to post comments

Comment #66

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 10 March 2014 at 15:44

Status:

Active

» Needs review

Ah, excellent, that's all I needed to know!
Can you then please test the attached patch and see if it succeeds in producing tables with the correct collation? Thanks!

(Before committing we should maybe just wrap the whole switch in a try/catch – for PSQL, there are not enough testers to be sure whether this also works in 8.x.)

Log in or register to post comments

Comment #67

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 10 March 2014 at 15:44

File	Size
1144620-66--psql_collation_support.patch	684 bytes

Log in or register to post comments

Comment #68

10 March 2014 at 15:45

The last submitted patch, 8: 1144620--known-umlaut-bug-8.patch, failed testing.

Log in or register to post comments

Comment #69

10 March 2014 at 15:45

The last submitted patch, 29: search_api_db_issue_1144620.patch, failed testing.

Log in or register to post comments

Comment #70

10 March 2014 at 15:45

The last submitted patch, 38: search_api_db_utf8-1144620-38.patch, failed testing.

Log in or register to post comments

Comment #71

10 March 2014 at 15:45

The last submitted patch, 42: search_api_db_utf8_1144620-42.patch, failed testing.

Log in or register to post comments

Comment #72

10 March 2014 at 15:45

The last submitted patch, 45: 0001-Issue-1144620-by-drunken-monkey-freakalis-sylus-rob_.patch, failed testing.

Log in or register to post comments

Comment #73

10 March 2014 at 15:45

The last submitted patch, 46: 1144620--umlauts-46.patch, failed testing.

Log in or register to post comments

Comment #74

10 March 2014 at 15:45

The last submitted patch, 58: search_api_db-use_db_merge_when_indexing_items-1144620-58.patch, failed testing.

Log in or register to post comments

Comment #75

10 March 2014 at 15:45

The last submitted patch, 50: service.inc_.05122012.patch, failed testing.

Log in or register to post comments

Comment #76

cmonnow CreditAttribution: cmonnow commented 11 March 2014 at 03:31

Hi drunken monkey,

I can confirm that the patch worked as expected in a clean, standard Drupal 7.26 install using PostgreSQL 9.2 (with 2 nodes indexed).

The "word" column was successfully assigned a collation of "C". I tested the imaginary word "bànànà" in an article and it was found using the string "bànànà" (and not "banana", as expected).

(Aside: I wish there were more PostgreSQL users in general since there's a fraction of the user-friendly info out there when compared with mySQL).

Cheers

Log in or register to post comments

Comment #77

cmonnow CreditAttribution: cmonnow commented 11 March 2014 at 10:44

And yep, the try-catch sounds like a great idea for developmental queries. I suppose we can use the Database API's built in error handling/reporting as well e.g.

 try {
        switch ($this->connection->databaseType()) {
//All the collation fixes  
        }
}
catch (Exception $e) {
    watchdog('Search API DB Search', 'Search API DB Search is unable to alter the table/column collation type (to support universal character indexing). Message =@message, query= @query', array('@message' => $e->getMessage(), '@query' => $e->query_string), WATCHDOG_ERROR);

//Optionally also present message on screen since it's a back-end error
    drupal_set_message(t('Table/column collation type could not be altered to support universal character sets. Message = %message, query= %query',
array('%message' => $e->getMessage(), '%query' => $e->query_string)), 'error');
}

Log in or register to post comments

Comment #78

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 13 March 2014 at 13:53

Status:

Needs review

» Fixed

Good to hear! I've committed this fix now.
I think if anyone wants a less well-known DBMS supported, they should create a new issue – this one has been used enough.

Log in or register to post comments

Comment #79

27 March 2014 at 14:01

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #80

miro_dietiker

Switzerland

CreditAttribution: miro_dietiker commented 22 June 2014 at 14:19

The project front page contains still hint about there "currently be problems" with reference to this issue.
I guess you should also update it.

Log in or register to post comments

Comment #81

Mykolas CreditAttribution: Mykolas commented 22 July 2014 at 09:20

Category:

Bug report

» Feature request

What if I want search to be accent insensitive? COLLATE in search query seems to work, but is very slow, because 'word' column has index with utf8_bin collation and MySQL does not support indices with custom collations (http://dev.mysql.com/worklog/task/?id=826)

Log in or register to post comments

Comment #82

jamsilver CreditAttribution: jamsilver commented 31 August 2016 at 14:45

FYI: I've raised a related issue about ensuring sorting on a string field always occurs case insensitively, since the change here moved us over to utf8_bin collation which forces a case-sensitive sort. See: #2793009: Sorting on a string field should be case insensitive in mysql, but is case sensitive.

Log in or register to post comments

Comment #83

geek-merlin

they/omg

German

Freiburg, Germany

CreditAttribution: geek-merlin at Clever Systems, MachbarMacher, DruConsult commented 28 January 2017 at 00:50

Title:	Fix problems with umlauts, accented characters, etc.	» Fix character collation problems
Issue summary:	View changes

Now we have #2793009: Sorting on a string field should be case insensitive in mysql, but is case sensitive.
This is a pita.

I read and groked the whole thread and i'm quite sure that we need a deeper understanding and a more thorough approach.
Find ideas in the description.

Log in or register to post comments

Comment #84

geek-merlin

they/omg

German

Freiburg, Germany

CreditAttribution: geek-merlin at Clever Systems, MachbarMacher, DruConsult commented 28 January 2017 at 01:17

After some thinking my gut feeling is like
* have one column with binary collation and use as index
* have one column with general collation and use for sorting
(both columns contain the same word)