Special characters are always ignored in searches on DB backend [#2873023]

Comment	File	Size	Author
#15	2873023-15--fix_db_backend_special_char_keywords.patch	7.34 KB	drunken monkey
8.x-1.x: PHP 7.4 & SQLite 3.27, D9.1 468 pass
#15	2873023-15--fix_db_backend_special_char_keywords--tests_only.patch	2.57 KB	drunken monkey
8.x-1.x: PHP 7.4 & SQLite 3.27, D9.1 462 pass, 1 fail
#10	2873023-10--fix_db_backend_special_char_keywords.patch	4.77 KB	drunken monkey
8.x-1.x: PHP 7 & MySQL 5.5, D8.9 459 pass

Comment #1

26 April 2017 at 14:36

vanyamtv created an issue. See original summary.

Log in or register to post comments

Comment #2

vanyamtv CreditAttribution: vanyamtv commented 28 April 2017 at 07:59

Title:

Why the hypnen "-" is ignored, while used in keywords?

» Why the hyphen "-" is ignored, while used in keywords?

Log in or register to post comments

Comment #3

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 30 April 2017 at 10:14

Category:	Bug report	» Support request
Status:	Active	» Postponed (maintainer needs more info)

That depends on a lot of factors:
- Are you using the "Tokenizer" or "Ignore characters" processor in your index?
- What server backend are you using?
- Is the hyphen stand-alone or part of a compound word?

In general, the Search API itself doesn't define any keyword handling on its own. This all depends on the backend and can also be influenced by processors, hooks, etc.

Log in or register to post comments

Comment #4

Patrick R. CreditAttribution: Patrick R. at UEBERBIT GmbH commented 24 May 2017 at 08:24

Having a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.

I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.

Log in or register to post comments

Comment #5

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 26 May 2017 at 15:54

Title:	Why the hyphen "-" is ignored, while used in keywords?	» Special characters are always ignored in searches on DB backend
Version:	8.x-1.0-rc4	» 8.x-1.x-dev
Component:	General code	» Database backend
Category:	Support request	» Bug report
Status:	Postponed (maintainer needs more info)	» Active

Having a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.

Oh, you're right. It indeed seems like there's no way to avoid removing all special characters from the keywords, even though it's possible to get words indexed with them – leading, of course, to no results being found for such keywords.
However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.

However, this doesn't change the fact that this is indeed a bug: at the very least, if it's not possible to have keywords with special characters, it also shouldn't be possible to index words with them.
So, even though this would discard quite some work put into supporting this, the simplest solution here seems to be removing those characters from indexed text, too. (I.e., use splitIntoWords() even for already tokenized text.)

For a more complicated but also more correct/flexible solution (actually taking the tokenizer, etc., settings properly into account) …
… I don't actually know what to do, really. We'd have to remember whether we had to use our custom word-splitting during indexing or whether we can rely on a tokenizer being present and in the latter case only split keywords on space characters (which the tokenizer will produce in this case). This would be pretty complicated for solving such a small problem, and it wouldn't even be completely reliable, as it's possible to enable the tokenizer for just some fields, in which case we'd need even more code to generate a correct query (with two different sets of keywords, depending on field). (Though that's a general issues for processors and inadvisable in any case – cf. #2859683: Processors don't correctly preprocess keywords per field.)

Not really sure what to do here. Any opinions?

I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.

Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.

Log in or register to post comments

Comment #6

Patrick R. CreditAttribution: Patrick R. at UEBERBIT GmbH commented 29 May 2017 at 10:32

However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.

I tried that out and it seems to work reasonably well, thanks for the idea. It causes some side-effects as I now get isolated warnings during indexing caused by the processor e.g. when there are hyperlinks in html text and all special characters such as ":" and "/" get stripped ("An overlong word (more than 50 characters) was encountered while indexing") but I guess that's not too bad.

Initially I also got some strange SQL error ("Numeric value out of range: 1264 Out of range value for column 'score' ...") when re-indexing, but that disappeared as soon as I adjusted the processor weights so that the "HTML filter" processor runs before "Ignore characters".

Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.

Ah okay, thanks for the clarification. Maybe that should go into the README.txt of the module to prevent confusion if there is no way for the search backend to kind of limit the available parse mode options or to remove this option altogether. :-)

Log in or register to post comments

Comment #7

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 17 June 2017 at 07:43

OK, good to hear this is now mostly working for you.
Since it's still a bug, though, I guess I'll leave this open for now to see if anyone else wants to weigh in, or has trouble with this.

Log in or register to post comments

Comment #8

bander2 CreditAttribution: bander2 as a volunteer commented 17 May 2018 at 13:56

Just ran across this. I have content with a unique identifier field that follow a couple of different formats (from other organizations, so we don't control the formats). Some look like this 012.34.5679 and some look like MEEC-123456. There are probably others.

I would like a user who knows the identifier to be able to search for it and get the result. I can confirm that using the "ignore characters" processor seems to solve the problem. But given the fact that I need to support multiple formats, I am concerned that I will run into a wall eventually.

Log in or register to post comments

Comment #9

k4v CreditAttribution: k4v as a volunteer commented 17 January 2020 at 20:31

I cannot find email addresses in fulltext search with DB backend... For the same reason "splitKeys" splits them at the @ and . chars.

Log in or register to post comments

Comment #10

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 2 February 2020 at 09:57

Status:

Active

» Needs review

File	Size
2873023-10--fix_db_backend_special_char_keywords.patch	4.77 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.9 459 pass

OK, while probably not a large problem, this is definitely annoying for some, and might be a hidden problem for many more, so let’s attempt to fix this. Hopefully in a way that breaks (or makes the experience worse on) as few sites as possible.
I guess, for determining which strategy to use, we can simply check whether the Tokenizer processor is enabled on the index. This should be good enough in almost all cases.

Patch attached, please test/review!

Log in or register to post comments

Comment #11

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 23 March 2020 at 09:56

Could someone please test the patch and confirm this resolves, or at least alleviates, the problem for them?

Log in or register to post comments

Comment #12

Jabastin Arul CreditAttribution: Jabastin Arul commented 30 March 2020 at 09:57

Thanks for the patch,. But this is does' work for me..Still hyphon is not showing in autocomplete suggestion

Log in or register to post comments

Comment #13

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 4 April 2020 at 11:53

Thanks for the patch,. But this is does' work for me..Still hyphon is not showing in autocomplete suggestion

Autocomplete suggestions are another thing entirely. Please test the patch with the normal search, not autocomplete, and create a new issue if you (still) have problems with the latter.

Log in or register to post comments

Comment #14

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 10 May 2020 at 08:23

Last chance for testing whether this patch makes things better or worse for you!
Otherwise I’ll just hope for the best and commit.

Log in or register to post comments

Comment #15

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 13 July 2020 at 10:13

File	Size
2873023-15--fix_db_backend_special_char_keywords--tests_only.patch	2.57 KB
8.x-1.x: PHP 7.4 & SQLite 3.27, D9.1 462 pass, 1 fail
2873023-15--fix_db_backend_special_char_keywords.patch	7.34 KB
8.x-1.x: PHP 7.4 & SQLite 3.27, D9.1 468 pass

1 file was hidden/shown/deleted

File	Size
2873023-10--fix_db_backend_special_char_keywords.patch	4.77 KB
8.x-1.x: PHP 7 & MySQL 5.5, D8.9 459 pass

OK, not quite – still needs tests, I guess. (Tests-only patch is the interdiff.)

Log in or register to post comments

Comment #16

13 July 2020 at 10:21

The last submitted patch, 15: 2873023-15--fix_db_backend_special_char_keywords--tests_only.patch, failed testing. View results

Log in or register to post comments

Comment #17

borisson_

Dutch

Mechelen, 🇧🇪

CreditAttribution: borisson_ at Calibrate commented 13 July 2020 at 11:12

Status:

Needs review

» Reviewed & tested by the community

The testcoverage here seems really good. +1. I think the fix is also correct.

Log in or register to post comments

Comment #18

9 August 2020 at 16:11

drunken monkey committed 29e2039 on 8.x-1.x

Issue #2873023 by drunken monkey, borisson_: Fixed handling of special...

Log in or register to post comments

Comment #19

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey as a volunteer commented 9 August 2020 at 16:11

Status:

Reviewed & tested by the community

» Fixed

Good to hear, thanks a lot for reviewing!
Committed.

Log in or register to post comments

Comment #20

23 August 2020 at 16:14

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Special characters are always ignored in searches on DB backend

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Referenced by

Thank you to these Drupal contributors

News items

Our community

Documentation

Drupal code base

Governance of community