Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
CreditAttribution: drunken monkey as a volunteer commented
Category:
Bug report
» Support request
Status:
Active
» Postponed (maintainer needs more info)
That depends on a lot of factors:
- Are you using the "Tokenizer" or "Ignore characters" processor in your index?
- What server backend are you using?
- Is the hyphen stand-alone or part of a compound word?
In general, the Search API itself doesn't define any keyword handling on its own. This all depends on the backend and can also be influenced by processors, hooks, etc.
Having a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.
I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.
CreditAttribution: drunken monkey as a volunteer commented
Title:
Why the hyphen "-" is ignored, while used in keywords?
» Special characters are always ignored in searches on DB backend
Version:
8.x-1.0-rc4
» 8.x-1.x-dev
Component:
General code
» Database backend
Category:
Support request
» Bug report
Status:
Postponed (maintainer needs more info)
» Active
Having a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.
Oh, you're right. It indeed seems like there's no way to avoid removing all special characters from the keywords, even though it's possible to get words indexed with them – leading, of course, to no results being found for such keywords.
However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.
However, this doesn't change the fact that this is indeed a bug: at the very least, if it's not possible to have keywords with special characters, it also shouldn't be possible to index words with them.
So, even though this would discard quite some work put into supporting this, the simplest solution here seems to be removing those characters from indexed text, too. (I.e., use splitIntoWords() even for already tokenized text.)
For a more complicated but also more correct/flexible solution (actually taking the tokenizer, etc., settings properly into account) …
… I don't actually know what to do, really. We'd have to remember whether we had to use our custom word-splitting during indexing or whether we can rely on a tokenizer being present and in the latter case only split keywords on space characters (which the tokenizer will produce in this case). This would be pretty complicated for solving such a small problem, and it wouldn't even be completely reliable, as it's possible to enable the tokenizer for just some fields, in which case we'd need even more code to generate a correct query (with two different sets of keywords, depending on field). (Though that's a general issues for processors and inadvisable in any case – cf. #2859683: Processors don't correctly preprocess keywords per field.)
Not really sure what to do here. Any opinions?
I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.
Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.
However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.
I tried that out and it seems to work reasonably well, thanks for the idea. It causes some side-effects as I now get isolated warnings during indexing caused by the processor e.g. when there are hyperlinks in html text and all special characters such as ":" and "/" get stripped ("An overlong word (more than 50 characters) was encountered while indexing") but I guess that's not too bad.
Initially I also got some strange SQL error ("Numeric value out of range: 1264 Out of range value for column 'score' ...") when re-indexing, but that disappeared as soon as I adjusted the processor weights so that the "HTML filter" processor runs before "Ignore characters".
Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.
Ah okay, thanks for the clarification. Maybe that should go into the README.txt of the module to prevent confusion if there is no way for the search backend to kind of limit the available parse mode options or to remove this option altogether. :-)
CreditAttribution: drunken monkey as a volunteer commented
OK, good to hear this is now mostly working for you.
Since it's still a bug, though, I guess I'll leave this open for now to see if anyone else wants to weigh in, or has trouble with this.
bander2CreditAttribution: bander2 as a volunteer commented
Just ran across this. I have content with a unique identifier field that follow a couple of different formats (from other organizations, so we don't control the formats). Some look like this 012.34.5679 and some look like MEEC-123456. There are probably others.
I would like a user who knows the identifier to be able to search for it and get the result. I can confirm that using the "ignore characters" processor seems to solve the problem. But given the fact that I need to support multiple formats, I am concerned that I will run into a wall eventually.
OK, while probably not a large problem, this is definitely annoying for some, and might be a hidden problem for many more, so let’s attempt to fix this. Hopefully in a way that breaks (or makes the experience worse on) as few sites as possible.
I guess, for determining which strategy to use, we can simply check whether the Tokenizer processor is enabled on the index. This should be good enough in almost all cases.
CreditAttribution: drunken monkey as a volunteer commented
Thanks for the patch,. But this is does' work for me..Still hyphon is not showing in autocomplete suggestion
Autocomplete suggestions are another thing entirely. Please test the patch with the normal search, not autocomplete, and create a new issue if you (still) have problems with the latter.
Comments
Comment #2
vanyamtv CreditAttribution: vanyamtv commentedComment #3
drunken monkeyThat depends on a lot of factors:
- Are you using the "Tokenizer" or "Ignore characters" processor in your index?
- What server backend are you using?
- Is the hyphen stand-alone or part of a compound word?
In general, the Search API itself doesn't define any keyword handling on its own. This all depends on the backend and can also be influenced by processors, hooks, etc.
Comment #4
Patrick R. CreditAttribution: Patrick R. at UEBERBIT GmbH commentedHaving a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.
I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.
Comment #5
drunken monkeyOh, you're right. It indeed seems like there's no way to avoid removing all special characters from the keywords, even though it's possible to get words indexed with them – leading, of course, to no results being found for such keywords.
However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.
However, this doesn't change the fact that this is indeed a bug: at the very least, if it's not possible to have keywords with special characters, it also shouldn't be possible to index words with them.
So, even though this would discard quite some work put into supporting this, the simplest solution here seems to be removing those characters from indexed text, too. (I.e., use
splitIntoWords()
even for already tokenized text.)For a more complicated but also more correct/flexible solution (actually taking the tokenizer, etc., settings properly into account) …
… I don't actually know what to do, really. We'd have to remember whether we had to use our custom word-splitting during indexing or whether we can rely on a tokenizer being present and in the latter case only split keywords on space characters (which the tokenizer will produce in this case). This would be pretty complicated for solving such a small problem, and it wouldn't even be completely reliable, as it's possible to enable the tokenizer for just some fields, in which case we'd need even more code to generate a correct query (with two different sets of keywords, depending on field). (Though that's a general issues for processors and inadvisable in any case – cf. #2859683: Processors don't correctly preprocess keywords per field.)
Not really sure what to do here. Any opinions?
Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.
Comment #6
Patrick R. CreditAttribution: Patrick R. at UEBERBIT GmbH commentedI tried that out and it seems to work reasonably well, thanks for the idea. It causes some side-effects as I now get isolated warnings during indexing caused by the processor e.g. when there are hyperlinks in html text and all special characters such as ":" and "/" get stripped ("An overlong word (more than 50 characters) was encountered while indexing") but I guess that's not too bad.
Initially I also got some strange SQL error ("Numeric value out of range: 1264 Out of range value for column 'score' ...") when re-indexing, but that disappeared as soon as I adjusted the processor weights so that the "HTML filter" processor runs before "Ignore characters".
Ah okay, thanks for the clarification. Maybe that should go into the README.txt of the module to prevent confusion if there is no way for the search backend to kind of limit the available parse mode options or to remove this option altogether. :-)
Comment #7
drunken monkeyOK, good to hear this is now mostly working for you.
Since it's still a bug, though, I guess I'll leave this open for now to see if anyone else wants to weigh in, or has trouble with this.
Comment #8
bander2 CreditAttribution: bander2 as a volunteer commentedJust ran across this. I have content with a unique identifier field that follow a couple of different formats (from other organizations, so we don't control the formats). Some look like this 012.34.5679 and some look like MEEC-123456. There are probably others.
I would like a user who knows the identifier to be able to search for it and get the result. I can confirm that using the "ignore characters" processor seems to solve the problem. But given the fact that I need to support multiple formats, I am concerned that I will run into a wall eventually.
Comment #9
k4v CreditAttribution: k4v as a volunteer commentedI cannot find email addresses in fulltext search with DB backend... For the same reason "splitKeys" splits them at the @ and . chars.
Comment #10
drunken monkeyOK, while probably not a large problem, this is definitely annoying for some, and might be a hidden problem for many more, so let’s attempt to fix this. Hopefully in a way that breaks (or makes the experience worse on) as few sites as possible.
I guess, for determining which strategy to use, we can simply check whether the Tokenizer processor is enabled on the index. This should be good enough in almost all cases.
Patch attached, please test/review!
Comment #11
drunken monkeyCould someone please test the patch and confirm this resolves, or at least alleviates, the problem for them?
Comment #12
Jabastin Arul CreditAttribution: Jabastin Arul commentedThanks for the patch,. But this is does' work for me..Still hyphon is not showing in autocomplete suggestion
Comment #13
drunken monkeyAutocomplete suggestions are another thing entirely. Please test the patch with the normal search, not autocomplete, and create a new issue if you (still) have problems with the latter.
Comment #14
drunken monkeyLast chance for testing whether this patch makes things better or worse for you!
Otherwise I’ll just hope for the best and commit.
Comment #15
drunken monkeyOK, not quite – still needs tests, I guess. (Tests-only patch is the interdiff.)
Comment #17
borisson_The testcoverage here seems really good. +1. I think the fix is also correct.
Comment #19
drunken monkeyGood to hear, thanks a lot for reviewing!
Committed.