Support from Acquia helps fund testing for Drupal Acquia logo

Comments

vanyamtv created an issue. See original summary.

vanyamtv’s picture

Title: Why the hypnen "-" is ignored, while used in keywords? » Why the hyphen "-" is ignored, while used in keywords?
drunken monkey’s picture

Category: Bug report » Support request
Status: Active » Postponed (maintainer needs more info)

That depends on a lot of factors:
- Are you using the "Tokenizer" or "Ignore characters" processor in your index?
- What server backend are you using?
- Is the hyphen stand-alone or part of a compound word?

In general, the Search API itself doesn't define any keyword handling on its own. This all depends on the backend and can also be influenced by processors, hooks, etc.

Patrick R.’s picture

Having a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.

I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.

drunken monkey’s picture

Title: Why the hyphen "-" is ignored, while used in keywords? » Special characters are always ignored in searches on DB backend
Version: 8.x-1.0-rc4 » 8.x-1.x-dev
Component: General code » Database backend
Category: Support request » Bug report
Status: Postponed (maintainer needs more info) » Active

Having a problem with this as well. The hyphen in my case is part of a compound word "e-Daitem" (a product name) and I'm using the search_api_db backend without any of the aforementioned processors. By debugging I found out that the "damage" is ultimately done in Drupal\search_api_db\Plugin\search_api\backend\Database::splitIntoWords() and there doesn't seem to be a way to prevent this from happening by configuration.

Oh, you're right. It indeed seems like there's no way to avoid removing all special characters from the keywords, even though it's possible to get words indexed with them – leading, of course, to no results being found for such keywords.
However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.

However, this doesn't change the fact that this is indeed a bug: at the very least, if it's not possible to have keywords with special characters, it also shouldn't be possible to index words with them.
So, even though this would discard quite some work put into supporting this, the simplest solution here seems to be removing those characters from indexed text, too. (I.e., use splitIntoWords() even for already tokenized text.)

For a more complicated but also more correct/flexible solution (actually taking the tokenizer, etc., settings properly into account) …
… I don't actually know what to do, really. We'd have to remember whether we had to use our custom word-splitting during indexing or whether we can rely on a tokenizer being present and in the latter case only split keywords on space characters (which the tokenizer will produce in this case). This would be pretty complicated for solving such a small problem, and it wouldn't even be completely reliable, as it's possible to enable the tokenizer for just some fields, in which case we'd need even more code to generate a correct query (with two different sets of keywords, depending on field). (Though that's a general issues for processors and inadvisable in any case – cf. #2859683: Processors don't correctly preprocess keywords per field.)

Not really sure what to do here. Any opinions?

I'm also wondering if this (second) process of preparing the search keywords might conflict with some of the parse mode options which can be set in the "search_api_fulltext" views filter plugin. Not sure about that, though.

Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.

Patrick R.’s picture

However, in this example, wouldn't just ignoring the hyphen (via the "Ignore characters" processor) produce the correct result for you? I think it will work correctly in most cases.

I tried that out and it seems to work reasonably well, thanks for the idea. It causes some side-effects as I now get isolated warnings during indexing caused by the processor e.g. when there are hyperlinks in html text and all special characters such as ":" and "/" get stripped ("An overlong word (more than 50 characters) was encountered while indexing") but I guess that's not too bad.

Initially I also got some strange SQL error ("Numeric value out of range: 1264 Out of range value for column 'score' ...") when re-indexing, but that disappeared as soon as I adjusted the processor weights so that the "HTML filter" processor runs before "Ignore characters".

Yes, it does, but that's "by design", or at least a "known issue" – due to its inner workings, the DB backend doesn't support phrase searches, so those have to be split into individual words even if the parse mode says otherwise.

Ah okay, thanks for the clarification. Maybe that should go into the README.txt of the module to prevent confusion if there is no way for the search backend to kind of limit the available parse mode options or to remove this option altogether. :-)

drunken monkey’s picture

OK, good to hear this is now mostly working for you.
Since it's still a bug, though, I guess I'll leave this open for now to see if anyone else wants to weigh in, or has trouble with this.

bander2’s picture

Just ran across this. I have content with a unique identifier field that follow a couple of different formats (from other organizations, so we don't control the formats). Some look like this 012.34.5679 and some look like MEEC-123456. There are probably others.

I would like a user who knows the identifier to be able to search for it and get the result. I can confirm that using the "ignore characters" processor seems to solve the problem. But given the fact that I need to support multiple formats, I am concerned that I will run into a wall eventually.

k4v’s picture

I cannot find email addresses in fulltext search with DB backend... For the same reason "splitKeys" splits them at the @ and . chars.

drunken monkey’s picture

Status: Active » Needs review
FileSize
4.77 KB

OK, while probably not a large problem, this is definitely annoying for some, and might be a hidden problem for many more, so let’s attempt to fix this. Hopefully in a way that breaks (or makes the experience worse on) as few sites as possible.
I guess, for determining which strategy to use, we can simply check whether the Tokenizer processor is enabled on the index. This should be good enough in almost all cases.

Patch attached, please test/review!

drunken monkey’s picture

Could someone please test the patch and confirm this resolves, or at least alleviates, the problem for them?

Jabastin Arul’s picture

Thanks for the patch,. But this is does' work for me..Still hyphon is not showing in autocomplete suggestion

drunken monkey’s picture

Thanks for the patch,. But this is does' work for me..Still hyphon is not showing in autocomplete suggestion

Autocomplete suggestions are another thing entirely. Please test the patch with the normal search, not autocomplete, and create a new issue if you (still) have problems with the latter.

drunken monkey’s picture

Last chance for testing whether this patch makes things better or worse for you!
Otherwise I’ll just hope for the best and commit.

drunken monkey’s picture

borisson_’s picture

Status: Needs review » Reviewed & tested by the community

The testcoverage here seems really good. +1. I think the fix is also correct.

  • drunken monkey committed 29e2039 on 8.x-1.x
    Issue #2873023 by drunken monkey, borisson_: Fixed handling of special...
drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

Good to hear, thanks a lot for reviewing!
Committed.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.