Problem/Motivation
The 8.x change "Search removes diacritics in indexing rather than relying on database collation" described in https://www.drupal.org/node/2447357 (based on the issue at https://www.drupal.org/node/731298) is incompatible with several languages. It introduces a new removeDiacritics function, \Drupal::service('transliteration')->removeDiacritics($text), into the search_simplify function. This function is always run (both during indexing and actual searches) and removes all diacritical marks from the input text. This happens before hook_search_preprocess implementations have had a chance to affect the text, making e.g. stemming (a suggested use case for the hook) impossible when it relies on the existence of accented characters.
Some examples of common stemming algorithms that expect input to have accented characters to produce reliable results are the following Snowball stemmers:
- Swedish: cannot replace 'löst' with 'lös' in step 3 when input only has 'lost',
- Danish: similarly, in step 3, cannot replace 'løst' with 'løs',
- Italian: in step 1, 'ità' won't get removed; in step 2, 'erà', 'erò' or 'irà' won't get removed,
- Spanish: all steps rely on the existence of diacritical marks, and
- French: all steps rely on the existence of diacritical marks.
Removing diacritics in the actual search phase also makes the search too greedy, producing results not related to what the user was searching for.
Proposed resolution
There are several possibilities:
- Don't remove diacritical marks.
- Make removing them optional per language (and probably per diacritical mark for best results) as originally planned in the linked issue, with sensible defaults.
- Run diacritical mark removal function after hook_search_preprocess implementations.
Workaround
It's possible to change the provider of the transliteration service to a custom class. This custom class can extend PhpTransliteration to retain all of its functionality, but have its own removeDiacritics function that does not alter the input text. This function will get called instead of the one in PhpTransliteration.
| Comment | File | Size | Author |
|---|---|---|---|
| #4 | drupal-search-diacritics.PNG | 38.74 KB | ataimist |
Comments
Comment #2
ataimist commentedComment #3
ataimist commentedComment #4
ataimist commentedClarified issue description per discussion in https://www.drupal.org/node/731298#comment-11973067 & added screenshot. Moved search fuzzy matching problem to new issue: https://www.drupal.org/node/2858595
Comment #5
ataimist commentedComment #18
smustgrave commentedThank you for reporting this problem. We rely on issue reports like this one to resolve bugs and improve Drupal core.
Since there has been no activity here for over 8 years we are asking if this problem persists on a currently supported version of Drupal. To help, add a comment explaining if the problem still occurs or not. Any extra detail you can provide can help others who experienced this.
Since we need more information to move forward with this issue, the status is now Postponed (maintainer needs more info). If we don't receive additional information to help with the issue, it may be closed after three months.
Thanks!
Comment #20
quietone commentedAnother 6 months and no indication if this is still valid. Therefor, I am closing this issue.
If this is incorrect, re-open the issue. Or you can create a new issue and reference this one.