I am setting up Search API for a german language website. The default stemmer won't be much help for german. Anyway, I guess it shouldn't hurt. But it does. It seems that stemming isn't used on indexing, but on search. For example I have the title "Eine neue WEBinar-Reihe". The following keywords will be indexed:

select * from search_api_db_default_index_text;
+------------------+---------------+----------------------------+-------+
| item_id          | field_name    | word                       | score |
+------------------+---------------+----------------------------+-------+
| entity:node/1:de | title         | eine                       |  5000 |
| entity:node/1:de | title         | neue                       |  5000 |
| entity:node/1:de | title         | webinarreihe               |  5000 |
...

But on search the stemmer transform the search keywords to "ein", "neue" and "webinarreih". Its cutting the final e for eine and webinarreihe, due to Porter2::step5(). Sofar that seems to be correct, but if it happens on the search process it should happen also on the index process.

I disabled stemming, as it's not useful. But maybe the findings are still helpful. Maybe it's simply related to the language. I don't know, if the default stemmer only works on english content, what would be correct, but than something needs to tell the search, that it needs to be running in german without invoking stemmer too (the current website has only german enabled, english is disabled).

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

osopolar created an issue. See original summary.

drunken monkey’s picture

Thanks a lot for reporting this problem!

FIrst off: "I guess it shouldn't hurt" is definitely not true. Stemmers for the wrong language will almost always do more harm than good, since they will cause matches where none should be.

This is also why, at indexing time, we take care to only process English items and leave all others alone.
However, you are right, we forgot to do the same at search time. The attached patch should fix this – with this, enabling the stemmer on a site without English content will have no effect whatsoever (except for a minimal decrease in performance).
Actually, it would probably make sense to add a supportsIndex() implementation to completely hide the processor on sites without English language, but that's a separate issue (#2922742: Problems with French stemming).

The last submitted patch, 2: 2922024-2--stemmer_skip_non_english_queries--tests_only.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

drunken monkey’s picture

Would be great if you could confirm this patch works for you.

osopolar’s picture

Thanks for your fast response and sorry for not being able to give feedback, currently I am on a different d7 project, but will definitely back to the d7 project with Search API to test the stuff, but not sure if it will happen this year.

borisson_’s picture

Status: Needs review » Reviewed & tested by the community

I like the tests, they are very clear. I have not manually tested this but thumbs up!

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

OK, then, good enough. Thanks a lot for reviewing!
Committed.

  • drunken monkey committed 87cb223 on 8.x-1.x
    Issue #2922024 by drunken monkey, borisson_: Fixed Stemmer incorrectly...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.