Problem/Motivation

I’ve recently installed Search API with the DB back end, and had a problem where searching for ‘building’ returned no results, even though there are nodes with that word in their title.

It turned out to be because Stemmer was silently skipping those nodes, because their language is set to ‘und’. Then as Stemmer was processing 'building' from the Views filter to 'build', that couldn't match the 'building' that was saved in the index, so no results. It’s the same problem described at https://www.drupal.org/project/search_api/issues/2954334, where the content on our site has been migrated from D7, and doesn’t have the right language set.

Please could the Stemmer processor log a warning message when it skips processing an item because that item’s language isn’t English? That would have helped me track down the problem much faster, and as there are still many D7 sites to be migrated I think it will help other users in future.

Thankyou for all your work on this module.

Comments

davidhk created an issue. See original summary.

drunken monkey’s picture

Component: General code » Plugins
Status: Active » Needs review
StatusFileSize
new2.02 KB
new5.04 KB

Thanks for reporting this issue!

In general, my advice is always to fix the underlying data: if you know that the content is English, then it should also be marked as English in the database. That way, you’ll save yourself a lot of problems down the line. However, it’s true that people regularly report such problems, so it does seem like something we might want to address.

I have two suggested patches: The first, by your suggestion, just logs a notice (a warning seems too strong) for any items that are skipped because their language is unspecified. (If they have a different language, I don’t think we should log anything.) Hopefully, users will notice these and investigate.

The other patch would instead make this behavior configurable. Not only would that allow people with und language content to have that content still stemmed (if they just know that it will be English), the mere presence of the option on the configuration page might also alert users to this behavior.

Now that I think more about it, there seems to be a third option: Adapt the processor to just process all items in case English is the only language on the site. In this case, I think we can pretty safely assume that all text will be English. (And where it is not, we are still better off stemming it anyways, for the sake of consistency – as all search keywords will be stemmed, there’d be a lot of mismatches otherwise.)
This should also solve the problem for a lot of people – usually, when you have a lot of content with language und, I guess you’d be on a single-language site.

What would you say is the best solution?

The last submitted patch, 2: 3267092-2--stemmer_skipping_und_log_notice.patch, failed testing. View results

drunken monkey’s picture

StatusFileSize
new7.96 KB

Thinking further about it, I’m more and more convinced that the third option makes most sense. It is therefore implemented in the attached patch – please test/review!

drunken monkey’s picture

Would be great to get this patch tested/reviewed so I can commit it. It does seem like a nice improvement for inexperienced users.

  • drunken monkey committed 1205ca43 on 8.x-1.x
    Issue #3267092 by drunken monkey: Fixed stemming of content with invalid...
drunken monkey’s picture

Status: Needs review » Fixed

I decided to just merge this now, I’m pretty confident that this will actually be an improvement.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.