There are a couple of minor problems with the default tokenizer setting for English sites.

The default tokenizer config treats apostrophes as whitespace characters, which means that O'Brien gets indexed as O and Brien (if not too short for the server). This is mostly a problem for names, as contractions usally aren't important words. But it does add unsearchable/incorrect words to the index. Won't becomes Won and t. If you search for "won't" results will include "won".

Some stopwords (see #1161676: Stopwords processor) lists leave apostrophes in contractions (you'll, won't etc.) and others take them out (youll, wont). The default tokenizer config passes "you" and "ll", and "won" and "t" to the stopwords processor (if not too short for the server). If the apostrope is added into the ignorable characters list then you get "wont" and "youll". This works with the apostrophe-free list but not the list that includes them. Adjusting the whitespace characters may be a bit much to ask for some users.

So I guess the request here is to have defaults that work a little better with English (just because it's Drupal's native language) or to document some better settings for English so site admins can easily change them.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drunken monkey’s picture

So, am I understanding you correctly: the default tokenizer settings should be changed to add apostrophes to the ignored characters?
Hm, this really makes sense mostly for English, e.g. for French it might even be detrimental … But I guess it won't matter much in various other languages, and having the default work well in English (as long as it's not too specific) probably makes sense.
So if I understand the request correctly, I think we can do it, yes.

awolfey’s picture

I'm not completely sure. A quick survey shows that most lists do leave in the apostrophes in stop words. I think that will work with English and not create problems for French. To leave them in we'd have [^\p{L}\p{N}^'] for whitespace characters and leave ignorable characters the way it is [-]. That would get "O'Brien" into the index. Stopwords files should then list contractions with apostrophes, like "won't, woudn't, y'all".

drunken monkey’s picture

Status: Active » Needs review
FileSize
1.14 KB

Well, it will create problems for French, as e.g. "l'heur" isn't split anymore, which means a search for "heur" wouldn't match anything. But of course, this issue was from the beginning about improving the defaults for English – it's impossible to create a default that will be suitable for every language, anyways. So setting a better default for English, and noting that it might not be suitable for other languages makes sense, I guess.

Patch attached.

drunken monkey’s picture

Status: Needs review » Fixed

Committed.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

cudevdev’s picture

Issue summary: View changes
Status: Closed (fixed) » Active

Reopening.

We have the same issue, with basically the same last name. O'Brian doesn't match to O'Brian. Using Solr.

I understand that using the tokenizer would fix this issue, but in another issue thread it was mentioned:

https://www.drupal.org/node/2279121#comment-8904581

Are you maybe using the "Tokenizer" processor?
If so, that should always be disabled when using Solr!

That tells me something needs to changed in the configuration files that are installed in Solr. (schema.xml, I believe?)

Any help would be very much appreciated!

drunken monkey’s picture

Status: Active » Closed (fixed)

That seems to be a different issue then, please open a new one!
Also, I can't reproduce the problem, so please try the latest module versions and state your Solr server version in the new issue.