Compound words

Last updated on

4 April 2022

User contribution by mpp, not yet verified in depth by Search API Solr developers but considered useful.

What is partial search

If the user enters "ball" as the query, the search engine will consider a document a match if it contains "volleyball" or "beachvolleyball" etc. So the search term should be contained in the word.

Implementing partial search

A few different approaches exist to do partial string search with SOLR.

DictionaryCompoundWordTokenFilterFactory (advised)

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

Notes

Add and configure DictionaryCompoundWordTokenFilterFactory and a nouns.txt to your field type configuration (see search_api_solr.solr_field_type.text_nl_7_0_0.yml as an example).
Instead of modifying the default field type configuration that ships with the module, it is recommended to "derive" from it by introducing your own content domain.

NGramTokenizerFactory (discouraged)

Reads the field text and generates n-gram tokens of sizes in the given range.

Notes

There are some disadvantages to this approach:

It has a quadratic memory complexity as it stores all permutations of the word to the index.
Since it isn't aware of the dictionary it will result in mismatches as the permutations may be meaningless (e.g. beachvolleyball will be split up in b, be, bea, beac, ...). If you want this behaviour you may want to consider autocompletion.

Add wildcards around keywords with search_api_query_alter_keywords (discouraged)

With search_api_query_alter_keywords one can alter a search query before it gets executed. In this case we'd need to add '*' wildcard before and afer each keyword.

Notes

You have to use a direct query in order to be able to use tokens in your query which may lead to query injections if not properly filtered.
The same mismatches will happen as when using an NGram tokenizer.