Compound words

Last updated on
4 April 2022
User contribution by mpp, not yet verified in depth by Search API Solr developers but considered useful.

If the user enters "ball" as the query, the search engine will consider a document a match if it contains "volleyball" or "beachvolleyball" etc. So the search term should be contained in the word.

A few different approaches exist to do partial string search with SOLR.

DictionaryCompoundWordTokenFilterFactory (advised)

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

Notes

NGramTokenizerFactory (discouraged)

Reads the field text and generates n-gram tokens of sizes in the given range.

Notes

There are some disadvantages to this approach:

  • It has a quadratic memory complexity as it stores all permutations of the word to the index.
  • Since it isn't aware of the dictionary it will result in mismatches as the permutations may be meaningless (e.g. beachvolleyball will be split up in b, be, bea, beac, ...). If you want this behaviour you may want to consider autocompletion.

Add wildcards around keywords with search_api_query_alter_keywords (discouraged)

With search_api_query_alter_keywords one can alter a search query before it gets executed. In this case we'd need to add '*' wildcard before and afer each keyword.

Notes

  • You have to use a direct query in order to be able to use tokens in your query which may lead to query injections if not properly filtered.
  • The same mismatches will happen as when using an NGram tokenizer.

Help improve this page

Page status: No known problems

You can: