Processors

Last updated on

22 May 2023

Processors are plugins which you can enable for an index to change the indexed data or the search queries on it in some way. Processors are very versatile and can have many different effects on an index, but they generally operate in three phases of the search process, also listed in the middle of the "Processors" tab (where you can re-order the enabled processors for each phase):

Preprocess index

In this phase the processor will manipulate the indexed fields data before it is sent to the server. Usually, this means that text data is somehow pre-processed – e.g., by removing HTML markup, by changing the text to lower case so the search will be case-insensitive, by breaking text into individual words, etc.

Note, though, that the preprocessing of indexed content will also influence returned facet values (in most cases), so expect these to be lowercase, or missing ignored characters, too.

Preprocess query

In many cases, when manipulating indexed data, the same manipulation needs to occur when searching. For example, when all indexed text was lowercased, the search keywords also all need to be lowercase to be able to match indexed content. However, any other kind of manipulation of the search query is possible here, too.

Postprocess query

This is probably the rarest phase for processors to use, postprocessing the query after it has run to manipulate the search results in some way. A prominent example would be the "Highlight" processor, which highlights the search keywords in the returned results. But this phase could also be used, for example, to re-run the query completely with some changes in case no results were found the first time.

Processors also have some other capabilities not explicitly listed as a "stage" on this page:

Alter indexed items: This step comes before "Preprocess index" and is usually used for preventing certain items from being indexed completely. For instance, the "Entity status" processor can be used to keep unpublished content from being indexed.
Add properties: Processors can also add new properties that will become available for indexing (or even force them to always be indexed). For example, all properties listed under "General" when adding fields (that is, the datasource-independent properties) are defined by processors internally.
Arbitrary changes to the index: Finally, processors are able to change the index upon saving – for example, to ensure certain fields that they need are indexed.

Processors included in the Search API module

The following contains a list of all processor plugins included in the Search API module itself, along with short descriptions. Contrib modules might define additional processor. (Note: If you know of a contrib module defining additional processor(s), please feel free to link to their documentation from here as "Related content"! That way, this page becomes an even better resource for looking up processor documentation.)

Note that not all processors might be available for a certain index. This is usually based on the index's datasources. For instance, the Role filter is only available for indexes that contain user entities (usually in the form of the "User" datasource). Furthermore, backend plugins like Apache Solr might hide some processors that aren't recommended for use with that backend.

In general, if you use a specialized backend such as Solr or OpenSearch, you should use your backend’s built-in processors/filters instead of Search API’s processors.

Content access

Adds access checks to searches on this index, regarding the "Content" and "Comment" entity types. When this processor is enabled on an index, appropriate filters will be automatically be added to all searches so that they only return results that the current user is allowed to view. Some searches (e.g., search views) provide the option to override this behaviour on a per-search basis, though. Check the corresponding module's documentation for details. Also note that no access checks will be performed on entities of other types, and some contrib modules might implement content access mechanisms not compatible with this processor (which might then be ignored). If you use access modules not provided in Core, please make sure after setup that access checks work as expected and no information leaks occur in searches.

You should also be aware that these access checks are solely based on the indexed data. If content is edited in a way that changes its access permissions (e.g., by being unpublished), this change will only take effect once the node is indexed in its latest state. This means that there is potentially a gap between changing the node and the update of the access checks on search results, meaning that—depending on the data displayed for search results—users could in that time see data that should not be accessible to them. If you need to avoid that, use the index's Index items immediately option.

In addition to this processor, search views by default also manually check the access information of all items returned in the search results. These additional checks will not be properly reflected by things like facets or the pager, so this should only be used as a last resort. If you are sure that you have set up the search in a way that will only return results which the user can access, you can disable these additional checks by enabling the "Skip item access checks" option in the view's "Query settings". When checking this option, you also have the option to "Bypass access checks" altogether – this will disable the checks done by this processor for this view. Use only if you're certain this is what you want to do!

Entity status

This provides a simpler way to handle access, for sites that only use the "Published" state of content to determine access. The processor will simply exclude all unpublished content and comments, and all inactive users from being indexed at all. The same restrictions as for the "Content access" processor apply, though, regarding the need to Index items immediately to avoid information leaks for recently unpublished content.

Highlight

This processor will highlight matched search keywords in the search results and also provide an excerpt for display on search results pages. Whether highlighted fields data will be used depends on the module providing the search page. Refer to the issue queue of that module if this is not working properly. It works well with Solr, though you need to configure it carefully for good performance.

HTML filter

Strips HTML tags from the selected fields and decodes HTML entities. If you are indexing HTML content (like the "Body" field of content) and the search server doesn't handle HTML on its own, this should be activated to avoid indexing HTML tags, as well as to give, for instance, words appearing in a header a higher boost.

Ignore case

Makes fulltext searches, string filters and sorts (on the selected fields) case-insensitive by lowercasing all indexed field values and search keywords.

Ignore characters

Allows you to remove certain characters from indexed field values and search keywords.

Index hierarchy

This processor is mainly used in conjunction with hierarchical taxonomy vocabularies. If you have such a vocabulary, you usually want searches for a high-level category to also return results tagged with lower-level terms – for instance, filtering for "Europe" should also return content from "Denmark". This processor will facilitate this behavior by indexing, for every encountered taxonomy term, all its parent/ancestor terms, too.

This also works for fields of other types that reference entities of the same type. If you have such a setup and want hierarchy functionality for that, too, you can also use this processor.

Role filter

When indexing users, this processor allows you to exclude users with certain roles from being indexed. For instance, you might not want to include the "Anonymous" user in the index.

Stemmer

"Stemming" in this context refers to bringing indexed words to their "stem". This ensures, for example, that a search for "walk" will also return results containing "walking". For more information please read the Wikipedia article. This processor implements stemming only for the English language, at the moment, using the "Porter2" stemming algorithm. Support for other languages might be added in the future.

Stopwords

Keeps certain (configured) words from being indexed, usually very common words that don't add much meaning. This can be used to make matching and scoring more accurate, and also improve performance (by keeping the fulltext index smaller). For best results, this should be used alongside (and after) "Tokenizer".

Tokenizer

Splits indexed text into individual words. As dedicated search backends, like Apache Solr or OpenSearch, typically do a very good job in this regard, it is mainly meant for use with the Database backend, for which it offers more control over the tokenization process.

The processor works in the following way when indexing a piece of text (say, a node’s body):

If enabled, rudimentary CJK handling is applied.
Numbers only separated by punctuation (like dates, telephone numbers, etc.) are merged to a single string of digits, such that it is possible to find them even when formatted in a slightly different way.
The configured “ignored characters” are handled, if any: occurrences of two or more consecutive “ignored characters” are replaced by spaces, then all remaining ones are removed from the text.
The text is then split into tokens, taking the configured “whitespace characters” as the separators.
Finally, all tokens that are shorter than the configured “minimum word length” are removed.

At search time, the keywords entered by the user are processed in a similar manner to ensure they match as expected.

Transliteration

Improves indexing and searching of non-English content with special characters (accents, umlauts, or completely different alphabets) by transliterating the complete text to the standard English alphabet. Depending on your site and use case, this might or might not be what you want – please investigate the effects carefully before settling on its use.

There are also three other processors which are hidden from the UI and always active, which provide additional properties for indexing.

Aggregated fields ("Aggregated field" property)

With the "Aggregated field" property you can create fields that combine the contents of multiple properties – either by aggregating them to a single value (for instance, the sum of all values, or the first value) or by just including the value of all of them ("Union" type). Since this works for fields across multiple datasources, this can for example be used to be able to sort on the item labels, by providing a field that contains the item label regardless of datasource.

Rendered item ("Rendered HTML output" property)

This provides a text property containing the complete entity (or item), rendered in the selected view mode. With a correctly configured view mode (no field labels or other presentation text) this can be used to easily create a field with the item's complete text contents, but also (for instance) to include the result output in the index already prepared for display (with a backend and search adequately configured – this is not trivial to set up).

Note that not all datasources support viewing of their items, so the property might be empty for some datasources. This will be apparent when configuring the property, though.

URL field ("URI" property)

This provides the item's URL (if any – might be empty) for indexing.

Use only Solr processors with Search API Solr

If you using a Search API Solr server, the Solr processors should be used primarily.

The "Processors" page (/admin/config/search/search-api/index/my_index/processors) lists the built-in processor plugins of the Search API (and those added by other modules) that can influence indexing and/or searching. Some of these, like "Ignore case" or "Tokenizer" offer very common functionality, that almost every search engine/software will need.

However, Apache Solr, having a full-fledged search engine itself, already provides this functionality itself, in a much better, faster, more flexible way than the Search API processors offer. To avoid conflicts, it is therefore very much recommended to disable the Search API processors that would interfere with any built-in functionality.

Please disable those processors (unless you are sure you want them) and configure Solr to handle these things for you. Which is the default, so you most likely won’t have to change anything there.

These are the processors which should be disabled, if Search API Solr is used. They will have the text "It is recommended not to use this processor with the selected server." under them if used with a Search API Solr server:

Ignore case
Stemmer
Stopwords
Tokenizer
Transliteration

It will not be possible to enable these processors under a Search API Solr server, since the check-boxes will be disabled. But if you started with a regular Search API and enabled them, and then switched the same index over to a Search API Solr server, they will still be enabled, and should be disabled.

Additional processor information for the Search API Solr module (such as Boost by date) is documented under Search API Solr HowTos.