On this page
What is indexing
From text to lexemes
In this chapter I will explain what "indexing" is within a PostgreSQL full text search setting and how this module goes about creating such a "FTS index".
As said on the introduction page, a full text solution most convert your passages of text (field data in our case) into something else in order for it to perform full text searches. This "something else" in PostgreSQL is called a tsvector - a data type specially designed to hold processed lexemes.
What are lexemes?
Full text searching is all about trying to mimic "natural language" lookups, trying to look for a word in a text the same as our human (language trained) brain would do. For instance, if you would search a piece of text for the word "child", your brain might be satisfied when it finds the word "kid". For us this is a natural language "match", but for a computer these are totally different words with no relation to each other.
In order to go from "child" to "kid" we have to convert our "normal" text into a tsvector data type. The system has to atomize each word - convert it to its most basic natural form called a lexeme. When you search a tsvector for certain words (using a tsquery data type, query counterpart of the tsvector), each word entered in your search phrase will also be converted to lexemes. Than each lexeme in your tsquery is compared to the lexemes in your tsvector.
Okay, so lexemes are the most atomic form of a natural word, but you still did not explain how to go from "child" to "kid" ...
Yes, thats correct ... to be able to convert found words to their atomic form, PostgreSQL uses a set of dictionaries to decide how each word should be converted. In a typical dictionary chain you have several dictionaries, each with their own purpose:
- Stop word dictionary: Removes common stop words such as a, the, we, us, ... .
- Synonym dictionary: Finds synonyms for words
- Stemming dictionary: Stems words to their basic form
- Thesaurus dictionary: Finds synonyms for phrases
- Morphing dictionary: Find similar words over different writing styles
- - ...
In our case, we want to search for the word "child" in a text that has the word "kid" in it. When the system indexes that text, in would hit the synonym dictionary and the word "kid" would get converted into "child" as this is the main synonym (according to that dictionary). We can thus say the lexeme "child" is proposed and stored in the tsvector.
Then, likewise, when we enter a search keyword "child", that will be handled by a tsquery, it will undergo the same conversion sequence. But in this case the synonym dictionary sees "child" is already the main synonym and leaves it alone, thus for the tsquery, the lexeme "child" would be proposed as well.
You now have a match between "child" (in the tsvector) and "child" (in the tsquery). That's how we can search for "child" and still get a match on "kid".
Note: We call it a dictionary chain because they are all consulted in order. If a dictionary produces a result (aka "emits a lexeme"), further processing is stopped.
PS: If you wish to know even more about the inner workings of this awesome PostgreSQL feature, feel free to checkout a three chapter series I wrote about the subject.
How does this module index?
Now that you have a basic idea of how PostgreSQL goes from text to lexemes, let us see how this module creates these "FTS indexes" under the hood.
When you first create an FTS index for a field you must choose which columns you wish to index (eg. value, summary, ...) and which of the field's instances (see chapter: How to create an index).
Once chosen PostgreSQL will generate a new table for this field/column(s)/instance(s) combination to hold both the same primary keys as the field's data table and a tsvector column for each chosen field column.
Finally, a database constraint will be setup between the field's data table and the FTS index table to manage cascading deletes - if you delete a field entry, the FTS index will be updated accordingly.
Once this setup process is complete, the FTS index table is still empty. You have to hit the "Index remaining items" button to start the actual conversion process to turn each applicable entry in your field's data table into lexemes.
This conversion process is done in a batch operation and is partiality handled by Drupal and partially by PostgreSQL. Drupal is consulted to load the correct field data associated with the field instance and the entity it is attached to. This information is then processed and handed off to PostgreSQL which, in turn, will do the actual tsvector conversion.
Besides the manual indexing via the "Index remaining items" button, the module also self-maintains the FTS index by triggering upon inserts and updates on field data that matches the field/column(s)/instance(s) combination.
It does this by scheduling changes to field data onto an internal system queue. This queue is then processed (and thus the appropriate FTS index updated) during cron runs. A queue approach was chosen to not put extra load on the system when saving or updating entities.
Help improve this page
You can:
- Log in, click Edit, and edit this page
- Log in, click Discuss, update the Page status value, and suggest an improvement
- Log in and create a Documentation issue with your suggestion