Problem/Motivation
When indexing nodes or other search content, we preprocess the information (remove HTML tags, lower case, remove diacriticals and accents, etc.). Then we put the words in the text into two tables:
a) search_index: a table whose rows are the words in the text, along with the number of times the word appears, and a global "score" column indicating how common the word is in general (e.g., "the" is very common so when matching, "the" doesn't get much weight, while "swimming" is less common, so if a page patches "swimming" that would get more weight in a search).
b) search_dataset: the entire (preprocessed) text of the page gets added to a TEXT field.
However, when it comes time to search, we only use the search_index table to weight one matching page vs. another. For actual matching, we use a LIKE query on search_dataset, even if the search requested is a very simple single-word query.
This seems very inefficient, since the search_index table is an indexed database table, and LIKE queries on TEXT fields are not efficient.
Proposed resolution
Figure out if, at least in some cases, the LIKE queries on search_dataset could be replaced with more efficient exact match queries on search_index.
Remaining tasks
a) Determine if this is possible.
b) Make a patch
c) See if it's actually more efficient.
User interface changes
None. This would all be behind the scenes.
API changes
Not really. The SQL query that the Search module eventually uses for searches would be different, so if someone was doing a query alter assuming some particular query format, it might fail.
Data model changes
None.
Comments
Comment #1
aspilicious CreditAttribution: aspilicious commentedMoving to D8
Comment #2
pwolanin CreditAttribution: pwolanin commentedComment #3
anavarreComment #4
jhedstromI think this is more than just a performance issue. These steps illustrate the issue:
Comment #5
jhodgdonThat is not related to this issue... But we do have all kinds of tests for search rankings so I am surprised at this. Please file a separate issue so we can explore the problem.
Comment #6
jhedstromI opened #2544830: Views search filter needs additional WHERE on search_index table for the issue in #4 since jhodgdon informed me this is working as expected in the core search box.
Comment #7
xjmThis issue was marked as a beta target for the 8.0.x beta, but is not applicable as an 8.1.x beta target, so untagging.
This sounds like a disruptive bug to fix, so moving to 8.2.x.
Comment #8
xjmComment #9
pwolanin CreditAttribution: pwolanin as a volunteer commentedThere are some places in core we are doing LIKE queries to get case insensitive searches on non-MySQL dbs.
I'm not sure why this would be disruptive, but seems like it needs some research.
Comment #10
pwolanin CreditAttribution: pwolanin as a volunteer commentedwould be good to update the summary with specifics of which queries are wrong.
Comment #11
pwolanin CreditAttribution: pwolanin as a volunteer commentedLooks like this was fixed already at #2544830: Views search filter needs additional WHERE on search_index table
Can this issue be closed now?
Comment #12
jhodgdonSo, this bug is not actually about what was stated in comment #4 -- and that issue was fixed.
This bug is about performance:
We make an index of words (search_index), and we also add the entire (preprocessed) text of the page to the search_dataset table.
However, when we actually do a search, even if it is a simple one-word search, we don't use the search_index table with an exact match. Instead, we do a LIKE query on search_dataset.
Why are we doing this? Wouldn't it be more efficient to use the word index table (an indexed database table whose rows are single words) rather than doing a LIKE query on a TEXT field that stores the entire text of each node?
That is what this issue is about. I've updated the issue summary a bit...
Comment #24
catchComment #25
catchMoving to a task.
It would be good to get a proof of concept patch for the single word search against search_index and then compare an
EXPLAIN
of the queries generated.