Performance: node search should use search_index for single word queries [#740900]

Problem/Motivation

When indexing nodes or other search content, we preprocess the information (remove HTML tags, lower case, remove diacriticals and accents, etc.). Then we put the words in the text into two tables:

a) search_index: a table whose rows are the words in the text, along with the number of times the word appears, and a global "score" column indicating how common the word is in general (e.g., "the" is very common so when matching, "the" doesn't get much weight, while "swimming" is less common, so if a page patches "swimming" that would get more weight in a search).

b) search_dataset: the entire (preprocessed) text of the page gets added to a TEXT field.

However, when it comes time to search, we only use the search_index table to weight one matching page vs. another. For actual matching, we use a LIKE query on search_dataset, even if the search requested is a very simple single-word query.

This seems very inefficient, since the search_index table is an indexed database table, and LIKE queries on TEXT fields are not efficient.

Proposed resolution

Figure out if, at least in some cases, the LIKE queries on search_dataset could be replaced with more efficient exact match queries on search_index.

Remaining tasks

a) Determine if this is possible.
b) Make a patch
c) See if it's actually more efficient.

User interface changes

None. This would all be behind the scenes.

API changes

Not really. The SQL query that the Search module eventually uses for searches would be different, so if someone was doing a query alter assuming some particular query format, it might fail.

Data model changes

None.

Comments

Comment #1

aspilicious CreditAttribution: aspilicious commented 24 June 2011 at 14:44

Version:

7.x-dev

» 8.x-dev

Moving to D8

Comment #2

pwolanin CreditAttribution: pwolanin commented 20 August 2014 at 13:51

Issue summary:	View changes
Issue tags:		+beta target

Comment #3

anavarre

French

🇪🇺

CreditAttribution: anavarre commented 4 September 2014 at 12:06

Issue tags:

+Performance

Comment #4

jhedstrom

English

Portland, OR

CreditAttribution: jhedstrom at Phase2 commented 3 August 2015 at 17:36

I think this is more than just a performance issue. These steps illustrate the issue:

Clean installation of D8
Create a page titled 'Testing one two two two two' (node/add/page)
Create a page titled 'Testing one one one' (node/add/page)
Run cron task manually. (admin/reports/status)
Verify search index is at 100% (admin/config/search/pages)
Add search score field, keys filter, and score sort DESC to content view (admin/structure/views/view/content)
Execute a few search and notice how the score never changes.

admin/content?keys=testing
admin/content?keys=one
admin/content?keys=two

Also note how 'Testing one two two two two' appear first when searching for 'one'.

Comment #5

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 3 August 2015 at 17:47

That is not related to this issue... But we do have all kinds of tests for search rankings so I am surprised at this. Please file a separate issue so we can explore the problem.

Comment #6

jhedstrom

English

Portland, OR

CreditAttribution: jhedstrom at Phase2 commented 3 August 2015 at 18:05

I opened #2544830: Views search filter needs additional WHERE on search_index table for the issue in #4 since jhodgdon informed me this is working as expected in the core search box.

Comment #7

xjm

she/her

English

CreditAttribution: xjm commented 13 March 2016 at 20:45

Version:

8.0.x-dev

» 8.2.x-dev

This issue was marked as a beta target for the 8.0.x beta, but is not applicable as an 8.1.x beta target, so untagging.

This sounds like a disruptive bug to fix, so moving to 8.2.x.

Comment #8

xjm

she/her

English

CreditAttribution: xjm commented 13 March 2016 at 20:51

Issue tags:

-beta target

Comment #9

pwolanin CreditAttribution: pwolanin as a volunteer commented 13 March 2016 at 21:50

There are some places in core we are doing LIKE queries to get case insensitive searches on non-MySQL dbs.

I'm not sure why this would be disruptive, but seems like it needs some research.

Comment #10

pwolanin CreditAttribution: pwolanin as a volunteer commented 13 March 2016 at 21:52

Issue tags:

+Needs issue summary update

would be good to update the summary with specifics of which queries are wrong.

Comment #11

pwolanin CreditAttribution: pwolanin as a volunteer commented 13 March 2016 at 22:01

Looks like this was fixed already at #2544830: Views search filter needs additional WHERE on search_index table

Can this issue be closed now?

Comment #12

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 14 March 2016 at 18:02

Issue summary:

View changes

So, this bug is not actually about what was stated in comment #4 -- and that issue was fixed.

This bug is about performance:

We make an index of words (search_index), and we also add the entire (preprocessed) text of the page to the search_dataset table.

However, when we actually do a search, even if it is a simple one-word search, we don't use the search_index table with an exact match. Instead, we do a LIKE query on search_dataset.

Why are we doing this? Wouldn't it be more efficient to use the word index table (an indexed database table whose rows are single words) rather than doing a LIKE query on a TEXT field that stores the entire text of each node?

That is what this issue is about. I've updated the issue summary a bit...

Comment #13

14 March 2016 at 18:02

Version:

8.2.x-dev

» 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #14

14 March 2016 at 18:02

Version:

8.3.x-dev

» 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #15

14 March 2016 at 18:02

Version:

8.4.x-dev

» 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #16

14 March 2016 at 18:02

Version:

8.5.x-dev

» 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #17

14 March 2016 at 18:02

Version:

8.6.x-dev

» 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #18

14 March 2016 at 18:02

Version:

8.7.x-dev

» 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Comment #19

14 March 2016 at 18:02

Version:

8.8.x-dev

» 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Comment #20

14 March 2016 at 18:02

Version:

8.9.x-dev

» 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Comment #21

14 March 2016 at 18:02

Version:

9.1.x-dev

» 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Comment #22

14 March 2016 at 18:02

Version:

9.2.x-dev

» 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #23

14 March 2016 at 18:02

Version:

9.3.x-dev

» 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #24

catch

he/him

English

CreditAttribution: catch at Third and Grove commented 22 April 2022 at 20:57

Comment #25

catch

he/him

English

CreditAttribution: catch at Third and Grove commented 27 April 2022 at 09:49

Title:	Performance: node search is not really using the index table	» Performance: node search should use search_index for single word queries
Category:	Bug report	» Task

Moving to a task.

It would be good to get a proof of concept patch for the single word search against search_index and then compare an EXPLAIN of the queries generated.

Comment #26

27 April 2022 at 09:49

Version:

9.4.x-dev

» 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #27

27 April 2022 at 09:49

Version:

9.5.x-dev

» 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #28

27 April 2022 at 09:49

Version:

10.1.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Performance: node search should use search_index for single word queries

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments

Related issues