Problem/Motivation

When indexing nodes or other search content, we preprocess the information (remove HTML tags, lower case, remove diacriticals and accents, etc.). Then we put the words in the text into two tables:

a) search_index: a table whose rows are the words in the text, along with the number of times the word appears, and a global "score" column indicating how common the word is in general (e.g., "the" is very common so when matching, "the" doesn't get much weight, while "swimming" is less common, so if a page patches "swimming" that would get more weight in a search).

b) search_dataset: the entire (preprocessed) text of the page gets added to a TEXT field.

However, when it comes time to search, we only use the search_index table to weight one matching page vs. another. For actual matching, we use a LIKE query on search_dataset, even if the search requested is a very simple single-word query.

This seems very inefficient, since the search_index table is an indexed database table, and LIKE queries on TEXT fields are not efficient.

Proposed resolution

Figure out if, at least in some cases, the LIKE queries on search_dataset could be replaced with more efficient exact match queries on search_index.

Remaining tasks

a) Determine if this is possible.
b) Make a patch
c) See if it's actually more efficient.

User interface changes

None. This would all be behind the scenes.

API changes

Not really. The SQL query that the Search module eventually uses for searches would be different, so if someone was doing a query alter assuming some particular query format, it might fail.

Data model changes

None.

Comments

aspilicious’s picture

Version: 7.x-dev » 8.x-dev

Moving to D8

pwolanin’s picture

Issue summary: View changes
Issue tags: +beta target
anavarre’s picture

Issue tags: +Performance
jhedstrom’s picture

I think this is more than just a performance issue. These steps illustrate the issue:

  • Clean installation of D8
  • Create a page titled 'Testing one two two two two' (node/add/page)
  • Create a page titled 'Testing one one one' (node/add/page)
  • Run cron task manually. (admin/reports/status)
  • Verify search index is at 100% (admin/config/search/pages)
  • Add search score field, keys filter, and score sort DESC to content view (admin/structure/views/view/content)
  • Execute a few search and notice how the score never changes.
    • admin/content?keys=testing
    • admin/content?keys=one
    • admin/content?keys=two
  • Also note how 'Testing one two two two two' appear first when searching for 'one'.
jhodgdon’s picture

That is not related to this issue... But we do have all kinds of tests for search rankings so I am surprised at this. Please file a separate issue so we can explore the problem.

jhedstrom’s picture

I opened #2544830: Views search filter needs additional WHERE on search_index table for the issue in #4 since jhodgdon informed me this is working as expected in the core search box.

xjm’s picture

Version: 8.0.x-dev » 8.2.x-dev

This issue was marked as a beta target for the 8.0.x beta, but is not applicable as an 8.1.x beta target, so untagging.

This sounds like a disruptive bug to fix, so moving to 8.2.x.

xjm’s picture

Issue tags: -beta target
pwolanin’s picture

There are some places in core we are doing LIKE queries to get case insensitive searches on non-MySQL dbs.

I'm not sure why this would be disruptive, but seems like it needs some research.

pwolanin’s picture

would be good to update the summary with specifics of which queries are wrong.

pwolanin’s picture

Looks like this was fixed already at #2544830: Views search filter needs additional WHERE on search_index table

Can this issue be closed now?

jhodgdon’s picture

Issue summary: View changes

So, this bug is not actually about what was stated in comment #4 -- and that issue was fixed.

This bug is about performance:

We make an index of words (search_index), and we also add the entire (preprocessed) text of the page to the search_dataset table.

However, when we actually do a search, even if it is a simple one-word search, we don't use the search_index table with an exact match. Instead, we do a LIKE query on search_dataset.

Why are we doing this? Wouldn't it be more efficient to use the word index table (an indexed database table whose rows are single words) rather than doing a LIKE query on a TEXT field that stores the entire text of each node?

That is what this issue is about. I've updated the issue summary a bit...

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

catch’s picture

catch’s picture

Title: Performance: node search is not really using the index table » Performance: node search should use search_index for single word queries
Category: Bug report » Task

Moving to a task.

It would be good to get a proof of concept patch for the single word search against search_index and then compare an EXPLAIN of the queries generated.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.