Problem/Motivation

Users expect that searching for a phrase surrounded by double quotes should only show results that include the quoted words consecutively in the specified order.

Currently search_api_db expands double-quoted phrases into individual words, so searching for a double-quoted phrase includes results that contain those keywords but not necessarily consecutively and not necessarily in the specified order, leading to poor search result relevance.

Previously this was mentioned in #3090482: Fulltext search + phrase search, but that issue is closed with the advice to use Apache Solr. Nevertheless, phrase searching can be important to users of smaller sites that don't have the resources to maintain an Apache Solr instance, so I think it could be helpful to include phrase searching in search_api_db.

Proposed resolution

The attached patch adds a simple phrase matching implementation:

  • When indexing documents, it includes bigrams (pairs of consecutive words, a.k.a. biwords) in the index.
  • When searching, it preserves double-quoted phrases instead of expanding them into individual words. Since the index contains bigrams, and not longer phrases (N-grams), it splits phrases with more than 2 words into multiple ANDed 2-word phrase groupings.

This new feature is optional, toggled by a new option in the database backend configuration form:
Screenshot of the database backend configuration form, with the new option highlighted in pink

Caveats:

  • This bigram-based implementation only considers whether each individual pair of double-quoted words is consecutive, not whether an entire 3+ word phrase is consecutive. For example, it converts the query "malleable logarithmic casing" to "malleable logarithmic" AND "logarithmic casing". Even so, this still provides a substantial improvement in search result relevance compared to not having any phrase search capability.
  • Including bigrams increases the size of the index and increases indexing time. (On the plus side, it doesn't appreciably affect search performance.) I tested on a local copy of a site consisting of about 6,500 hand-written English articles (about 55 MB storage according to information_schema.tables):
    • Baseline:
      • search_api_db_content_search_text has about 1.6 million rows (about 210 MB storage according to information_schema.tables)
      • 88 seconds to reindex entire site
    • With this patch:
      • search_api_db_content_search_text has about 4.8 million rows (about 1.1 GB storage according to information_schema.tables)
      • 190 seconds to reindex entire site

    Since search_api_db is recommended for smaller sites, probably having much less content than the aforementioned test site, I wouldn't expect the index storage/time increases to have a significant impact on the recommended use case for search_api_db.

Remaining tasks

  • Decide whether this is an acceptable solution
  • Review and commit patch

Comments

smokris created an issue. See original summary.

smokris’s picture

Assigned: smokris » Unassigned
Status: Active » Needs review
StatusFileSize
new16.51 KB

Patch attached.

Status: Needs review » Needs work

The last submitted patch, 2: 3345999-bigram.patch, failed testing. View results

smokris’s picture

Status: Needs work » Needs review
StatusFileSize
new17 KB

Revised patch to address failure in testAutocompletion.

smokris’s picture

StatusFileSize
new17.01 KB

Rebased patch.

drunken monkey’s picture

Assigned: Unassigned » drunken monkey
StatusFileSize
new8.72 KB
new16.52 KB

Thanks a lot for the initiative!
You’re right, being able to support phrase searches with the DB backend would be great. As you say, even a rudimentary support via bigrams should already be a vast improvement for such searches.

The large increase in DB storage size is a bit worrisome, though. I can imagine some people not being happy with that. Therefore, we should probably ask around for input from others before committing this, and maybe make this new functionality optional.

Apart from that, the patch looks very good already. Especially great that you already included tests! (Though we might want to add a few more to also cover some edge cases.) I just had a few notes, mostly code style:

  1. +++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php
    @@ -85,6 +85,14 @@ class Database extends BackendPluginBase implements AutocompleteBackendInterface
    +  const TOKEN_LENGTH_MAX = 50;
    

    Seems like this could be protected?

  2. +++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php
    @@ -1121,7 +1129,7 @@ class Database extends BackendPluginBase implements AutocompleteBackendInterface
    +              'length' => self::TOKEN_LENGTH_MAX,
    

    I’d use static instead of self everywhere, just on the off chance that someone sub-classes the backend plugin to override the max token length. (Seems also like this could be configurable, but totally separate issue, of course.)

  3. +++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php
    @@ -1243,6 +1251,119 @@ class Database extends BackendPluginBase implements AutocompleteBackendInterface
    +    $all_tokens = [];
    

    Doesn’t seem like $all_tokens (together with the whole separate first loop) is actually needed? You fill it sequentially, and then immediately loop over it a single time in the same order. I therefore just combined the two loops into one.

  4. +++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php
    @@ -1243,6 +1251,119 @@ class Database extends BackendPluginBase implements AutocompleteBackendInterface
    +      if (mb_strlen($bigram) <= self::TOKEN_LENGTH_MAX) {
    

    Hm, yes, bit of a problem what to do in this case. I think instead of discarding it we should just trim it to the maximum allowed length.

This and a bit more is all fixed in the attached patch revision. However, I think I’ll still work on this a bit more, a few other things also don’t seem ideal.

drunken monkey’s picture

Assigned: drunken monkey » Unassigned
StatusFileSize
new4.47 KB
new18.74 KB

This fixes handling of overlong bigrams, and adds tests for that.

drunken monkey’s picture

StatusFileSize
new5.3 KB
new19.52 KB

And with this, tokens below the minimum length should also be ignored in the bigrams.

smokris’s picture

StatusFileSize
new4.5 KB
new21.28 KB

Great; thanks for reviewing this. I read through your changes and I think they make sense. A couple thoughts:

  • While testing your latest patch on the 6,500-article corpus I mentioned above, I ran into some Integrity constraint violation: Duplicate entry … for key PRIMARY errors during indexing, which I traced to 2 problems:
    • Source token 0 gets detected as a number, then converted to emptystring since the ltrim() removes the leading zero, which in this case is the only character in the numeric string. Then it concatenates the bigram: the previous (non-empty) word + space + emptystring. When inserting this into a MySQL/MariaDB database, the separator space (which is now the trailing space) apparently gets trimmed, resulting in a duplicate primary key. I addressed this by adding a special case: ltrim() shouldn't trim 0 to emptystring.
    • A source token that's exactly 49 characters long (TOKEN_LENGTH_MAX - 1) results in a similar situation to the above: the first word in the bigram is 49 characters, then the space is added, then the bigram is trimmed to 50 characters. When inserting this into a MySQL/MariaDB database, the trailing space apparently gets trimmed, resulting in a duplicate key. I addressed this by trimming $bigram_base_form so that it doesn't try to insert a key with trailing whitespace, which would be a duplicate of a singleton word.
  • I removed the $dpm = lines which I presume you temporarily added for debugging

maybe make this new functionality optional

Yes, optional makes sense to me. Should it be enabled or disabled by default?

drunken monkey’s picture

Thanks a lot, this all sounds reasonable to me.
Had to think a bit about the cleanNumericString() special case, but yes: When a numeric string becomes empty after stripping minuses and zeroes, then I guess 0 really is the only possibly correct token, it should never be empty.
I just added a reference to the cleanNumericString() method into its test method, as I try to always do that when accessing something via reflection. That way, it’s still possible to find all usages using an IDE.

Now, I think only this is missing:

Yes, optional makes sense to me. Should it be enabled or disabled by default?

Hm, since it will bloat the database quite a bit, I guess we’d better keep it disabled by default? Or just disabled for existing indexes, but enabled for new ones, since in the latter case the users have a higher chance of seeing the option?
What do you think? Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly? (Now that I write it like that, it more sounds like the former to me … Please just pick the option that makes most sense to you, and we’ll hope for the best.)

tedfordgif’s picture

Although Solr is the primary backend we use, I find myself considering the db backend for specific use cases when Solr is not available, sometimes on large sites. I like the option of not touching existing indexes, and enabling it by default for new indexes with a sufficiently noticeable warning. Thanks for your work on this!

drunken monkey’s picture

Status: Needs review » Needs work

NW for the option.
Are you going to work on that, smokris? Otherwise, I can also put it on my todo list, but no guarantees on when I’ll get to it.

smokris’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new27.77 KB
new8.81 KB
new95.6 KB

Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly?

For new installations, enabling it by default sounds great.

For existing installations, if this change will be included in an upcoming 1.x minor-version release, I think it might be disruptive/surprising if it were automatically enabled and suddenly the site's index size grew significantly. (Though the upcoming 2.0 major-version release (#3126115: Create a 2.0.0 release) might be a suitable time to automatically enable it for existing installations too?)

I've attached an updated patch that includes a first draft of the UI for toggling support for phrase indexing/searching, and a screenshot of it with the new option highlighted in pink. It's enabled by default for new installations, and for existing installations I added a hook_update_N implementation that disables it.

drunken monkey’s picture

Thanks a lot for your continued work, looks great!
(And sorry it took me a while to get back to you.)

+++ b/modules/search_api_db/src/Plugin/search_api/backend/Database.php
@@ -580,6 +592,11 @@
+      'info' => ucfirst($this->configuration['phrase']),

This would fail to be translated on non-English websites. However, in this specific case, it should be fine to just wrap the expression in $this->t() (since we those values are already passed as literals in another place).

Otherwise, this patch is already very good, there were only a few small details I changed. Please see the attached revision.

Most notably, I changed the description for the “Phrase indexing” option to be split into one description for the setting itself, and separate descriptions for the individual options. It now reads as follows:

Phrase indexing
Choose how phrase queries (queries with double quotes) should be handled. Better matching comes at the price of slower indexing and larger database size.

  • Disabled: Double-quoted phrases in search queries will be expanded into individual words, so searches will include results that contain those keywords but not necessarily consecutively and not necessarily in the same order. (This is the standard behavior of version 8.x-1.29 and earlier.)
  • Bigram: Pairs of consecutive words will be included in the index, so double-quoted searches will only include results that contain those keywords consecutively and in the same order. Warning: this will increase the database index size by about 5x, and will increase indexing time by about 2x.

I’m not convinced that this is the best we can do, but it’s an improvement. Unfortunately, I’m rather bad at writing such descriptions in a way that make them understandable to laypeople.
This would be an alternative (just for the two options) – feedback appreciated on which you think would be better:

  • Disabled: This will ignore quoted phrase in search keywords and just look for the individual words separately. For instance, "blue house" would match any item that contains the words “blue” and “house” somewhere in its indexed text, no matter where and in what order.
  • Bigram: This treats quoted phrases in keywords like it is generally expected, matching only items that contain those words in the exact same order, consecutively. For instance, "blue house" would match only items that contain the word “blue” immediately followed by the word “house”. Please make sure that the associated increase in database size (about 5x) and indexing time (about 2x) is acceptable for your site.

Any combination of the two, or completely new suggestions, are of course also very welcome!

In any case, thanks a lot once again, this would really be a great improvement!

ressa’s picture

This would be an awesome feature in Search API. And I agree, it's something the users expect, so thanks a lot for working on it!

I just tried the latest patch, and it works perfectly.

+1 for enabling it by default in new installations.

About the wording, it is difficult to formulate these concepts, but I think you succeeded with the #2 version, which makes sense to me. Here is a new suggestion, with some slight tweaks:

Disabled: Ignore quotes in searches and just look for the individual words separately. For instance, "blue house" matches any item that contains the words "blue" and "house" somewhere in its indexed text, no matter where and in what order.

Bigram: Treat a quoted phrase in a search like it is generally expected [...] (the rest is great!)

aharown07’s picture

I have applied the patch to a dev site with good success in results so far, with Bigram enabled.
The test site has just over 23,000 items indexed.

Seems to work really well with facets and everything as expected.

To echo ressa, thanks for your work on this. I hope it makes it into a future release. Also makes sense to me to enable by default in new installs.

drunken monkey’s picture

StatusFileSize
new1.89 KB
new28.35 KB

Thanks a lot for both your feedback!
Sounds very good so far. I’m just gonna leave it open a few more days to maybe get another review, but then I thinwe can merge this.

Attached is a patch with the adapted descriptions suggested in #15, thanks a lot for those as well!

drunken monkey’s picture

Status: Needs review » Fixed

OK, merged.
Thanks again, everyone – especially, thanks a lot, @smokris!

ressa’s picture

So cool, this is a great feature to have. Thanks @drunken monkey and @smokris!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

jeff.a’s picture

Excellent feature, thanks everyone. One possible addition that could be very useful, in addition to Bigram, would be "Required". This would default to phrase searching but without the user needing to enter the quotes. This would be useful for specific/technical searches, names of places/locations, names of people, leading zeros, etc.

Not sure about the performance ramifications of doing this and if another index would be needed. Seems like it could be done within the Bigram index.

berdir’s picture

The release notes are a bit confusing, they say the new feature is disabled by default, but that's not correct, for new servers, it's enabled by default, only existing servers/sites have it disabled through the update function.

That matters with default config in install profiles for example, if you don't explicitly provide the setting, it will default to bigram.