Problem/Motivation
Users expect that searching for a phrase surrounded by double quotes should only show results that include the quoted words consecutively in the specified order.
Currently search_api_db expands double-quoted phrases into individual words, so searching for a double-quoted phrase includes results that contain those keywords but not necessarily consecutively and not necessarily in the specified order, leading to poor search result relevance.
Previously this was mentioned in #3090482: Fulltext search + phrase search, but that issue is closed with the advice to use Apache Solr. Nevertheless, phrase searching can be important to users of smaller sites that don't have the resources to maintain an Apache Solr instance, so I think it could be helpful to include phrase searching in search_api_db.
Proposed resolution
The attached patch adds a simple phrase matching implementation:
- When indexing documents, it includes bigrams (pairs of consecutive words, a.k.a. biwords) in the index.
- When searching, it preserves double-quoted phrases instead of expanding them into individual words. Since the index contains bigrams, and not longer phrases (N-grams), it splits phrases with more than 2 words into multiple ANDed 2-word phrase groupings.
This new feature is optional, toggled by a new option in the database backend configuration form:

Caveats:
- This bigram-based implementation only considers whether each individual pair of double-quoted words is consecutive, not whether an entire 3+ word phrase is consecutive. For example, it converts the query
"malleable logarithmic casing"to"malleable logarithmic" AND "logarithmic casing". Even so, this still provides a substantial improvement in search result relevance compared to not having any phrase search capability. - Including bigrams increases the size of the index and increases indexing time. (On the plus side, it doesn't appreciably affect search performance.) I tested on a local copy of a site consisting of about 6,500 hand-written English articles (about 55 MB storage according to
information_schema.tables):- Baseline:
search_api_db_content_search_texthas about 1.6 million rows (about 210 MB storage according toinformation_schema.tables)- 88 seconds to reindex entire site
- With this patch:
search_api_db_content_search_texthas about 4.8 million rows (about 1.1 GB storage according toinformation_schema.tables)- 190 seconds to reindex entire site
Since search_api_db is recommended for smaller sites, probably having much less content than the aforementioned test site, I wouldn't expect the index storage/time increases to have a significant impact on the recommended use case for search_api_db.
- Baseline:
Remaining tasks
- Decide whether this is an acceptable solution
- Review and commit patch
| Comment | File | Size | Author |
|---|---|---|---|
| #17 | 3345999-17--db_backend_phrase_search.patch | 28.35 KB | drunken monkey |
| #9 | 3345999-9--db_backend_phrase_search.patch | 21.28 KB | smokris |
Comments
Comment #2
smokrisPatch attached.
Comment #4
smokrisRevised patch to address failure in
testAutocompletion.Comment #5
smokrisRebased patch.
Comment #6
drunken monkeyThanks a lot for the initiative!
You’re right, being able to support phrase searches with the DB backend would be great. As you say, even a rudimentary support via bigrams should already be a vast improvement for such searches.
The large increase in DB storage size is a bit worrisome, though. I can imagine some people not being happy with that. Therefore, we should probably ask around for input from others before committing this, and maybe make this new functionality optional.
Apart from that, the patch looks very good already. Especially great that you already included tests! (Though we might want to add a few more to also cover some edge cases.) I just had a few notes, mostly code style:
Seems like this could be protected?
I’d use static instead of self everywhere, just on the off chance that someone sub-classes the backend plugin to override the max token length. (Seems also like this could be configurable, but totally separate issue, of course.)
Doesn’t seem like $all_tokens (together with the whole separate first loop) is actually needed? You fill it sequentially, and then immediately loop over it a single time in the same order. I therefore just combined the two loops into one.
Hm, yes, bit of a problem what to do in this case. I think instead of discarding it we should just trim it to the maximum allowed length.
This and a bit more is all fixed in the attached patch revision. However, I think I’ll still work on this a bit more, a few other things also don’t seem ideal.
Comment #7
drunken monkeyThis fixes handling of overlong bigrams, and adds tests for that.
Comment #8
drunken monkeyAnd with this, tokens below the minimum length should also be ignored in the bigrams.
Comment #9
smokrisGreat; thanks for reviewing this. I read through your changes and I think they make sense. A couple thoughts:
Integrity constraint violation: Duplicate entry … for key PRIMARYerrors during indexing, which I traced to 2 problems:0gets detected as a number, then converted to emptystring since theltrim()removes the leading zero, which in this case is the only character in the numeric string. Then it concatenates the bigram: the previous (non-empty) word + space + emptystring. When inserting this into a MySQL/MariaDB database, the separator space (which is now the trailing space) apparently gets trimmed, resulting in a duplicate primary key. I addressed this by adding a special case:ltrim()shouldn't trim0to emptystring.$bigram_base_formso that it doesn't try to insert a key with trailing whitespace, which would be a duplicate of a singleton word.$dpm =lines which I presume you temporarily added for debuggingYes, optional makes sense to me. Should it be enabled or disabled by default?
Comment #10
drunken monkeyThanks a lot, this all sounds reasonable to me.
Had to think a bit about the
cleanNumericString()special case, but yes: When a numeric string becomes empty after stripping minuses and zeroes, then I guess0really is the only possibly correct token, it should never be empty.I just added a reference to the
cleanNumericString()method into its test method, as I try to always do that when accessing something via reflection. That way, it’s still possible to find all usages using an IDE.Now, I think only this is missing:
Hm, since it will bloat the database quite a bit, I guess we’d better keep it disabled by default? Or just disabled for existing indexes, but enabled for new ones, since in the latter case the users have a higher chance of seeing the option?
What do you think? Will users appreciate it more if phrase search “just works” (as most people would expect), or if we use their database storage sparingly? (Now that I write it like that, it more sounds like the former to me … Please just pick the option that makes most sense to you, and we’ll hope for the best.)
Comment #11
tedfordgif commentedAlthough Solr is the primary backend we use, I find myself considering the db backend for specific use cases when Solr is not available, sometimes on large sites. I like the option of not touching existing indexes, and enabling it by default for new indexes with a sufficiently noticeable warning. Thanks for your work on this!
Comment #12
drunken monkeyNW for the option.
Are you going to work on that, smokris? Otherwise, I can also put it on my todo list, but no guarantees on when I’ll get to it.
Comment #13
smokrisFor new installations, enabling it by default sounds great.
For existing installations, if this change will be included in an upcoming 1.x minor-version release, I think it might be disruptive/surprising if it were automatically enabled and suddenly the site's index size grew significantly. (Though the upcoming 2.0 major-version release (#3126115: Create a 2.0.0 release) might be a suitable time to automatically enable it for existing installations too?)
I've attached an updated patch that includes a first draft of the UI for toggling support for phrase indexing/searching, and a screenshot of it with the new option highlighted in pink. It's enabled by default for new installations, and for existing installations I added a
hook_update_Nimplementation that disables it.Comment #14
drunken monkeyThanks a lot for your continued work, looks great!
(And sorry it took me a while to get back to you.)
This would fail to be translated on non-English websites. However, in this specific case, it should be fine to just wrap the expression in
$this->t()(since we those values are already passed as literals in another place).Otherwise, this patch is already very good, there were only a few small details I changed. Please see the attached revision.
Most notably, I changed the description for the “Phrase indexing” option to be split into one description for the setting itself, and separate descriptions for the individual options. It now reads as follows:
I’m not convinced that this is the best we can do, but it’s an improvement. Unfortunately, I’m rather bad at writing such descriptions in a way that make them understandable to laypeople.
This would be an alternative (just for the two options) – feedback appreciated on which you think would be better:
Any combination of the two, or completely new suggestions, are of course also very welcome!
In any case, thanks a lot once again, this would really be a great improvement!
Comment #15
ressaThis would be an awesome feature in Search API. And I agree, it's something the users expect, so thanks a lot for working on it!
I just tried the latest patch, and it works perfectly.
+1 for enabling it by default in new installations.
About the wording, it is difficult to formulate these concepts, but I think you succeeded with the #2 version, which makes sense to me. Here is a new suggestion, with some slight tweaks:
Comment #16
aharown07 commentedI have applied the patch to a dev site with good success in results so far, with Bigram enabled.
The test site has just over 23,000 items indexed.
Seems to work really well with facets and everything as expected.
To echo ressa, thanks for your work on this. I hope it makes it into a future release. Also makes sense to me to enable by default in new installs.
Comment #17
drunken monkeyThanks a lot for both your feedback!
Sounds very good so far. I’m just gonna leave it open a few more days to maybe get another review, but then I thinwe can merge this.
Attached is a patch with the adapted descriptions suggested in #15, thanks a lot for those as well!
Comment #19
drunken monkeyOK, merged.
Thanks again, everyone – especially, thanks a lot, @smokris!
Comment #20
ressaSo cool, this is a great feature to have. Thanks @drunken monkey and @smokris!
Comment #22
jeff.a commentedExcellent feature, thanks everyone. One possible addition that could be very useful, in addition to Bigram, would be "Required". This would default to phrase searching but without the user needing to enter the quotes. This would be useful for specific/technical searches, names of places/locations, names of people, leading zeros, etc.
Not sure about the performance ramifications of doing this and if another index would be needed. Seems like it could be done within the Bigram index.
Comment #23
berdirThe release notes are a bit confusing, they say the new feature is disabled by default, but that's not correct, for new servers, it's enabled by default, only existing servers/sites have it disabled through the update function.
That matters with default config in install profiles for example, if you don't explicitly provide the setting, it will default to bigram.