Updated: Comment #N
Problem/Motivation
The core Search module allows the use of AND (implicit) and OR (explicit) in keyword/node search. For instance, you can search for "boy dog" and it will look for nodes with both boy and dog in them; you can search for "boy OR dog" and it will find nodes with either.
This is normally working fine, but in some cases, due to the way the "first pass" of the query works, if you use the same keyword twice in the query (which is actually a valid thing to want do), the search fails even though it shouldn't.
For example, if you create a node with the following body:
I created this page with some very interesting content in it.
Here are some search results, all for queries that should find the node -- keep in mind that AND is implied unless you explicitly type OR, and that parens are not recognized:
- this
- found page
- this OR that
- found page
- this OR that this
- found page
- this OR that this OR foo
- DID NOT FIND
- this OR that this OR interesting
- found page
- this OR foo this OR interesting
- found page
- this OR foo this OR foo
- found page
All of these are correct except the "this OR that this OR foo", which is not working.
Proposed resolution
Fix the SearchQuery class so that these searches all work. See comment #3 for an explanation of what the root cause of the bug is -- basically that the "first pass" query is incorrectly saying that the search doesn't work and aborting the search. The solution is to define a search with multiple OR clauses as "not simple" so that it will fall through to doing the full query (which does work correctly).
Remaining tasks
Get the patch reviewed and committed (it works and has tests).
User interface changes
None, except that more searches will work correctly.
API changes
None.
Original report by @spartlow
Search combines keywords by AND except where adjacent keywords explicitly use OR. So this search:
match1 OR miss1 match2 OR miss2
is treated as:
(match1 OR miss1) AND (match2 OR miss2)
and will return a node that has match1 and match2 in it (but happens to not have miss1 or miss2 in it).
However, if the same exact keyword is used in two OR statements such as:
match1 OR miss1 match1 OR miss2
then that same node is NOT returned. This is the bug.
Interestingly the following search WILL find that node. It's just when an additional OR is added that it fails.
match1 OR miss1 match1
Background:
I'd like to use search hooks to conditionally add certain keywords to searches. Theoretically, I'd want to do (user_key1 user_key2) OR (my_key). So instead I have to add "OR my_key" to each of the user's search keywords, which I suspect would work if it wasn't for this bug. I'm still looking for a work-around.
Comment | File | Size | Author |
---|---|---|---|
#51 | 1842226-51.patch | 5.44 KB | aerozeppelin |
#51 | test-only-fail-1842226-51.patch | 840 bytes | aerozeppelin |
#33 | Screen Shot 2014-09-23 at 10.53.12 AM.png | 76.14 KB | mgifford |
#26 | 1842226-search-or-26.patch | 6.59 KB | jhodgdon |
#9 | 1842226-9-test-only.patch | 2.31 KB | jhodgdon |
Comments
Comment #1
jhodgdonComment #2
jhodgdonSorry, I meant to add a comment here.
This is very odd.
I tested this in Drupal 8 with a node I created with the following body:
Here are some search results:
Anyway... it's odd. The bug is confirmed and still exists in 8.x.
Comment #2.0
jhodgdonAdded additional info.
Comment #3
jhodgdonI think I figured out what the cause of the problem is.
During SearchQuery::parseSearchExpression(), the entered search expression is parsed into words and AND/OR query conditions.
So let's say we have a node whose text is just
foo
If we enter search expression
foo OR baz foo OR bar
which is supposed to be interpreted as
(foo OR baz) AND (foo OR bar)
This should match. And the query conditions that are created are correct.
However, during the pre-query step, SearchQuery sets up a "normalization" query to figure out whether to even run the full query. To do this, during parsing it keeps track of a list of words that have to match, and of how many matches it needs to find to even continue with the full query.
For the AND/OR case envisioned in SearchQuery, where all the words are distinct, you would make a list of 4 distinct words and know that you need to find two matches to continue (one for each of the OR pieces). But in this case, it only finds 3 distinct words (foo, baz, bar) and still thinks it needs to find 2 matches (one for each of the OR pieces). But this fails because only one of those 3 words is a match.
The only way to fix it is to not require the full number of matches to continue in the case of a combined AND/OR query. There are already member variables that check whether the query is "complex" and we can use them to decide not to impose this "we need 2 words matching to continue" criterion.
However... I tried to fix this and my fix didn't work... uploading the patch for now and maybe someone can see what is going on, or maybe I'll get back to it. Also added tests for this problem in the patch, but they do not pass locally even with the code patch. Sigh.
Comment #5
jhodgdonDang, stray chunk got into that patch. Sorry. Trying again.
Comment #6
jhodgdonComment #8
jhodgdonOK, that's the failing test. Now we just need to figure out how to fix the bug.
Comment #9
jhodgdonDoh! I had a typo in the tests: minum vs. minim. Patch is otherwise the same.
So on my machine, the tests pass with the patch, and fail without the patch. Let's make sure the test bot agrees...
Comment #10
jhodgdonDoh again. :)
Comment #12
jhodgdonThat test was supposed to fail, and it failed in the right places to illustrate the bug.
Comment #13
jhodgdon9: 1842226-9.patch queued for re-testing.
Comment #14
jhodgdonThis still needs to be reviewed. Should be straightforward... has a test... fix is pretty simple...
Comment #15
jhodgdonAdded an issue summary.
Comment #16
jhodgdon9: 1842226-9.patch queued for re-testing.
Comment #17
jhodgdon9: 1842226-9.patch queued for re-testing.
Comment #19
jhodgdonStraight reroll.
Comment #20
jhodgdon19: 1842226-18.patch queued for re-testing.
Comment #21
pwolanin CreditAttribution: pwolanin commentedThis variable naming or logic doesn't make sense:
If it's simple it's NOT simple?
I don't think we usually include issue numbers in comments, or core would just be a long list of them:
See issue #1842226.
Comment #22
jhodgdonOn the case...
Comment #23
jhodgdonThe code was correct, but I agree with #21 that the local variable names were confusing. So I took the opportunity to go through this whole parseSearchExpression() method and clean up the comments and local variable names a bit. Hope that is OK -- there were a lot of local variables with "or" and "and" in the name that I thought were confusing.
Also took out mentions of the issue number.
Comment #24
pwolanin CreditAttribution: pwolanin commentedLooks better. Seems like this might be redundant?
At least, the comment implies it is, given this other code:
Comment #25
jhodgdonRE #24... no, they are actually for two separate cases.
a) Search query is something like
cat OR mouse dog OR chicken
In this case, the first step of parsing gives you something like:
So in the lower loop, $has_or gets set to TRUE the first time through, and then this code
the second time through says "Oh, we already had one OR, here is another one, so this is not a "simple" query.
b) A search query something like:
cat OR mouse dog chicken
which gets parsed to:
so that get $has_or set to TRUE from the first array, and $has_and set to TRUE for the other two words. Then
says "This query had both AND and OR clauses, so it's not simple".
Does that make sense?
Comment #26
jhodgdonReroll for PSR-4
Comment #31
hughmccracken CreditAttribution: hughmccracken commentedRe #26 - I repeated the search with the single node test page:
this
found page
this OR that
found page
this OR that this
found page
this OR that this OR foo
found page
this OR that this OR interesting
found page
this OR foo this OR interesting
found page
this OR foo this OR foo
found page
Looks good to me!
Comment #32
jhodgdonThanks for testing muzza299! Do you want to mark this "reviewed and tested by the community" then?
Comment #33
mgiffordI tested with the patch and without.
My only page just had "I repeated the search with the single node test page" as the body.
My query was "search OR the single OR test page" which I think should emulate the only instance that wasn't working before.
I don't know why the phrase "the search" is bold since it isn't in the search. That's there both with or without the patch.
I also really have no idea why
<a title="View user profile." href="/user/1" lang="" about="/user/1" typeof="schema:Person" property="schema:name" datatype="" class="username">admin</a> - 09/23/2014 - 10:45 - 0 comments
shows up, but it's in the page without the patch too.Can you tell me where I got my search wrong. You can edit the content in both instances above for the next few hours.
Comment #34
jhodgdonThanks for testing!
You have the word "the" in your search query, as well as the word "search", which is why both words are highlighted in the results.
The "view user profile" thing is a completely separate issue: #2297711: Fix HTML escaping due to Twig autoescape
And just to clarify... This issue's bug is that if you have the same word in your query twice, results do not happen. So you'd need to do something like "search OR single search OR test". Your test query does not have the same word in there twice. I think the issue summary explains it nicely...
Comment #35
mgiffordI still can't replicate the bug.. Here's the same instance above (still with 4hrs left on it):
http://sa41d02fd7a6a027.s2.simplytest.me/search/node?keys=search+OR+sing...
http://sfe7665a808d27f8.s2.simplytest.me/search/node?keys=search+OR+sing...
I get the same result with or without the patch.
Please feel free to edit the content or the search query in the example above and show me what you mean..
Comment #36
jhodgdonWell, I am still able to follow the steps in the issue summary and reproduce this issue. There is also a failing test
Comment #37
jhodgdonSorry, I hit save before I meant to. There is also a failing test in comment #9. I do not know what your simplytest.me sites are doing, or how they are set up, but I can still reproduce the issue by following the steps in the issue summary...
Comment #38
mgiffordI just loaded Core + the patch in #26 as per:
http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org...
If you're logged in you've got 6hrs to test, evaluate, share.
There's a common u/p so it is really easy to pass around and see configurations.
I've looked through the summary. I don't see steps in the form of 1, 2, 3, 4...
By using SimplyTest.me we can look at the exactly the same interface and make sure that we're talking about the same thing. They are both still active [40 minutes left] or you can set up your own.
I can't repeat it.
Comment #39
jhodgdonThe issue summary gives some sample content and some sample searches. If I put in that sample content, run cron to index for searching, and run the sample searches, I get the same results as reported in the issue summary.
If you are not getting these on simplytest.me, without the patch, I do not know what you did. I am still able to reproduce the original bug report without the patch.
If you made a simplytest.me with the patch, I would assume the problems are fixed with the patch -- the patch has tests with it, which pass with the patch, and I have personally tested the patch and it fixes the problem.
Comment #40
mgiffordAs I said in #33, I tested with & without the patch.
Comment #41
jhodgdonOK, I just did the following:
a) Went to simplytest.me
b) Selected Drupal Core version 8.0.x
c) Installed with Standard profile
d) Created a Basic page node with title "Page" and body "I created this page with some very interesting content in it."
e) Went to admin/config/system/cron and ran cron.
f) Went to search/node
g) Searched for
this OR that this OR foo
The node was not found in the results.
h) As a reality check, searched for:
this OR that
The node was found.
My simplytest.me will expire in 12 minutes so I will not bother to paste the URL here. But maybe you can follow those instructions and verify that ***without the patch*** this bug still exists. Then you can proceed to add the patch and follow the same sequence and verify that the search works.
Or you could read the patch and the tests that it adds. There is a test-only patch in #9 that demonstrates that the tests fail without the patch (you can still look at the results there). The latest patch (with tests and code fix) passes.
I don't know what else to suggest. The steps to reproduce the issue are in the issue summary at the top of this page...
Comment #42
pwolanin CreditAttribution: pwolanin commentedGiven the clear test case here and clear understanding of the bug, I think this is rtbc.
Comment #43
alexpottCommitted ca306bf and pushed to 8.0.x. Thanks!
Comment #45
jhodgdonThanks for the commit! I'm probably not going to port this to 7.x any time soon, so... any takers?
Comment #49
Rajab Natshah CreditAttribution: Rajab Natshah at Vardot commentedComment #50
Rajab Natshah CreditAttribution: Rajab Natshah at Vardot commentedComment #51
aerozeppelin CreditAttribution: aerozeppelin at California State University San Bernardino commentedPorting this to D7.