The simpletest's written by Steven for the search module (see #205631) uncovered a bug in the search results regarding search queries with short words.
The problem is with the logic for "simple" searches. "Simple" searches are ones that don't need to be joined with the {search_dataset} -- something previously referred to as the second pass or second pass arguments. Searches are faster when they avoid the join with {search_dataset}, so the setting of the "simple" variable tries to avoid this. However, the simpletest's found one case that wasn't quite as simple as original thought. This patch fixes that.
I'm not completely satisfied with this patch, because it adds 5 lines of code, and uses a variable named "count2". If we do go with the extra count variable, the variable names need to be cleaned up. But before we go with this solution, I'd like to see if a more elegant solution materializes, possibly by counting counting the terms in the $query array.
We should do some performance tests and/or look at the queries, just to make sure that this patch is not joining {search_dataset} too often.
Comment | File | Size | Author |
---|---|---|---|
#14 | 205631.patch | 2.16 KB | chx |
#8 | 205631.patch | 2.81 KB | douggreen |
205631.patch | 1.46 KB | douggreen | |
Comments
Comment #1
Rok Žlender CreditAttribution: Rok Žlender commentedI just committed search tests which run with no failures after this patch. http://drupal.org/node/205631
Comment #2
Gábor HojtsyThis patch does not look like affecting performance much by looking at the code. Are you talking about performance implications of a possible alternate version?
Comment #3
douggreen CreditAttribution: douggreen commentedI had two concerns: (a) a minor concern that the code didn't seem elegant to me, and thinking that there might be a more elegant way to set the $simple variable, (b) a bigger concern that the additional case which sets the $simple variable (introduced by this patch) is as narrow as it can be, and doesn't introduce full table queries on the {search_dataset} more often than needed.
Comment #4
chx CreditAttribution: chx commentedThe simpletests provided by Steven (thanks much!) prove that the patch works. The patch needs a bit of explanation which I will try to give:
While
$count
counts scoring words,$count2
counts the words that are either numeric or are long enough. So if you just type in a bunch of words -- which I would say is the typical case -- that will be interpreted as a series of words ANDed together, this code will fire and if just one of them is long enough then it's still a simple query.Comment #5
douggreen CreditAttribution: douggreen commentedif there are short words, i.e. words smaller than the minimum_word_size, then it's not a simple query and {search_dataset} must be used. The simple test for phrases is handled elsewhere. So this test just checks for simple AND and simple OR terms, and if any of the words are smaller than the minimum_word_size, then, it's not so simple after all.
Comment #6
Gábor HojtsyLet's get a better variable name instead of $count2 then.
Comment #7
chx CreditAttribution: chx commented" if any of the words are smaller than the minimum_word_size, then, it's not so simple after all." -- you sure that's what this piece does? To my eyes it seems that if there are no words above the minimum size then will the new !$simple fire.
Comment #8
douggreen CreditAttribution: douggreen commentedIf the word is too small then $count2 == 0 and $simple is set to FALSE. @chx, does that make sense, or do you still think I'm seeing something wrong?
How about changing both $count variables to $num_new_scores and $num_valid_words, respectively. The first count, actually keeps track of the number of words that were added to the $scores array, that weren't already there... thus the suggested variable name, while the second count keeps track of the number of valid words regardless of whether they've already been added to the $scores array or not.
Comment #9
Gábor HojtsyNew variable names look
sway better! I guess the performance argument is left.Comment #10
Gábor HojtsyChx notes there is an already committed hunk from your previously committed patch included.
Comment #11
chx CreditAttribution: chx commentedI am probably daft but I still do not get it. The issue here says "if any of the words are smaller than the minimum_word_size" but $num_valid_words apparently counts the number of the words that are longer than the minimum_word_size and if that's zero then it's not a simple query. So it's "if all of the words are smaller than the minimum_word_size" notice the any vs all. What am I missing here?
Comment #12
chx CreditAttribution: chx commentedSorry, crosspost.
Comment #13
douggreen CreditAttribution: douggreen commentedsearch_parse_query() first the breaks the search phrase into individual array elements, with the exception that expressions are still lumped into a single array. Then _search_parse_query is called on each of these array elements.
If you search for
alpha beta
, which looks for the words alpha and beta, search_parse_query() first puts this into the $keys['positive'] array where each of these words in the array('alpha', 'beta'). Then _search_parse_query() is called twice, once with each of these words, and $num_valid_words is set to 1, on each call.If you search for
"alpha beta"
, which looks for the words alpha and beta, but also does a second pass using {search_dataset} to make sure that these two words occur next to each other as in a phrase search for "alpha beta", This time, search_parse_query() puts the two words into a single array element, and _search_parse_query() is called only once, and the explode and loop in _search_parse_query, then iterate over both words, and $num_valid_words returns 2.If either of these words is smaller than the minimum_word_size, We need to set $simple to FALSE We actually only need to figure this out for the first case, the AND case, because we already catch phrase searches on line 1285. So, looking at the first case, the AND search, _search_parse_query is called twice, once with each word. One of these calls with return $num_valid_words of 1, and the other call will return $num_valid_words of 0. On that second call that returns $num_valid_words of 0, $simple will be set to FALSE.
At least that's what I think (and hope) is happening.
Comment #14
chx CreditAttribution: chx commentedOh that thingie is called per word when it matters! My bad. Let me resubmit your patch (do not credit me) with the already committed hunk removed.
Comment #15
Gábor HojtsyCommitted, thanks.
Comment #16
Anonymous (not verified) CreditAttribution: Anonymous commentedAutomatically closed -- issue fixed for two weeks with no activity.