Early Bird Registration for DrupalCon Portland 2024 is open! Register by 23:59 PST on 31 March 2024, to get $100 off your ticket.
I believe the fix is pretty simple I think the (int) cast is not really needed, see following
search.module
function search_index($sid, $type, $text)
$word = (int)ltrim($word, '-0'); //original line, causes overflow with large numbers
$word = ltrim($word, '-0'); //fixed line
function _search_parse_query(&$word, &$scores, $not = false)
$s = $num ? ((int)ltrim($s, '-0')) : $s; //original line, causes overflow with large numbers
$s = $num ? (ltrim($s, '-0')) : $s; //fixed line
Comment | File | Size | Author |
---|---|---|---|
#82 | drupal-74673-82-prevent-integer-overflow-within-search.patch | 935 bytes | Alan D. |
#76 | 74673-slash-tests.patch | 974 bytes | mcarbone |
#74 | 74673-74.patch | 16.32 KB | mcarbone |
#69 | 74673-69.patch | 16.32 KB | mcarbone |
#62 | 74673-62.patch | 16.45 KB | mcarbone |
Comments
Comment #1
magico CreditAttribution: magico commented@stilldodge: could you make a real patch for this (it is still present in HEAD)?
Thanks!
Comment #2
Jo Wouters CreditAttribution: Jo Wouters commentedI tested this with head, and could not repeat it.
Numbers up to 19 digits (bigger then the maxint on my machine) are correctly indexed, and I can search for them (and find them).
Comment #3
magico CreditAttribution: magico commentedThanks Jo!
Comment #4
(not verified) CreditAttribution: commentedComment #5
marikeeler CreditAttribution: marikeeler commentedHi, I know this is 'closed' But I have an issue with searching for #'s (digits). I tried the first posts fix (removing int) and it's still not working. Not sure what the HEAD aspect of this thread means, can someone explain?
I'm not a php expert.
Main issue: website is a forum, posts will be made with Order #'s, example. post title = 'PO Order # 455667' .
(Order # will also be in the body of post).
I would like users to type in the order # in the search box, and have it come up in results. no luck just yet.
(words work, just no numbers)
please please help.
Thanks
Comment #6
jhodgdonI am not having this problem in Drupal 6.15. If I put a number in some content, I can search for that number.
I tried it in the title as well as the body, and I tried it with and without a space between the # and the number.
Can you provide more information on how your site is set up, which modules you are using, etc.?
Comment #7
jhodgdonActually, I was able to duplicate this in Drupal 7 with a sufficiently large number. My page content is:
9999999999999888888888888877777777777776666666666666
978-0446365383
111
1111111111111111
1111111111111111111111111111111111
11111111111111111111111111111111111111111111111
Everything except the first number is searchable, but the first number isn't, and appears in my search index as 2147483647.
Comment #8
catchDowngrading from critical, strange bug though!
Comment #9
jhodgdonSee the original bug report to see why it is happening (int cast of the word, for some unknown reason?). Should be an easy fix, but we need a patch and probably a test would be good too.
Comment #10
sin CreditAttribution: sin commentedThis (int) cuts fractional part of a float numbers to not to be indexed and searched. We need to replace it like this:
$word = ltrim($word, '-0')
$dot_pos = strpos($word, '.');
if ($dot_pos) {
$word = substr($word , 0, $dot_pos);
}
Comment #11
jhodgdonGood idea. We still need a patch and a test.
Comment #13
sin CreditAttribution: sin commentedI'll do the patch.
Comment #14
sin CreditAttribution: sin commentedComment #15
sin CreditAttribution: sin commentedOther thread for reference:
http://drupal.org/node/323299
Comment #17
jhodgdonIt looks like you need to set your line endings on your development enviroment/editor to Unix-style, and redo the patch.
Comment #18
jhodgdonOther comments on the patch:
a)
Could use a comment saying it is stripping off leading zeros and - signs, and why this is a good idea.
b) Since the code that is preprocessing numbers is duplicated in search.extender.inc and search.module, I really think it should be made into a little function. The main reason for that is that it's essential it gets done the same way both times, and since the code has grown to 5-6 lines, it could easily be modified one place and not the other.
c) Needs a test to verify we can index/search for large numbers, so that this doesn't happen again. The test should include negative numbers, numbers starting with zero, and large numbers. I think the easiest thing would be to make a node including all three of these, run cron to index it, and then search for all of the numbers and verify that the node shows up in the search results.
Comment #19
jhodgdonI'm working on this - new patch plus tests.
Comment #20
jhodgdonI did some testing... a couple of notes:
a) It turns out that truncating to integer is totally unnecessary. The reason is that search_simplify() is removing all punctuation (including decimal points) inside numbers.
b) I think that the real reason that large number searching is failing (at least in Drupal 7) is that the search index word field is only 50 characters. So words are truncated to 50 characters during search indexing. However, they weren't being truncated in the same way during searching, so they weren't matching the values in the index and were not found.
I'm still working on this... will try to get a working patch with a test. Right now my test isn't passing, so I have a bit of work to do (tomorrow).
Comment #21
jhodgdonA few more notes:
1) Removing the initial - sign on a number is unnecessary, because all punctuation including hyphens is removed during search_simplify anyway.
2) There is no way to search for negative numbers. The - signs are removed during search indexing, so the page is indexed as if it was a positive number. Then during searching, -(text) indicates to find a page without (text) in it. That's just a note; this patch doesn't do anything about it or change that in any way.
3) The ltrim() that removes leading zeros was not allowing you to search for 0234 and have it match 234. It only enabled that these words were scored the same. This seemed like it wasn't really the intention, so I made it so 0234 matches 234.
So, I've made a patch that makes sure truncation and leading 0 removal happens the same way during search indexing as searching, so that numbers can be searched more successfully. It also removes unnecessary (int) casts and redundant removal of -.
I've also added some tests. There are two sets of tests: (a) test that various numbers can be searched for, and (b) test that numbers with internal punctuation and leading zeros match each other.
Without the patches in search.module and search.extender.inc, my new tests all pass except the larger-than-50-characters number search test, and the leading-zero-matching tests. With the patches, all tests in search.test pass.
Gosh this is an old bug (July 2006?!?) with only a 5-digit issue number...
Comment #22
sin CreditAttribution: sin commentedThank you for working on it!
You right about (int), search_simplify indeed removes a decimal point, (int) looks useless, now I have no idea what it was intended for, +1 to drop it.
Search works fine with your patch. I noticed that searching for 0.777 returns no results before the patch but works with your patch as expected.
Ancient bug indeed. The fix may worth backporting.
Comment #23
jhodgdon#21: 74673.patch queued for re-testing.
Comment #25
jhodgdonI don't know why that patch wouldn't apply, but here's a reroll (it applied fine for me).
Comment #26
douggreen CreditAttribution: douggreen commentedSimplify keyword according to indexing rules and external preprocessors.
toSimplify keyword as it would be simplified during indexing.
makes less sense to me now.Some suggested changes attached in patch, feel free to use, or go back to your last one...
Comment #28
douggreen CreditAttribution: douggreen commentedFailure has something to do with removing the search_simplify argument. Since we don't need the argument, this might be a legitimate failure, but we might need to fix the test
Comment #29
jhodgdonThere's an additional call to search_simplify that is not in the patch (unchanged line), and it doesn't pass TRUE. So we need the argument. I've put it back. I also revised that comment so hopefully it is better, and otherwise used your patch. Hopefully it will pass tests this time...
Comment #30
jhodgdonComment #31
douggreen CreditAttribution: douggreen commented@jhodgdon, sorry for being a pain here, but the only other call to search_simplify I see is in search.test. Where is this call that doesn't pass TRUE?
Comment #32
douggreen CreditAttribution: douggreen commentedoops, didn't mean to mark it RTBC yet, sorry :(
Comment #33
jhodgdonDuh. I was searching the wrong repository (D6). You are correct that in the actual code, it's always TRUE.
And now I am remembering that I left that argument in there just for that test, because I was too afraid of splitting up the test strings, which I couldn't really edit very well (they have all kinds of interesting arguments in them). We could make the default be FALSE, although that would mean yet another thing to document as a change from D6 to D7. Or you could take a stab at breaking up those strings. I can't do it in my editing setup.
Comment #34
duellj CreditAttribution: duellj commentedI've tested the patch, and both the search changes and the testing are working properly. I can verify the following:
-Search correctly handles complex numbers
-Tests pass correctly with patch
-Tests fail without patch
Comment #35
douggreen CreditAttribution: douggreen commented@jhodgdon if you have the energy, I'd prefer that remove the extra argument and fix the test, but I'll leave this RTBC and not fight very hard.
Comment #36
jhodgdonOK, I'll see what I can do. If it's impossible/nearly so, I may decide not to though. Last time I looked, there were many non-Latin characters in that part of the tests that I couldn't edit properly, and I didn't see a way to break it up into smaller chunks, but we'll see.
Comment #37
jhodgdonI guess another option would be to make the default of the argument go the other way, and pass in FALSE (for no truncate) from the test. In that case, we would need to document this function change on the module upgrade guide (since the default behavior would be different from D6).
Comment #38
douggreen CreditAttribution: douggreen commentedMy suggestion to remove the argument is that we have extra code in core, that is only there because our tests need it. Any extra function argument, whether the default or not, has this same problem.
Comment #39
jhodgdonOK, point taken. I'll see what I can do about splitting up those strings in the test.
Comment #40
jhodgdonHmmm...
The problem is the file modules/search/tests/UnicodeTest.txt
I don't think I have an editor that can edit this file, because it just looks like a bunch of non-printable characters on my screen (I tried several editors in both Windows and Ubuntu Linux). It would need to be truncated to at most 50-character lines to work with the proposed changes to search_simplify.
Does anyone have an editor/system that can handle this file? Maybe we should ask chx, who put these characters in on this issue:
#604002: Poor search support of some Unicode scripts
to see if he can perhaps truncate them to 50 character lines?
Comment #41
jhodgdonOK, I've fixed the test so that it breaks the strings up into shorter ones before trying search_simplify() on them. This way we don't have to modify that unicode text file. I also added a bunch of comments to the test, because it took me a bit to figure out what was going on.
The search simplify test now passes for me. Let's see what the test bot says...
Comment #42
jhodgdonComment #43
jhodgdonOh. I also found some trailing spaces in search.test and removed those (I changed my editor settings so they are showing up in BRIGHT RED and they are hard to miss...)
Comment #44
jhodgdon#41: 74673-v41.patch queued for re-testing.
Comment #45
jhodgdon#41: 74673-v41.patch queued for re-testing.
Comment #46
jhodgdon#41: 74673-v41.patch queued for re-testing.
Comment #47
jhodgdon#41: 74673-v41.patch queued for re-testing.
Comment #49
jhodgdonHere's a reroll. Patch still could use a review...
Comment #50
jhodgdonWhoops. I guess I didn't attach that last reroll. Here's a reroll.
Comment #51
jhodgdonIn case someone wants to review this patch, here's a summary of what the patch does:
- search_simplify() now truncates everything to 50 characters, because the word index db table only stores 50 characters. This means that both at indexing time and search time, characters past 50 are ignored, and they can match. As things were previously, that was only done at indexing time (implicitly when you saved the word in the index db), so longer words wouldn't come up as search matches.
- Tests for unicode in search_simplify() had to be rewritten to take this into account.
- Changes were made when indexing/searching numbers, so that leading zeros were removed in numbers, but numbers were not cast using (int), as that was causing this bug.
- Tests were added to test number searching/matching.
- A few spaces at ends of lines were removed.
Comment #52
mcarbone CreditAttribution: mcarbone commentedThis is a fantastic patch. Tests are thorough, and the change is very much needed. Two minor things you may want to add:
1) While it's true that negative numbers can't be searched (e.g., -100) because of the not functionality, it actually does work fine if you use phrase searching with quotes (e.g., "-100"). I created a node with -123456.7890 as the body and was able to find it with both "-123456.7890" and "-1234567890" -- so I think an example like this might be worth adding to the tests.
2) However, a minor bug popped up when searching for that number. If I search for 123456.7890 it comes up in the result, and the occurrence of that number in the node is highlighted in bold. If I search for 1234567890, the result comes up, but it is *not* highlighted in bold. In other words, the search_excerpt() function that generates the search snippet isn't using search_simplify() before highlighting the word match.
Comment #53
mcarbone CreditAttribution: mcarbone commented#2 above also applies to numbers with > 50 characters. I created a node with 666666666666666666666666666666666666666666666666666666666666 as the body. If I search for the whole thing, it gets highlighted in the snippet. If I remove a few characters so that it's still longer than 50 characters, it pulls up the correct result but doesn't bold the original number string.
Comment #54
David_Rothstein CreditAttribution: David_Rothstein commentedI'm guessing this got moved between queues accidentally :)
Looks like a great patch, but no time for a real review at the moment, unfortunately.
Comment #55
mcarbone CreditAttribution: mcarbone commentedI think I might take a stab at resolving the issues in #52 and #53 -- jhodgdon, ping me on freenode if you've already started on them but hopefully you haven't.
Comment #56
mcarbone CreditAttribution: mcarbone commented#50: 74673-50.patch queued for re-testing.
Comment #58
mcarbone CreditAttribution: mcarbone commentedI reran the test because I discovered locally that a recent change broke slashes in search. See #915214: Regression: Slashes broken in search.
Comment #59
mcarbone CreditAttribution: mcarbone commentedAll right, patch re-rolled to account for issues in #52 and #53 above, with additional tests for both quoted negatives and for highlighting simplified matches in search snippets.
The code to support simplified matches in search snippets is somewhat non-trivial, but I'm fairly sure it's performant. It runs only on keywords that required simplification for matching, and so is O(kn) where k is the number of such keywords and n is the number of words in the original text. There's no reason to expect k to be larger than a handful.
This will fail testing until #915214: Regression: Slashes broken in search is resolved.
Comment #61
jhodgdonSupporting more matches in search results is a separate issue, so please don't combine the patches... that issue has been around for a while. I'm on vacation in Europe for the next three weeks, not checking email much or doing patches... sorry I can't be of more help at the moment!
Comment #62
mcarbone CreditAttribution: mcarbone commentedjhodgdon, you're totally correct -- the search_excerpt issue I mentioned above exists without your original patch. (Newly created issue here: #916086: search_excerpt() doesn't highlight words that are matched via search_simplify())
I've re-attached your original patch with one line added with an additional test for a quoted negative. Minus the irrelevant search_excerpt issue, my original review from #52 then finds this patch RTBC.
However, the original patch will still fail until #298561: menu_tail should automatically act as a load function as well to allow slashes in search is committed so I won't switch it to that until this passes tests.
Have an excellent vacation.
Comment #64
jhodgdon#62: 74673-62.patch queued for re-testing.
Comment #65
jhodgdonI took a quick look at this patch, and the code part looks reasonable (didn't really look at the tests yet)... what exactly in here is dependent on that other issue? Can we take that part out and get this in?
Comment #67
mcarbone CreditAttribution: mcarbone commentedYeah, it's just the test for "12/34/56789" -- the slashes need the menu_tail_load fix to work. I'll strip that out and put up a new patch later today.
Comment #68
mcarbone CreditAttribution: mcarbone commentedOh, this is coming back to me. Those tests are pretty much all yours, except for one I added with a negative number in quotes.
Comment #69
mcarbone CreditAttribution: mcarbone commentedI simply commented out the two tests that were failing, as I really hope they get added in as soon as the menu_tail_load fix goes in. (And I really hope that fix goes in soon, as date search is broken without it.)
This patch, jhodgdon, was pretty much entirely created by you, and I reviewed it awhile back, but I'm not sure it'd be appropriate for me to RTBC it or not at this point.
Comment #70
jhodgdonWell, I think both of us like the patch quite a bit, so it can probably be marked RTBC (assuming the test bot agrees). Thanks for all the work you did!
Comment #72
jhodgdonLooks like the patch fails 8 tests...
Comment #73
jhodgdonpunting back to you, mcarbone...
Comment #74
mcarbone CreditAttribution: mcarbone commentedWeird, I somehow uploaded the wrong patch.
Comment #75
webchickGreat bug fix, great tests. What's not to love?
Committed to HEAD. Thanks!
Moving back to 6.x for consideration.
Comment #76
mcarbone CreditAttribution: mcarbone commentedBefore we switch this to 6.x-dev to be ported, can we get in these two small tests that we commented out before because they were waiting for #298561: menu_tail should automatically act as a load function as well to allow slashes in search?
Comment #77
jhodgdonGood idea, thanks -- assuming the tests pass of course. :)
Comment #78
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks.
Comment #79
jhodgdonBack to 6.x for consideration.
Comment #80
jhodgdonTalked with Gabor (the Drupal 6 branch maintainer) and D6 issues are really not being committed unless they're really essential -- we really don't have a test system for Drupal 6 and it's too dangerous. So... putting this back to D7 / fixed.
Comment #82
Alan D. CreditAttribution: Alan D. commentedA simple patch if you do need this on D6 :/