Closed (fixed)
Project:
Porter Algorithm Search Stemmer
Version:
6.x-2.x-dev
Component:
Code
Priority:
Normal
Category:
Bug report
Assigned:
Reporter:
Created:
11 Jul 2010 at 14:42 UTC
Updated:
27 Aug 2010 at 16:50 UTC
Jump to comment: Most recent file
Comments
Comment #1
jhodgdonThis is correct. It only currently affects people using Porter Stemmer with Search by Page, because the core search module didn't get this excerpt fix into D7 (much less D6).
The problem is that when the SBP excerpt function lets Porter Stemmer highlight words, Porter Stemmer could miss some matches because it doesn't lower case the words before stemming them. There's a suggested fix over on that other issue.
Comment #2
gpk commentedComment #3
gpk commentedComment #4
jhodgdonProbably this can be made more efficient by lower casing once instead every time inside the if(). And it needs a test. And a port to Drupal 7. But it's along the right idea.
Comment #5
gpk commentedIndeed, indeed. But you missed the best bit - search finally *works*!!!!! --> Many thanks for all your tireless work on this :-D
>more efficient by lower casing once
I guess one could store lowercased words (or stems) in a static array.. wonder if it would be worth it, given that the hook can be invoked with the same $key more than once.
Reference:
http://drupalcode.org/viewvc/drupal/contributions/modules/search_by_page...
http://drupalcontrib.org/api/function/porterstemmer_sbp_excerpt_match/6
Comment #6
jhodgdonYou're probably right about that, and I don't think the same key would be likely to come in more than once. I was thinking it was in a loop, but I was working from memory rather than looking at the code.
Anyway, I do very much appreciate the patch, as well as of course finding the bug in the first place. :)
And just for the record, my work on search is not tireless. I am actually very very very tired of the way things have and haven't been accepted into Drupal 7. Sigh. Anyway, not an issue with Porter Stemmer or Search by Page. :)
Comment #7
jhodgdonThis prompted me to write a test for the excerpt matching function, which uncovered some more issues with it. They're now fixed in the dev branch for Drupal 6, and here's the patch I ended up with.
Care to give it a try gpk? Then I can put out a new release...
Comment #9
jhodgdonI also just made this change in the 7.x-dev branch.
Comment #10
gpk commentedHi Jennifer,
I can confirm that porterstemmer.module 1.2.2.10 fixes the capitalization problem and also the problem you discovered where a "false match" prevents a subsequent match from being found.
I tested this on a node with content "qqQqqeat qqQqqeateat qqQqqeating hello world" and searched for "qqqqqeat". Without the fix I got
qqQqqeat qqQqqeateat qqQqqeating hello world
After the fix I got
qqQqqeat qqQqqeateat qqQqqeating hello world
(I also did some other tests.)
While I was at it I dug around for the other problem I alluded to at #27 and #29 of #493270: search_excerpt() doesn't work well with stemming and will post my findings there!
Many thanks,
Comment #11
gpk commentedOn second thoughts, your test covers the case where lowercase text in the key matches uppercase in the content, but not vice versa, e.g. what if the key entered was "waLk" ?
Comment #12
jhodgdonHmmm. I'm not too concerned about waLk, since I think it's unlikely to be entered, but I will add a test for a key of Walk and verify that it matches too...
Well. I am going to first see if the search module automatically lower cases search strings before you even get to this point. If it does that, then we don't need the test, but if it doesn't, then we need the test.
Comment #13
gpk commentedThe test is needed for the same reason that drupal_strtolower() appears twice in this line i.e. $key may contain uppercase...
if (drupal_strtolower($foundstem) == drupal_strtolower($key)) {Comment #14
jhodgdonI get it. I just wanted to check to see whether search.module lowercases all search query keywords before we even get to the point of finding the excerpt. If it does, then we don't actually have to worry about $key being anything but lowercase. If it doesn't, then we need the drupal_strtolower($key) and a new test.
Let's see... In core search, the answer is no -- the keywords are not lowercased before the extract. So you are correct, additional tests are needed.
Comment #15
jhodgdonOK. I've added tests for the keyword capitalization as well, to both D6 and D7 versions. New test: