search_excerpt() doesn't highlight words that are matched via search_simplify() [#916086]

Comment	File	Size	Author
#100	highlight_words_that_are_matched_via_search_simplify-916986-100.patch	14.34 KB	RaulMuroc

#94	interdiff-91-94.txt	2.27 KB	jhodgdon
#94	916086-94-search_excerpt_fixup.patch	23.7 KB	jhodgdon

#91	916086-91-search_excerpt_fixup.patch	22.2 KB	jhodgdon

#87	916086-86-search_excerpt_fixup.patch	22.22 KB	jhodgdon

#82	916086-82-search_excerpt_fixup.patch	22.2 KB	jhodgdon

#76	916086-76-search_excerpt_fixup.interdiff.txt	8.47 KB	jhodgdon
#76	916086-76-search_excerpt_fixup.patch	22.22 KB	jhodgdon

#70	916086-70.patch	20.24 KB	pwolanin

#66	916086-66.patch	20.23 KB	pwolanin

#63	916086-63.patch	20.35 KB	jhodgdon

#58	916086-58-fix-preg.patch	19.85 KB	jhodgdon

#56	916086-rework-excerpt-function.patch	19.39 KB	jhodgdon

#54	916086-newtests.patch	5.75 KB	jhodgdon

#52	916086-morefails.patch	9.62 KB	jhodgdon

#47	916086-fixlogic.patch	2.19 KB	jhodgdon

#43	search_simplify_highlight-916086-43.patch	5.58 KB	Albert Volkman

#32	916086-32.patch	9.2 KB	mcarbone

#24	916086-24.patch	9.23 KB	mcarbone

#20	916086-20.patch	9.22 KB	mcarbone

#14	916086.patch	8.98 KB	mcarbone

#10	916086.patch	8.97 KB	mcarbone

#7	916086.patch	8.95 KB	mcarbone

#5	916086.patch	9.01 KB	mcarbone

	search_excerpt_simplified.patch	3.44 KB	mcarbone

Comment #1

eustace CreditAttribution: eustace commented 4 October 2010 at 04:53

@mcarbone, I'll like to verify the bad behavior as reported before applying the patch, however, I'm seeing some anomalies.

Search only gets a hit when "123456.7890" is part of the node title. Not true when it is only in the node body.
Search does not get a hit with "1234567890" even though it's reported it does but without a link.
Probably not related but previous search results remain persistent after current searches. Clearing caches and re-initializing page don't seem to help.

Is there something I'm not doing? Thx

Log in or register to post comments

Comment #2

mcarbone CreditAttribution: mcarbone commented 4 October 2010 at 13:43

Silly me, you would have to apply the latest patch at #74673: Search does not correctly handle large numbers (> maxint) so you cannot search by UPC code for example in order to reproduce with my steps above. Without it, you could do it using hyphens instead of periods, I'm pretty sure.

I don't see how your caching issue would be related to this thread.

Log in or register to post comments

Comment #3

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 October 2010 at 21:36

Status:

Needs review

» Closed (duplicate)

This issue is a duplicate of
#493270: search_excerpt() doesn't work well with stemming
actually

Log in or register to post comments

Comment #4

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 22 November 2010 at 16:41

Status:

Closed (duplicate)

» Active

Reopening, because we're not allowed to make the big API change that would actually resolve the whole issue, so this is no longer a duplicate.

Log in or register to post comments

Comment #5

mcarbone CreditAttribution: mcarbone commented 24 November 2010 at 14:51

Status:

Active

» Needs review

File	Size
916086.patch	9.01 KB

Took the patch from #493270: search_excerpt() doesn't work well with stemming, stripped out the new hook, and tweaked and moved around some comments.

Log in or register to post comments

Comment #6

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 November 2010 at 16:15

Status:

Needs review

» Needs work

There are several whitespace issues in your patch, such as here (should be 2 space indentation, not 3):

       if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
-        $p = $match[0][1];
+         $p = $match[0][1];

Also in the new function search_simplify_excerpt_match(), a few of the doc header lines end in spaces, the indentation in @param and @return section is wrong (should be 2 spaces), and there should be a blank line between the @params and the @return.

Regarding the actual code, I'm not sure I fully understand this bit:

+      // Iterate through the text in the window until we find the full original
+      // matching text.
+      for ($i = strlen($simplified_key); $i <= strlen($window); $i++) {
+        $keyfound = substr($text, $value[1], $i);
+        if ($simplified_key == search_simplify($keyfound)) {
+          break;
+        }
+      }

It looks like it's making an assumption that the original text is at least as long as the simplified text? I guess that is currently the case with the current search_simplify() function, but I'm concerned that if this assumption is silently hard-wired into this function, it might not always be true. Or am I misreading this code? Also, the second arg to substr is the requested length of the string, so the end of the loop should not be strlen($window), but strlen($window) - $value[1], plus or minus 1.

Other than that, the proposed fix looks good to me.

Log in or register to post comments

Comment #7

mcarbone CreditAttribution: mcarbone commented 24 November 2010 at 17:45

Status:

Needs work

» Needs review

File	Size
916086.patch	8.95 KB

"It looks like it's making an assumption that the original text is at least as long as the simplified text? I guess that is currently the case with the current search_simplify() function, but I'm concerned that if this assumption is silently hard-wired into this function, it might not always be true."

You are right that that assumption was being made, as I was trying to minimize the number of iterations. But there's little increase in searching the whole window, so I've fixed this.

"Also, the second arg to substr is the requested length of the string, so the end of the loop should not be strlen($window), but strlen($window) - $value[1], plus or minus 1."

No, the second arg to substr is the position of the beginning of the window in the original text where the match begins, which we got from using the PREG_SPLIT_OFFSET_CAPTURE flag. So what this little bit of code is doing is taking that window, which it knows begins with the match from the strpos above it, and then checking a larger and larger portion of that window until it finds some original text that when simplified matches the simplified key. It ends at strlen($window) because on the very last step it's simplifying the entire 80 character window. It should pretty much never get that far except when the match is derived from a very large piece of original text.

Let me do some justifying: the problem here is that an infinite number of strings can simplify to one single key, just by adding a bunch of characters that get stripped. So we could have a key that's 5 characters long that's matched by something in the original text that's 10000 characters long. To go through the entire text to account for every possible length of original text would not be very performant, so I figured since the excerpt window only looks 80 characters ahead from the matching position anyway, that's the window I'd use to look for a simplified match. I believe this will catch nearly all real-world examples, but it's true that it may fail on really long words that simplify down. The alternative would be to keep expanding the window until the end of the full text, rather than 80 characters. That would catch all possible matches, but we'd essentially be simplifying the entire text for every boundary, which for long pieces of text could get ugly.

Log in or register to post comments

Comment #8

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 November 2010 at 18:06

Sorry, I meant the *third* argument, not the second arg to substr, but I see now that strlen($window) is the correct max length.

So this looks good to me, and the tests demonstrate several cases... Can we add tests for accented characters though -- those are another important case? There are some specific examples at #731298: Searches for words with diacritics/accents: word not highlighted in results and #157684: Search module does not highlight non-ASCII search terms that could be used.

And while you're at it

+      for ($i = 1; $i <= strlen($window); $i++) {

This is somewhat inefficient, since strlen($window) gets evaluated each time through the loop, but is unchanging. Might be better to evaluate once outside the loop?

Log in or register to post comments

Comment #9

24 November 2010 at 18:31

Status:

Needs review

» Needs work

The last submitted patch, 916086.patch, failed testing.

Log in or register to post comments

Comment #10

mcarbone CreditAttribution: mcarbone commented 24 November 2010 at 18:36

Status:

Needs work

» Needs review

File	Size
916086.patch	8.97 KB

As I argued in #731298: Searches for words with diacritics/accents: word not highlighted in results, diacritic highlighting in excerpts have nothing to do with search_simplify, so I don't think it's appropriate to add a test here for them. That thread suggests using search_simplify to remove diacritics entirely, so if that was done then tests could be added there. But right now, the reason diacritic matching happens is because of the particular mysql collation being used in the index tables.

Added strlen fix.

Log in or register to post comments

Comment #11

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 November 2010 at 18:46

This is still not quite right:

+      for ($i = 1; $i <= $length; $i++) {

$length is hard-wired to 80, but there's no guarantee that $window is that long.

Point taken on the diacriticals.

Log in or register to post comments

Comment #12

mcarbone CreditAttribution: mcarbone commented 24 November 2010 at 19:15

Well, if I set $length after $window is set then there's no point creating a variable at all, as strlen($window) would only get called once in the for loop anyway. So either I can switch it back to the way I had it, or I can leave it as it is now. I think it's OK as is, because even though the window length might not be 80 at the end, we know in that case it will necessarily find a match before it gets to anything beyond the actual window size.

Log in or register to post comments

Comment #13

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 November 2010 at 19:34

That is not true. The ending condition is re-evaluated each loop iteration.

Log in or register to post comments

Comment #14

mcarbone CreditAttribution: mcarbone commented 24 November 2010 at 19:49

File	Size
916086.patch	8.98 KB

Oops, you're right. Attached.

Log in or register to post comments

Comment #15

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 November 2010 at 20:10

OK, that looks good as far as logic goes, and I think it's almost ready.

One last thing (sorry, should have noticed this earlier).

Should we be using drupal_strlen() etc. rather than strlen() here (i.e. unicode-safe functions)?
http://api.drupal.org/api/drupal/includes--common.inc/group/php_wrappers/7

Log in or register to post comments

Comment #16

26 November 2010 at 00:40

Status:

Needs review

» Needs work

The last submitted patch, 916086.patch, failed testing.

Log in or register to post comments

Comment #17

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 26 November 2010 at 14:32

Those exceptions in the test failure are in search.module, and look real... no idea what's going on...

They are saying:
preg_replace(): Unknown modifier '3' Warning search.module 1245 search_excerpt()
I'm going to hit retest just in case.

Log in or register to post comments

Comment #18

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 26 November 2010 at 14:33

Status:

Needs work

» Needs review

#14: 916086.patch queued for re-testing.

Log in or register to post comments

Comment #19

26 November 2010 at 15:10

Status:

Needs review

» Needs work

The last submitted patch, 916086.patch, failed testing.

Log in or register to post comments

Comment #20

mcarbone CreditAttribution: mcarbone commented 30 November 2010 at 21:52

Status:

Needs work

» Needs review

File	Size
916086-20.patch	9.22 KB

Found the problem causing the test failures - preg_quotes wasn't being called on the keywords found via search_simplify(). Also put in drupal_strlen's.

Log in or register to post comments

Comment #21

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 30 November 2010 at 23:13

You are using drupal_srlen() but not drupal_substr(), which is a problem. You might try using the Coder module and see what it says, as it would find that error.

Log in or register to post comments

Comment #22

mcarbone CreditAttribution: mcarbone commented 1 December 2010 at 13:32

You know, I thought of that, but then I 1) looked around the module and saw tons of substr usage, and 2) read this line in the API: "Note that for cutting off a string at a known character/substring location, the usage of PHP's normal strpos/substr is safe and much faster." Any idea why drupal_substr usage is so inconsistent in core?

Log in or register to post comments

Comment #23

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 December 2010 at 16:03

My understanding is that if you are going to use drupal_strlen to find the length of a string, then you need to use drupal_substr to substring it, for consistency.

Let me think about whether we should go back to just strlen/substr or not...

Log in or register to post comments

Comment #24

mcarbone CreditAttribution: mcarbone commented 1 December 2010 at 17:10

File	Size
916086-24.patch	9.23 KB

I think you're right about the inconsistency, and I think we have to use them if we expect this to work for multibyte characters. If there's a performance issue, that probably has to be address in the confines of drupal_substr, not here, and if there are incorrect uses of substr in the code, that should also probably be addressed elsewhere.

Log in or register to post comments

Comment #25

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 December 2010 at 19:12

Ugh. Right after the patched hunk where you're calling search_simplify_excerpt_match(), the return value is used in strpos (not drupal_strpos):

+      // Note that a space was added to the front and end of $text above.
+      if ($p) {
         if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {

So I guess we have to use substr/strpos in that function after all, not drupal_substr/drupal_strpos. So these need to be changed back... Large apology for the misdirection. I don't think we should try to fix the whole module's usage of strlen/strpos/etc.

Log in or register to post comments

Comment #26

mcarbone CreditAttribution: mcarbone commented 1 December 2010 at 22:33

drupal_strpos doesn't exist.

Log in or register to post comments

Comment #27

mcarbone CreditAttribution: mcarbone commented 1 December 2010 at 22:36

Which is to say, I'm inclined to keep this particular patch adhering to coding standards, and the inconsistency issue + the strpos issue can be discussed in a new thread. 'Cause if multibyte is broken, we're not going to fix it here, and the solution here will work fine otherwise.

Log in or register to post comments

Comment #28

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 December 2010 at 23:04

OK... right after the strpos(), in the next line, it does a substr()...

I'm going to have to think about this a bit.

Log in or register to post comments

Comment #29

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 December 2010 at 16:22

Status:

Needs review

» Needs work

I took a closer look at this patch, thinking about drupal_* vs. plain-old strlen and friends.

And by the way, the patch needs a reroll -- the search.test hunk doesn't apply any more (at least not for me at the moment).

So at line 1164 [line numbers are after patching the file] in search_excerpt(), we have:

      if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
        $p = $match[0][1];

That is returning the offset into $text, presumably in bytes and not characters. A few lines down, if there was no match, we obtain $p from a call to search_simplify_excerpt_match(), and then farther down $p is used in the following:

        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);
          if (($s = strrpos($end, ' ')) !== FALSE) {
            // Account for the added spaces.
            $q = max($q - 1, 0);
            $s = min($s, drupal_strlen($end) - 1);
            $ranges[$q] = $p + $s;
            $length += $p + $s - $q;
            $included[$key] = $p + 1;

So this has a mix of strpos, substr, and drupal_strlen. I don't think we should be mixing these -- if we use drupal_strlen, which counts the number of characters (not bytes), then we should use drupal_substr(), which picks off characters (not bytes) and PHP's mb_strpos() and friends. And if we use strpos and strrpos, then we should use the bare PHP strlen(). Right?

search_simplify_excerpt_match() has the same problem. We need to use the same set of functions in it that we are using in the parent search_excerpt() function, since we are going to use its return value in the parent function.

And as a note, I see that PHP has mb_ereg_search* functions -- maybe we should be using those?

Log in or register to post comments

Comment #30

mcarbone CreditAttribution: mcarbone commented 2 December 2010 at 16:41

Yeah, as I said above, I think we have a bit of a mess here. But as I see it, we have two problems:

1) the search_simplify excerpt bug that is the title of this issue
2) general string function inconsistency in search.module, which is likely making it buggy for multibyte languages -- but *not* for standard uft8 that clearly most people are using as we haven't seen any bug reports otherwise

I think we should fix #1 here by sticking to drupal coding standards internally in the patch -- it will have no affect either way on #2, which is already a problem. We should open a new ticket for #2, which can then explore the issues you raise above which I think are thorny and have very broad implications throughout the code (namely, can drupal_str functions be used in conjunction with PHP's strpos and other string functions? What about mb_ereg_search* functions?). Basically, it'd be useful I think to have those questions in a more general interest thread so we can attract those members in the community who are more familiar with them.

If you agree with this, I'll reroll the patch just to address the search.test changes. If not, we should retitle the thread to capture the broader issue.

Log in or register to post comments

Comment #31

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 December 2010 at 17:02

The problem is that if you just use the drupal_* functions in the new code for the patch, it becomes inconsistent. I think we either need to fix search_excerpt() completely to use multi-byte safe functions in this patch, or use the non-mulit-byte functions in tihs patch.

I'm not advocating fixing all of search.module. search_excerpt() is an independent function. But it can't mix the two types of functions.

Log in or register to post comments

Comment #32

mcarbone CreditAttribution: mcarbone commented 2 December 2010 at 20:00

Status:

Needs work

» Needs review

File	Size
916086-32.patch	9.2 KB

Discussed with jhodgdon on IRC, and opened up a new issue to discuss the multibyte issue. (#987472: search.module doesn't consistently support multibyte characters)

Re-rolled this patch to use non-mb-safe versions to be consistent with rest of search_excerpt code.

Log in or register to post comments

Comment #33

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 December 2010 at 20:08

This looks right, pending testing bot... I agree we should handle the multi-byte stuff on a separate issue, and this is consistent with what the search module is doing now.

Log in or register to post comments

Comment #34

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 December 2010 at 21:07

Status:

Needs review

» Reviewed & tested by the community

OK. Test bot agrees. I think we've hashed through this enough... THANK YOU mcarbone for all your work, and for taking an interest in the deep dark corners of the search module!!!!

Log in or register to post comments

Comment #35

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 January 2011 at 23:23

#32: 916086-32.patch queued for re-testing.

Log in or register to post comments

Comment #36

2 January 2011 at 19:41

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 916086-32.patch, failed testing.

Log in or register to post comments

Comment #37

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 3 January 2011 at 16:50

The failed tests are in a User admin test, not related to this issue...

Log in or register to post comments

Comment #38

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 3 January 2011 at 16:50

Status:

Needs work

» Needs review

#32: 916086-32.patch queued for re-testing.

Log in or register to post comments

Comment #39

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 3 January 2011 at 16:51

Status:

Needs review

» Reviewed & tested by the community

Log in or register to post comments

Comment #40

pwolanin CreditAttribution: pwolanin commented 4 January 2011 at 04:00

the code comments could use a little copy editing, and the use of single-letter vars like $p isn't ideal, but seems like those already exist in the function and might as well be handled in a separate cleanup issue (if at all).

Log in or register to post comments

Comment #41

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick commented 4 January 2011 at 05:41

Status:

Reviewed & tested by the community

» Fixed

Ah, nuts. I meant to commit this weeks ago and it somehow got buried. :( Sorry about that. Since it's more involved than the ~2-10 line patches I've been allowing myself ~24 hours before 7.0, I asked Peter to give it a second look and he said he doesn't see that this would break stuff.

So here's hoping he's right. ;)

Committed to HEAD. Thanks!!

Log in or register to post comments

Comment #42

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 January 2011 at 14:57

Version:	7.x-dev	» 6.x-dev
Status:	Fixed	» Patch (to be ported)

We should perhaps consider porting this to D6. This issue has been around for a LONG time. Thanks for committing it!

Log in or register to post comments

Comment #43

Albert Volkman CreditAttribution: Albert Volkman commented 10 July 2012 at 21:07

Status:

Patch (to be ported)

» Needs review

File	Size
search_simplify_highlight-916086-43.patch	5.58 KB

D6 backport.

Log in or register to post comments

Comment #44

apaderno

he/him

Italian

Brescia, 🇮🇹 🇪🇺

CreditAttribution: apaderno commented 5 January 2013 at 15:56

#43: search_simplify_highlight-916086-43.patch queued for re-testing.

Log in or register to post comments

Comment #45

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 February 2013 at 22:31

Version:	6.x-dev	» 8.x-dev
Status:	Needs review	» Needs work

I think we need to revisit this patch. I just tested it in D7 with Porter Stemmer and it doesn't actually work.

The reason is that in search_simplify_excerpt_match() can return a string that is not a complete word, and then the search_excerpt function will not highlight it. Here's the code:

      // Iterate through the text in the window until we find the full original
      // matching text.
      $length = strlen($window);
      for ($i = 1; $i <= $length; $i++) {
        $keyfound = substr($text, $value[1], $i);
        if ($simplified_key == search_simplify($keyfound)) {
          break;
        }
      }

Instead of iterating through every possible length of a substring, it should only be looking at full words or multiple full words. For instance, if my actual node has the word "finding" in it, and I search for "find" using Porter Stemmer, it will match. Then in this code, it returns "find" as the matching keyword rather than realizing it needs to go to the entire word "finding" and returning that.

Hope that makes sense... Basically what needs to be done is that instead of doing a for loop with $i++, it should be skipping ahead to only substring on "boundary" characters.

Log in or register to post comments

Comment #46

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 February 2013 at 22:59

Issue tags:

+Needs tests, +Needs backport to D7

Here's a patch. I tested it in D7 with Porter Stemmer and it seems to work, and then made it into a d8 patch.

Hopefully it will pass the existing tests that were committed with the original patch. The tests didn't use a stemming-type test module, so they didn't catch the case that I just identified. So they need to be added to, but let's see if this passes the existing tests first. There were some tests and tests modules on
#493270: search_excerpt() doesn't work well with stemming
that should help -- that issue has been marked as a duplicate of this one.

Log in or register to post comments

Comment #47

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 February 2013 at 22:59

Status:

Needs work

» Needs review

File	Size
916086-fixlogic.patch	2.19 KB

forgot the patch and status change

Log in or register to post comments

Comment #48

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 February 2013 at 23:06

And I guess that other issues didn't have any tests that will help. What is needed is a test module that invokes hook_search_preprocess() to do some simplifications, such as replacing "finding" and "finds" with "find", and also replacing "people" with "person" just for fun. Then the excerpt test should have some added assertions that would verify that if your search keyword is "find" and the text is one of "finding", "finds", "find", that the word will be highlighted, and also that if your keyword is "person" or "people", "people" will be highlighted. It might also be useful to do some negative-match tests for excerpts, like making sure that if your text is find and your keyword is person, nothing is highlighted. And some tests for having the keyword inside a sentence, like if you search for "find" and your sentence is "I am finding this to be good", you get the original sentence with "finding" in bold.

Anyway let's see if the existing tests pass.

Log in or register to post comments

Comment #49

6 February 2013 at 23:55

Status:

Needs review

» Needs work

The last submitted patch, 916086-fixlogic.patch, failed testing.

Log in or register to post comments

Comment #50

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 February 2013 at 00:04

Status:

Needs work

» Needs review

Well... two test fails in the Excerpt text. I'm not going to have time to work on this for a few days, so if someone else wants to take it up, please do. Right now I'm not sure whether the test failures are due to logic in the tests or logic in the new patch.

Log in or register to post comments

Comment #51

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 February 2013 at 00:04

Status:

Needs review

» Needs work

inadvertent status change, sorry

Log in or register to post comments

Comment #52

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 February 2013 at 06:14

Status:

Needs work

» Needs review

File	Size
916086-morefails.patch	9.62 KB

Wow. This thing is quite a mess. Here's ... well it still doesn't quite work. I've added a few more tests and it's not working when the keywords are found right at the beginning or end of the string. I got it to almost work but ... I am out of time for the next few days... This whole approach is quite a huge messy thing and the code is really convoluted...

Log in or register to post comments

Comment #53

7 February 2013 at 07:11

Status:

Needs review

» Needs work

The last submitted patch, 916086-morefails.patch, failed testing.

Log in or register to post comments

Comment #54

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 7 February 2013 at 15:05

Status:

Needs work

» Needs review

File	Size
916086-newtests.patch	5.75 KB

So... I don't think my patch to the code in #52 is going to work in all cases. The code for search_excerpt and the helper function introduced in the originally-committed patch here probably needs to be totally reworked.

As a start to doing this, here's a test-only patch that I think should ideally pass if the excerpt functionality is working. I do not expect it to pass. The goal now should be to rewrite the excerpt functionality so it will pass.

I think that the whole system in the search_excerpt system is currently flawed. What I would suggest is rewriting it so that it would:

a) Figure out a list of keywords that are actually found in the text (both keywords from the directly-entered search string, and keywords that, when run through search_simplify(), are found in the text).

b) Extract an excerpt of the text of about 256 characters that contains these keywords, making sure that there is a buffer of about 60 characters before and after each keyword (breaking at word boundaries).

c) Return this excerpt, with "..." (translated) showing where pieces of the original text have been left out.

Right now (a) and (b) are mixed up together and the logic is very convoluted. And I think the whole function will need to be rewritten to make all these tests pass anyway.

Log in or register to post comments

Comment #55

7 February 2013 at 16:51

Status:

Needs review

» Needs work

The last submitted patch, 916086-newtests.patch, failed testing.

Log in or register to post comments

Comment #56

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 8 February 2013 at 20:31

Assigned:	mcarbone	» jhodgdon
Status:	Needs work	» Needs review

File	Size
916086-rework-excerpt-function.patch	19.39 KB

I decided I could spend some more time on this today, and I totally reworked the search_excerpt() function, made some new tests, and got them all to pass. Hopefully the new code in search_excerpt() actually makes sense (as opposed to the, at least I thought, very cryptic old code), and I think the test coverage is good enough to be fairly certain it is working well. And hopefully someone can review this? Please? :)

Log in or register to post comments

Comment #57

8 February 2013 at 21:35

Status:

Needs review

» Needs work

The last submitted patch, 916086-rework-excerpt-function.patch, failed testing.

Log in or register to post comments

Comment #58

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 8 February 2013 at 22:46

Status:

Needs work

» Needs review

File	Size
916086-58-fix-preg.patch	19.85 KB

Doh! Part of that test failure (the 50 exceptions) is real, having to do with needing to quote "/" in preg matching because "/" is the preg delimiter. Good thing someone wrote all these search module tests. :)

Here's a new patch that passes the tests that just failed (and the original excerpt test). Let's see how this works on the entire test suite.

Log in or register to post comments

Comment #59

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 8 February 2013 at 22:55

Hopefully the search.module code changes in this patch speak for themselves.

I have a couple of notes on the test code changes:

- Previously, search_excerpt() always put "..." at the end of each excerpt, even if the excerpt went to the end of the passed-in string, while it omitted the "..." if it started at the beginning of the passed-in string. Now the ending ... is omitted if the text ends at the end. So, one test's assertion changed due to that change in logic/behavior.

- Highlighting in excerpts is now always complete words, not including the trailing space. So a couple of assertions that codified some kind-of bad previous behavior that had spaces ending the STRONG tags had to be changed for this I think better behavior.

- I hijacked an existing test module that already had a hook_search_preprocess() in it, and told it that if the passed-in language code was 'ex' (for "excerpt"), some special preprocessing should be done so that I could test search excerpts working with preprocessing.

The rest of the test changes are additional tests that I added to make sure there was better test coverage of a few edge cases, and of preprocessing.

Log in or register to post comments

Comment #60

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 February 2013 at 14:38

I had a thought this morning...

Before we commit this patch, I think we should add one more test:
- In the test class, have it preprocess "hypertext markup language" to "html" (preprocessing is always done on lower-case text).
- In the excerpt test, verify that if you have a string with "hypertext markup language" at the start, end, and in the middle, and you look for an excerpt with "html", that "hypertext markup language" is highlighted.

This should work, but I think we should test it to make sure and to make sure it stays working.

I'm also unclear as to whether we have tests to verify that these types of preprocessing steps work with search itself (do matches come up in search results). I know that they *do* work in D7, since I maintain the Porter Stemmer module and it works fine in D7 (aside from this excerpt problem), but I don't know if we have core tests that verify that it works for search matching (e.g., with the tests we have now and the proposed new test from this comment, would you be able to search for "dependency", "finding", and "html" and match a node whose body is "The person finds the DIC to be useful for hypertext markup language construction". That would also be interesting to add as a test. Note that the current tests' preprocessing is only done in a fictitious language called "ex", so we might need to change that to a real language like "de" and set that language in the search test.

Log in or register to post comments

Comment #61

pwolanin CreditAttribution: pwolanin commented 11 February 2013 at 14:56

quick comments:

$keys = array_values(array_unique($newkeys));

Why do you need the array_values() call?

For function _search_find_match_with_simplify(), can it return NULL or FALSE instead of an emty string on failure, or do some matches return and empty string and still need to be skipped?

It seems you are removing a function that's nominally part of the API?

Log in or register to post comments

Comment #62

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 11 February 2013 at 15:49

RE #61 - right on all counts:
- We probably don't need the array_values call now. It was needed in a previous iteration while I was trying to fix this.
- Yes, we could return NULL or FALSE instead of an empty string from _search_find_match_with_simplify()
- Yes, I am removing a flawed function that was nominally part of the API. The D7 port will probably not be able to remove that function, but as it's not needed in d8 and it doesn't work anyway, I don't see a reason to keep it around. We can file a change notice just in case.

Anyway, I can try a patch with those two items changed (no array_values and change the function to return NULL if there is no match). Maybe later today...

Log in or register to post comments

Comment #63

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 11 February 2013 at 17:53

File	Size
916086-63.patch	20.35 KB

Here's a patch that removes array_values() and changes the return value of _search_find_match_with_simplify() to NULL if nothing is found, as suggested in #61.

I also added the additional excerpt tests I suggested in #60.

On my test machine, the search numbers and excerpt tests pass... so let's make sure with the testbot.

Log in or register to post comments

Comment #64

pwolanin CreditAttribution: pwolanin commented 13 February 2013 at 19:08

Some minor comments:

+  if (count($workkeys) > 0) {
+    while ($length < 256 && count($workkeys)) {

I'm not sure you need the if{} at all( not seeing an else) and why not just use a cast to boolean?

Also, I'm not sure why you need strlen here:

+  $temp = trim($key);
+  if (strlen($temp) < 1) {
+    return NULL;
   }

and in the preg_match below you could just put preg_quote($text, '/')

Maybe I'll re-roll with some minor changes like that, but overall looks good.

Log in or register to post comments

Comment #65

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 February 2013 at 20:32

If you'd like to reroll with some code cleanup, that would be great! As long as the new patch passes all the new (and old) tests, I'm all for it. :)

Log in or register to post comments

Comment #66

pwolanin CreditAttribution: pwolanin commented 13 February 2013 at 22:11

File	Size
916086-66.patch	20.23 KB

ok, here's a re-roll with mostly cosmetic changes. It passes tests locally.

note that this removes the _search_quote_slashes_for_preg() function in favor of just quoting each key.

Log in or register to post comments

Comment #67

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 February 2013 at 22:26

I will be surprised if this works, because I don't think the $keys array is having the slashes escaped (only workkeys). There's a preg_replace at the bottom of the function that uses $keys and will need to have slashes escaped.

Log in or register to post comments

Comment #68

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 February 2013 at 22:27

Oh wait. Your gymnastics with $keys and $workkeys ... I guess it will work.

Log in or register to post comments

Comment #69

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 February 2013 at 22:34

So, I'm OK with this patch, assuming the tests pass. I personally would never use an array variable as a boolean in a condition... I always use count() or empty() I guess, but if it works, OK. These two statements are inconsistent in how you've coded them though:

+  while ($length < 256 && $workkeys) {
...
+  if (empty($ranges)) {

I'd go ahead and use empty() in the first one or preferably (for me) count() for both.

But this is minor. If the test bot comes back green, I am in favor of this being RTBC. I wrote most of the patch so I won't mark it that myself, but I'm in favor of pwolanin's version in #66, so if he is in favor of it also, let's ship it. :)

Log in or register to post comments

Comment #70

pwolanin CreditAttribution: pwolanin commented 13 February 2013 at 22:52

File	Size
916086-70.patch	20.24 KB

Ok, changed the loop to:

  while ($length < 256 && !empty($workkeys)) {

Log in or register to post comments

Comment #71

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 February 2013 at 23:02

Still in favor of RTBC. :)

Log in or register to post comments

Comment #72

pwolanin CreditAttribution: pwolanin commented 14 February 2013 at 04:25

Status:

Needs review

» Reviewed & tested by the community

Log in or register to post comments

Comment #73

catch

he/him

English

CreditAttribution: catch commented 18 February 2013 at 19:01

Status:

Reviewed & tested by the community

» Needs review

Do we need to use preg_quote() here, or could we just use something other than '/' for the delimiters?

Also $workkeys and $newkeys could either use underscores and/or more descriptive variable names I think.

Log in or register to post comments

Comment #74

pwolanin CreditAttribution: pwolanin commented 19 February 2013 at 14:49

@catch - since we are looking at user input, any delimiter we choose would need to be quoted.

Log in or register to post comments

Comment #75

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 19 February 2013 at 18:46

Yes, whatever character we choose as a delimiter would need to be quoted, so I don't see a reason to choose a different one either. It seems like / is fairly common for regexp delimiters, so let's stick with that.

Regarding variable names, OK. $workkeys came from the old function... I will not have time to rework the patch today for better variable names, but maybe pwolanin will or I can do it in a few days.

Log in or register to post comments

Comment #76

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 March 2013 at 21:32

File	Size
916086-76-search_excerpt_fixup.patch	22.22 KB

916086-76-search_excerpt_fixup.interdiff.txt	8.47 KB

Here is a new patch. I went through search_excerpt() and _search_find_match_with_simplify(). I fixed up variable names and comments. Hopefully both functions are a little bit more comprehensible than they were the last iteration.

I'm attaching an interdiff from the patch in #70 along with this new patch.

Log in or register to post comments

Comment #77

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 22 April 2013 at 15:17

Assigned:

jhodgdon

» Unassigned

This still needs a review. @pwolanin??

Log in or register to post comments

Comment #78

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 22 April 2013 at 15:18

Issue tags:

-Needs tests, -Needs backport to D7

#76: 916086-76-search_excerpt_fixup.patch queued for re-testing.

Log in or register to post comments

Comment #79

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 September 2013 at 16:07

#76: 916086-76-search_excerpt_fixup.patch queued for re-testing.

Log in or register to post comments

Comment #80

6 September 2013 at 16:33

Status:	Needs review	» Needs work
Issue tags:		+Needs tests, +Needs backport to D7

The last submitted patch, 916086-76-search_excerpt_fixup.patch, failed testing.

Log in or register to post comments

Comment #81

pwolanin CreditAttribution: pwolanin commented 6 September 2013 at 19:27

Issue tags:

+Needs reroll

needs reroll for D8 search plugins

Log in or register to post comments

Comment #82

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 September 2013 at 19:49

Status:	Needs work	» Needs review
Issue tags:	-Needs reroll

File	Size
916086-82-search_excerpt_fixup.patch	22.2 KB

Here's a straight reroll of the patch above.

I haven't looked at this code or docs in ages and maybe it can be improved... still needs a review though.

Log in or register to post comments

Comment #83

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 September 2013 at 19:54

Assigned:	Unassigned	» jhodgdon
Issue tags:	-Needs tests

I guess this wasn't assigned to me. Anyway, I just glanced through again and I think this is still ready to go.

I'll update the issue summary too.

Log in or register to post comments

Comment #84

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 6 September 2013 at 20:02

Summary updated.

Log in or register to post comments

Comment #85

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 18 October 2013 at 20:04

Issue tags:

-Needs backport to D7

#82: 916086-82-search_excerpt_fixup.patch queued for re-testing.

Log in or register to post comments

Comment #86

18 October 2013 at 20:06

Status:	Needs review	» Needs work
Issue tags:		+Needs backport to D7

The last submitted patch, 916086-82-search_excerpt_fixup.patch, failed testing.

Log in or register to post comments

Comment #87

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 18 October 2013 at 20:09

Status:

Needs work

» Needs review

File	Size
916086-86-search_excerpt_fixup.patch	22.22 KB

Here's a reroll. This thing still needs a review...

Log in or register to post comments

Comment #87.0

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 18 October 2013 at 20:09

Issue summary:

View changes

create an issue summary

Log in or register to post comments

Comment #88

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 December 2013 at 17:48

Priority:	Normal	» Major
Issue summary:	View changes

We have at least three other issues that are postponed pending this issue being reviewed and committed, so I am going to up the priority on this one. We really need to get this in. I'll click retest and then can someone please review it?

Log in or register to post comments

Comment #89

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 December 2013 at 17:49

87: 916086-86-search_excerpt_fixup.patch queued for re-testing.

Log in or register to post comments

Comment #90

10 December 2013 at 17:50

Status:

Needs review

» Needs work

The last submitted patch, 87: 916086-86-search_excerpt_fixup.patch, failed testing.

Log in or register to post comments

Comment #91

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 December 2013 at 19:46

Issue summary:	View changes
Status:	Needs work	» Needs review

File	Size
916086-91-search_excerpt_fixup.patch	22.2 KB

4 files were hidden/shown/deleted

File	Size
916086-70.patch	20.24 KB

916086-76-search_excerpt_fixup.patch	22.22 KB

916086-76-search_excerpt_fixup.interdiff.txt	8.47 KB
916086-82-search_excerpt_fixup.patch	22.2 KB

Straight reroll of previous patch.

Log in or register to post comments

Comment #92

10 December 2013 at 21:16

Status:

Needs review

» Needs work

The last submitted patch, 91: 916086-91-search_excerpt_fixup.patch, failed testing.

Log in or register to post comments

Comment #93

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 December 2013 at 22:54

It looks like a recently-added test was relying on the old behavior, and will need to be fixed up.

Log in or register to post comments

Comment #94

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 December 2013 at 23:51

Status:

Needs work

» Needs review

File	Size
916086-94-search_excerpt_fixup.patch	23.7 KB

interdiff-91-94.txt	2.27 KB

1 file was hidden/shown/deleted

File	Size
916086-86-search_excerpt_fixup.patch	22.22 KB

I figured out what the problem was. The test SearchPreprocessedLangcodeText::testPreprocessLangcode() was relying on some old behavior: when an excerpt was being generated, it was always running something through hook_search_preprocess(). The current behavior is that if you search for something that is an exact match of the keyword in the unprocessed node, you don't run the preprocess at all. So it was looking for a message that was not there.

I changed the test so that the test module adds some additional text to the node during search indexing, and we search for that text. This causes the excerpt generation to invoke the preprocessing, generating the expected message.

Hope that makes sense... perhaps the interdiff will illuminate things.

The failing test now passes on my test site. Let's see if this addition to the test module breaks any other tests. Hopefully not.

Log in or register to post comments

Comment #95

andypost

he/him

Russian

CreditAttribution: andypost commented 11 December 2013 at 18:12

Status:

Needs review

» Reviewed & tested by the community

It works and have test coverage, let's unblock other issues.
Tested this manually too!!!

Log in or register to post comments

Comment #96

catch

he/him

English

CreditAttribution: catch commented 20 December 2013 at 12:14

Version:	8.x-dev	» 7.x-dev
Status:	Reviewed & tested by the community	» Patch (to be ported)

We need a way to differentiate between issues assigned to a committer because they were working on them, vs. assigned because another committer handed the issue over...

Committed/pushed to 8.x, thanks!

Log in or register to post comments

Comment #97

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 20 December 2013 at 15:26

Ah, I was wondering why this issue was in the queue for so long waiting for a commit. :) I'll just try to remember to un-assign issues when they're ready for commit. Thanks catch!

Log in or register to post comments

Comment #98

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 February 2014 at 20:36

Assigned:

jhodgdon

» Unassigned

If someone else wants to port this to D7, please do. I probably will not make time to do it any time soon, sorry.

Log in or register to post comments

Comment #99

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 17 July 2014 at 15:18

If someone decides to port this to Drupal 7, we also need to make sure to get the fix for #2301715: search_excerpt() uses strrpos() incorrectly when we port this patch.

Log in or register to post comments

Comment #100

RaulMuroc CreditAttribution: RaulMuroc commented 20 November 2014 at 12:17

Status:

Patch (to be ported)

» Needs review

File	Size
highlight_words_that_are_matched_via_search_simplify-916986-100.patch	14.34 KB

First patch (or try) for D7.

Log in or register to post comments

Comment #101

20 November 2014 at 12:37

Status:

Needs review

» Needs work

The last submitted patch, 100: highlight_words_that_are_matched_via_search_simplify-916986-100.patch, failed testing.

Log in or register to post comments

Comment #102

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 22 November 2014 at 00:17

The Drupal 7 port needs to be somewhat different from the Drupal 8 code, because in Drupal 7, there is no $lancode for search_excerpt. Hence, most of the test failures.

Also, we need to make sure to use as a starting point the current Drupal 8 code, not whatever was in the patch on this issue. The reason is that after this was done, #2301715: search_excerpt() uses strrpos() incorrectly was also fixed, which changed the code somewhat. We don't want to reopen that other issue for Drupal 7; we just want to make sure that the code for backport to Drupal 7 on *this* issue is the current, fixed Drupal 8 code.

Log in or register to post comments

Comment #103

RaulMuroc CreditAttribution: RaulMuroc commented 23 November 2014 at 08:45

Ok will try to work sometime on that but I'm currently out of time to spend on it, sorry about that.

thank you for your effort.

Log in or register to post comments

Comment #104

3 August 2016 at 00:04

webchick committed 801a098 on 8.3.x

#916086 by mcarbone, jhodgdon: Fixed search_excerpt() doesn't highlight...

catch committed f82a068 on 8.3.x

Issue #916086 by jhodgdon, mcarbone, pwolanin, Albert Volkman:...

Log in or register to post comments

Comment #105

3 August 2016 at 00:05

webchick committed 801a098 on 8.3.x

#916086 by mcarbone, jhodgdon: Fixed search_excerpt() doesn't highlight...

catch committed f82a068 on 8.3.x

Issue #916086 by jhodgdon, mcarbone, pwolanin, Albert Volkman:...

Log in or register to post comments

Comment #106

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 November 2016 at 20:06

Version:	7.x-dev	» 8.1.x-dev
Status:	Needs work	» Fixed

I guess we should mark this issue 8.x and Fixed, instead of 7.x and Needs backport. If someone wants to pursue the 7.x version, it should be another issue as per current policies.