Updated: Comment #81

Problem/Motivation

The Search module uses a "simplification" process to alter words at index time (the words in the text that is indexed) and at search time (the keywords). However, when producing an excerpt for display on the search results page, it doesn't consider simplification, so even if there is a match between the simplified keyword and the simplified indexed text, it may not be highlighted in the search results.

Proposed resolution

Fix the search_excerpt() function so that instead of just highlighting exact matches between the entered keywords and the text, it also highlights matches between the simplified keywords and the simplified text.

Remaining tasks

Patch is ready.

User interface changes

Search results will show better excerpts with more things highlighted.

API changes

None... Well, at least the function signatures are the same. The return value of search_excerpt will be improved.

#731298: Searches for words with diacritics/accents: word not highlighted in results
#493270: search_excerpt() doesn't work well with stemming

Original report by mcarbone

To recreate:

1) Create a node with the text "123456.7890" and index it.

2) Search for "123456.7890" and the node will be found with the number highlighted.

3) Search for "1234567890" and the node will be found, but the matching number won't be highlighted.

This is because search_excerpt() looks for exact keyword matches only, and doesn't take into account the fact that both keywords and indexed words are run through search_simplify() first.

Attached patch checks if a keyword has an exact match in the node -- if not, it checks to see if there's a simplified match in the node and if so, adds the original word as a potential keyword to be highlighted. Tests included.

This differs from #731298: Searches for words with diacritics/accents: word not highlighted in results because those words with diacritics aren't matching because of Drupal preprocessing, but because of the collation in the database (and hence is a thornier issue). This also differs from #493270: search_excerpt() doesn't work well with stemming because it involves core preprocessing, and not stemming from a contributed module (and hence is delayed to D8 because it requires an API change).

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

eustace’s picture

@mcarbone, I'll like to verify the bad behavior as reported before applying the patch, however, I'm seeing some anomalies.

  • Search only gets a hit when "123456.7890" is part of the node title. Not true when it is only in the node body.
  • Search does not get a hit with "1234567890" even though it's reported it does but without a link.
  • Probably not related but previous search results remain persistent after current searches. Clearing caches and re-initializing page don't seem to help.

Is there something I'm not doing? Thx

mcarbone’s picture

Silly me, you would have to apply the latest patch at #74673: Search does not correctly handle large numbers (> maxint) so you cannot search by UPC code for example in order to reproduce with my steps above. Without it, you could do it using hyphens instead of periods, I'm pretty sure.

I don't see how your caching issue would be related to this thread.

jhodgdon’s picture

Status: Needs review » Closed (duplicate)

This issue is a duplicate of
#493270: search_excerpt() doesn't work well with stemming
actually

jhodgdon’s picture

Status: Closed (duplicate) » Active

Reopening, because we're not allowed to make the big API change that would actually resolve the whole issue, so this is no longer a duplicate.

mcarbone’s picture

Status: Active » Needs review
FileSize
9.01 KB

Took the patch from #493270: search_excerpt() doesn't work well with stemming, stripped out the new hook, and tweaked and moved around some comments.

jhodgdon’s picture

Status: Needs review » Needs work

There are several whitespace issues in your patch, such as here (should be 2 space indentation, not 3):

       if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
-        $p = $match[0][1];
+         $p = $match[0][1];

Also in the new function search_simplify_excerpt_match(), a few of the doc header lines end in spaces, the indentation in @param and @return section is wrong (should be 2 spaces), and there should be a blank line between the @params and the @return.

Regarding the actual code, I'm not sure I fully understand this bit:

+      // Iterate through the text in the window until we find the full original
+      // matching text.
+      for ($i = strlen($simplified_key); $i <= strlen($window); $i++) {
+        $keyfound = substr($text, $value[1], $i);
+        if ($simplified_key == search_simplify($keyfound)) {
+          break;
+        }
+      }

It looks like it's making an assumption that the original text is at least as long as the simplified text? I guess that is currently the case with the current search_simplify() function, but I'm concerned that if this assumption is silently hard-wired into this function, it might not always be true. Or am I misreading this code? Also, the second arg to substr is the requested length of the string, so the end of the loop should not be strlen($window), but strlen($window) - $value[1], plus or minus 1.

Other than that, the proposed fix looks good to me.

mcarbone’s picture

Status: Needs work » Needs review
FileSize
8.95 KB

"It looks like it's making an assumption that the original text is at least as long as the simplified text? I guess that is currently the case with the current search_simplify() function, but I'm concerned that if this assumption is silently hard-wired into this function, it might not always be true."

You are right that that assumption was being made, as I was trying to minimize the number of iterations. But there's little increase in searching the whole window, so I've fixed this.

"Also, the second arg to substr is the requested length of the string, so the end of the loop should not be strlen($window), but strlen($window) - $value[1], plus or minus 1."

No, the second arg to substr is the position of the beginning of the window in the original text where the match begins, which we got from using the PREG_SPLIT_OFFSET_CAPTURE flag. So what this little bit of code is doing is taking that window, which it knows begins with the match from the strpos above it, and then checking a larger and larger portion of that window until it finds some original text that when simplified matches the simplified key. It ends at strlen($window) because on the very last step it's simplifying the entire 80 character window. It should pretty much never get that far except when the match is derived from a very large piece of original text.

Let me do some justifying: the problem here is that an infinite number of strings can simplify to one single key, just by adding a bunch of characters that get stripped. So we could have a key that's 5 characters long that's matched by something in the original text that's 10000 characters long. To go through the entire text to account for every possible length of original text would not be very performant, so I figured since the excerpt window only looks 80 characters ahead from the matching position anyway, that's the window I'd use to look for a simplified match. I believe this will catch nearly all real-world examples, but it's true that it may fail on really long words that simplify down. The alternative would be to keep expanding the window until the end of the full text, rather than 80 characters. That would catch all possible matches, but we'd essentially be simplifying the entire text for every boundary, which for long pieces of text could get ugly.

jhodgdon’s picture

Sorry, I meant the *third* argument, not the second arg to substr, but I see now that strlen($window) is the correct max length.

So this looks good to me, and the tests demonstrate several cases... Can we add tests for accented characters though -- those are another important case? There are some specific examples at #731298: Searches for words with diacritics/accents: word not highlighted in results and #157684: Search module does not highlight non-ASCII search terms that could be used.

And while you're at it

+      for ($i = 1; $i <= strlen($window); $i++) {

This is somewhat inefficient, since strlen($window) gets evaluated each time through the loop, but is unchanging. Might be better to evaluate once outside the loop?

Status: Needs review » Needs work

The last submitted patch, 916086.patch, failed testing.

mcarbone’s picture

Status: Needs work » Needs review
FileSize
8.97 KB

As I argued in #731298: Searches for words with diacritics/accents: word not highlighted in results, diacritic highlighting in excerpts have nothing to do with search_simplify, so I don't think it's appropriate to add a test here for them. That thread suggests using search_simplify to remove diacritics entirely, so if that was done then tests could be added there. But right now, the reason diacritic matching happens is because of the particular mysql collation being used in the index tables.

Added strlen fix.

jhodgdon’s picture

This is still not quite right:

+      for ($i = 1; $i <= $length; $i++) {

$length is hard-wired to 80, but there's no guarantee that $window is that long.

Point taken on the diacriticals.

mcarbone’s picture

Well, if I set $length after $window is set then there's no point creating a variable at all, as strlen($window) would only get called once in the for loop anyway. So either I can switch it back to the way I had it, or I can leave it as it is now. I think it's OK as is, because even though the window length might not be 80 at the end, we know in that case it will necessarily find a match before it gets to anything beyond the actual window size.

jhodgdon’s picture

That is not true. The ending condition is re-evaluated each loop iteration.

mcarbone’s picture

FileSize
8.98 KB

Oops, you're right. Attached.

jhodgdon’s picture

OK, that looks good as far as logic goes, and I think it's almost ready.

One last thing (sorry, should have noticed this earlier).

Should we be using drupal_strlen() etc. rather than strlen() here (i.e. unicode-safe functions)?
http://api.drupal.org/api/drupal/includes--common.inc/group/php_wrappers/7

Status: Needs review » Needs work

The last submitted patch, 916086.patch, failed testing.

jhodgdon’s picture

Those exceptions in the test failure are in search.module, and look real... no idea what's going on...

They are saying:
preg_replace(): Unknown modifier '3' Warning search.module 1245 search_excerpt()
I'm going to hit retest just in case.

jhodgdon’s picture

Status: Needs work » Needs review

#14: 916086.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, 916086.patch, failed testing.

mcarbone’s picture

Status: Needs work » Needs review
FileSize
9.22 KB

Found the problem causing the test failures - preg_quotes wasn't being called on the keywords found via search_simplify(). Also put in drupal_strlen's.

jhodgdon’s picture

You are using drupal_srlen() but not drupal_substr(), which is a problem. You might try using the Coder module and see what it says, as it would find that error.

mcarbone’s picture

You know, I thought of that, but then I 1) looked around the module and saw tons of substr usage, and 2) read this line in the API: "Note that for cutting off a string at a known character/substring location, the usage of PHP's normal strpos/substr is safe and much faster." Any idea why drupal_substr usage is so inconsistent in core?

jhodgdon’s picture

My understanding is that if you are going to use drupal_strlen to find the length of a string, then you need to use drupal_substr to substring it, for consistency.

Let me think about whether we should go back to just strlen/substr or not...

mcarbone’s picture

FileSize
9.23 KB

I think you're right about the inconsistency, and I think we have to use them if we expect this to work for multibyte characters. If there's a performance issue, that probably has to be address in the confines of drupal_substr, not here, and if there are incorrect uses of substr in the code, that should also probably be addressed elsewhere.

jhodgdon’s picture

Ugh. Right after the patched hunk where you're calling search_simplify_excerpt_match(), the return value is used in strpos (not drupal_strpos):

+      // Note that a space was added to the front and end of $text above.
+      if ($p) {
         if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {

So I guess we have to use substr/strpos in that function after all, not drupal_substr/drupal_strpos. So these need to be changed back... Large apology for the misdirection. I don't think we should try to fix the whole module's usage of strlen/strpos/etc.

mcarbone’s picture

drupal_strpos doesn't exist.

mcarbone’s picture

Which is to say, I'm inclined to keep this particular patch adhering to coding standards, and the inconsistency issue + the strpos issue can be discussed in a new thread. 'Cause if multibyte is broken, we're not going to fix it here, and the solution here will work fine otherwise.

jhodgdon’s picture

OK... right after the strpos(), in the next line, it does a substr()...

I'm going to have to think about this a bit.

jhodgdon’s picture

Status: Needs review » Needs work

I took a closer look at this patch, thinking about drupal_* vs. plain-old strlen and friends.

And by the way, the patch needs a reroll -- the search.test hunk doesn't apply any more (at least not for me at the moment).

So at line 1164 [line numbers are after patching the file] in search_excerpt(), we have:

      if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
        $p = $match[0][1];

That is returning the offset into $text, presumably in bytes and not characters. A few lines down, if there was no match, we obtain $p from a call to search_simplify_excerpt_match(), and then farther down $p is used in the following:

        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);
          if (($s = strrpos($end, ' ')) !== FALSE) {
            // Account for the added spaces.
            $q = max($q - 1, 0);
            $s = min($s, drupal_strlen($end) - 1);
            $ranges[$q] = $p + $s;
            $length += $p + $s - $q;
            $included[$key] = $p + 1;

So this has a mix of strpos, substr, and drupal_strlen. I don't think we should be mixing these -- if we use drupal_strlen, which counts the number of characters (not bytes), then we should use drupal_substr(), which picks off characters (not bytes) and PHP's mb_strpos() and friends. And if we use strpos and strrpos, then we should use the bare PHP strlen(). Right?

search_simplify_excerpt_match() has the same problem. We need to use the same set of functions in it that we are using in the parent search_excerpt() function, since we are going to use its return value in the parent function.

And as a note, I see that PHP has mb_ereg_search* functions -- maybe we should be using those?

mcarbone’s picture

Yeah, as I said above, I think we have a bit of a mess here. But as I see it, we have two problems:

1) the search_simplify excerpt bug that is the title of this issue
2) general string function inconsistency in search.module, which is likely making it buggy for multibyte languages -- but *not* for standard uft8 that clearly most people are using as we haven't seen any bug reports otherwise

I think we should fix #1 here by sticking to drupal coding standards internally in the patch -- it will have no affect either way on #2, which is already a problem. We should open a new ticket for #2, which can then explore the issues you raise above which I think are thorny and have very broad implications throughout the code (namely, can drupal_str functions be used in conjunction with PHP's strpos and other string functions? What about mb_ereg_search* functions?). Basically, it'd be useful I think to have those questions in a more general interest thread so we can attract those members in the community who are more familiar with them.

If you agree with this, I'll reroll the patch just to address the search.test changes. If not, we should retitle the thread to capture the broader issue.

jhodgdon’s picture

The problem is that if you just use the drupal_* functions in the new code for the patch, it becomes inconsistent. I think we either need to fix search_excerpt() completely to use multi-byte safe functions in this patch, or use the non-mulit-byte functions in tihs patch.

I'm not advocating fixing all of search.module. search_excerpt() is an independent function. But it can't mix the two types of functions.

mcarbone’s picture

Status: Needs work » Needs review
FileSize
9.2 KB

Discussed with jhodgdon on IRC, and opened up a new issue to discuss the multibyte issue. (#987472: search.module doesn't consistently support multibyte characters)

Re-rolled this patch to use non-mb-safe versions to be consistent with rest of search_excerpt code.

jhodgdon’s picture

This looks right, pending testing bot... I agree we should handle the multi-byte stuff on a separate issue, and this is consistent with what the search module is doing now.

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

OK. Test bot agrees. I think we've hashed through this enough... THANK YOU mcarbone for all your work, and for taking an interest in the deep dark corners of the search module!!!!

jhodgdon’s picture

#32: 916086-32.patch queued for re-testing.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 916086-32.patch, failed testing.

jhodgdon’s picture

The failed tests are in a User admin test, not related to this issue...

jhodgdon’s picture

Status: Needs work » Needs review

#32: 916086-32.patch queued for re-testing.

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community
pwolanin’s picture

the code comments could use a little copy editing, and the use of single-letter vars like $p isn't ideal, but seems like those already exist in the function and might as well be handled in a separate cleanup issue (if at all).

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Ah, nuts. I meant to commit this weeks ago and it somehow got buried. :( Sorry about that. Since it's more involved than the ~2-10 line patches I've been allowing myself ~24 hours before 7.0, I asked Peter to give it a second look and he said he doesn't see that this would break stuff.

So here's hoping he's right. ;)

Committed to HEAD. Thanks!!

jhodgdon’s picture

Version: 7.x-dev » 6.x-dev
Status: Fixed » Patch (to be ported)

We should perhaps consider porting this to D6. This issue has been around for a LONG time. Thanks for committing it!

Albert Volkman’s picture

Status: Patch (to be ported) » Needs review
FileSize
5.58 KB

D6 backport.

apaderno’s picture

jhodgdon’s picture

Version: 6.x-dev » 8.x-dev
Status: Needs review » Needs work

I think we need to revisit this patch. I just tested it in D7 with Porter Stemmer and it doesn't actually work.

The reason is that in search_simplify_excerpt_match() can return a string that is not a complete word, and then the search_excerpt function will not highlight it. Here's the code:

      // Iterate through the text in the window until we find the full original
      // matching text.
      $length = strlen($window);
      for ($i = 1; $i <= $length; $i++) {
        $keyfound = substr($text, $value[1], $i);
        if ($simplified_key == search_simplify($keyfound)) {
          break;
        }
      }

Instead of iterating through every possible length of a substring, it should only be looking at full words or multiple full words. For instance, if my actual node has the word "finding" in it, and I search for "find" using Porter Stemmer, it will match. Then in this code, it returns "find" as the matching keyword rather than realizing it needs to go to the entire word "finding" and returning that.

Hope that makes sense... Basically what needs to be done is that instead of doing a for loop with $i++, it should be skipping ahead to only substring on "boundary" characters.

jhodgdon’s picture

Here's a patch. I tested it in D7 with Porter Stemmer and it seems to work, and then made it into a d8 patch.

Hopefully it will pass the existing tests that were committed with the original patch. The tests didn't use a stemming-type test module, so they didn't catch the case that I just identified. So they need to be added to, but let's see if this passes the existing tests first. There were some tests and tests modules on
#493270: search_excerpt() doesn't work well with stemming
that should help -- that issue has been marked as a duplicate of this one.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
2.19 KB

forgot the patch and status change

jhodgdon’s picture

And I guess that other issues didn't have any tests that will help. What is needed is a test module that invokes hook_search_preprocess() to do some simplifications, such as replacing "finding" and "finds" with "find", and also replacing "people" with "person" just for fun. Then the excerpt test should have some added assertions that would verify that if your search keyword is "find" and the text is one of "finding", "finds", "find", that the word will be highlighted, and also that if your keyword is "person" or "people", "people" will be highlighted. It might also be useful to do some negative-match tests for excerpts, like making sure that if your text is find and your keyword is person, nothing is highlighted. And some tests for having the keyword inside a sentence, like if you search for "find" and your sentence is "I am finding this to be good", you get the original sentence with "finding" in bold.

Anyway let's see if the existing tests pass.

Status: Needs review » Needs work

The last submitted patch, 916086-fixlogic.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review

Well... two test fails in the Excerpt text. I'm not going to have time to work on this for a few days, so if someone else wants to take it up, please do. Right now I'm not sure whether the test failures are due to logic in the tests or logic in the new patch.

jhodgdon’s picture

Status: Needs review » Needs work

inadvertent status change, sorry

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
9.62 KB

Wow. This thing is quite a mess. Here's ... well it still doesn't quite work. I've added a few more tests and it's not working when the keywords are found right at the beginning or end of the string. I got it to almost work but ... I am out of time for the next few days... This whole approach is quite a huge messy thing and the code is really convoluted...

Status: Needs review » Needs work

The last submitted patch, 916086-morefails.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
5.75 KB

So... I don't think my patch to the code in #52 is going to work in all cases. The code for search_excerpt and the helper function introduced in the originally-committed patch here probably needs to be totally reworked.

As a start to doing this, here's a test-only patch that I think should ideally pass if the excerpt functionality is working. I do not expect it to pass. The goal now should be to rewrite the excerpt functionality so it will pass.

I think that the whole system in the search_excerpt system is currently flawed. What I would suggest is rewriting it so that it would:

a) Figure out a list of keywords that are actually found in the text (both keywords from the directly-entered search string, and keywords that, when run through search_simplify(), are found in the text).

b) Extract an excerpt of the text of about 256 characters that contains these keywords, making sure that there is a buffer of about 60 characters before and after each keyword (breaking at word boundaries).

c) Return this excerpt, with "..." (translated) showing where pieces of the original text have been left out.

Right now (a) and (b) are mixed up together and the logic is very convoluted. And I think the whole function will need to be rewritten to make all these tests pass anyway.

Status: Needs review » Needs work

The last submitted patch, 916086-newtests.patch, failed testing.

jhodgdon’s picture

Assigned: mcarbone » jhodgdon
Status: Needs work » Needs review
FileSize
19.39 KB

I decided I could spend some more time on this today, and I totally reworked the search_excerpt() function, made some new tests, and got them all to pass. Hopefully the new code in search_excerpt() actually makes sense (as opposed to the, at least I thought, very cryptic old code), and I think the test coverage is good enough to be fairly certain it is working well. And hopefully someone can review this? Please? :)

Status: Needs review » Needs work

The last submitted patch, 916086-rework-excerpt-function.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
19.85 KB

Doh! Part of that test failure (the 50 exceptions) is real, having to do with needing to quote "/" in preg matching because "/" is the preg delimiter. Good thing someone wrote all these search module tests. :)

Here's a new patch that passes the tests that just failed (and the original excerpt test). Let's see how this works on the entire test suite.

jhodgdon’s picture

Hopefully the search.module code changes in this patch speak for themselves.

I have a couple of notes on the test code changes:

- Previously, search_excerpt() always put "..." at the end of each excerpt, even if the excerpt went to the end of the passed-in string, while it omitted the "..." if it started at the beginning of the passed-in string. Now the ending ... is omitted if the text ends at the end. So, one test's assertion changed due to that change in logic/behavior.

- Highlighting in excerpts is now always complete words, not including the trailing space. So a couple of assertions that codified some kind-of bad previous behavior that had spaces ending the STRONG tags had to be changed for this I think better behavior.

- I hijacked an existing test module that already had a hook_search_preprocess() in it, and told it that if the passed-in language code was 'ex' (for "excerpt"), some special preprocessing should be done so that I could test search excerpts working with preprocessing.

The rest of the test changes are additional tests that I added to make sure there was better test coverage of a few edge cases, and of preprocessing.

jhodgdon’s picture

I had a thought this morning...

Before we commit this patch, I think we should add one more test:
- In the test class, have it preprocess "hypertext markup language" to "html" (preprocessing is always done on lower-case text).
- In the excerpt test, verify that if you have a string with "hypertext markup language" at the start, end, and in the middle, and you look for an excerpt with "html", that "hypertext markup language" is highlighted.

This should work, but I think we should test it to make sure and to make sure it stays working.

I'm also unclear as to whether we have tests to verify that these types of preprocessing steps work with search itself (do matches come up in search results). I know that they *do* work in D7, since I maintain the Porter Stemmer module and it works fine in D7 (aside from this excerpt problem), but I don't know if we have core tests that verify that it works for search matching (e.g., with the tests we have now and the proposed new test from this comment, would you be able to search for "dependency", "finding", and "html" and match a node whose body is "The person finds the DIC to be useful for hypertext markup language construction". That would also be interesting to add as a test. Note that the current tests' preprocessing is only done in a fictitious language called "ex", so we might need to change that to a real language like "de" and set that language in the search test.

pwolanin’s picture

quick comments:

$keys = array_values(array_unique($newkeys));

Why do you need the array_values() call?

For function _search_find_match_with_simplify(), can it return NULL or FALSE instead of an emty string on failure, or do some matches return and empty string and still need to be skipped?

It seems you are removing a function that's nominally part of the API?

jhodgdon’s picture

RE #61 - right on all counts:
- We probably don't need the array_values call now. It was needed in a previous iteration while I was trying to fix this.
- Yes, we could return NULL or FALSE instead of an empty string from _search_find_match_with_simplify()
- Yes, I am removing a flawed function that was nominally part of the API. The D7 port will probably not be able to remove that function, but as it's not needed in d8 and it doesn't work anyway, I don't see a reason to keep it around. We can file a change notice just in case.

Anyway, I can try a patch with those two items changed (no array_values and change the function to return NULL if there is no match). Maybe later today...

jhodgdon’s picture

FileSize
20.35 KB

Here's a patch that removes array_values() and changes the return value of _search_find_match_with_simplify() to NULL if nothing is found, as suggested in #61.

I also added the additional excerpt tests I suggested in #60.

On my test machine, the search numbers and excerpt tests pass... so let's make sure with the testbot.

pwolanin’s picture

Some minor comments:

+  if (count($workkeys) > 0) {
+    while ($length < 256 && count($workkeys)) {

I'm not sure you need the if{} at all( not seeing an else) and why not just use a cast to boolean?

Also, I'm not sure why you need strlen here:

+  $temp = trim($key);
+  if (strlen($temp) < 1) {
+    return NULL;
   }

and in the preg_match below you could just put preg_quote($text, '/')

Maybe I'll re-roll with some minor changes like that, but overall looks good.

jhodgdon’s picture

If you'd like to reroll with some code cleanup, that would be great! As long as the new patch passes all the new (and old) tests, I'm all for it. :)

pwolanin’s picture

FileSize
20.23 KB

ok, here's a re-roll with mostly cosmetic changes. It passes tests locally.

note that this removes the _search_quote_slashes_for_preg() function in favor of just quoting each key.

jhodgdon’s picture

I will be surprised if this works, because I don't think the $keys array is having the slashes escaped (only workkeys). There's a preg_replace at the bottom of the function that uses $keys and will need to have slashes escaped.

jhodgdon’s picture

Oh wait. Your gymnastics with $keys and $workkeys ... I guess it will work.

jhodgdon’s picture

So, I'm OK with this patch, assuming the tests pass. I personally would never use an array variable as a boolean in a condition... I always use count() or empty() I guess, but if it works, OK. These two statements are inconsistent in how you've coded them though:

+  while ($length < 256 && $workkeys) {
...
+  if (empty($ranges)) {

I'd go ahead and use empty() in the first one or preferably (for me) count() for both.

But this is minor. If the test bot comes back green, I am in favor of this being RTBC. I wrote most of the patch so I won't mark it that myself, but I'm in favor of pwolanin's version in #66, so if he is in favor of it also, let's ship it. :)

pwolanin’s picture

FileSize
20.24 KB

Ok, changed the loop to:

  while ($length < 256 && !empty($workkeys)) {
jhodgdon’s picture

Still in favor of RTBC. :)

pwolanin’s picture

Status: Needs review » Reviewed & tested by the community
catch’s picture

Status: Reviewed & tested by the community » Needs review

Do we need to use preg_quote() here, or could we just use something other than '/' for the delimiters?

Also $workkeys and $newkeys could either use underscores and/or more descriptive variable names I think.

pwolanin’s picture

@catch - since we are looking at user input, any delimiter we choose would need to be quoted.

jhodgdon’s picture

Yes, whatever character we choose as a delimiter would need to be quoted, so I don't see a reason to choose a different one either. It seems like / is fairly common for regexp delimiters, so let's stick with that.

Regarding variable names, OK. $workkeys came from the old function... I will not have time to rework the patch today for better variable names, but maybe pwolanin will or I can do it in a few days.

jhodgdon’s picture

Here is a new patch. I went through search_excerpt() and _search_find_match_with_simplify(). I fixed up variable names and comments. Hopefully both functions are a little bit more comprehensible than they were the last iteration.

I'm attaching an interdiff from the patch in #70 along with this new patch.

jhodgdon’s picture

Assigned: jhodgdon » Unassigned

This still needs a review. @pwolanin??

jhodgdon’s picture

jhodgdon’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests, +Needs backport to D7

The last submitted patch, 916086-76-search_excerpt_fixup.patch, failed testing.

pwolanin’s picture

Issue tags: +Needs reroll

needs reroll for D8 search plugins

jhodgdon’s picture

Status: Needs work » Needs review
Issue tags: -Needs reroll
FileSize
22.2 KB

Here's a straight reroll of the patch above.

I haven't looked at this code or docs in ages and maybe it can be improved... still needs a review though.

jhodgdon’s picture

Assigned: Unassigned » jhodgdon
Issue tags: -Needs tests

I guess this wasn't assigned to me. Anyway, I just glanced through again and I think this is still ready to go.

I'll update the issue summary too.

jhodgdon’s picture

Summary updated.

jhodgdon’s picture

Issue tags: -Needs backport to D7

Status: Needs review » Needs work
Issue tags: +Needs backport to D7

The last submitted patch, 916086-82-search_excerpt_fixup.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
22.22 KB

Here's a reroll. This thing still needs a review...

jhodgdon’s picture

Issue summary: View changes

create an issue summary

jhodgdon’s picture

Priority: Normal » Major
Issue summary: View changes

We have at least three other issues that are postponed pending this issue being reviewed and committed, so I am going to up the priority on this one. We really need to get this in. I'll click retest and then can someone please review it?

jhodgdon’s picture

Status: Needs review » Needs work

The last submitted patch, 87: 916086-86-search_excerpt_fixup.patch, failed testing.

jhodgdon’s picture

Issue summary: View changes
Status: Needs work » Needs review
FileSize
22.2 KB

Straight reroll of previous patch.

Status: Needs review » Needs work

The last submitted patch, 91: 916086-91-search_excerpt_fixup.patch, failed testing.

jhodgdon’s picture

It looks like a recently-added test was relying on the old behavior, and will need to be fixed up.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
23.7 KB
2.27 KB

I figured out what the problem was. The test SearchPreprocessedLangcodeText::testPreprocessLangcode() was relying on some old behavior: when an excerpt was being generated, it was always running something through hook_search_preprocess(). The current behavior is that if you search for something that is an exact match of the keyword in the unprocessed node, you don't run the preprocess at all. So it was looking for a message that was not there.

I changed the test so that the test module adds some additional text to the node during search indexing, and we search for that text. This causes the excerpt generation to invoke the preprocessing, generating the expected message.

Hope that makes sense... perhaps the interdiff will illuminate things.

The failing test now passes on my test site. Let's see if this addition to the test module breaks any other tests. Hopefully not.

andypost’s picture

Status: Needs review » Reviewed & tested by the community

It works and have test coverage, let's unblock other issues.
Tested this manually too!!!

catch’s picture

Version: 8.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

We need a way to differentiate between issues assigned to a committer because they were working on them, vs. assigned because another committer handed the issue over...

Committed/pushed to 8.x, thanks!

jhodgdon’s picture

Ah, I was wondering why this issue was in the queue for so long waiting for a commit. :) I'll just try to remember to un-assign issues when they're ready for commit. Thanks catch!

jhodgdon’s picture

Assigned: jhodgdon » Unassigned

If someone else wants to port this to D7, please do. I probably will not make time to do it any time soon, sorry.

jhodgdon’s picture

If someone decides to port this to Drupal 7, we also need to make sure to get the fix for #2301715: search_excerpt() uses strrpos() incorrectly when we port this patch.

RaulMuroc’s picture

Status: Patch (to be ported) » Needs review
FileSize
14.34 KB

First patch (or try) for D7.

Status: Needs review » Needs work
jhodgdon’s picture

The Drupal 7 port needs to be somewhat different from the Drupal 8 code, because in Drupal 7, there is no $lancode for search_excerpt. Hence, most of the test failures.

Also, we need to make sure to use as a starting point the current Drupal 8 code, not whatever was in the patch on this issue. The reason is that after this was done, #2301715: search_excerpt() uses strrpos() incorrectly was also fixed, which changed the code somewhat. We don't want to reopen that other issue for Drupal 7; we just want to make sure that the code for backport to Drupal 7 on *this* issue is the current, fixed Drupal 8 code.

RaulMuroc’s picture

Ok will try to work sometime on that but I'm currently out of time to spend on it, sorry about that.

thank you for your effort.

  • webchick committed 801a098 on 8.3.x
    #916086 by mcarbone, jhodgdon: Fixed search_excerpt() doesn't highlight...
  • catch committed f82a068 on 8.3.x
    Issue #916086 by jhodgdon, mcarbone, pwolanin, Albert Volkman:...

  • webchick committed 801a098 on 8.3.x
    #916086 by mcarbone, jhodgdon: Fixed search_excerpt() doesn't highlight...
  • catch committed f82a068 on 8.3.x
    Issue #916086 by jhodgdon, mcarbone, pwolanin, Albert Volkman:...
jhodgdon’s picture

Version: 7.x-dev » 8.1.x-dev
Status: Needs work » Fixed

I guess we should mark this issue 8.x and Fixed, instead of 7.x and Needs backport. If someone wants to pursue the 7.x version, it should be another issue as per current policies.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.