Support from Acquia helps fund testing for Drupal Acquia logo

Comments

brianV’s picture

Version: 6.2 » 7.x-dev

Just assigning this to D7, since that is where this will have to be fixed.

Search results are trimming at a given character limit. They should be trimmed at a word boundary. Should be an easy fix - if nobody else takes it on, I will get to it eventually.

jhodgdon’s picture

Looks like this still needs to be fixed, and can we have a test for it too?

Freso’s picture

Status: Active » Needs review
Issue tags: +Needs tests
FileSize
654 bytes

I believe this this will fix it, but I haven't tested. There should be a test for this.

jhodgdon’s picture

The patch looks like it should work in the case that the exerpt didn't find the keywords... A couple of questions:

a) Is it possible for search_exerpt to truncate incorrectly if it did find a keyword? ... I looked at the code: No. So that's OK.

b) Is putting " ..." after the excerpt the right thing to do in all languages? I doubt it. However, that's probably a separate issue.

c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?

Pasqualle’s picture

andypost’s picture

Interesting question about word boundary. Suppose $boundary is language specific so should be defined in iso.inc

jhodgdon’s picture

Perhaps (word boundary definitions in iso.inc)... But the search excerpt function is making a search excerpt of the page for search results, which is trucated content (e.g. node title/body truncated to N characters), and the content could be in any language. Search doesn't know about languages, in its current incarnation. So having a language-aware set of boundary characters would not help until search becomes more language aware. (There are separate issues about that, but at this point it's D8 material.)

Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese -- I know just enough about languages and UTF to know that I don't know much and bring up questions. Who should we ask?

Freso’s picture

c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?

If this is indeed an issue, and it might very well be, it should be filed as a separate issue against the truncate_utf8() function.

Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese [...]

truncate_utf8() is a Drupal function, not a PHP function. (There's also no such thing as "UTF-9".)

andypost’s picture

I think it's good idea to post about this in http://groups.drupal.org/japan

jhodgdon’s picture

Sorry, my mistake about PHP vs Drupal function on trucate_utf8.

Obviously utf9 was a typo. There is really no need to be so snarky in the issue queue when someone makes a typo. Thanks!

Can someone find an answer about whether spaces are valid word boundaries in all languages? I am extremely busy this week and don't have time to research it on IRC.

brianV’s picture

FileSize
7.69 KB
5.47 KB

@jhodgon:

Taking Chinese as an example, there are no characters which denote word boundaries, since a word is often wholly contained within a given character.

I've attached two scripts I've found that handle Chinese reasonably well using different approaches. However, keep in mind that this is just one language - the same applies more or less to Japanese, Korean and other similar languages which require somewhat different algorithms to parse.

So, short form, if we want truncate_utf8() to handle languages that don't use a space for word separation, it's going to have to get a whole lot more complicated and resource hungry. If you think this is worth pursuing further, let's do so in a separate issue. In the meantime, I think we can disregard the improper handling of CJK results for the purpose of this issue, at least.

jhodgdon’s picture

Status: Needs review » Needs work

Thanks brianV for confirming what I suspected: a space is not a word boundary in all languages.

So the patch in #3 is not going to work for us in general. And since the search module unfortunately doesn't know anything about what language it's dealing with, I don't know what will. Not an easy problem.

Actually, we could fix the actual reported issue about HTML entities by converting HTML entities in this function to characters before truncating (not on a word boundary). It would still stop in a word, but it at least wouldn't make half-entities.

So I also took a quick look at the the 12 functions reported by api.drupal.org in D7 that call truncate_utf8() [note that there could be other calls in classes, since api.drupa.org is not displaying them yet!]. Most do not use the word boundary functionality. Those that do:
http://api.drupal.org/api/function/_menu_parents_recurse/7
http://api.drupal.org/api/function/_book_toc_recurse/7
http://api.drupal.org/api/function/theme_dblog_message/7
http://api.drupal.org/api/function/comment_submit/7

Those are also suspect.

Also I think the "..." after truncating that is in truncate_utf8 as well (almost all the calls either use that or put their own "..." afterwards) is possibly also not correct? I don't know about that.

We have some CJK word boundary code in the search module already, by the way. I don't know how well it works.

Not too sure what to do about all of this.

brianV’s picture

Status: Needs work » Needs review

@jhodgon:

That is a separate issue about trancate_utf8(), and should be filed as such. It in no way affects this issue. The problems with CJK have to be fixed there, not in this issue.

jhodgdon’s picture

Status: Needs review » Needs work

Well, until truncate_utf8() is fixed, the solution suggested here is not acceptable, because it will likely result in zero-length excerpts in some languages, which would be worse than what we have currently (which is excerpts that break in the middle of a word). So I'm marking this back to "needs work".

Plus, I still don't know if putting "..." at the end of an excerpt is the right thing for all languages. Any ideas on that?

So, let's get back to basics here. The problem reported here is that the current way of doing things is broken in that it can break in the middle of an HTML entity. I think that one solution to that would be to use http://php.net/manual/en/function.html-entity-decode.php to decode the entities, then truncate without the word boundary restriction). That way we wouldn't be breaking up entities, and that would solve the reported issue.

And by the way, there are about 25 issues that come up when you search for truncate_utf8 (including this one). I will go through them later today and file a new one if this is a new issue about it not working for CJK.

But this one is almost certainly relevant:
#200185: truncate_utf8() is used as a substring function
As noted by Gabor:
http://drupal.org/node/200185#comment-662567
truncate_utf8() is NOT meant to be used as a substring function. So we should not be using it here.

jhodgdon’s picture

Damien Tournoud’s picture

Title: wrong trim used for search result » Wrong trim used for search result
Priority: Minor » Normal
Status: Needs work » Reviewed & tested by the community
Issue tags: +Needs backport to D6, +Needs backport to D5

#3 is ok.

@jhodgdon: in the current form of truncate_utf8(), if there is no space anywhere it will simply truncate at the indicated character length. This will never result in a zero-length string.

This likely should be backported to D5 and D6.

jhodgdon’s picture

Title: Wrong trim used for search result » wrong trim used for search result
Priority: Normal » Minor
Status: Reviewed & tested by the community » Needs work
Issue tags: -Needs backport to D6, -Needs backport to D5
Damien Tournoud’s picture

Title: wrong trim used for search result » Wrong trim used for search result
Priority: Minor » Normal
Status: Needs work » Reviewed & tested by the community
Issue tags: +Needs backport to D6, +Needs backport to D5
Freso’s picture

Here's the patch for #3 against latest DRUPAL-5 and DRUPAL-6.

Also, jhodgon, the UTF-9 comment was not meant to be snarky. I really thought you thought there would be a truncate_utf9() function. Sorry about that. :/

jhodgdon’s picture

Status: Reviewed & tested by the community » Needs review

Sorry about the cross-posting there, and about being prickly, Freso. :)

And Damien is correct about the truncate function working OK if a space isn't found (#16), although in this case it will still be broken as far as possibly breaking in the middle of an entity or HTML tag. (Hopefully we'll also fix the truncate function to work better for non-Latin languages, on that other issue).

But what about #14 suggestion -- shouldn't we be doing an entity decode and maybe a tag removal before we truncate?

Dries’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs review » Reviewed & tested by the community

Committed to CVS HEAD. Thanks!

jhodgdon’s picture

Version: 6.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Needs work

See #20. Shouldn't we be decoding entitites and removing html tags before truncating?

Damien Tournoud’s picture

Yes, we should probably decode entities in search_excerpt(), just after the call to strip_tags().

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
1.75 KB

Here's a patch that does an entity decode/encode.

It appears to work. I was able to test both the "I found keywords to highlight" and "I didn't" paths in the code by using the Porter Stemmer module, and both are displaying OK on the search results page.

Status: Needs review » Needs work

The last submitted patch, 269911.patch, failed testing.

jhodgdon’s picture

Looks like those test failures are real... will need to investigate.

andypost’s picture

comment search test is fragile because it depends on exact string comparison in search results

jhodgdon’s picture

Yeah, probably the exact string comparison changed a bit and the test needs a rewrite if we go with this patch.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
1.82 KB

Here's a new patch. On my test box it passes the search tests... let's see what the testbot says.

jpmckinney’s picture

Addresses the issue in the OP. Can we have a test for this?

jhodgdon’s picture

Status: Needs review » Needs work

Yeah, a test would be a good idea indeed. A node with lots of entities would need to be constructed so we could verify the old function would break it wrong. Probably the test would also need to use a preprocessing function in order to exercise the "I didn't actually find the keyword, just take the first N characters" path in the code (the preprocess function changes the text so the exact search keyword isn't found in the node text, and then the excerpt function doesn't work well, see #493270: search_excerpt() doesn't work well with stemming, which I would have loved to get fixed for D7 but it was moved to D8, sigh, maybe I should move it back, since all sorts of other API changes have gotten in after the deadline, sigh, and that's a really ugly bug.)

That aside, we've also been discussing on #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) that '...' is not the correct thing to do for non-Latin languages. So we should be passing the ... that is used in the search_excerpt() function through t() before pasting together the ranges.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
5.02 KB

OK.

I fixed the above issues and I wrote a test. It (of course) uncovered some other problems in this function, so I had to fix them too in order to get the test to pass.

douggreen’s picture

Status: Needs review » Needs work

Patch works for me, a couple minor comments:

  • Is the space after the $text not needed in strpos(' ' . $text) ... it's in original source, should it be strpos(' ' . $text . ' ')?
  • Avoid running strlen() twice, by using min(), [untested] something like $s = min(strlen($end) - 1, $s);
  • ... if you do the above min(), it makes sense to do same thing using max() in line above it.
  • Please add comment above htmlspecialchars (around +1117) that says something about re-encoding previously decoded html entities.
  • Similarly, comment // If we didn't find anything, return the beginning. doesn't make sense now that we're re-encoding it, and not just returning the beginning string (and "return the beginning" isn't a complete enough sentence anyways).
jhodgdon’s picture

FileSize
5.19 KB

Regarding space before/after:

        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);

The space before is in the first line, and the space after is in the second line.

Otherwise, I think I've addressed your comments - thanks!

jhodgdon’s picture

Status: Needs work » Needs review
douggreen’s picture

Status: Needs review » Reviewed & tested by the community

Looks great! I'm not ecstatic about the comments, but let's address those separately .. maybe we can work together on all search comments and have a single cleanup patch.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Could we better comment that test a little?

mradcliffe’s picture

Status: Needs work » Needs review
Issue tags: +Needs documentation
FileSize
5.71 KB

I added the following as an introduction to the test case:

'Passes keywords and a sample marked up string, "The quick brown fox jumps over the lazy dog", and compares it to the correctly marked up string. The correctly marked up string contains either highlighted keywords or the original marked up string if no keywords matched the string.'

jhodgdon’s picture

Looks fine to me...

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community
klausi’s picture

Status: Reviewed & tested by the community » Needs work

Does not apply anymore.

jhodgdon’s picture

Status: Needs work » Needs review

#42: 269911-cleanup-docs.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, 269911-cleanup-docs.patch, failed testing.

apaderno’s picture

Status: Needs work » Needs review
FileSize
10.72 KB

I re-rolled the patch, which didn't apply anymore. It seems my editor caught some spaces that should have been removed.

Status: Needs review » Needs work

The last submitted patch, 269911_cleanup_docs_rerolled.patch, failed testing.

apaderno’s picture

Status: Needs work » Needs review
FileSize
5.74 KB

Let's try once more.

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

Looks good to me.

jhodgdon’s picture

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

jhodgdon’s picture

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

pwolanin’s picture

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

jhodgdon’s picture

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

jhodgdon’s picture

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

jhodgdon’s picture

Title: Wrong trim used for search result » Search result trimming should not fall inside HTML entities/tags

pwolanin asked me to change the title to something better

Damien Tournoud’s picture

Status: Reviewed & tested by the community » Needs work

Please use our own decode_entities() instead of html_entity_decode(), which is buggy. And check_plain() instead of htmlspecialchars().

Also, this is not really great for translatability purposes:

+  $text = (isset($newranges[0]) ? '' : t('... ')) . implode(t(' ... '), $out) . t(' ...');

I suggest we do it this way:

$separators = explode('!excerpt', t('... !excerpt ... !excerpt ...'));
$text = (isset($newranges[0]) ? '' : $separators[0]) . implode($separators[1]), $out) . $separators[2]);

That will give more context to translators.

jhodgdon’s picture

Good suggestions. I'll make another patch later.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
5.59 KB

Here we go (I hope)... let's see if it still passes the tests at least.

Status: Needs review » Needs work

The last submitted patch, 269911-drupalway.patch, failed testing.

jhodgdon’s picture

Hm... Failed "Entities are converted in excerpt". Perhaps the substitution of the drupal functions wasn't successful. Will need to investigate.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
5.59 KB

Ah. Looks like there was a screwed-up character in that last patch. This should work better.

andypost’s picture

jhodgdon, could you use cvs diff -up to see which functions are changing

While you on it, is it possible to make highlighting swappable?
As maintainer of http://drupal.org/project/rustemmer I need to patch search.module because I need highlight a stemmed words which mostly different from words contained in fragment

jhodgdon’s picture

andypost: I wish. I use Eclipse for patching and they don't have a -p option as far as I can tell. I'm too lazy to learn another tool...

Regarding making highlighting swappable: I tried, and it was rejected for D7. So it's been pushed off to D8. I know what you're saying -- I am the maintainer of Porter Stemmer for English... Anyway, it's a separate issue. See
#493270: search_excerpt() doesn't work well with stemming
If you're interested, I do have a working version of this in the Search by Page contrib module and Porter Stemmer (with a hook in SBP that lets Porter Stemmer do its own highlighting). Your stemming module could implement the same hook... I have both D6 and D7 versions in SBP/Porter.

And I would still love to get that into D7, but you can read the history on that other issue...

pwolanin’s picture

Status: Needs review » Reviewed & tested by the community

I think it's possible we can rip decode_entities() in favor of the php built-in html_entity_decode() since Drupal 7 requires PHP 5.2 and utf8 support was added in 5.0. http://www.php.net/manual/en/function.html-entity-decode.php

However, that should be a separate issue. This patch looks good.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to CVS HEAD. Thanks.

Damien Tournoud’s picture

@pwolanin: I can confirm that html_entity_decode($input, ENT_QUOTES, 'UTF-8') works correctly and passes our tests. This charset parameter must have hidden in a cave or something... we refactored the whole decode_entities() in D7 to workaround supposed limitations in the entity table set of PHP :)

pwolanin’s picture

ok, well assuming the built-in is dramatically faster, let's have a follow-up for that conversion: #878408: Replace decode_entities() with built-in html_entity_decode()

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.