Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Search result should not be trimmed inside word, and nowise on html code. It could look awful.
example:
készítése
looks like:
készít&eac ...
Comment | File | Size | Author |
---|---|---|---|
#63 | 269911-fixchars.patch | 5.59 KB | jhodgdon |
#60 | 269911-drupalway.patch | 5.59 KB | jhodgdon |
#50 | 269911_cleanup_docs_cleaned.patch | 5.74 KB | apaderno |
#48 | 269911_cleanup_docs_rerolled.patch | 10.72 KB | apaderno |
#42 | 269911-cleanup-docs.patch | 5.71 KB | mradcliffe |
Comments
Comment #1
brianV CreditAttribution: brianV commentedJust assigning this to D7, since that is where this will have to be fixed.
Search results are trimming at a given character limit. They should be trimmed at a word boundary. Should be an easy fix - if nobody else takes it on, I will get to it eventually.
Comment #2
jhodgdonLooks like this still needs to be fixed, and can we have a test for it too?
Comment #3
Freso CreditAttribution: Freso commentedI believe this this will fix it, but I haven't tested. There should be a test for this.
Comment #4
jhodgdonThe patch looks like it should work in the case that the exerpt didn't find the keywords... A couple of questions:
a) Is it possible for search_exerpt to truncate incorrectly if it did find a keyword? ... I looked at the code: No. So that's OK.
b) Is putting " ..." after the excerpt the right thing to do in all languages? I doubt it. However, that's probably a separate issue.
c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?
Comment #5
Pasquallesimilar problem fixed in dblog module
#718662: DBLog listings truncate messages in the middle of HTML tags
Comment #6
andypostInteresting question about word boundary. Suppose $boundary is language specific so should be defined in iso.inc
Comment #7
jhodgdonPerhaps (word boundary definitions in iso.inc)... But the search excerpt function is making a search excerpt of the page for search results, which is trucated content (e.g. node title/body truncated to N characters), and the content could be in any language. Search doesn't know about languages, in its current incarnation. So having a language-aware set of boundary characters would not help until search becomes more language aware. (There are separate issues about that, but at this point it's D8 material.)
Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese -- I know just enough about languages and UTF to know that I don't know much and bring up questions. Who should we ask?
Comment #8
Freso CreditAttribution: Freso commentedIf this is indeed an issue, and it might very well be, it should be filed as a separate issue against the
truncate_utf8()
function.truncate_utf8()
is a Drupal function, not a PHP function. (There's also no such thing as "UTF-9".)Comment #9
andypostI think it's good idea to post about this in http://groups.drupal.org/japan
Comment #10
jhodgdonSorry, my mistake about PHP vs Drupal function on trucate_utf8.
Obviously utf9 was a typo. There is really no need to be so snarky in the issue queue when someone makes a typo. Thanks!
Can someone find an answer about whether spaces are valid word boundaries in all languages? I am extremely busy this week and don't have time to research it on IRC.
Comment #11
brianV CreditAttribution: brianV commented@jhodgon:
Taking Chinese as an example, there are no characters which denote word boundaries, since a word is often wholly contained within a given character.
I've attached two scripts I've found that handle Chinese reasonably well using different approaches. However, keep in mind that this is just one language - the same applies more or less to Japanese, Korean and other similar languages which require somewhat different algorithms to parse.
So, short form, if we want truncate_utf8() to handle languages that don't use a space for word separation, it's going to have to get a whole lot more complicated and resource hungry. If you think this is worth pursuing further, let's do so in a separate issue. In the meantime, I think we can disregard the improper handling of CJK results for the purpose of this issue, at least.
Comment #12
jhodgdonThanks brianV for confirming what I suspected: a space is not a word boundary in all languages.
So the patch in #3 is not going to work for us in general. And since the search module unfortunately doesn't know anything about what language it's dealing with, I don't know what will. Not an easy problem.
Actually, we could fix the actual reported issue about HTML entities by converting HTML entities in this function to characters before truncating (not on a word boundary). It would still stop in a word, but it at least wouldn't make half-entities.
So I also took a quick look at the the 12 functions reported by api.drupal.org in D7 that call truncate_utf8() [note that there could be other calls in classes, since api.drupa.org is not displaying them yet!]. Most do not use the word boundary functionality. Those that do:
http://api.drupal.org/api/function/_menu_parents_recurse/7
http://api.drupal.org/api/function/_book_toc_recurse/7
http://api.drupal.org/api/function/theme_dblog_message/7
http://api.drupal.org/api/function/comment_submit/7
Those are also suspect.
Also I think the "..." after truncating that is in truncate_utf8 as well (almost all the calls either use that or put their own "..." afterwards) is possibly also not correct? I don't know about that.
We have some CJK word boundary code in the search module already, by the way. I don't know how well it works.
Not too sure what to do about all of this.
Comment #13
brianV CreditAttribution: brianV commented@jhodgon:
That is a separate issue about trancate_utf8(), and should be filed as such. It in no way affects this issue. The problems with CJK have to be fixed there, not in this issue.
Comment #14
jhodgdonWell, until truncate_utf8() is fixed, the solution suggested here is not acceptable, because it will likely result in zero-length excerpts in some languages, which would be worse than what we have currently (which is excerpts that break in the middle of a word). So I'm marking this back to "needs work".
Plus, I still don't know if putting "..." at the end of an excerpt is the right thing for all languages. Any ideas on that?
So, let's get back to basics here. The problem reported here is that the current way of doing things is broken in that it can break in the middle of an HTML entity. I think that one solution to that would be to use http://php.net/manual/en/function.html-entity-decode.php to decode the entities, then truncate without the word boundary restriction). That way we wouldn't be breaking up entities, and that would solve the reported issue.
And by the way, there are about 25 issues that come up when you search for truncate_utf8 (including this one). I will go through them later today and file a new one if this is a new issue about it not working for CJK.
But this one is almost certainly relevant:
#200185: truncate_utf8() is used as a substring function
As noted by Gabor:
http://drupal.org/node/200185#comment-662567
truncate_utf8() is NOT meant to be used as a substring function. So we should not be using it here.
Comment #15
jhodgdonSeparate issue on truncate_utf8()
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Comment #16
Damien Tournoud CreditAttribution: Damien Tournoud commented#3 is ok.
@jhodgdon: in the current form of truncate_utf8(), if there is no space anywhere it will simply truncate at the indicated character length. This will never result in a zero-length string.
This likely should be backported to D5 and D6.
Comment #17
jhodgdonNew truncate_utf8() issue:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Comment #18
Damien Tournoud CreditAttribution: Damien Tournoud commentedComment #19
Freso CreditAttribution: Freso commentedHere's the patch for #3 against latest DRUPAL-5 and DRUPAL-6.
Also, jhodgon, the UTF-9 comment was not meant to be snarky. I really thought you thought there would be a truncate_utf9() function. Sorry about that. :/
Comment #20
jhodgdonSorry about the cross-posting there, and about being prickly, Freso. :)
And Damien is correct about the truncate function working OK if a space isn't found (#16), although in this case it will still be broken as far as possibly breaking in the middle of an entity or HTML tag. (Hopefully we'll also fix the truncate function to work better for non-Latin languages, on that other issue).
But what about #14 suggestion -- shouldn't we be doing an entity decode and maybe a tag removal before we truncate?
Comment #21
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks!
Comment #24
jhodgdonSee #20. Shouldn't we be decoding entitites and removing html tags before truncating?
Comment #25
Damien Tournoud CreditAttribution: Damien Tournoud commentedYes, we should probably decode entities in search_excerpt(), just after the call to strip_tags().
Comment #26
jhodgdonHere's a patch that does an entity decode/encode.
It appears to work. I was able to test both the "I found keywords to highlight" and "I didn't" paths in the code by using the Porter Stemmer module, and both are displaying OK on the search results page.
Comment #28
jhodgdonLooks like those test failures are real... will need to investigate.
Comment #29
andypostcomment search test is fragile because it depends on exact string comparison in search results
Comment #30
jhodgdonYeah, probably the exact string comparison changed a bit and the test needs a rewrite if we go with this patch.
Comment #33
jhodgdonHere's a new patch. On my test box it passes the search tests... let's see what the testbot says.
Comment #34
jpmckinney CreditAttribution: jpmckinney commentedAddresses the issue in the OP. Can we have a test for this?
Comment #35
jhodgdonYeah, a test would be a good idea indeed. A node with lots of entities would need to be constructed so we could verify the old function would break it wrong. Probably the test would also need to use a preprocessing function in order to exercise the "I didn't actually find the keyword, just take the first N characters" path in the code (the preprocess function changes the text so the exact search keyword isn't found in the node text, and then the excerpt function doesn't work well, see #493270: search_excerpt() doesn't work well with stemming, which I would have loved to get fixed for D7 but it was moved to D8, sigh, maybe I should move it back, since all sorts of other API changes have gotten in after the deadline, sigh, and that's a really ugly bug.)
That aside, we've also been discussing on #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) that '...' is not the correct thing to do for non-Latin languages. So we should be passing the ... that is used in the search_excerpt() function through t() before pasting together the ranges.
Comment #36
jhodgdonOK.
I fixed the above issues and I wrote a test. It (of course) uncovered some other problems in this function, so I had to fix them too in order to get the test to pass.
Comment #37
douggreen CreditAttribution: douggreen commentedPatch works for me, a couple minor comments:
strpos(' ' . $text)
... it's in original source, should it bestrpos(' ' . $text . ' ')
?$s = min(strlen($end) - 1, $s);
// If we didn't find anything, return the beginning.
doesn't make sense now that we're re-encoding it, and not just returning the beginning string (and "return the beginning" isn't a complete enough sentence anyways).Comment #38
jhodgdonRegarding space before/after:
The space before is in the first line, and the space after is in the second line.
Otherwise, I think I've addressed your comments - thanks!
Comment #39
jhodgdonComment #40
douggreen CreditAttribution: douggreen commentedLooks great! I'm not ecstatic about the comments, but let's address those separately .. maybe we can work together on all search comments and have a single cleanup patch.
Comment #41
webchickCould we better comment that test a little?
Comment #42
mradcliffeI added the following as an introduction to the test case:
'Passes keywords and a sample marked up string, "The quick brown fox jumps over the lazy dog", and compares it to the correctly marked up string. The correctly marked up string contains either highlighted keywords or the original marked up string if no keywords matched the string.'
Comment #43
jhodgdonLooks fine to me...
Comment #44
jhodgdonComment #45
klausiDoes not apply anymore.
Comment #46
jhodgdon#42: 269911-cleanup-docs.patch queued for re-testing.
Comment #48
apadernoI re-rolled the patch, which didn't apply anymore. It seems my editor caught some spaces that should have been removed.
Comment #50
apadernoLet's try once more.
Comment #51
jhodgdonLooks good to me.
Comment #52
jhodgdon#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.
Comment #53
jhodgdon#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.
Comment #54
pwolanin CreditAttribution: pwolanin commented#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.
Comment #55
jhodgdon#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.
Comment #56
jhodgdon#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.
Comment #57
jhodgdonpwolanin asked me to change the title to something better
Comment #58
Damien Tournoud CreditAttribution: Damien Tournoud commentedPlease use our own decode_entities() instead of html_entity_decode(), which is buggy. And check_plain() instead of htmlspecialchars().
Also, this is not really great for translatability purposes:
I suggest we do it this way:
That will give more context to translators.
Comment #59
jhodgdonGood suggestions. I'll make another patch later.
Comment #60
jhodgdonHere we go (I hope)... let's see if it still passes the tests at least.
Comment #62
jhodgdonHm... Failed "Entities are converted in excerpt". Perhaps the substitution of the drupal functions wasn't successful. Will need to investigate.
Comment #63
jhodgdonAh. Looks like there was a screwed-up character in that last patch. This should work better.
Comment #64
andypostjhodgdon, could you use
cvs diff -u
p to see which functions are changingWhile you on it, is it possible to make highlighting swappable?
As maintainer of http://drupal.org/project/rustemmer I need to patch search.module because I need highlight a stemmed words which mostly different from words contained in fragment
Comment #65
jhodgdonandypost: I wish. I use Eclipse for patching and they don't have a -p option as far as I can tell. I'm too lazy to learn another tool...
Regarding making highlighting swappable: I tried, and it was rejected for D7. So it's been pushed off to D8. I know what you're saying -- I am the maintainer of Porter Stemmer for English... Anyway, it's a separate issue. See
#493270: search_excerpt() doesn't work well with stemming
If you're interested, I do have a working version of this in the Search by Page contrib module and Porter Stemmer (with a hook in SBP that lets Porter Stemmer do its own highlighting). Your stemming module could implement the same hook... I have both D6 and D7 versions in SBP/Porter.
And I would still love to get that into D7, but you can read the history on that other issue...
Comment #66
pwolanin CreditAttribution: pwolanin commentedI think it's possible we can rip decode_entities() in favor of the php built-in html_entity_decode() since Drupal 7 requires PHP 5.2 and utf8 support was added in 5.0. http://www.php.net/manual/en/function.html-entity-decode.php
However, that should be a separate issue. This patch looks good.
Comment #67
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks.
Comment #68
Damien Tournoud CreditAttribution: Damien Tournoud commented@pwolanin: I can confirm that
html_entity_decode($input, ENT_QUOTES, 'UTF-8')
works correctly and passes our tests. This charset parameter must have hidden in a cave or something... we refactored the whole decode_entities() in D7 to workaround supposed limitations in the entity table set of PHP :)Comment #69
pwolanin CreditAttribution: pwolanin commentedok, well assuming the built-in is dramatically faster, let's have a follow-up for that conversion: #878408: Replace decode_entities() with built-in html_entity_decode()