I have come across a bug in Views 2, drupal 6.12.
I have a View that displays the title, teaser (with html entities) and link to node. The language of the content is Greek (When I tested on Lorem ipsum text with html entities, there is NO problem, trim works as expected). When I set the teaser to be trimmed at say 320 characters, with Trim only on a word boundary and Add an ellipsis, the output is:
Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι...
It stops at the second html entity. If I where to trim the teaser at 380 characters then it starts to render the entities correctly. The output is:
Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι» δηλώνει την κατάσταση που θα βρεθεί η νύφη μετά το μυστήριο: Υπό του ανδρός. Κι όλα αυτά γιατί; Γιατί και ο γαμπρός έχει θέση στην εκκλησία. Δεν είναι το ίδιο φανταχτερός και βέβαια δεν έχει το ίδιο άγχος με τη νύφη. Μέσα από τις σελίδες του site...
Now if I were to select not to Trim only on a word boundary the problem does not happen.
Selecting the Field can contain HTML does not change the output.
The text I have as seen in the fckeditor is:
<p>Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι» δηλώνει την κατάσταση που θα βρεθεί η νύφη μετά το μυστήριο: Υπό του ανδρός. Κι όλα αυτά γιατί; Γιατί και ο γαμπρός έχει θέση στην εκκλησία. Δεν είναι το ίδιο φανταχτερός και βέβαια δεν έχει το ίδιο άγχος με τη νύφη. Μέσα από τις σελίδες του site μπορείτε να κάνετε μια πρώτη επίσκεψη «στην αγορά» παίρνοντας μια ιδία για τις τάσεις και τη μόδα.</p>
Any help would be appreciated.
Comment | File | Size | Author |
---|---|---|---|
#29 | 513396_trim_mb.patch | 2.76 KB | dawehner |
#25 | 513396_trim_mb.patch | 1.01 KB | yhager |
#8 | views-513396.patch | 744 bytes | jcisio |
#4 | yh.patch | 888 bytes | yhager |
#2 | yh.patch | 644 bytes | yhager |
Comments
Comment #1
dwb17 CreditAttribution: dwb17 commentedDoes anyone have any idea?
Comment #2
yhager CreditAttribution: yhager commentedThis is a problem with PHP unicode character handling. Can you say if the attached patch fixes this for you (it does for me, on Hebrew characters).
Note views maintainers: I do not think this patch is fit for the general public as is, but fact is that PHP fails on utf-8 strings word boundary. Do you think there is a way to "check if this is not ascii text, and if php has mb support, and only then use the mb_ereg functions in this patch"?
I am well aware that mb_ereg is deprecated - but until PHP 6 lands, I have not found any better solution. Open to suggestions.
Comment #3
dawehnerI think this library is not enabled everywhere, you have to check for function_exists.
Additional it would be perhaps cool if you could add
// TODO: replace this with cleanstring of ctools
Comment #4
yhager CreditAttribution: yhager commentedrerolled.
Comment #5
jcisio CreditAttribution: jcisio commentedI think the "word boundary" character \b is not what we want here. Definition:
That means "abc xyz" cut at 5 character would return "abc " instead of "abc" (what we want), because the 4th and 5th characters are both word boundary.
I replace
if (preg_match("/(.*)\b.+/us", $value, $matches)) {
by
if (preg_match("/(.*)\s.*/s", $value, $matches)) {
It works for utf8 Vietnamese strings. Please confirm that it works for other languages, too.
Comment #6
yhager CreditAttribution: yhager commentedTrue, but if you cut it at 3, you get 'abc' and there is no whitespace there, so you lose the entire word. With word-boundary search, you still get it.
Comment #7
jcisio CreditAttribution: jcisio commentedI don't think I understand what you mean by "lose the entire word". Actually "abc xyz" cut at 5 returns "abc ...", but it should return "abc...", shouldn't it?
The code I submitted is for the original code. I don't use "u" modifier as it just works without. With "u", it doesn't work, don't know why. So it needs more tests.
Comment #8
jcisio CreditAttribution: jcisio commentedSubmit a patch against 6.x-2.x so that it's easier to test. The only problem I can see is that "abc xyz www" trimmed at 7 will return "abc". But that's another problem.
Comment #9
yhager CreditAttribution: yhager commented> I don't think I understand what you mean by "lose the entire word".
try to cut "abc xyz" at 3 with your patch, and you get an empty string, instead of 'abc'.
> The only problem I can see is that "abc xyz www" trimmed at 7 will return "abc". But that's another problem.
No, it's the same problem. If you happen to cut the string *exactly* at the end of a word, you lose that last word in the result.
Comment #10
jcisio CreditAttribution: jcisio commentedNot the mb encoding problem anyway ;)
If we're willing to change the issue title, I propose to replace the empty string with t('(empty)'). That's the users who must take care for their title, and the admin who should always set the length to at least 10-15.
I don't see in any case that we need to trim at word boundary and the defined length is less than 3-4 times an average word length. And lost the last word is not a problem, either, as we don't need an exact length.
The last one, if we really need it, just check if (N+1)th character is \s, we need to increase N by 1 before doing the truncation.
PS: \s is dependant on environment, but usually it is
[\t\r\n ]
or more, so we may want to replace \s by something like[\s\.\?,;]
Comment #11
merlinofchaos CreditAttribution: merlinofchaos commentedThe problem presented in #5 can be solved by adding a trim() function after the word boundary check, rather than trying to redefine word boundary which regex actually does pretty well for us.
The solution in #4 seems ok (though the patch name is generic -- putting the issue # in the patch name is helpful. I have a LOT of patches in my views directory =) and I'm going to go ahead and go with it. I think it will do what we want. The fact that we keep a space after word boundary testing is a different issue I think and can be addressed separately. #4 is committed to all branches.
jcisio, would you like to create a new issue for the trailing whitespace? A simple trim() should fix it.
Comment #12
merlinofchaos CreditAttribution: merlinofchaos commentedComment #13
yhager CreditAttribution: yhager commentedThanks for committing this - sorry for the patch name confusion.
Comment #14
OnkelTem CreditAttribution: OnkelTem commentedAll russian text is cut to a null-length string, if using this Views feature for Teaser for example with selected "cut by words" checkbox.
I don't understand - is this really FIXED or not, and if yes - what Views version should I download to get it working? Or what patch to apply?
P.S. This patch views-513396.patch has nothing to do with my issue. Russian text is simply disappeared.
regards
UPD. http://drupal.org/node/376722#comment-2467170 - fix to replace silly preg_match's "\b"
Comment #16
jcisio CreditAttribution: jcisio commentedThis patch doesn't work. Random latest posts on my homepage display like this (I cut the beginning, just leave the relevant part):
3 corrects, and 3 wrongs.
However, the patch that I submitted in #8 works. Maybe my PHP has already built-in multibyte support?
Comment #17
egd CreditAttribution: egd commentedI can confirm that the submitted patch does not fix the problem. Here is an example with Bulgarian.
preg_match("/(.*)\s.+/us","асд асд асд ас", $matches); var_dump($matches)
produces:
array(2) {
[0]=>
string(25) "асд асд асд ас"
[1]=>
string(20) "асд асд асд"
}
whereas
preg_match("/(.*)\b.+/us","асд асд асд ас", $matches); var_dump($matches)
produces:
array(0) {
}
which is a problem.
Comment #18
merlinofchaos CreditAttribution: merlinofchaos commentedIf you've got bulgarian, it should be using mb_ereg not preg_match
Comment #19
egd CreditAttribution: egd commentedviews_trim_text() should be able to handle Bulgarian as well :-).
Anyway, I am new to php. Isn't POSIX regex support (e.g. ereg) supposed to be dropped in PHP6?
Comment #20
merlinofchaos CreditAttribution: merlinofchaos commented1) I can only offer what PHP gives me
2) preg is known to not be multibyte safe
Statements like this are annoying. Yes I would freakin' love to support every language everywhere all the time. Thank you for enlightening me.
Then maybe in PHP6 we'll have a mb_preg_replace() or something.
My *point* was that the code you're demonstrating doesn't work is not the path that the code will follow. The patch is set up to use mb_ereg if multibyte is available and preg if not. If you're using Bulgarian and do not have the mb_ library, then preg is our best effort. It turns out our best effort isn't going to work.
Comment #21
jcisio CreditAttribution: jcisio commentedA few tests:
1 and 2 (even without the u modifier) are ok. 3 is not ok (not matching at all). 4 returns "асд асд асд а?", so not ok.
As 4 is actually implemented in Views, it has the problem with trim at the middle of words as reported in #16. And I don't know why #14 reports that the patch didn't work.
ereg is deprecated in PHP 5.3 (I don't know if it is available in PHP6), but not mb_ereg. So #4 were safe to use if it worked.
Comment #22
yhager CreditAttribution: yhager commented@jcisio, @egd — are you sure your PHP is set up to support multibyte? I cannot recreate any of your reports using a simple test program:
test.php:
data:
And running this I get:
Comment #23
egd CreditAttribution: egd commented@yhager - I have run your test code and it indeed works.
I believe the whole confusion here (@merlinofchaos - Sorry about that) comes from the fact that the patch is not in the "6.x-2.10" (datestamp = "1270766108") version of views module as seen on drupal.org.
Or am I missing something?
Comment #24
jcisio CreditAttribution: jcisio commentedConfirm! The patch is not in 6.x-2.x-dev. Now it should patch against views.module and work!
Thanks, @yhager.
I found the error where
mb_ereg("(.*)\b.+","асд асд асд ас", $matches); var_dump($matches);
didn't work, too. My PHP doesn't use utf8 by default, so the callmb_regex_encoding('UTF-8');
is necessary.Comment #25
yhager CreditAttribution: yhager commentedAssuming maintainer wants patches to flow from 6.x-3.x backwards, I am attaching a trivial reroll against DRUPAL-6--3 branch (untested).
Can please someone test this and RTBC?
Comment #26
yhager CreditAttribution: yhager commentedComment #27
egd CreditAttribution: egd commentedThanks yhager. Your patch works for me.
Comment #28
jcisio CreditAttribution: jcisio commentedOk, I've just downloaded 6.x-3.x-dev and patch works without any notice.
Comment #29
dawehnerI wrote a simpletest for your test examples , which failed on some places. After this i applied the patch, and after the patch it worked fine.
Comment #30
merlinofchaos CreditAttribution: merlinofchaos commentedCommitted to all branches. Test committed to all 3.x branches. THanks!
Comment #32
remaye CreditAttribution: remaye commentedSorry for re-opening this thread,
I'm wondering if this bug could also happen with accentuated chars in french language ?
Also I'm using Views 6.x-3.0+53-dev (2012-mar-06) and normally this bug should be fixed, right ?
So let me know if you think this should a different issue, but the problem is :
With "Trim only on a word boundary" checked, words are cut before or after any accentuated char in the middle of the world :
like "référence" can be cut at "r" or "ré" or "réf" or "réfé" but not "référ" nor "référe", "référen" ...
And there is also some strange behaviours like :
"verrieres avec désenfumage" cut at "verrieres avec..." but
"verrières avec désenfumage" cut at "verriè..."
like if "è" would count for more than one char... !?
I'm quite confused about what is happening.
Thanks to let me know what you think...
Comment #33
jcisio CreditAttribution: jcisio commentedDo the VIews test cases pass in your system? In a UTF-8 coded string, "è" is one char but two bytes, but it should not be cut at that position.