I have come across a bug in Views 2, drupal 6.12.
I have a View that displays the title, teaser (with html entities) and link to node. The language of the content is Greek (When I tested on Lorem ipsum text with html entities, there is NO problem, trim works as expected). When I set the teaser to be trimmed at say 320 characters, with Trim only on a word boundary and Add an ellipsis, the output is:

Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι...

It stops at the second html entity. If I where to trim the teaser at 380 characters then it starts to render the entities correctly. The output is:

Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι» δηλώνει την κατάσταση που θα βρεθεί η νύφη μετά το μυστήριο: Υπό του ανδρός. Κι όλα αυτά γιατί; Γιατί και ο γαμπρός έχει θέση στην εκκλησία. Δεν είναι το ίδιο φανταχτερός και βέβαια δεν έχει το ίδιο άγχος με τη νύφη. Μέσα από τις σελίδες του site...

Now if I were to select not to Trim only on a word boundary the problem does not happen.
Selecting the Field can contain HTML does not change the output.
The text I have as seen in the fckeditor is:

<p>Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη &laquo;παντρεύομαι&raquo; δηλώνει την κατάσταση που θα βρεθεί η νύφη μετά το μυστήριο: Υπό του ανδρός. Κι όλα αυτά γιατί; Γιατί και ο γαμπρός έχει θέση στην εκκλησία. Δεν είναι το ίδιο φανταχτερός και βέβαια δεν έχει το ίδιο άγχος με τη νύφη. Μέσα από τις σελίδες του site μπορείτε να κάνετε μια πρώτη επίσκεψη &laquo;στην αγορά&raquo; παίρνοντας μια ιδία για τις τάσεις και τη μόδα.</p>

Any help would be appreciated.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

dwb17’s picture

Does anyone have any idea?

yhager’s picture

Version: 6.x-2.6 » 6.x-2.8
Status: Active » Needs review
FileSize
644 bytes

This is a problem with PHP unicode character handling. Can you say if the attached patch fixes this for you (it does for me, on Hebrew characters).

Note views maintainers: I do not think this patch is fit for the general public as is, but fact is that PHP fails on utf-8 strings word boundary. Do you think there is a way to "check if this is not ascii text, and if php has mb support, and only then use the mb_ereg functions in this patch"?

I am well aware that mb_ereg is deprecated - but until PHP 6 lands, I have not found any better solution. Open to suggestions.

dawehner’s picture

Status: Needs review » Needs work

I think this library is not enabled everywhere, you have to check for function_exists.

Additional it would be perhaps cool if you could add
// TODO: replace this with cleanstring of ctools

yhager’s picture

Status: Needs work » Needs review
FileSize
888 bytes

rerolled.

jcisio’s picture

Title: Trim field to a maximum length - Greek language » Trim field to a maximum length - Multibyte encodings

I think the "word boundary" character \b is not what we want here. Definition:

A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.

That means "abc xyz" cut at 5 character would return "abc " instead of "abc" (what we want), because the 4th and 5th characters are both word boundary.

I replace
if (preg_match("/(.*)\b.+/us", $value, $matches)) {
by
if (preg_match("/(.*)\s.*/s", $value, $matches)) {

It works for utf8 Vietnamese strings. Please confirm that it works for other languages, too.

yhager’s picture

That means "abc xyz" cut at 5 character would return "abc " instead of "abc" (what we want), because the 4th and 5th characters are both word boundary.

True, but if you cut it at 3, you get 'abc' and there is no whitespace there, so you lose the entire word. With word-boundary search, you still get it.

jcisio’s picture

I don't think I understand what you mean by "lose the entire word". Actually "abc xyz" cut at 5 returns "abc ...", but it should return "abc...", shouldn't it?

The code I submitted is for the original code. I don't use "u" modifier as it just works without. With "u", it doesn't work, don't know why. So it needs more tests.

jcisio’s picture

FileSize
744 bytes

Submit a patch against 6.x-2.x so that it's easier to test. The only problem I can see is that "abc xyz www" trimmed at 7 will return "abc". But that's another problem.

yhager’s picture

> I don't think I understand what you mean by "lose the entire word".

try to cut "abc xyz" at 3 with your patch, and you get an empty string, instead of 'abc'.

> The only problem I can see is that "abc xyz www" trimmed at 7 will return "abc". But that's another problem.

No, it's the same problem. If you happen to cut the string *exactly* at the end of a word, you lose that last word in the result.

jcisio’s picture

Not the mb encoding problem anyway ;)

If we're willing to change the issue title, I propose to replace the empty string with t('(empty)'). That's the users who must take care for their title, and the admin who should always set the length to at least 10-15.

I don't see in any case that we need to trim at word boundary and the defined length is less than 3-4 times an average word length. And lost the last word is not a problem, either, as we don't need an exact length.

The last one, if we really need it, just check if (N+1)th character is \s, we need to increase N by 1 before doing the truncation.

PS: \s is dependant on environment, but usually it is [\t\r\n ] or more, so we may want to replace \s by something like [\s\.\?,;]

merlinofchaos’s picture

The problem presented in #5 can be solved by adding a trim() function after the word boundary check, rather than trying to redefine word boundary which regex actually does pretty well for us.

The solution in #4 seems ok (though the patch name is generic -- putting the issue # in the patch name is helpful. I have a LOT of patches in my views directory =) and I'm going to go ahead and go with it. I think it will do what we want. The fact that we keep a space after word boundary testing is a different issue I think and can be addressed separately. #4 is committed to all branches.

jcisio, would you like to create a new issue for the trailing whitespace? A simple trim() should fix it.

merlinofchaos’s picture

Status: Needs review » Fixed
yhager’s picture

Thanks for committing this - sorry for the patch name confusion.

OnkelTem’s picture

All russian text is cut to a null-length string, if using this Views feature for Teaser for example with selected "cut by words" checkbox.
I don't understand - is this really FIXED or not, and if yes - what Views version should I download to get it working? Or what patch to apply?

P.S. This patch views-513396.patch has nothing to do with my issue. Russian text is simply disappeared.

regards

UPD. http://drupal.org/node/376722#comment-2467170 - fix to replace silly preg_match's "\b"

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

jcisio’s picture

Version: 6.x-2.8 » 6.x-2.x-dev
Status: Closed (fixed) » Active

This patch doesn't work. Random latest posts on my homepage display like this (I cut the beginning, just leave the relevant part):

Tuy nhiên, những hi vọng
Tuy nhiên, những... (correct)

Giả sử chúng tôi có 3 Apple
Giả sử chú... (wrong)

siêu nhỏ này là bộ xử lý
siêu nhỏ này là b... (wrong)

Di động của nhà sản xuất Phần Lan
Di động của nhà sản xuất... (correct)

khoảng cách từ đại lí đến
khoảng cách từ đại... (correct)

của hãng bao gồm ba dòng
của hãng bao gồ... (wrong)

3 corrects, and 3 wrongs.

However, the patch that I submitted in #8 works. Maybe my PHP has already built-in multibyte support?

egd’s picture

I can confirm that the submitted patch does not fix the problem. Here is an example with Bulgarian.

preg_match("/(.*)\s.+/us","асд асд асд ас", $matches); var_dump($matches)

produces:

array(2) {
[0]=>
string(25) "асд асд асд ас"
[1]=>
string(20) "асд асд асд"
}

whereas

preg_match("/(.*)\b.+/us","асд асд асд ас", $matches); var_dump($matches)

produces:

array(0) {
}

which is a problem.

merlinofchaos’s picture

If you've got bulgarian, it should be using mb_ereg not preg_match

egd’s picture

views_trim_text() should be able to handle Bulgarian as well :-).

Anyway, I am new to php. Isn't POSIX regex support (e.g. ereg) supposed to be dropped in PHP6?

merlinofchaos’s picture

views_trim_text() should be able to handle Bulgarian as well

1) I can only offer what PHP gives me
2) preg is known to not be multibyte safe

Statements like this are annoying. Yes I would freakin' love to support every language everywhere all the time. Thank you for enlightening me.

Anyway, I am new to php. Isn't POSIX regex support (e.g. ereg) supposed to be dropped in PHP6?

Then maybe in PHP6 we'll have a mb_preg_replace() or something.

My *point* was that the code you're demonstrating doesn't work is not the path that the code will follow. The patch is set up to use mb_ereg if multibyte is available and preg if not. If you're using Bulgarian and do not have the mb_ library, then preg is our best effort. It turns out our best effort isn't going to work.

jcisio’s picture

A few tests:

1. preg_match("/(.*)\s.+/us","асд асд асд ас", $matches);var_dump($matches);
2. preg_match("/(.*)\s.+/s","асд асд асд ас", $matches);var_dump($matches);
3. preg_match("/(.*)\b.+/us","асд асд асд ас", $matches);var_dump($matches);
4. mb_ereg("(.*)\b.+","асд асд асд ас", $matches); var_dump($matches); 

1 and 2 (even without the u modifier) are ok. 3 is not ok (not matching at all). 4 returns "асд асд асд а?", so not ok.

As 4 is actually implemented in Views, it has the problem with trim at the middle of words as reported in #16. And I don't know why #14 reports that the patch didn't work.

ereg is deprecated in PHP 5.3 (I don't know if it is available in PHP6), but not mb_ereg. So #4 were safe to use if it worked.

yhager’s picture

@jcisio, @egd — are you sure your PHP is set up to support multibyte? I cannot recreate any of your reports using a simple test program:

test.php:


$data = file_get_contents('data');
$regex = "(.*)\b.+";
mb_regex_encoding('UTF-8');
foreach (explode("\n", $data) as $line) {
  print "$line\n";
  foreach (array(15, 20, 25) as $length) {
    $value = mb_substr($line, 0, $length);
    $found = mb_ereg($regex, $value, $matches);
    if ($found) {
      print " ($length) ==> ". $matches[1] . "\n";
    }
  }
}

data:

Tuy nhiên, những hi vọng
Giả sử chúng tôi có 3 Apple
siêu nhỏ này là bộ xử lý
Di động của nhà sản xuất Phần Lan
khoảng cách từ đại lí đến
của hãng bao gồm ba dòng
сд асд асд ас
асд асд асд ас

And running this I get:

 $ php test.php
Tuy nhiên, những hi vọng
 (15) ==> Tuy nhiên, 
 (20) ==> Tuy nhiên, những
 (25) ==> Tuy nhiên, những hi 
Giả sử chúng tôi có 3 Apple
 (15) ==> Giả sử 
 (20) ==> Giả sử chúng 
 (25) ==> Giả sử chúng tôi 
siêu nhỏ này là bộ xử lý
 (15) ==> siêu nhỏ 
 (20) ==> siêu nhỏ này 
 (25) ==> siêu nhỏ này là 
Di động của nhà sản xuất Phần Lan
 (15) ==> Di động 
 (20) ==> Di động của 
 (25) ==> Di động của nhà 
khoảng cách từ đại lí đến
 (15) ==> khoảng cách
 (20) ==> khoảng cách từ
 (25) ==> khoảng cách từ 
của hãng bao gồm ba dòng
 (15) ==> của hãng 
 (20) ==> của hãng bao 
 (25) ==> của hãng bao gồm ba
сд асд асд ас
 (15) ==> сд асд 
 (20) ==> сд асд асд
 (25) ==> сд асд асд 
асд асд асд ас
 (15) ==> асд асд
 (20) ==> асд асд 
 (25) ==> асд асд асд 

egd’s picture

@yhager - I have run your test code and it indeed works.

I believe the whole confusion here (@merlinofchaos - Sorry about that) comes from the fact that the patch is not in the "6.x-2.10" (datestamp = "1270766108") version of views module as seen on drupal.org.

Or am I missing something?

jcisio’s picture

Confirm! The patch is not in 6.x-2.x-dev. Now it should patch against views.module and work!

Thanks, @yhager.

I found the error where mb_ereg("(.*)\b.+","асд асд асд ас", $matches); var_dump($matches); didn't work, too. My PHP doesn't use utf8 by default, so the call mb_regex_encoding('UTF-8'); is necessary.

yhager’s picture

Status: Active » Needs review
FileSize
1.01 KB

Assuming maintainer wants patches to flow from 6.x-3.x backwards, I am attaching a trivial reroll against DRUPAL-6--3 branch (untested).
Can please someone test this and RTBC?

yhager’s picture

Version: 6.x-2.x-dev » 6.x-3.x-dev
egd’s picture

Status: Needs review » Reviewed & tested by the community

Thanks yhager. Your patch works for me.

jcisio’s picture

Ok, I've just downloaded 6.x-3.x-dev and patch works without any notice.

dawehner’s picture

FileSize
2.76 KB

I wrote a simpletest for your test examples , which failed on some places. After this i applied the patch, and after the patch it worked fine.

merlinofchaos’s picture

Status: Reviewed & tested by the community » Fixed

Committed to all branches. Test committed to all 3.x branches. THanks!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

remaye’s picture

Sorry for re-opening this thread,
I'm wondering if this bug could also happen with accentuated chars in french language ?

Also I'm using Views 6.x-3.0+53-dev (2012-mar-06) and normally this bug should be fixed, right ?

So let me know if you think this should a different issue, but the problem is :

With "Trim only on a word boundary" checked, words are cut before or after any accentuated char in the middle of the world :
like "référence" can be cut at "r" or "ré" or "réf" or "réfé" but not "référ" nor "référe", "référen" ...

And there is also some strange behaviours like :
"verrieres avec désenfumage" cut at "verrieres avec..." but
"verrières avec désenfumage" cut at "verriè..."
like if "è" would count for more than one char... !?

I'm quite confused about what is happening.
Thanks to let me know what you think...

jcisio’s picture

Do the VIews test cases pass in your system? In a UTF-8 coded string, "è" is one char but two bytes, but it should not be cut at that position.