Trim field to a maximum length - Multibyte encodings [#513396]

I have come across a bug in Views 2, drupal 6.12.
I have a View that displays the title, teaser (with html entities) and link to node. The language of the content is Greek (When I tested on Lorem ipsum text with html entities, there is NO problem, trim works as expected). When I set the teaser to be trimmed at say 320 characters, with Trim only on a word boundary and Add an ellipsis, the output is:

Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι...

It stops at the second html entity. If I where to trim the teaser at 380 characters then it starts to render the entities correctly. The output is:

Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι» δηλώνει την κατάσταση που θα βρεθεί η νύφη μετά το μυστήριο: Υπό του ανδρός. Κι όλα αυτά γιατί; Γιατί και ο γαμπρός έχει θέση στην εκκλησία. Δεν είναι το ίδιο φανταχτερός και βέβαια δεν έχει το ίδιο άγχος με τη νύφη. Μέσα από τις σελίδες του site...

Now if I were to select not to Trim only on a word boundary the problem does not happen.
Selecting the Field can contain HTML does not change the output.
The text I have as seen in the fckeditor is:

<p>Σκεφτείτε να έλειπε ο γαμπρός από τον γάμο. Μόνη της θα παντρευόταν η νύφη; Αλώστε και η ίδια η λέξη «παντρεύομαι» δηλώνει την κατάσταση που θα βρεθεί η νύφη μετά το μυστήριο: Υπό του ανδρός. Κι όλα αυτά γιατί; Γιατί και ο γαμπρός έχει θέση στην εκκλησία. Δεν είναι το ίδιο φανταχτερός και βέβαια δεν έχει το ίδιο άγχος με τη νύφη. Μέσα από τις σελίδες του site μπορείτε να κάνετε μια πρώτη επίσκεψη «στην αγορά» παίρνοντας μια ιδία για τις τάσεις και τη μόδα.</p>

Any help would be appreciated.

Comment	File	Size	Author
#29	513396_trim_mb.patch	2.76 KB	dawehner
#25	513396_trim_mb.patch	1.01 KB	yhager
#8	views-513396.patch	744 bytes	jcisio
#4	yh.patch	888 bytes	yhager
#2	yh.patch	644 bytes	yhager

Comments

Comment #1

dwb17 commented 4 September 2009 at 08:33

Does anyone have any idea?

Comment #2

yhager commented 27 January 2010 at 16:51

Version:	6.x-2.6	» 6.x-2.8
Status:	Active	» Needs review

Status	File	Size
new	yh.patch	644 bytes

This is a problem with PHP unicode character handling. Can you say if the attached patch fixes this for you (it does for me, on Hebrew characters).

Note views maintainers: I do not think this patch is fit for the general public as is, but fact is that PHP fails on utf-8 strings word boundary. Do you think there is a way to "check if this is not ascii text, and if php has mb support, and only then use the mb_ereg functions in this patch"?

I am well aware that mb_ereg is deprecated - but until PHP 6 lands, I have not found any better solution. Open to suggestions.

Comment #3

dawehner

German

commented 27 January 2010 at 19:25

Status:

Needs review

» Needs work

I think this library is not enabled everywhere, you have to check for function_exists.

Additional it would be perhaps cool if you could add
// TODO: replace this with cleanstring of ctools

Comment #4

yhager commented 27 January 2010 at 21:14

Status:

Needs work

» Needs review

Status	File	Size
new	yh.patch	888 bytes

rerolled.

Comment #5

jcisio commented 8 February 2010 at 23:44

Title:

Trim field to a maximum length - Greek language

» Trim field to a maximum length - Multibyte encodings

I think the "word boundary" character \b is not what we want here. Definition:

A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.

That means "abc xyz" cut at 5 character would return "abc " instead of "abc" (what we want), because the 4th and 5th characters are both word boundary.

I replace
if (preg_match("/(.*)\b.+/us", $value, $matches)) {
by
if (preg_match("/(.*)\s.*/s", $value, $matches)) {

It works for utf8 Vietnamese strings. Please confirm that it works for other languages, too.

Comment #6

yhager commented 9 February 2010 at 06:44

That means "abc xyz" cut at 5 character would return "abc " instead of "abc" (what we want), because the 4th and 5th characters are both word boundary.

True, but if you cut it at 3, you get 'abc' and there is no whitespace there, so you lose the entire word. With word-boundary search, you still get it.

Comment #7

jcisio commented 9 February 2010 at 09:25

I don't think I understand what you mean by "lose the entire word". Actually "abc xyz" cut at 5 returns "abc ...", but it should return "abc...", shouldn't it?

The code I submitted is for the original code. I don't use "u" modifier as it just works without. With "u", it doesn't work, don't know why. So it needs more tests.

Comment #8

jcisio commented 9 February 2010 at 09:40

Status	File	Size
new	views-513396.patch	744 bytes

Submit a patch against 6.x-2.x so that it's easier to test. The only problem I can see is that "abc xyz www" trimmed at 7 will return "abc". But that's another problem.

Comment #9

yhager commented 9 February 2010 at 21:51

> I don't think I understand what you mean by "lose the entire word".

try to cut "abc xyz" at 3 with your patch, and you get an empty string, instead of 'abc'.

> The only problem I can see is that "abc xyz www" trimmed at 7 will return "abc". But that's another problem.

No, it's the same problem. If you happen to cut the string *exactly* at the end of a word, you lose that last word in the result.

Comment #10

jcisio commented 10 February 2010 at 10:32

Not the mb encoding problem anyway ;)

If we're willing to change the issue title, I propose to replace the empty string with t('(empty)'). That's the users who must take care for their title, and the admin who should always set the length to at least 10-15.

I don't see in any case that we need to trim at word boundary and the defined length is less than 3-4 times an average word length. And lost the last word is not a problem, either, as we don't need an exact length.

The last one, if we really need it, just check if (N+1)th character is \s, we need to increase N by 1 before doing the truncation.

PS: \s is dependant on environment, but usually it is [\t\r\n ] or more, so we may want to replace \s by something like [\s\.\?,;]

Comment #11

merlinofchaos commented 13 March 2010 at 01:17

The problem presented in #5 can be solved by adding a trim() function after the word boundary check, rather than trying to redefine word boundary which regex actually does pretty well for us.

The solution in #4 seems ok (though the patch name is generic -- putting the issue # in the patch name is helpful. I have a LOT of patches in my views directory =) and I'm going to go ahead and go with it. I think it will do what we want. The fact that we keep a space after word boundary testing is a different issue I think and can be addressed separately. #4 is committed to all branches.

jcisio, would you like to create a new issue for the trailing whitespace? A simple trim() should fix it.

Comment #12

merlinofchaos commented 13 March 2010 at 01:18

Status:

Needs review

» Fixed

Comment #13

yhager commented 13 March 2010 at 06:36

Thanks for committing this - sorry for the patch name confusion.

Comment #14

OnkelTem commented 15 March 2010 at 19:50

All russian text is cut to a null-length string, if using this Views feature for Teaser for example with selected "cut by words" checkbox.
I don't understand - is this really FIXED or not, and if yes - what Views version should I download to get it working? Or what patch to apply?

P.S. This patch views-513396.patch has nothing to do with my issue. Russian text is simply disappeared.

regards

UPD. http://drupal.org/node/376722#comment-2467170 - fix to replace silly preg_match's "\b"

Comment #15

29 March 2010 at 20:00

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #16

jcisio commented 12 April 2010 at 15:58

Version:	6.x-2.8	» 6.x-2.x-dev
Status:	Closed (fixed)	» Active

This patch doesn't work. Random latest posts on my homepage display like this (I cut the beginning, just leave the relevant part):

Tuy nhiên, những hi vọng
Tuy nhiên, những... (correct)

Giả sử chúng tôi có 3 Apple
Giả sử chú... (wrong)

siêu nhỏ này là bộ xử lý
siêu nhỏ này là b... (wrong)

Di động của nhà sản xuất Phần Lan
Di động của nhà sản xuất... (correct)

khoảng cách từ đại lí đến
khoảng cách từ đại... (correct)

của hãng bao gồm ba dòng
của hãng bao gồ... (wrong)

3 corrects, and 3 wrongs.

However, the patch that I submitted in #8 works. Maybe my PHP has already built-in multibyte support?

Comment #17

egd commented 12 April 2010 at 21:17

I can confirm that the submitted patch does not fix the problem. Here is an example with Bulgarian.

preg_match("/(.*)\s.+/us","асд асд асд ас", $matches); var_dump($matches)

produces:

array(2) {
[0]=>
string(25) "асд асд асд ас"
[1]=>
string(20) "асд асд асд"
}

whereas

preg_match("/(.*)\b.+/us","асд асд асд ас", $matches); var_dump($matches)

produces:

array(0) {
}

which is a problem.

Comment #18

merlinofchaos commented 12 April 2010 at 19:56

If you've got bulgarian, it should be using mb_ereg not preg_match

Comment #19

egd commented 12 April 2010 at 21:02

views_trim_text() should be able to handle Bulgarian as well :-).

Anyway, I am new to php. Isn't POSIX regex support (e.g. ereg) supposed to be dropped in PHP6?

Comment #20

merlinofchaos commented 12 April 2010 at 21:32

views_trim_text() should be able to handle Bulgarian as well

1) I can only offer what PHP gives me
2) preg is known to not be multibyte safe

Statements like this are annoying. Yes I would freakin' love to support every language everywhere all the time. Thank you for enlightening me.

Anyway, I am new to php. Isn't POSIX regex support (e.g. ereg) supposed to be dropped in PHP6?

Then maybe in PHP6 we'll have a mb_preg_replace() or something.

My *point* was that the code you're demonstrating doesn't work is not the path that the code will follow. The patch is set up to use mb_ereg if multibyte is available and preg if not. If you're using Bulgarian and do not have the mb_ library, then preg is our best effort. It turns out our best effort isn't going to work.

Comment #21

jcisio commented 12 April 2010 at 22:47

A few tests:

1. preg_match("/(.*)\s.+/us","асд асд асд ас", $matches);var_dump($matches);
2. preg_match("/(.*)\s.+/s","асд асд асд ас", $matches);var_dump($matches);
3. preg_match("/(.*)\b.+/us","асд асд асд ас", $matches);var_dump($matches);
4. mb_ereg("(.*)\b.+","асд асд асд ас", $matches); var_dump($matches);

1 and 2 (even without the u modifier) are ok. 3 is not ok (not matching at all). 4 returns "асд асд асд а?", so not ok.

As 4 is actually implemented in Views, it has the problem with trim at the middle of words as reported in #16. And I don't know why #14 reports that the patch didn't work.

ereg is deprecated in PHP 5.3 (I don't know if it is available in PHP6), but not mb_ereg. So #4 were safe to use if it worked.

Comment #22

yhager commented 13 April 2010 at 04:43

@jcisio, @egd — are you sure your PHP is set up to support multibyte? I cannot recreate any of your reports using a simple test program:

test.php:


$data = file_get_contents('data');
$regex = "(.*)\b.+";
mb_regex_encoding('UTF-8');
foreach (explode("\n", $data) as $line) {
  print "$line\n";
  foreach (array(15, 20, 25) as $length) {
    $value = mb_substr($line, 0, $length);
    $found = mb_ereg($regex, $value, $matches);
    if ($found) {
      print " ($length) ==> ". $matches[1] . "\n";
    }
  }
}

data:

Tuy nhiên, những hi vọng
Giả sử chúng tôi có 3 Apple
siêu nhỏ này là bộ xử lý
Di động của nhà sản xuất Phần Lan
khoảng cách từ đại lí đến
của hãng bao gồm ba dòng
сд асд асд ас
асд асд асд ас

And running this I get:

 $ php test.php
Tuy nhiên, những hi vọng
 (15) ==> Tuy nhiên, 
 (20) ==> Tuy nhiên, những
 (25) ==> Tuy nhiên, những hi 
Giả sử chúng tôi có 3 Apple
 (15) ==> Giả sử 
 (20) ==> Giả sử chúng 
 (25) ==> Giả sử chúng tôi 
siêu nhỏ này là bộ xử lý
 (15) ==> siêu nhỏ 
 (20) ==> siêu nhỏ này 
 (25) ==> siêu nhỏ này là 
Di động của nhà sản xuất Phần Lan
 (15) ==> Di động 
 (20) ==> Di động của 
 (25) ==> Di động của nhà 
khoảng cách từ đại lí đến
 (15) ==> khoảng cách
 (20) ==> khoảng cách từ
 (25) ==> khoảng cách từ 
của hãng bao gồm ba dòng
 (15) ==> của hãng 
 (20) ==> của hãng bao 
 (25) ==> của hãng bao gồm ba
сд асд асд ас
 (15) ==> сд асд 
 (20) ==> сд асд асд
 (25) ==> сд асд асд 
асд асд асд ас
 (15) ==> асд асд
 (20) ==> асд асд 
 (25) ==> асд асд асд

Comment #23

egd commented 13 April 2010 at 05:48

@yhager - I have run your test code and it indeed works.

I believe the whole confusion here (@merlinofchaos - Sorry about that) comes from the fact that the patch is not in the "6.x-2.10" (datestamp = "1270766108") version of views module as seen on drupal.org.

Or am I missing something?

Comment #24

jcisio commented 13 April 2010 at 06:30

Confirm! The patch is not in 6.x-2.x-dev. Now it should patch against views.module and work!

Thanks, @yhager.

I found the error where mb_ereg("(.*)\b.+","асд асд асд ас", $matches); var_dump($matches); didn't work, too. My PHP doesn't use utf8 by default, so the call mb_regex_encoding('UTF-8'); is necessary.

Comment #25

yhager commented 13 April 2010 at 07:51

Status:

Active

» Needs review

Status	File	Size
new	513396_trim_mb.patch	1.01 KB

Assuming maintainer wants patches to flow from 6.x-3.x backwards, I am attaching a trivial reroll against DRUPAL-6--3 branch (untested).
Can please someone test this and RTBC?

Comment #26

yhager commented 13 April 2010 at 07:51

Version:

6.x-2.x-dev

» 6.x-3.x-dev

Comment #27

egd commented 13 April 2010 at 08:29

Status:

Needs review

» Reviewed & tested by the community

Thanks yhager. Your patch works for me.

Comment #28

jcisio commented 13 April 2010 at 09:49

Ok, I've just downloaded 6.x-3.x-dev and patch works without any notice.

Comment #29

dawehner

German

commented 22 April 2010 at 21:45

Status	File	Size
new	513396_trim_mb.patch	2.76 KB

I wrote a simpletest for your test examples , which failed on some places. After this i applied the patch, and after the patch it worked fine.

Comment #30

merlinofchaos commented 29 April 2010 at 18:26

Status:

Reviewed & tested by the community

» Fixed

Committed to all branches. Test committed to all 3.x branches. THanks!

Comment #31

13 May 2010 at 18:30

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #32

remaye commented 9 May 2012 at 15:36

Sorry for re-opening this thread,
I'm wondering if this bug could also happen with accentuated chars in french language ?

Also I'm using Views 6.x-3.0+53-dev (2012-mar-06) and normally this bug should be fixed, right ?

So let me know if you think this should a different issue, but the problem is :

With "Trim only on a word boundary" checked, words are cut before or after any accentuated char in the middle of the world :
like "référence" can be cut at "r" or "ré" or "réf" or "réfé" but not "référ" nor "référe", "référen" ...

And there is also some strange behaviours like :
"verrieres avec désenfumage" cut at "verrieres avec..." but
"verrières avec désenfumage" cut at "verriè..."
like if "è" would count for more than one char... !?

I'm quite confused about what is happening.
Thanks to let me know what you think...

Comment #33

jcisio commented 10 May 2012 at 07:11

Do the VIews test cases pass in your system? In a UTF-8 coded string, "è" is one char but two bytes, but it should not be cut at that position.

Trim field to a maximum length - Multibyte encodings

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

Comment #28

Comment #29

Comment #30

Comment #31

Comment #32

Comment #33

News items

Our community

Documentation

Drupal code base

Governance of community