truncate_utf8() only works for latin languages (and drupal_substr has a bug) [#768040]

Comment	File	Size	Author
#75	768040-bettercomments.patch	21.48 KB	jhodgdon

#71	768040-reroll.patch	21.06 KB	jhodgdon

#68	768040-fix-substr.patch	21.06 KB	jhodgdon

#65	768040-moretests-fails.patch	18.86 KB	jhodgdon

#56	768040-56-fixsearchcomment.patch	12.13 KB	janusman

#53	768040-53-fixsearchcomment.patch	15.32 KB	janusman

#52	768040-52-fixsearchcomment.patch	12.44 KB	janusman

#47	768040-fixsearchcomment.patch	12.51 KB	jhodgdon

#45	768040-v7-alternative.patch	12.53 KB	codycraven

#43	768040-v7-alternative.patch	12.47 KB	codycraven

#37	768040-v7.patch	12.58 KB	jhodgdon

#32	768040-v6.patch	7.47 KB	jhodgdon

#31	768040-v5.patch	7.46 KB	jhodgdon

#25	768040-v4.patch	6.9 KB	jhodgdon

#23	768040-v3.patch	11.43 KB	jhodgdon

#8	768040-v2.patch	11.15 KB	jhodgdon

#4	768040.patch	6.37 KB	jhodgdon

Comment #1

Damien Tournoud CreditAttribution: Damien Tournoud commented 11 April 2010 at 14:18

Indeed, there is clearly room for improvement here.

For (a), splitting "words" at PREG_CLASS_SEARCH_EXCLUDE boundaries would go a long way already (there cannot be any perfect solution without (1) knowing the language, (2) a corpus and complex machine learning algorithms).

For (b), allowing to translate the ellipsis could also probably cover 90% of the use cases.

Log in or register to post comments

Comment #2

Damien Tournoud CreditAttribution: Damien Tournoud commented 11 April 2010 at 14:19

Title:

truncate_utf8 features are not locale-aware

» truncate_utf8() only works for latin languages

Clarifying the title.

Log in or register to post comments

Comment #3

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 11 April 2010 at 14:27

Priority:

Normal

» Critical

Issues related to this:

#11883: Add test coverage for auto-comment title functionality and word-breaking
#34171: Title truncation from description always removes last word
#86461: Truncate long paths and path aliases on admin/build/path
#134090: Change 'Top referrers' report to 'Recent referrers'
#147045: Forum: Link to last post in 'Last post' column
#200185: truncate_utf8() is used as a substring function
#269911: Search result trimming should not fall inside HTML entities/tags
#362609: node teaser generation breaks when using entities
#404356: Aggregator should strip tags before using beginning of description tag as title
#504376: Watchdog entries go missing because of non-plain link in the watchdog view
#718662: DBLog listings truncate messages in the middle of HTML tags

As there are so many other issues that rely on this functionality, I'm bumping this one up to critical.

There is also
#180452: Need new truncate function with more features
which is a D8 feature request at this point.

Log in or register to post comments

Comment #4

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 00:01

Status:

Active

» Needs review

File	Size
768040.patch	6.37 KB

Here's a first attempt at a patch using methodology discussed above. It may change something in existing code, because the word boundary character class includes punctuation and symbols. But the change should be for the better.

Note that the patch moves the regexp class from search.module to unicode.inc (with a rename), and translates the '...' added at the end.

Let's see what the test bot says and go from there...

Log in or register to post comments

Comment #5

13 April 2010 at 00:40

Status:

Needs review

» Needs work

The last submitted patch, 768040.patch, failed testing.

Log in or register to post comments

Comment #6

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 April 2010 at 01:19

You will have to be smarter then that :)

I suggest the following heuristic:

Find the longest substring that is <= $len (you can use the regexp to do that, but anchor it to the start and limit the .* to $len characters)
If that string is > $len / 2, use it as-is, else truncate at a non-word safe position

Log in or register to post comments

Comment #7

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 15:21

Assigned:

Unassigned

» jhodgdon

What? There were only 25,480 exceptions. I don't see any problem. :)

Must have been a Monday patch... I'll see if I can do better (and test it this time before submitting it).

Log in or register to post comments

Comment #8

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 15:53

Status:

Needs work

» Needs review

File	Size
768040-v2.patch	11.15 KB

Here's a new patch.

I'm not sure about the heuristic of only truncating to a word boundary if the string would be 1/2 of the desired length or more. That's not how the old function worked, and I'm not sure we want to change it in that way.

So, this patch has the benefit of actually working (at least, the dblog looks reasonable). It may cause a few test failures, though, since the definition of a word boundary has changed to include punctuation. Let's see what the test bot says this time...

Log in or register to post comments

Comment #9

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 16:01

I should also note that I think putting a space before the ... is silly. I left that in there for backwards compatibility, but I think the normal thing (in English at least) would be to have text like:

This is a really long sentence that has been...

rather than

This is a really long sentence that has been ...

Thoughts?

Log in or register to post comments

Comment #10

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 April 2010 at 16:20

Agreed on the uselessness of the space prefix.

// space exists AND is not on position 0

^ There was an heuristic already. Basically we don't want to cut a string too short simply because the last ponctuation sign appears too soon in the string.

Log in or register to post comments

Comment #11

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 16:31

The patch in #8 requires at least one character in the string to use the word boundary, which is the same as the original heuristic. I think if we apply any other heuristics, it should be an additional input to the function, like $min_length. Thoughts?

So I guess we are agreed that there shouldn't be a space before the ...

I can remove that (and likely will need to fix a few tests where it's assuming the space is there). It looks like the patch in #8 works this time, though, which is good.

Log in or register to post comments

Comment #12

jpmckinney CreditAttribution: jpmckinney commented 13 April 2010 at 20:19

I never liked "...". "..." is three periods. Ellipsis (… = …) is distinct from three periods. Ellipsis is the correct punctuation.

Log in or register to post comments

Comment #13

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 21:08

The helip entity may be the correct punctuation, but it is probably not the right punctuation for all cases, because sometimes you just need ascii characters and not HTML entities.

We could conceivably add another input argument to this function to give you the option of ... or the elipses entity. Or callers of this function could not use the ... argument and add helip themselves.

But, as discussed before, ... or even the entity is probably not the right thing for non-latin languages anyway. Which is why t('...') is probably the right thing to do... I think?

Log in or register to post comments

Comment #14

jpmckinney CreditAttribution: jpmckinney commented 13 April 2010 at 23:04

Yeah, … would be a poor choice, given that check_plain is often called after passing a string through truncate_utf8. But, would it be possible instead to put t('…') in place of t('...')? I don't know what the policy is on putting non-ASCII characters in the code.

Log in or register to post comments

Comment #15

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 13 April 2010 at 23:09

Not sure... is it a valid UTF-8 character? I also don't understand why ... in plain ASCII is so objectionable?

Log in or register to post comments

Comment #16

jpmckinney CreditAttribution: jpmckinney commented 14 April 2010 at 02:07

Typography snobs would object, but ... is otherwise quite alright.

Log in or register to post comments

Comment #17

Anonymous (not verified) CreditAttribution: Anonymous commented 14 April 2010 at 03:02

The orizontal ellipsis is a UTF-8 character (E2 80 A6).
I don't see any particular reason to use that character instead of 3 dots, except that it would allow to see two characters more from the original string (which is not a valid reason, IMO).

The character to use when truncating is a secondary detail; the most important thing would be to make the function work for more languages (if that is possible).

Log in or register to post comments

Comment #18

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 14 April 2010 at 13:45

Issue tags:

+Needs documentation

OK... [that was three dots :) ]

So what about the patch in #8 then? Questions/comments that have been brought up:
a) Should we have a heuristic like "must be at least half the string" rather than the current "must be at least 1 character" to use word-safe? My inclination is not, unless we make it an input (let the caller specify the minimum length for wordsafe truncation). I'd be inclinded to leave the default at 1 if we make it an input, for backwards compatibility.
b) Should we remove the space before the ... that is appended?
c) Should we switch to appending the horizontal elipses character/entity rather than three . characters?

The patch in #8 does none of the above, but is not backwards-compatible because it is different on what it considers a word boundary (previously it was a space, now it's a space, punctuation, or CJK word-boundary indicators).

I am inclined to say yes on (b) and no on (a) and (c), but am open to community consensus because they are all reasonable suggestions in my opinion.

In any case, we we need to document this change in the module upgrade guide when we decide/commit, and likely update tests if we make any of the changes above, and update the function header along with the code if we implement any of the three.

It would also be great if someone could come up with a CJK word-safe trunctation test.

Log in or register to post comments

Comment #19

jpmckinney CreditAttribution: jpmckinney commented 14 April 2010 at 15:20

I would be happy with just b).

Log in or register to post comments

Comment #20

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 April 2010 at 13:04

That's two for (b)... anyone else?

Log in or register to post comments

Comment #21

Anonymous (not verified) CreditAttribution: Anonymous commented 15 April 2010 at 14:17

I agree that implementing only point b is preferable. The reported patch seems good to me, and I agree that a different euristic should probably be activated by an optional argument; in that way it is guaranteed the compatibility with the past code.

Log in or register to post comments

Comment #22

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 April 2010 at 14:47

Damien (see comment #10 above) also agreed about (b), and it sounds like we're all "OK but not mandatory" about (a). I'll see what I can do.

Log in or register to post comments

Comment #23

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 15 April 2010 at 15:02

File	Size
768040-v3.patch	11.43 KB

Here's a new patch.

Changes from orginal function:
1. If using wordsafe, finds non-latin-safe word boundary (including punctuation), instead of specifically looking for ' ' as word boundary.
2. If appending dots, appends t('...') instead of ' ...' -- so '...' is translated, and there is no space before the ...
3. If using wordsafe mode, allows you to specify a minimum length (defaults to 1 character for backwards compatibility).

2 & 3 are changes from last patch.

Let's see what the test bot says. There may be tests that are looking for 'truncated stuff ...' somewhere and now they'll have 'truncated stuff...' without the extra space.

It would be great if someone could provide a CJK test of this function, too, and maybe we should write some unit tests too?

Log in or register to post comments

Comment #24

drewish CreditAttribution: drewish commented 17 April 2010 at 00:32

This patch looks pretty good to me but the comments might need a little tweaking:

  * @param $dots
- *   Flag to add trailing dots. Defaults to FALSE.
+ *   If TRUE, add t('...') to the end of the truncated string (defaults FALSE).

It seems like "Defaults to FALSE." is the way we normally say it.

+ * @param $min_length
+ *   If using word-safe mode, minimum length to truncate to before appending
+ *   dots (if using dots).
+ *

Just from reading that it seems unclear what this does... the interactions with $length and $dots aren't clear without reading the code.

Log in or register to post comments

Comment #25

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 18 April 2010 at 17:19

File	Size
768040-v4.patch	6.9 KB

OK, how about this version? I decided it might make things clearer to change the name of the 2nd arg for clarity, and I rewrote the rest of the doc.

Log in or register to post comments

Comment #26

drewish CreditAttribution: drewish commented 19 April 2010 at 18:15

This looks much better.

Does it make sense to change $max_return_length to $max_length?

$min_length's help text is better but it's not clear to me why you'd ever want to use that. Also "Has no effect if $wordsafe is not TRUE." seems like we could change not TRUE to FALSE.

Log in or register to post comments

Comment #27

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 19 April 2010 at 18:36

As for why to have $min_length, this is discussed above in #10, #18, #21. The reason is that if the last space is only after a couple of characters, followed by maybe a really long URL with no spaces in it, you probably don't want to return that really short string because no one will understand what it is supposed to be.

I changed max_length to max_return_length (which refers to string with dots) to distinguish it from min_length (which refers to the string without dots). It seemed wrong for min_length and max_length to not be parallel, but they are used for different reasons presumably (max to truncate for db storage, min for understandability).

Thoughts?

I can make another go of clarifying the doc...

Log in or register to post comments

Comment #28

drewish CreditAttribution: drewish commented 19 April 2010 at 18:39

That URL example should go into the docs... that's exactly the answer I'd like to see when reading them.

Log in or register to post comments

Comment #29

Anonymous (not verified) CreditAttribution: Anonymous commented 19 April 2010 at 19:32

I am a typography snob :-)

The main reason it makes a difference is that the dots in an ellipsis ought to spaced further apart than three dots get spaced by most faces/engines. (And less far apart than if they had actual spaces between them.) A second reason is that logically the thing is a single symbol, not three symbols.

If we cannot rely on everything being able to deal with the Unicode ellipsis character or HTML entity, then I would not want to argue for using either. That seems like asking for trouble.

Using t('...') allows those of us that really care about this to put a translation from three dots to the Unicode character. And in the forthcoming Nirvana where all software knows about Unicode, this could be revised to do the actually right thing.

Log in or register to post comments

Comment #30

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 19 April 2010 at 23:21

RE #28 -- OK, I will add that example to the patch.

Log in or register to post comments

Comment #31

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 19 April 2010 at 23:53

File	Size
768040-v5.patch	7.46 KB

OK, here's a new patch...

Log in or register to post comments

Comment #32

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 20 April 2010 at 00:04

File	Size
768040-v6.patch	7.47 KB

One more go after some verbal comments from drewish...

Log in or register to post comments

Comment #33

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 29 April 2010 at 23:00

Bump. Can we get this reviewed and hopefully committed? I have some search patches to do that will need to use this preg class whose name is changing...

Log in or register to post comments

Comment #34

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 29 April 2010 at 23:01

#32: 768040-v6.patch queued for re-testing.

Log in or register to post comments

Comment #35

codycraven CreditAttribution: codycraven commented 30 April 2010 at 03:37

Review of #32:

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
+ * @param $max_return_length

What is the reason for renaming $len to $max_return_length? I would much rather see $max_length for consistency with $min_length.

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
  * @param $dots

Since we are modifying variable names, I would like to see $dots replaced with $ellipsis as it is more correct and global rather than the colloquial dot (short for dot-dot-dot) - especially considering that we are making this translatable.

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
+  if ($dots) {
+    $dotstring = t('...');
+    $max_return_length -= drupal_strlen($dotstring);
+  }

Bad practice to only define a variable in the event of an if statement.

This should be changed to something like:

$dotstring = '';
if ($dots) {
  $dotstring = t('...');
  $max_return_length -= drupal_strlen($dotstring);
}

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
+    // Find the last word boundary, if there is one within $min_length to ¶

Trailing space needs to be removed.

Log in or register to post comments

Comment #36

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 30 April 2010 at 14:24

Status:

Needs review

» Needs work

- The reason max_return_length and min_length are not named in a parallel way is that they do not behave in a parallel way. Read the doc/code for details -- one governs return length and the other is length before dots are added. I didn't want them to look like they behaved the same, so I didn't name them the same.

- Agreed on other thoughts. I'll update the patch.

Log in or register to post comments

Comment #37

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 30 April 2010 at 14:57

Status:

Needs work

» Needs review

File	Size
768040-v7.patch	12.58 KB

OK. Here's a new patch with many updated variable names and some updated documentation. I realized the URL example was wrong, since . and : are punctuation, so I modified it slightly. I also realized that the search.module chunk of this patch has been missing since that patch in #23, so this patch puts it back in.

Log in or register to post comments

Comment #38

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 30 April 2010 at 17:07

As noted in #23, do we need some tests for this function before the patch is committed? Also, the changes to the function will need to go into the module update guide.

Log in or register to post comments

Comment #39

codycraven CreditAttribution: codycraven commented 1 May 2010 at 07:30

Regarding #36 and #37

I'm glad that my feedback was helpful.

I'm still not keen on having a name such as $max_return_length as a function parameter, however if you do decide to keep it as such please use the variable consistently throughout (not just renaming it to $max_length at the beginning of the function).

If others agree that the variable should be left as $max_return_length and you do decide to keep the re-assignment to $max_length for use in the function, then please do so by reference (technically by alias in php) by doing $max_length =& $max_return_length for memory allocation reasons.

Other than that gripe, I think the patch looks pretty good. I do think that some test cases should be added, especially for the min_length functionality since it is new. Creating test-cases for the other functionality would be difficult as there is no way we can write a test case for every language to test the truncation but we should at least have some.

Log in or register to post comments

Comment #40

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 May 2010 at 17:46

Seriously, memory allocation for one non-array varialble worries you?

The reason I used a second variable $max_length inside the function is that it no longer has the same meaning -- it's the length to truncate the string to, not the total returned length.

Log in or register to post comments

Comment #41

codycraven CreditAttribution: codycraven commented 2 May 2010 at 03:23

jhodgdon,

It's not the memory allocation of the specific variable, it is that if this is an acceptable practice throughout the entire project you could conceivably have tens of thousands of duplicate, unnecessary variables in memory at a single instance in time on high volume sites.

I don't understand how the variable $max_length no longer has the same meaning as $max_return_length when $max_return_length is not used anywhere in the function. It seems like the variable should be named one or the other, and if it is $max_length then the documentation that you have added for the parameter already accurately explains the use of the variable.

If I am missing something regarding the functionality of $max_return_length please explain it to me. From what I understand I would feel confident in calling the function and setting a max length for the string I would want and would expect that the maximum length returned would be whatever max length I set.

Please note that I am not trying to shred your work apart, I think you have done a fantastic job on the patch and am only interested in providing an independent view from my experience background to help better the patch.

Log in or register to post comments

Comment #42

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 May 2010 at 14:05

It's a local variable and should be popped off the stack as soon as the function finishes, so I fail to see how it would increase the footprint of Drupal much to have a few extra bytes in memory for a miniscule amount of time. It's not like we have thousands of functions in the call stack at a time.

Roll a new patch or mark it "needs work" if you feel strongly enough about it. I'm happy with what it is now and it makes logical sense to me that "max_return_length" refers to the length of the returned string in the argument, and "max_length" refers to the length of the string before adding dots to it within the function, which is not the same thing. I don't feel strongly enough about it to object if you want to not use a different variable within the function.

Log in or register to post comments

Comment #43

codycraven CreditAttribution: codycraven commented 2 May 2010 at 21:10

Status:

Needs review

» Reviewed & tested by the community

File	Size
768040-v7-alternative.patch	12.47 KB

jhodgdon,

You are right regarding the actual memory impact in this instance. I just like to avoid setting practices that may be adopted by other contributors because they saw it in core, in which their use could result in actual impact.

From your explanation of the $max_return_length vs $max_length in your code I can see how you are viewing the instance and from that perspective does make sense.

I've attached a .patch file which is equivalent to the patch from #37 with the only changes being the renaming the $max_return_length parameter to $max_length and removing the assignment of $max_return_length to $max_length (which is no longer needed). I am offering the .patch as an alternative based on the discussions above and find no fault in the patch in #37 which I consider RTBC.

**As a side note, this is the first time I have rolled a core patch from NetBeans IDE so hopefully it applies correctly, it does in my local environment.

Log in or register to post comments

Comment #44

2 May 2010 at 21:40

Status:

Reviewed & tested by the community

» Needs work

The last submitted patch, 768040-v7-alternative.patch, failed testing.

Log in or register to post comments

Comment #45

codycraven CreditAttribution: codycraven commented 2 May 2010 at 22:27

Status:

Needs work

» Reviewed & tested by the community

File	Size
768040-v7-alternative.patch	12.53 KB

I realized that my alternative patch (#43) did not remove the assignment of $max_return_length to $max_length but only re-factored it to $max_length = $max_length. The attached patch corrects the fault.

Again as a note this patch has not been tested by anyone other than myself, although it is intended to be identical to the patch from #37 which has been reviewed, except for the re-factoring of a parameter. If this patch is chosen to commit instead of #37 then please review before commit.

**I also dumped NetBeans for generating the patch, so this one should apply.

Log in or register to post comments

Comment #46

chx CreditAttribution: chx commented 3 May 2010 at 00:32

Status:

Reviewed & tested by the community

» Needs work

Well, well. We have no better idea still, right? When working on the Unicode problems, I have looked into implementing TR29 Word Break in user space but back then I just did not have the time to attempt it. I stll dont :) but boy, it would be interesting. You would need to take http://unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt the generator script in the search issue and the rules in http://unicode.org/reports/tr29/#Default_Word_Boundaries . Might be able to build a pcre even that does all of that (with atomic subpatterns).

Until then, the docblock of the pcre you moved makes no sense any more "Consequently, the index only contains characters with the following" this wants to be in search module.

Log in or register to post comments

Comment #47

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 3 May 2010 at 00:45

Status:

Needs work

» Needs review

File	Size
768040-fixsearchcomment.patch	12.51 KB

RE #46 - Yeah, I missed that when I moved it over from search.module. Here's a patch with the text changed to be generic rather than talking about what gets into the search index.

It also adopts codycraven's renaming of the max_length value form the "alternative" patch, which was the name I had given the parameter back in earlier versions of the patches anyway, before codycraven objected to that name back in #35. Whatever. :)

Log in or register to post comments

Comment #48

Garrett Albright CreditAttribution: Garrett Albright commented 4 May 2010 at 17:58

Log in or register to post comments

Comment #49

YesCT CreditAttribution: YesCT commented 18 May 2010 at 08:03

Issue tags:

-Needs documentation

#47: 768040-fixsearchcomment.patch queued for re-testing.

Log in or register to post comments

Comment #50

18 May 2010 at 08:20

Status:

Needs review

» Needs work

The last submitted patch, 768040-fixsearchcomment.patch, failed testing.

Log in or register to post comments

Comment #51

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 18 May 2010 at 14:56

Status:	Needs work	» Needs review
Issue tags:		+Needs documentation

#47: 768040-fixsearchcomment.patch queued for re-testing.

Log in or register to post comments

Comment #52

janusman CreditAttribution: janusman commented 21 May 2010 at 16:27

File	Size
768040-52-fixsearchcomment.patch	12.44 KB

Did some basic testing and found out the current logic needed some tweaking.

Patch from #47:

(123, 1, true, true) == 	1...	==> Should be ...
(123, 2, true, true) == 	12...	==> Should be ...
(123, 3, true, true) == 	...	==> Should be 123 (failure)
(1234, 3, true, true) == 	...	==> Should be ...
(12345 7890, 10, true, true) == 	12345...	==> Should be 12345 7890 (failure)
(123 567 90, 10, true, true) == 	123 567...	==> Should be 123 567 90 (failure)
(1234567890, 10, true, true) == 	1234567...	==> Should be 1234567890 (failure)
(12345678901, 10, true, true) == 	1234567...	==> Should be 1234567...
(12345678901, 11, true, true) == 	12345678...	==> Should be 12345678901 (failure)
(123456789012, 11, true, true) == 	12345678...	==> Should be 12345678...

New patch: (I'm assuming the ellipsis length takes priority over the max_length in strings shorter than the ellipsis itself)

(123, 1, true, true) == 	...	==> Should be ...
(123, 2, true, true) == 	...	==> Should be ...
(123, 3, true, true) == 	123	==> Should be 123
(1234, 3, true, true) == 	...	==> Should be ...
(12345 7890, 10, true, true) == 	12345 7890	==> Should be 12345 7890
(123 567 90, 10, true, true) == 	123 567 90	==> Should be 123 567 90
(1234567890, 10, true, true) == 	1234567890	==> Should be 1234567890
(12345678901, 10, true, true) == 	1234567...	==> Should be 1234567...
(12345678901, 11, true, true) == 	12345678901	==> Should be 12345678901
(123456789012, 11, true, true) == 	12345678...	==> Should be 12345678...

Log in or register to post comments

Comment #53

janusman CreditAttribution: janusman commented 21 May 2010 at 18:16

File	Size
768040-53-fixsearchcomment.patch	15.32 KB

Above patch but with some tests. Might have gone a little overboard with the tests, though =)

Log in or register to post comments

Comment #54

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 21 May 2010 at 18:48

Status:

Needs review

» Needs work

The max length param is supposed to be a hard limit. If the person requests ellipses and the length of t('...') is > max_length, and we need to truncate, I think we should return an empty string.

Good catch on the bad logic though!

Log in or register to post comments

Comment #55

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 21 May 2010 at 18:50

Also, I think your tests are great, but maybe we should also test with some other inputs rather than hard-wiring everything except the length?

Log in or register to post comments

Comment #56

janusman CreditAttribution: janusman commented 24 May 2010 at 17:21

Status:

Needs work

» Needs review

File	Size
768040-56-fixsearchcomment.patch	12.13 KB

Hmm, how about this?

Log in or register to post comments

Comment #57

24 May 2010 at 18:30

Status:

Needs review

» Needs work

The last submitted patch, 768040-56-fixsearchcomment.patch, failed testing.

Log in or register to post comments

Comment #58

janusman CreditAttribution: janusman commented 24 May 2010 at 19:43

Now this is weird; tests fail but direct calls to the function *do* work.

This is the test bot output:

Truncate frànçAIS is über-åwesome to 23 characters is frànçAIS is über-åwesome; expected frànçAIS is über-åwesom	Other	unicode.test	235	UnicodeUnitTest->testTruncate()	
Truncate 123 to 2 characters is ...; expected ..	Other	unicode.test	290	UnicodeUnitTest->testTruncate()

However if I run this code on a block:

echo truncate_utf8("frànçAIS is über-åwesome", 23, FALSE, FALSE);

it produces the expected "frànçAIS is über-åwesom" (test claims the result is "frànçAIS is über-åwesome").

Will try to dig deeper...

Log in or register to post comments

Comment #59

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 May 2010 at 21:14

What do you see when you run the tests through simpletest on your testing system, as opposed to the bot?

Looking at the logic in your current truncate_utf8() function, it looks like drupal_substr() is returning 24 characters when asked for 23, or else drupal_strlen() is saying that the string length is <= 23.

  if (drupal_strlen($string) <= $max_length) {
    // No truncation needed, so don't add ellipsis, just return.
     return $string;
   }
... (if wordsafe {} )

   else {
    $string = drupal_substr($string, 0, $max_length);
   }

So there could possibly be some difference in the behavior of drupal_substr() or drupal_strlen() based on some global variables that could be different between the test bot and your system:
http://api.drupal.org/api/function/drupal_substr/7
http://api.drupal.org/api/function/drupal_strlen/7

If so, that's a bug. Would be great to debug it! You can try putting some debug() statements into your tests, see how they work on your testing system, and then submit a patch and see what the test bot says.

Log in or register to post comments

Comment #60

janusman CreditAttribution: janusman commented 24 May 2010 at 23:18

Confirming I see the same failed tests in my environment and from the test bot. Guess I failed to run those first =)

So, yes, it does *seem* like a bug of sorts, the weirdness is that seemingly the same string is truncated fine to 23 characters from within PHP in a block, but not inside the tests (I added a case to $testcase inside helperTestSubStr() and it also fails to truncate correctly to 23 characters).~~So I'm guessing this probably even be a PHP bug...~~ *edit* more probably, it's the global variable like @jhodgon mentions.

I'll release this issue for a while as I don't have any more free time for it... so feel free to take it.

Log in or register to post comments

Comment #61

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 24 May 2010 at 23:22

OK. Thanks for all the work you've done! I will see if I can find some bandwidth to debug it.

I guess it's already assigned to me.

Log in or register to post comments

Comment #62

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 15:25

~~The patch in #56 is also missing the chunk for search.module.~~
EDIT: scratch that. I can't read.

Log in or register to post comments

Comment #63

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 16:00

I am debugging this now. I can see the test failures. And that patch in #56 is missing the search.module chunk.

Log in or register to post comments

Comment #64

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 16:29

Hah. I added a few more tests to unicode.test, to the part where it is testing drupal_substr() and the bug is happening there -- it is not truncating correctly.

So I will need to delve into that...

Log in or register to post comments

Comment #65

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 16:39

Status:

Needs work

» Needs review

File	Size
768040-moretests-fails.patch	18.86 KB

Here's the current patch, so you can see drupal_substr() failing. I also modified the last set of tests a bit - modularized it slightly, and put it into the set of tests that is running in both non-multibyte and multibyte modes, for better test coverage.

So the drupal_substr() tests are failing when the global $multibyte is set to emulation mode, for several cases:
truncating '...' to two characters
truncating 'frànçAIS is über-åwesome' to 23 characters

It seems to be a problem when the requested length is 1 less than the actual string length.

I will continue to investigate, but I thought it might be useful to attach this patch now so you can see the tests failing.

Log in or register to post comments

Comment #66

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 16:57

OK, I found where the bug is. In drupal_substr():

      // Backtrace one byte if the end of the string was not reached.
      if ($iend < $strlen - 1) {
        $iend--;
      }

The problem is that if $length is equal to one less than the string length, at this point $iend == $strlen - 1, and so it doesn't step back one byte even though it should. So this logic is wrong.

I'm going to add some additional tests for negative lengths and negative starts (there were some but not all cases were covered) and see if that has problems too.

Log in or register to post comments

Comment #67

4 June 2010 at 17:00

Status:

Needs review

» Needs work

The last submitted patch, 768040-moretests-fails.patch, failed testing.

Log in or register to post comments

Comment #68

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 17:36

Title:	truncate_utf8() only works for latin languages	» truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Status:	Needs work	» Needs review

File	Size
768040-fix-substr.patch	21.06 KB

Here's a fix.

This is the only way I could figure out to fix the logic problem in drupal_substr. It comes back all green on my test box for the unicode tests. Let's see what the testbot and reviewers say...

I added some additional test cases to the drupal_substr() tests, beyond the ones in #65, which I think should cover all the bases now. Hopefully.

Log in or register to post comments

Comment #69

4 June 2010 at 18:40

Status:

Needs review

» Needs work

The last submitted patch, 768040-fix-substr.patch, failed testing.

Log in or register to post comments

Comment #70

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 22:10

Hmmm.

This is failing on the CJK line I tried to put in the patch. The test works fine on my machine, but I think my patch-creation software is choking on those characters.

And now my patch-applying program is not able to apply the patch above either.

Help?

The problem in the CJK is these lines:

+      array('以呂波耳・�?��?��?��?�。リヌルヲ。', 1, 3,
+              '呂波耳'),

which are coming through garbled... I took this string of characters from search.test around line 347... Can someone perhaps re-roll the patch?

Sigh.

Log in or register to post comments

Comment #71

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 4 June 2010 at 22:25

Status:

Needs work

» Needs review

File	Size
768040-reroll.patch	21.06 KB

Never mind. Tried my other machine and I think it has worked better this time.

Log in or register to post comments

Comment #72

janusman CreditAttribution: janusman commented 8 June 2010 at 13:56

Status:

Needs review

» Reviewed & tested by the community

Awesome.
Run tests locally, all pass.

Marking RTBC. =)

Log in or register to post comments

Comment #73

Dries CreditAttribution: Dries commented 9 June 2010 at 15:53

Great catch with the drupal_substr() bug!

Some minor details:

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Each case is an array with input string, max_length, and expected return
+    // value.

- When reading this patch top-to-bottom it wasn't clear what max_length was going to be used for.

- Instead of saying 'max_length' we could write 'maximum length'?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Test non-wordsafe, non-ellipsis cases

Missing trailing point?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+      array('以呂波耳・ほへとち。リヌルヲ。', 6, '以呂波耳・ほ',),

Extra comment before ')' ?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Test non-wordsafe, ellipsis cases

Missing trailing point?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Test wordsafe, ellipsis cases

Missing trailing point?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+   * Runs test cases for trucation.
+   *
+   * Helper function for helperTestTruncate().

I'd remove the first sentence, or somehow combine them.

Log in or register to post comments

Comment #74

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 9 June 2010 at 16:08

Status:

Reviewed & tested by the community

» Needs work

Good points. I will reroll later today or tomorrow.

Log in or register to post comments

Comment #75

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 9 June 2010 at 18:25

Status:

Needs work

» Needs review

File	Size
768040-bettercomments.patch	21.48 KB

Here's a new patch, with a redo of the comments and fix for the extra comma mentioned above. Hopefully the unicode characters didn't get screwed up again...

Log in or register to post comments

Comment #76

janusman CreditAttribution: janusman commented 10 June 2010 at 14:35

Status:	Needs review	» Reviewed & tested by the community
Issue tags:	-Needs documentation

Changes look good, tests pass locally. I think it's RTBC.

This is still tagged "Needs Documentation". Guess we can untag it now too =)

Log in or register to post comments

Comment #77

Dries CreditAttribution: Dries commented 10 June 2010 at 15:21

Status:

Reviewed & tested by the community

» Fixed

This looks great now. Committed to CVS HEAD. Thanks all!

Log in or register to post comments

Comment #78

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 June 2010 at 15:50

Status:

Fixed

» Reviewed & tested by the community

Just as a note, if this is committed, we should probably port to Drupal 6, which has the same exact functions currently.

Log in or register to post comments

Comment #79

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 10 June 2010 at 15:51

Version:	7.x-dev	» 6.x-dev
Status:	Reviewed & tested by the community	» Patch (to be ported)

Oops, cross post.

Log in or register to post comments

Comment #80

Dave Reid

he/him

English

Nebraska USA

CreditAttribution: Dave Reid commented 30 June 2010 at 23:44

So I was looking into possibly backporting this code into pathauto itself and wanted to look into if there were any challenges to providing this with D6/PHP4/etc. As far as we can tell there should not be any reason not to backport. We can provide the minimum length parameter as an optional variable. However, the tricky thing might be the constant being moved from search.module to unicode.inc.

Log in or register to post comments

Comment #81

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 July 2010 at 00:13

I forgot about that. We can't really remove that DEFINE from search.module, as contrib modules are definitely depending on it.

So probably the only thing we could do is move it (with the same name) to unicode.inc.

Log in or register to post comments

Comment #82

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 1 July 2010 at 00:14

Although... that could break other contrib modules. I know of one that copied that define to their own module and left the name the same. Which was a bad idea, but... anyway.

Log in or register to post comments

Comment #83

AlexisWilke CreditAttribution: AlexisWilke commented 19 September 2010 at 10:26

Hi guys,

I was just working on the teaser that is completely broken because it calls this truncate_utf8() function which is also broken for HTML. And none of your patches fix the problem.

If the teaser is to call this function (or anyone else) with a string that may contain HTML, then you MUST take the HTML, comments and entities in account which at this time are totally ignored.

See my fix to the teaser completely removes the call to this function:

#221257: text_summary() should output valid HTML and Unicode text, and not count markup characters as part of the text length

Since it is broken I could not use it. However, a much better fix is to move the teaser cutting capability here and then we'd have nothing to do in the teaser (nothing more than call this truncate_utf8() function with the right size.)

Thank you.
Alexis Wilke

Log in or register to post comments

Comment #84

dpearcefl CreditAttribution: dpearcefl commented 25 June 2011 at 01:21

Priority:	Critical	» Normal
Status:	Patch (to be ported)	» Postponed (maintainer needs more info)

Has this issue been fixed in the latest D6?

Log in or register to post comments

Comment #85

dpearcefl CreditAttribution: dpearcefl commented 25 June 2011 at 18:46

Priority:	Normal	» Major
Status:	Postponed (maintainer needs more info)	» Patch (to be ported)

Log in or register to post comments

Comment #86

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 27 June 2011 at 15:11

No, it has not been fixed in Drupal 6. Someone needs to port the patch. Hence, the issue status...

Log in or register to post comments

Comment #87

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 28 February 2012 at 15:40

Assigned:

jhodgdon

» Unassigned

Unassigning. If someone wants to port this...

Log in or register to post comments

Comment #88

2 March 2016 at 22:18

Status:

Patch (to be ported)

» Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.

Log in or register to post comments

truncate_utf8() only works for latin languages (and drupal_substr has a bug)

Comments