There are two problems with truncate_utf8() that come up for some languages, and some combinations of arguments:

a) Truncate to a word boundary -- it assumes word boundaries are spaces. This is not true for CJK languages. See
http://drupal.org/node/269911#comment-2795478

b) Adding "..." at the end -- this is not the right way to indicate continuation in all languages. It should be adding t('...'). See http://drupal.org/node/86461#comment-2829792
for more information.

There are many issues related to one or both of these problems, which I'll list in a comment...

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Damien Tournoud’s picture

Indeed, there is clearly room for improvement here.

For (a), splitting "words" at PREG_CLASS_SEARCH_EXCLUDE boundaries would go a long way already (there cannot be any perfect solution without (1) knowing the language, (2) a corpus and complex machine learning algorithms).

For (b), allowing to translate the ellipsis could also probably cover 90% of the use cases.

Damien Tournoud’s picture

Title: truncate_utf8 features are not locale-aware » truncate_utf8() only works for latin languages

Clarifying the title.

jhodgdon’s picture

Status: Active » Needs review
FileSize
6.37 KB

Here's a first attempt at a patch using methodology discussed above. It may change something in existing code, because the word boundary character class includes punctuation and symbols. But the change should be for the better.

Note that the patch moves the regexp class from search.module to unicode.inc (with a rename), and translates the '...' added at the end.

Let's see what the test bot says and go from there...

Status: Needs review » Needs work

The last submitted patch, 768040.patch, failed testing.

Damien Tournoud’s picture

You will have to be smarter then that :)

I suggest the following heuristic:

  • Find the longest substring that is <= $len (you can use the regexp to do that, but anchor it to the start and limit the .* to $len characters)
  • If that string is > $len / 2, use it as-is, else truncate at a non-word safe position
jhodgdon’s picture

Assigned: Unassigned » jhodgdon

What? There were only 25,480 exceptions. I don't see any problem. :)

Must have been a Monday patch... I'll see if I can do better (and test it this time before submitting it).

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
11.15 KB

Here's a new patch.

I'm not sure about the heuristic of only truncating to a word boundary if the string would be 1/2 of the desired length or more. That's not how the old function worked, and I'm not sure we want to change it in that way.

So, this patch has the benefit of actually working (at least, the dblog looks reasonable). It may cause a few test failures, though, since the definition of a word boundary has changed to include punctuation. Let's see what the test bot says this time...

jhodgdon’s picture

I should also note that I think putting a space before the ... is silly. I left that in there for backwards compatibility, but I think the normal thing (in English at least) would be to have text like:

This is a really long sentence that has been...

rather than

This is a really long sentence that has been ...

Thoughts?

Damien Tournoud’s picture

Agreed on the uselessness of the space prefix.

// space exists AND is not on position 0

^ There was an heuristic already. Basically we don't want to cut a string too short simply because the last ponctuation sign appears too soon in the string.

jhodgdon’s picture

The patch in #8 requires at least one character in the string to use the word boundary, which is the same as the original heuristic. I think if we apply any other heuristics, it should be an additional input to the function, like $min_length. Thoughts?

So I guess we are agreed that there shouldn't be a space before the ...

I can remove that (and likely will need to fix a few tests where it's assuming the space is there). It looks like the patch in #8 works this time, though, which is good.

jpmckinney’s picture

I never liked "...". "..." is three periods. Ellipsis (&hellip; = …) is distinct from three periods. Ellipsis is the correct punctuation.

jhodgdon’s picture

The helip entity may be the correct punctuation, but it is probably not the right punctuation for all cases, because sometimes you just need ascii characters and not HTML entities.

We could conceivably add another input argument to this function to give you the option of ... or the elipses entity. Or callers of this function could not use the ... argument and add helip themselves.

But, as discussed before, ... or even the entity is probably not the right thing for non-latin languages anyway. Which is why t('...') is probably the right thing to do... I think?

jpmckinney’s picture

Yeah, &hellip; would be a poor choice, given that check_plain is often called after passing a string through truncate_utf8. But, would it be possible instead to put t('…') in place of t('...')? I don't know what the policy is on putting non-ASCII characters in the code.

jhodgdon’s picture

Not sure... is it a valid UTF-8 character? I also don't understand why ... in plain ASCII is so objectionable?

jpmckinney’s picture

Typography snobs would object, but ... is otherwise quite alright.

Anonymous’s picture

The orizontal ellipsis is a UTF-8 character (E2 80 A6).
I don't see any particular reason to use that character instead of 3 dots, except that it would allow to see two characters more from the original string (which is not a valid reason, IMO).

The character to use when truncating is a secondary detail; the most important thing would be to make the function work for more languages (if that is possible).

jhodgdon’s picture

Issue tags: +Needs documentation

OK... [that was three dots :) ]

So what about the patch in #8 then? Questions/comments that have been brought up:
a) Should we have a heuristic like "must be at least half the string" rather than the current "must be at least 1 character" to use word-safe? My inclination is not, unless we make it an input (let the caller specify the minimum length for wordsafe truncation). I'd be inclinded to leave the default at 1 if we make it an input, for backwards compatibility.
b) Should we remove the space before the ... that is appended?
c) Should we switch to appending the horizontal elipses character/entity rather than three . characters?

The patch in #8 does none of the above, but is not backwards-compatible because it is different on what it considers a word boundary (previously it was a space, now it's a space, punctuation, or CJK word-boundary indicators).

I am inclined to say yes on (b) and no on (a) and (c), but am open to community consensus because they are all reasonable suggestions in my opinion.

In any case, we we need to document this change in the module upgrade guide when we decide/commit, and likely update tests if we make any of the changes above, and update the function header along with the code if we implement any of the three.

It would also be great if someone could come up with a CJK word-safe trunctation test.

jpmckinney’s picture

I would be happy with just b).

jhodgdon’s picture

That's two for (b)... anyone else?

Anonymous’s picture

I agree that implementing only point b is preferable. The reported patch seems good to me, and I agree that a different euristic should probably be activated by an optional argument; in that way it is guaranteed the compatibility with the past code.

jhodgdon’s picture

Damien (see comment #10 above) also agreed about (b), and it sounds like we're all "OK but not mandatory" about (a). I'll see what I can do.

jhodgdon’s picture

FileSize
11.43 KB

Here's a new patch.

Changes from orginal function:
1. If using wordsafe, finds non-latin-safe word boundary (including punctuation), instead of specifically looking for ' ' as word boundary.
2. If appending dots, appends t('...') instead of ' ...' -- so '...' is translated, and there is no space before the ...
3. If using wordsafe mode, allows you to specify a minimum length (defaults to 1 character for backwards compatibility).

2 & 3 are changes from last patch.

Let's see what the test bot says. There may be tests that are looking for 'truncated stuff ...' somewhere and now they'll have 'truncated stuff...' without the extra space.

It would be great if someone could provide a CJK test of this function, too, and maybe we should write some unit tests too?

drewish’s picture

This patch looks pretty good to me but the comments might need a little tweaking:

  * @param $dots
- *   Flag to add trailing dots. Defaults to FALSE.
+ *   If TRUE, add t('...') to the end of the truncated string (defaults FALSE).

It seems like "Defaults to FALSE." is the way we normally say it.

+ * @param $min_length
+ *   If using word-safe mode, minimum length to truncate to before appending
+ *   dots (if using dots).
+ *

Just from reading that it seems unclear what this does... the interactions with $length and $dots aren't clear without reading the code.

jhodgdon’s picture

FileSize
6.9 KB

OK, how about this version? I decided it might make things clearer to change the name of the 2nd arg for clarity, and I rewrote the rest of the doc.

drewish’s picture

This looks much better.

Does it make sense to change $max_return_length to $max_length?

$min_length's help text is better but it's not clear to me why you'd ever want to use that. Also "Has no effect if $wordsafe is not TRUE." seems like we could change not TRUE to FALSE.

jhodgdon’s picture

As for why to have $min_length, this is discussed above in #10, #18, #21. The reason is that if the last space is only after a couple of characters, followed by maybe a really long URL with no spaces in it, you probably don't want to return that really short string because no one will understand what it is supposed to be.

I changed max_length to max_return_length (which refers to string with dots) to distinguish it from min_length (which refers to the string without dots). It seemed wrong for min_length and max_length to not be parallel, but they are used for different reasons presumably (max to truncate for db storage, min for understandability).

Thoughts?

I can make another go of clarifying the doc...

drewish’s picture

That URL example should go into the docs... that's exactly the answer I'd like to see when reading them.

Anonymous’s picture

I am a typography snob :-)

The main reason it makes a difference is that the dots in an ellipsis ought to spaced further apart than three dots get spaced by most faces/engines. (And less far apart than if they had actual spaces between them.) A second reason is that logically the thing is a single symbol, not three symbols.

If we cannot rely on everything being able to deal with the Unicode ellipsis character or HTML entity, then I would not want to argue for using either. That seems like asking for trouble.

Using t('...') allows those of us that really care about this to put a translation from three dots to the Unicode character. And in the forthcoming Nirvana where all software knows about Unicode, this could be revised to do the actually right thing.

jhodgdon’s picture

RE #28 -- OK, I will add that example to the patch.

jhodgdon’s picture

FileSize
7.46 KB

OK, here's a new patch...

jhodgdon’s picture

FileSize
7.47 KB

One more go after some verbal comments from drewish...

jhodgdon’s picture

Bump. Can we get this reviewed and hopefully committed? I have some search patches to do that will need to use this preg class whose name is changing...

jhodgdon’s picture

#32: 768040-v6.patch queued for re-testing.

codycraven’s picture

Review of #32:

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
+ * @param $max_return_length

What is the reason for renaming $len to $max_return_length? I would much rather see $max_length for consistency with $min_length.

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
  * @param $dots

Since we are modifying variable names, I would like to see $dots replaced with $ellipsis as it is more correct and global rather than the colloquial dot (short for dot-dot-dot) - especially considering that we are making this translatable.

+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
+  if ($dots) {
+    $dotstring = t('...');
+    $max_return_length -= drupal_strlen($dotstring);
+  }

Bad practice to only define a variable in the event of an if statement.

This should be changed to something like:

$dotstring = '';
if ($dots) {
  $dotstring = t('...');
  $max_return_length -= drupal_strlen($dotstring);
}
+++ includes/unicode.inc	20 Apr 2010 00:03:47 -0000
@@ -213,44 +274,70 @@
+    // Find the last word boundary, if there is one within $min_length to ¶

Trailing space needs to be removed.

jhodgdon’s picture

Status: Needs review » Needs work

- The reason max_return_length and min_length are not named in a parallel way is that they do not behave in a parallel way. Read the doc/code for details -- one governs return length and the other is length before dots are added. I didn't want them to look like they behaved the same, so I didn't name them the same.

- Agreed on other thoughts. I'll update the patch.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
12.58 KB

OK. Here's a new patch with many updated variable names and some updated documentation. I realized the URL example was wrong, since . and : are punctuation, so I modified it slightly. I also realized that the search.module chunk of this patch has been missing since that patch in #23, so this patch puts it back in.

jhodgdon’s picture

As noted in #23, do we need some tests for this function before the patch is committed? Also, the changes to the function will need to go into the module update guide.

codycraven’s picture

Regarding #36 and #37

I'm glad that my feedback was helpful.

I'm still not keen on having a name such as $max_return_length as a function parameter, however if you do decide to keep it as such please use the variable consistently throughout (not just renaming it to $max_length at the beginning of the function).

If others agree that the variable should be left as $max_return_length and you do decide to keep the re-assignment to $max_length for use in the function, then please do so by reference (technically by alias in php) by doing $max_length =& $max_return_length for memory allocation reasons.

Other than that gripe, I think the patch looks pretty good. I do think that some test cases should be added, especially for the min_length functionality since it is new. Creating test-cases for the other functionality would be difficult as there is no way we can write a test case for every language to test the truncation but we should at least have some.

jhodgdon’s picture

Seriously, memory allocation for one non-array varialble worries you?

The reason I used a second variable $max_length inside the function is that it no longer has the same meaning -- it's the length to truncate the string to, not the total returned length.

codycraven’s picture

jhodgdon,

It's not the memory allocation of the specific variable, it is that if this is an acceptable practice throughout the entire project you could conceivably have tens of thousands of duplicate, unnecessary variables in memory at a single instance in time on high volume sites.

I don't understand how the variable $max_length no longer has the same meaning as $max_return_length when $max_return_length is not used anywhere in the function. It seems like the variable should be named one or the other, and if it is $max_length then the documentation that you have added for the parameter already accurately explains the use of the variable.

If I am missing something regarding the functionality of $max_return_length please explain it to me. From what I understand I would feel confident in calling the function and setting a max length for the string I would want and would expect that the maximum length returned would be whatever max length I set.

Please note that I am not trying to shred your work apart, I think you have done a fantastic job on the patch and am only interested in providing an independent view from my experience background to help better the patch.

jhodgdon’s picture

It's a local variable and should be popped off the stack as soon as the function finishes, so I fail to see how it would increase the footprint of Drupal much to have a few extra bytes in memory for a miniscule amount of time. It's not like we have thousands of functions in the call stack at a time.

Roll a new patch or mark it "needs work" if you feel strongly enough about it. I'm happy with what it is now and it makes logical sense to me that "max_return_length" refers to the length of the returned string in the argument, and "max_length" refers to the length of the string before adding dots to it within the function, which is not the same thing. I don't feel strongly enough about it to object if you want to not use a different variable within the function.

codycraven’s picture

Status: Needs review » Reviewed & tested by the community
FileSize
12.47 KB

jhodgdon,

You are right regarding the actual memory impact in this instance. I just like to avoid setting practices that may be adopted by other contributors because they saw it in core, in which their use could result in actual impact.

From your explanation of the $max_return_length vs $max_length in your code I can see how you are viewing the instance and from that perspective does make sense.

I've attached a .patch file which is equivalent to the patch from #37 with the only changes being the renaming the $max_return_length parameter to $max_length and removing the assignment of $max_return_length to $max_length (which is no longer needed). I am offering the .patch as an alternative based on the discussions above and find no fault in the patch in #37 which I consider RTBC.

**As a side note, this is the first time I have rolled a core patch from NetBeans IDE so hopefully it applies correctly, it does in my local environment.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 768040-v7-alternative.patch, failed testing.

codycraven’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
12.53 KB

I realized that my alternative patch (#43) did not remove the assignment of $max_return_length to $max_length but only re-factored it to $max_length = $max_length. The attached patch corrects the fault.

Again as a note this patch has not been tested by anyone other than myself, although it is intended to be identical to the patch from #37 which has been reviewed, except for the re-factoring of a parameter. If this patch is chosen to commit instead of #37 then please review before commit.

**I also dumped NetBeans for generating the patch, so this one should apply.

chx’s picture

Status: Reviewed & tested by the community » Needs work

Well, well. We have no better idea still, right? When working on the Unicode problems, I have looked into implementing TR29 Word Break in user space but back then I just did not have the time to attempt it. I stll dont :) but boy, it would be interesting. You would need to take http://unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt the generator script in the search issue and the rules in http://unicode.org/reports/tr29/#Default_Word_Boundaries . Might be able to build a pcre even that does all of that (with atomic subpatterns).

Until then, the docblock of the pcre you moved makes no sense any more "Consequently, the index only contains characters with the following" this wants to be in search module.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
12.51 KB

RE #46 - Yeah, I missed that when I moved it over from search.module. Here's a patch with the text changed to be generic rather than talking about what gets into the search index.

It also adopts codycraven's renaming of the max_length value form the "alternative" patch, which was the name I had given the parameter back in earlier versions of the patches anyway, before codycraven objected to that name back in #35. Whatever. :)

Garrett Albright’s picture

YesCT’s picture

Issue tags: -Needs documentation

#47: 768040-fixsearchcomment.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, 768040-fixsearchcomment.patch, failed testing.

jhodgdon’s picture

Status: Needs work » Needs review
Issue tags: +Needs documentation

#47: 768040-fixsearchcomment.patch queued for re-testing.

janusman’s picture

Did some basic testing and found out the current logic needed some tweaking.

Patch from #47:

(123, 1, true, true) == 	1...	==> Should be ...
(123, 2, true, true) == 	12...	==> Should be ...
(123, 3, true, true) == 	...	==> Should be 123 (failure)
(1234, 3, true, true) == 	...	==> Should be ...
(12345 7890, 10, true, true) == 	12345...	==> Should be 12345 7890 (failure)
(123 567 90, 10, true, true) == 	123 567...	==> Should be 123 567 90 (failure)
(1234567890, 10, true, true) == 	1234567...	==> Should be 1234567890 (failure)
(12345678901, 10, true, true) == 	1234567...	==> Should be 1234567...
(12345678901, 11, true, true) == 	12345678...	==> Should be 12345678901 (failure)
(123456789012, 11, true, true) == 	12345678...	==> Should be 12345678...

New patch: (I'm assuming the ellipsis length takes priority over the max_length in strings shorter than the ellipsis itself)

(123, 1, true, true) == 	...	==> Should be ...
(123, 2, true, true) == 	...	==> Should be ...
(123, 3, true, true) == 	123	==> Should be 123
(1234, 3, true, true) == 	...	==> Should be ...
(12345 7890, 10, true, true) == 	12345 7890	==> Should be 12345 7890
(123 567 90, 10, true, true) == 	123 567 90	==> Should be 123 567 90
(1234567890, 10, true, true) == 	1234567890	==> Should be 1234567890
(12345678901, 10, true, true) == 	1234567...	==> Should be 1234567...
(12345678901, 11, true, true) == 	12345678901	==> Should be 12345678901
(123456789012, 11, true, true) == 	12345678...	==> Should be 12345678...
janusman’s picture

Above patch but with some tests. Might have gone a little overboard with the tests, though =)

jhodgdon’s picture

Status: Needs review » Needs work

The max length param is supposed to be a hard limit. If the person requests ellipses and the length of t('...') is > max_length, and we need to truncate, I think we should return an empty string.

Good catch on the bad logic though!

jhodgdon’s picture

Also, I think your tests are great, but maybe we should also test with some other inputs rather than hard-wiring everything except the length?

janusman’s picture

Status: Needs work » Needs review
FileSize
12.13 KB

Hmm, how about this?

Status: Needs review » Needs work

The last submitted patch, 768040-56-fixsearchcomment.patch, failed testing.

janusman’s picture

Now this is weird; tests fail but direct calls to the function *do* work.

This is the test bot output:

Truncate frànçAIS is über-åwesome to 23 characters is frànçAIS is über-åwesome; expected frànçAIS is über-åwesom	Other	unicode.test	235	UnicodeUnitTest->testTruncate()	
Truncate 123 to 2 characters is ...; expected ..	Other	unicode.test	290	UnicodeUnitTest->testTruncate()

However if I run this code on a block:

echo truncate_utf8("frànçAIS is über-åwesome", 23, FALSE, FALSE);

it produces the expected "frànçAIS is über-åwesom" (test claims the result is "frànçAIS is über-åwesome").

Will try to dig deeper...

jhodgdon’s picture

What do you see when you run the tests through simpletest on your testing system, as opposed to the bot?

Looking at the logic in your current truncate_utf8() function, it looks like drupal_substr() is returning 24 characters when asked for 23, or else drupal_strlen() is saying that the string length is <= 23.

  if (drupal_strlen($string) <= $max_length) {
    // No truncation needed, so don't add ellipsis, just return.
     return $string;
   }
... (if wordsafe {} )

   else {
    $string = drupal_substr($string, 0, $max_length);
   }

So there could possibly be some difference in the behavior of drupal_substr() or drupal_strlen() based on some global variables that could be different between the test bot and your system:
http://api.drupal.org/api/function/drupal_substr/7
http://api.drupal.org/api/function/drupal_strlen/7

If so, that's a bug. Would be great to debug it! You can try putting some debug() statements into your tests, see how they work on your testing system, and then submit a patch and see what the test bot says.

janusman’s picture

Confirming I see the same failed tests in my environment and from the test bot. Guess I failed to run those first =)

So, yes, it does *seem* like a bug of sorts, the weirdness is that seemingly the same string is truncated fine to 23 characters from within PHP in a block, but not inside the tests (I added a case to $testcase inside helperTestSubStr() and it also fails to truncate correctly to 23 characters).So I'm guessing this probably even be a PHP bug... *edit* more probably, it's the global variable like @jhodgon mentions.

I'll release this issue for a while as I don't have any more free time for it... so feel free to take it.

jhodgdon’s picture

OK. Thanks for all the work you've done! I will see if I can find some bandwidth to debug it.

I guess it's already assigned to me.

jhodgdon’s picture

The patch in #56 is also missing the chunk for search.module.
EDIT: scratch that. I can't read.

jhodgdon’s picture

I am debugging this now. I can see the test failures. And that patch in #56 is missing the search.module chunk.

jhodgdon’s picture

Hah. I added a few more tests to unicode.test, to the part where it is testing drupal_substr() and the bug is happening there -- it is not truncating correctly.

So I will need to delve into that...

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
18.86 KB

Here's the current patch, so you can see drupal_substr() failing. I also modified the last set of tests a bit - modularized it slightly, and put it into the set of tests that is running in both non-multibyte and multibyte modes, for better test coverage.

So the drupal_substr() tests are failing when the global $multibyte is set to emulation mode, for several cases:
truncating '...' to two characters
truncating 'frànçAIS is über-åwesome' to 23 characters

It seems to be a problem when the requested length is 1 less than the actual string length.

I will continue to investigate, but I thought it might be useful to attach this patch now so you can see the tests failing.

jhodgdon’s picture

OK, I found where the bug is. In drupal_substr():

      // Backtrace one byte if the end of the string was not reached.
      if ($iend < $strlen - 1) {
        $iend--;
      }

The problem is that if $length is equal to one less than the string length, at this point $iend == $strlen - 1, and so it doesn't step back one byte even though it should. So this logic is wrong.

I'm going to add some additional tests for negative lengths and negative starts (there were some but not all cases were covered) and see if that has problems too.

Status: Needs review » Needs work

The last submitted patch, 768040-moretests-fails.patch, failed testing.

jhodgdon’s picture

Title: truncate_utf8() only works for latin languages » truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Status: Needs work » Needs review
FileSize
21.06 KB

Here's a fix.

This is the only way I could figure out to fix the logic problem in drupal_substr. It comes back all green on my test box for the unicode tests. Let's see what the testbot and reviewers say...

I added some additional test cases to the drupal_substr() tests, beyond the ones in #65, which I think should cover all the bases now. Hopefully.

Status: Needs review » Needs work

The last submitted patch, 768040-fix-substr.patch, failed testing.

jhodgdon’s picture

Hmmm.

This is failing on the CJK line I tried to put in the patch. The test works fine on my machine, but I think my patch-creation software is choking on those characters.

And now my patch-applying program is not able to apply the patch above either.

Help?

The problem in the CJK is these lines:

+      array('以呂波耳・�?��?��?��?�。リヌルヲ。', 1, 3,
+              '呂波耳'),

which are coming through garbled... I took this string of characters from search.test around line 347... Can someone perhaps re-roll the patch?

Sigh.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
21.06 KB

Never mind. Tried my other machine and I think it has worked better this time.

janusman’s picture

Status: Needs review » Reviewed & tested by the community

Awesome.
Run tests locally, all pass.

Marking RTBC. =)

Dries’s picture

Great catch with the drupal_substr() bug!

Some minor details:

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Each case is an array with input string, max_length, and expected return
+    // value.

- When reading this patch top-to-bottom it wasn't clear what max_length was going to be used for.

- Instead of saying 'max_length' we could write 'maximum length'?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Test non-wordsafe, non-ellipsis cases

Missing trailing point?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+      array('以呂波耳・ほへとち。リヌルヲ。', 6, '以呂波耳・ほ',),

Extra comment before ')' ?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Test non-wordsafe, ellipsis cases

Missing trailing point?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+    // Test wordsafe, ellipsis cases

Missing trailing point?

+++ modules/simpletest/tests/unicode.test	4 Jun 2010 22:24:15 -0000
@@ -215,4 +247,81 @@
+   * Runs test cases for trucation.
+   *
+   * Helper function for helperTestTruncate().

I'd remove the first sentence, or somehow combine them.

jhodgdon’s picture

Status: Reviewed & tested by the community » Needs work

Good points. I will reroll later today or tomorrow.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
21.48 KB

Here's a new patch, with a redo of the comments and fix for the extra comma mentioned above. Hopefully the unicode characters didn't get screwed up again...

janusman’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: -Needs documentation

Changes look good, tests pass locally. I think it's RTBC.

This is still tagged "Needs Documentation". Guess we can untag it now too =)

Dries’s picture

Status: Reviewed & tested by the community » Fixed

This looks great now. Committed to CVS HEAD. Thanks all!

jhodgdon’s picture

Status: Fixed » Reviewed & tested by the community

Just as a note, if this is committed, we should probably port to Drupal 6, which has the same exact functions currently.

jhodgdon’s picture

Version: 7.x-dev » 6.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

Oops, cross post.

Dave Reid’s picture

So I was looking into possibly backporting this code into pathauto itself and wanted to look into if there were any challenges to providing this with D6/PHP4/etc. As far as we can tell there should not be any reason not to backport. We can provide the minimum length parameter as an optional variable. However, the tricky thing might be the constant being moved from search.module to unicode.inc.

jhodgdon’s picture

I forgot about that. We can't really remove that DEFINE from search.module, as contrib modules are definitely depending on it.

So probably the only thing we could do is move it (with the same name) to unicode.inc.

jhodgdon’s picture

Although... that could break other contrib modules. I know of one that copied that define to their own module and left the name the same. Which was a bad idea, but... anyway.

AlexisWilke’s picture

Hi guys,

I was just working on the teaser that is completely broken because it calls this truncate_utf8() function which is also broken for HTML. And none of your patches fix the problem.

If the teaser is to call this function (or anyone else) with a string that may contain HTML, then you MUST take the HTML, comments and entities in account which at this time are totally ignored.

See my fix to the teaser completely removes the call to this function:

#221257: text_summary() should output valid HTML and Unicode text, and not count markup characters as part of the text length

Since it is broken I could not use it. However, a much better fix is to move the teaser cutting capability here and then we'd have nothing to do in the teaser (nothing more than call this truncate_utf8() function with the right size.)

Thank you.
Alexis Wilke

dpearcefl’s picture

Priority: Critical » Normal
Status: Patch (to be ported) » Postponed (maintainer needs more info)

Has this issue been fixed in the latest D6?

dpearcefl’s picture

Priority: Normal » Major
Status: Postponed (maintainer needs more info) » Patch (to be ported)
jhodgdon’s picture

No, it has not been fixed in Drupal 6. Someone needs to port the patch. Hence, the issue status...

jhodgdon’s picture

Assigned: jhodgdon » Unassigned

Unassigning. If someone wants to port this...

Status: Patch (to be ported) » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.