When commenting any type of content, title of comment generates automatically. According to source code the title shold be 29 characters length. When using native characters, the length is shorter. Run a simple test:
write comment to node with the text: "A B C D E F G H I J K L M N O P Q R S T U W X Y Z"
the title should be "A B C D E F G H I J K L M N", it's ok
write comment to node with the text: "А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ъ Ы Ю Э Ю Я"
the title should be "А Б В Г Д Е Ё Ж З И Й К Л М",
but actual result is: "А Б В Г Д Е Ё Ж З"
Short titles are inconvinient for non-english speakers, the result is ofthen only one word in a title. I see problem in function truncate_utf8 in the file unicode.inc. I solved the problem by rewriting this function, the code is below. Please, cosider my variant:
function truncate_utf8($string, $len, $wordsafe = FALSE, $dots = FALSE) {
$string = drupal_substr($string, 0, $len - ($dots ? 4 : 0));
if ($wordsafe && ($last_space=mb_strrpos($string, ' ')))
$string = drupal_substr($string, 0, $last_space);
if ($dots)
$string .= ' ...';
return $string;
}
Comment | File | Size | Author |
---|---|---|---|
#57 | 200185_rename_truncate_utf8-56.patch | 14.87 KB | Freso |
#54 | 200185_rename_truncate_utf8-54.patch | 14.02 KB | Freso |
#52 | 200185_rename_truncate_utf8-52.patch | 14.02 KB | Freso |
#49 | 200185_rename_truncate_utf8-49.patch | 12.9 KB | Freso |
#47 | rename_truncate_utf8.patch | 12.78 KB | Freso |
Comments
Comment #1
PanchoI can reproduce this bug with other utf8 characters (here: German umlauts), though it will be most prominent in languages with a non latin alphabet. Should be fixed.
Comment #2
PanchoI took a look at truncate_utf8 and found it to be completely broken (therefore I mark this critical)!
truncate_utf8 didn't even use drupal_substr() but tried to reinvent the wheel, and did so in a wrong way. Obviously this function has been incorrectly patched when introducing unicode support.
I also checked out aslw's replacement function, which unfortunately has three drawbacks:
Therefore I had to rewrite the function.
As a replacement for mb_strrpos() is IMHO unavoidable for this task (correct me if I'm wrong), I adapted an existing one (written by Niels Leenheer & Andy Matsubara).
Now, instead of adding this replacement in line, which wouldn't have made much sense, I added the new API function drupal_substr() to unicode.inc, so the functionality can be used throughout Drupal.
If someone could come up with an alternative that gets along with only one drupal_substr() call, that might be even better. Otherwise I guess this is okay in terms of performance.
I tested the new function both in the comment title and the node teaser use case and found it to work absolutely correct with and without non-latin chars. Still, as truncate_utf8 is being used in many places, some more testing would be awesome.
One thing we might discuss is whether the function name should be changed to something more generic like drupal_truncate(). True that this function is UTF8-safe, but it is just our standard truncate function. This would also be more in line with all the other UTF8-safe functions in unicode.inc.
Comment #3
PanchoComment #4
Gábor HojtsyAs pointed out elsewhere, truncate_utf8() is not used how it was designed. It is designed to split off a string at a certain number of **bytes**, and it walks back from that number of bytes, when that position is in the middle of an UTF-8 sequence. It is not designed to split off a string at a certain number of **chars**.
See Steven's note at http://drupal.org/node/26688:
As far as I know, we only run on natively utf-8 supporting databases, so we might not need truncate_utf8() to do this. But in any case, all uses of truncate_utf8() should be examined and noted what it is used for at that place. Lots of places it is probably misused as a substring function, instead of a byte count based truncation function, some places might use it as a byte count based function, so any change in behavior will break them.
We seem to have these uses of it in core:
I remember that there should be an issue, which was probably postponed to Drupal 7 which tried to do this same utf8-ification of truncate_utf8(), then suggested a new function to be added, so there might very well be previous analysis. Feel free to come up with creative search terms to find that issue, I was unable to.
Comment #5
Gábor HojtsyBetter title :)
Comment #6
Gábor Hojtsytruncate_utf8() was introduced by Steven in http://drupal.org/cvs?commit=911 This was on April 15, 2004, more then 3 years ago. Actually, more then a year later, Steven proposed and soon committed a general utf8 subsystem for Drupal in the above mentioned and quoted issue: http://drupal.org/node/26688
When you look at the original truncate_utf8() commit, you will see that it was introduced in place of substr() calls, because substr() was not utf8 safe. Now what about simply turning all substr()-itended uses of truncate_utf8() to drupal_substr() which was introduced one year after truncate_utf8(), but was never found wide use in Drupal?
Seems like truncate_utf8() should have been retired with the requirement of utf-8 databases, which happened between 4.6 and 4.7 and the introduction of the utf8 lib, which happened much earlier.
Looking at my grep above, it looks like that outside unicode.inc, where mime_header_encode() reuses truncate_utf8(), it is only used in core as a (sometimes word safe) drupal_substr(). So if something needs to be developed here, then it is a wordsafe drupal_substr() with a $dots option under a different name and the usage of that where appropriate.
Comment #7
chx CreditAttribution: chx commentedI researched out characters having Separator property and then coded the beginning of what Gabor asks. Ugly. I want PHP6.
Comment #8
chx CreditAttribution: chx commentedUpdated the code because the array was storing numbers and we had characters.
Comment #9
PanchoGábor:
Thanks for your very useful comments. I checked out those old discussions with Steven and came up with a solution that doesn't introduce more unicode functions. Also, it uses the more expensive drupal_substr() only twice in the rather seldom case, that there is no space in the truncated string. Otherwise I resorted to a faster option using strrpos() and substr().
Surely, truncate_utf8() has been misused. But is has been misused at least from 2004 on, when the $wordsafe flag was overhastily introduced into truncate_utf8().
Enclosed is a new patch replacing the one in #2. As it only changes truncate_utf8(), it allows for nothing more and nothing less than testing my improved algorithm.
To avoid confusion, for the next step I follow up separately.
Comment #10
PanchoGábor again:
Ok, now let's proceed to the next problem.
I acknowledged that changing truncate_utf8() would break mime_header_encode(), which seems to be the only proper use case.
Therefore I removed both the inadequate $wordsafe and $dots flags from the original truncate_utf8() and renamed the function to drupal_truncate_bytes().
Then, I created the revised function (as in #9) as the new function drupal_truncate_chars().
(Because I didn't want to change all calls of the former truncate_utf8(), I enclosed a wrapper function for testing purposes)
Aside from the complete patch, I enclosed the added functions in a textfile, because the patch is hardly readable.
Comment #11
Panchochx:
This seems to be a more complete approach to the problem, which is awesome!
I just think this would rather be a new feature for D7, as we would need to elaborate quite a lot on it.
While I'm having a hard time reading the code, I just want to drop some comments:
So if we can get this properly done for D6 and noone objects to adding this feature, it would certainly be great! Otherwise I'd postpone this to D7 and invite you to review my (smaller) proposal in #10.
Comment #12
Gábor HojtsyI love the approach taken in #10. This is a nasty bug in Drupal for three years, as truncate_utf8() was used in place of substr(), and we are moving to the right direction. I also like the renaming of these functions, to get in line with Drupal standards, and let all contrib maintainers know that they were using the wrong function for their purpose. Drupal 6 is an "internationalization release", so it would be a shame to release without this old bug fixed.
That said, the code in #10 suffers from some small coding style errors, namely around brackets like
} else {
. Otherwise obviously needs more testing.Comment #13
Gábor HojtsyActually, this one is a better title. truncate_utf8() is utf8 safe but is not a substring function for "exact" lengths.
Comment #14
chx CreditAttribution: chx commentedI got the list from http://www.fileformat.info/info/unicode/category/Zs/list.htm (and then line and paragraph separators). this quote tells me that 0x00A0, 0x202F and 0x180E are nonbreaking spaces. If we deal with Unicode, then we need to deal with Unicode, so a single 0x0020 won't cut it IMO. I am using the converter http://mikebabcock.ca/code/ here to convert Unicode code points to UTF-8.
Comment #15
chx CreditAttribution: chx commentedI can init this nice little array now easily as it contains only numbers and bools... enrolled the comments into the array.
Comment #16
Steven CreditAttribution: Steven commentedWhy don't you use preg_match/u with \x{} instead of munching bytes with byzantine hex loops?
Comment #17
chx CreditAttribution: chx commentedPresuming this flag is not set, we will get some sort of error from PCRE. My code skips characters it does not recognize -- it's not a full UTF8 validator, no, but it recovers gracefully. This might or might not be good. Also what kind of regexp will match until the last space but if the last space is not given then at most N chars...? Of course, we might want to cross #10 and PCRE... that might work. Question is, what do want to do with invalid UTF8 sequences? How much effort we want to take recognizing them? Should we just fail or skip parts that are invalid per UTF8...?
Comment #18
Gábor HojtsyJust to note, I realized I went a bit too much overboard with marking the truncate_utf8() renaming a good idea so late in the cycle. We discussed this a bit after my comment above with chx, and agreed it is better kept with its old name. Otherwise there are a few open questions here still.
Comment #19
PanchoHmm... you are thinking about contrib using truncate_utf8() for truncating at character length? Think, what you said in #12 was exactly right:
It is okay that contrib modules get broken, as they have used the wrong function. Still this is a small edit for contrib developers and shouldn't be a problem if documented.
Included is a rerolled patch correcting the code style errors noted in #12, adding some spaces for clarity and rewording documentation for the $wordsafe flag. Note that the wrapper function is still included, so this may not be committed.
Comment #20
Gábor HojtsySmall edit or not, we are past beta 4 and heading to RC1, so we are not supposed to change ANY API whatsoever. This is not an exception. We can add a better function here for sure, fix the use in core, and let contrib maintainers know about it, but we are not supposed to break anything.
Comment #21
PanchoOkay, accepted. Rerolled the patch in #19.
The existing, but commonly misused function truncating a string to a number of bytes is renamed to drupal_truncate_bytes().
The new function truncating a string to a number of chars is now called truncate_utf8(). This should meet contrib developers' expectations.
In a next step this one should be renamed to drupal_truncate_chars() in D7. Then we can also discuss chx' more complete approach to wordsafe truncating.
Now I could safely remove the wrapper function. From my point of view, this is rtbc, unless you see some other open questions here.
Comment #22
chx CreditAttribution: chx commentedThis will be interesting for the contrib authors who sometimes have no idea about the difference in bytes and Unicode characters. Despite I like my patch better I can see it being D7 so this one is good to go (though I would have more liked a proper patch with -p but that's a minor tidbit, it's just easier to review).
Comment #23
Panchochx (off-topic): I'd love to add the -p option to Eclipse's patch creation, but it seems to be impossible... or do you know someone who succeded with this?
Comment #24
Gábor HojtsyCommitted #21. Please document the change in the module updates page:
- truncate_utf8() is now UTF-8 characters based
- those who used it as byte based now should use drupal_truncate_bytes()
Left open for Drupal 7 with chx's patch.
Comment #25
chx CreditAttribution: chx commentedhttp://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805
http://www.pcre.org/news.txt
Reading http://www.php.net/ChangeLog-4.php#4.3.5
So yes, preg_match is quite useable for our purposes.
Comment #26
bjaspan CreditAttribution: bjaspan commentedsubscribe
Comment #27
PanchoNow that the 7.x-dev branch is open, I enclose a patch that renames the new truncate_utf8() to drupal_truncate_chars() for consistancy, as it was agreed upon before (see Gábor's comment in #12).
All core functions that used to reference truncate_utf8() are changed to reflect this renaming.
This needs to be documented as an API change. Then, the next step to accomplish are chx' major improvements to a UTF8-enabled wordsafe truncating. So please set this back to CNW after committing.
Comment #28
PanchoRTBC is too bold, as this surely needs to be tested.
This is straightforward stuff though, so one more complete test should be enough to set this to RTBC.
Comment #29
PanchoOh, and of course this is no more critical...
Comment #30
Susurrus CreditAttribution: Susurrus commentedWould this have anything to do with #179521: New function to trim strings?
Comment #31
Susurrus CreditAttribution: Susurrus commentedI've tested it and it didn't apply cleanly, so I fixed that up and updated the documentation a little.
I think that the
drupal_truncate_chars()
should be changed so that any text can be appended to the end of a string. It would make it useful elsewhere in the code, especially where a space isn't desired in the ellipsis text. I changed the function so that it appends a...
instead of...
, so it's missing the starting space. This means it can replace much more code in Drupal. I think the order of the last two arguments should also be switched.I tested this code in the book heirarchy, but the simpletest module fails for a lot of tests right now...
Comment #32
Freso CreditAttribution: Freso commentedThanks for the heads up Susurrus! The
$dots
argument does seem to make "my" issue non-existant, but this also means that there's some of the stuff in my patch over there that should probably go into here. You've also changed a bunch of strings in_locale_get_predefined_list()
(includes/locale.inc), which shouldn't be part of this patch if they really need to be changed. Oh, and just tried to apply the patch. Lots and lots of failed hunks (actually, all the hunks fail).I'm going to give this patch some love, I think, and marking my own child a duplicate (even if it is older... *sigh*). Some of the discussion from #179521: New function to trim strings should still be considered though, namely these:
$append
here would be the$dots
.(Replace
drupal_trim
instances withdrupal_truncate_chars
in the above.)Note that my
drupal_trim
made it possible to make a custom$append
, so that you could have'...'
,' ...'
, or" (this has been truncated as it was overly long, so now I'm appending this message to make it even longer!)"
, and this should perhaps be considered here as well? With a default ofNULL
or''
, we should be able to safely append it in all cases. (If this is to be included, I'd also suggest reordering so that$wordsafe
is the last argument.)Comment #33
Freso CreditAttribution: Freso commentedWhile working on this, I realised that
drupal_truncate_chars
already does take care of the$dots
in the$len
it is given - but just by simply subtracting the length of the$dots
from$len
! Doing the replacements as has been done in the previous patch, this would lead to a regression (I could probably dig up the issue if requested) in that links truncated to, say, 15 characters would have 14 character links (http://abc.de/) "truncated" into a full 15 characters (http://abc.d...)! This is not fixed in the attached patch, so this still needs work IMO. I'm thinking about how to go about this with the current code (it worked beautifully indrupal_trim
).Apart from this, this also cleaned up some of the Doxygen line-wrapping as well as incorporated some stuff from my
drupal_trim
patch. I'll return soon with a way to fix the regression mentioned above. Feel free to test this in the mean while. ;)Comment #34
Freso CreditAttribution: Freso commentedAh, somehow I missed the first check in the function which should cancel the perceived regression. Setting to CNR. (I'm still not fully satisfied with the way
$wordsafe
and$dots
work together, but that shouldn't hold this up. I'll post a new patch if I can make it more elegant though.)Comment #35
Freso CreditAttribution: Freso commentedActually, this might be good. This should now enable the string "foo bar xyz" to be truncated (
$len = 8
) to "foo bar" ($wordsafe
, no$dots
), "foo ..." ($wordsafe
,$dots
, "foo b..." (no$wordsafe
,$dots
), or "foo bar " (no$wordsafe
, no$dots
). Please test and review!Comment #36
Freso CreditAttribution: Freso commentedOh, and this point is still up in the air:
(And re: simpletesting, please go give #243335: Move simpletest to core a look (and a test (and a review)).)
Comment #37
catchDid a visual review of the patch, amd tested it with a php page comparing generated output to Freso's expected output. Everything looks good.
IMO this is ready. I'd be in favour of a custom append as well, but no reason that shouldn't go in a follow up patch, worth looking at this old issue trying to use a proper ellipses instead of three dots, could maybe be done there: http://drupal.org/node/44987
Going to leave this open for one more pair of eyes to look at it before it goes RTBC, but no complaints from me at all.
Comment #38
Freso CreditAttribution: Freso commentedHere's a quick test. I haven't looked into where to place it yet, but if anyone has any brilliant idea, I'll happily turn this snippet into a proper patch against SimpleTest module's tests. :)
Comment #39
Freso CreditAttribution: Freso commentedPatch was broken due to commit of #245115: Fix Drupal's awkward coding standards for the . operator, and needed a single update for the new coding style as well. It still needs someone to review and I'd still like feedback to my comment #36. :)
Comment #40
Freso CreditAttribution: Freso commentedPatch had gotten some offset. Re-rolled. Also not sure what to do about the testing. Should the above test(s) be used, or should they be made into unit tests? I'm not really sure how to do the latter yet. I'll have to poke chx about it, I think.
Comment #41
Dries CreditAttribution: Dries commentedPlease include the tests in the patch, and add them to system.test or something.
Feel free to remove that other function, if it can be removed safely. Write tests for it if necessary.
Good job Freso.
Comment #42
Freso CreditAttribution: Freso commentedxmlrpc.inc seems to have xmlrpc.inc.test, so perhaps I should just make a unicode.inc.test?
/me goes to add tests to patch + rip out
_filter_url_trim()
Comment #43
Dries CreditAttribution: Dries commentedxmlrpc.inc.test is not currently called. It is something that needs to be fixed.
Comment #44
Freso CreditAttribution: Freso commentedI'm still rather rough around the edges, but here's an attempt at a unicode.test file, testing this function. The test itself hasn't actually been tested yet, but I thought I'd put it up here so that it isn't lost and so that people might comment on it before I get back to it again.
Comment #45
Freso CreditAttribution: Freso commentedComment #46
Freso CreditAttribution: Freso commentedSince Dries said he wanted me to give this some more love, I'm bumping this so it's not hiding away on page 4 of my issue list...
Comment #47
Freso CreditAttribution: Freso commentedCall it coincidence or synchronicity, but page 4 is exactly where this is placed right now. (Just on the tip of going to page 5 actually, but still hanging on at the edge of page 4.) However, as
truncate_utf8()
is about to be used (as a substring function) in Pathauto, I thought I'd give this patch another kick in its behind. :)The attached is a re-roll (including converting some recently added
truncate_utf8()
calls). I haven't gotten around removing_filter_url_trim()
, but I don't see that as a necessity of having this patch go in, and will increase the complexity (albeit slightly) of it instead. I'll happily work on it, oncedrupal_truncate_chars()
is born andtruncate_utf8()
is no more.I haven't tested or run any tests against this patch, but I'd still love to hear suggestions on how to improve the test file from #44.
Comment #48
Anonymous (not verified) CreditAttribution: Anonymous commentedThe last submitted patch failed testing.
Comment #49
Freso CreditAttribution: Freso commentedAaand... a re-roll. If anyone would help me with the tests so this could go in, I'd appreciate it.
Comment #50
bjaspan CreditAttribution: bjaspan commentedI'm not up on the details of this patch yet. However, I notice the following code:
This is poor DX. I do not know what FALSE and TRUE mean as arguments to drupal_truncate_chars. As a reviewer, for all I know this introduces a security hole. Even if I know what the two arguments are for, I'll never remember what order they are in or what TRUE and FALSE actually do. If those arguments are always or almost always provided, I suggest
with similar constants for the inverse operations (TRUNCATE_NOT_WORDSAFE, TRUNCATE_NO_DOTS or similar). If the arguments are often not provided (which, given that they have default values, I suspect might be case), I suggest an l()-style optional array:
Comment #51
Freso CreditAttribution: Freso commented@ bjaspan: Thanks for the excellent input! I'm going to go with the
l()
-style optional array, as that would also allow for further values (e.g. further customisation of the "dots", per #179521: New function to trim strings), without breaking other functions already using this one. I'm going to do some grocery shopping, but I'll get right on it when I get back! :)Comment #52
Freso CreditAttribution: Freso commentedHere's a patch incorporating
l()
-style$options
, and even adds a parameter to alter what the "dots" should be (per #179521: New function to trim strings). Thank you for the wonderful suggestion, bjaspan! :DComment #54
Freso CreditAttribution: Freso commentedOops.
Also: <3 the testing bot.
Comment #57
Freso CreditAttribution: Freso commentedRe-roll and putting it up against the Testbot to see if it still complains. It still needs to have some unit tests included though...
Comment #60
jhodgdonThis is still an issue, and still needs to be fixed.
Related issues that are caused by misuse of this function:
#362609: node teaser generation breaks when using entities (D6)
#269911: Search result trimming should not fall inside HTML entities/tags (D7)
Probably others as well...
Further note: It should be documented that word breaking will NOT work in CJK languages (Chinese, etc.) since it relies on having a space between words, which is not actually true. See
http://drupal.org/node/269911#comment-2795478
My opinion is that the word-breaking feature should be removed, because of this.
Comment #61
jhodgdonI've just filed a different issue to address the problem that the truncate_utf8 function is not good with locales:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)
Comment #62
jhodgdonOK....
So, on #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug), we are fixing up truncate_utf8() so that it can be used as a function to truncate strings of any character set to a particular number of characters (not bytes), possibly only on word boundaries, and then append t('...') afterwards if desired. So it will be set up as a great UI-facing substring function.
It looks like we now also have the drupal_truncate_bytes() function, which truncates UTF8 strings safely to a number of bytes. That must have been added in a different patch/issue.
So I think this issue has been resolved, and I'm going to just mark it as a duplicate at this point (duplicate of 768040 and the unknown issue that added the drupal_truncate_bytes() function).