Problem:
It doesn't make any sense to me that FieldPluginBase::trimText() in Views and Unicode::truncate() have slightly different but almost identical behaviour. I could understand this when Views was contrib and wanted to implement its own logic, and I understand it could be beneficial for the two functions to have different arguments, but now that Views is in Core, can we merge the two? We already have visibly diverging handling of truncation now that Views uses "..." and Unicode uses "…".
One difference that I can see is that Views handles HTML entities at the end of strings, whereas the Unicode truncate doesn't have a concept of HTML. If that's the only reason to keep the Views truncation kicking around then that functionality should probably be moved to \Drupal\Component\Utility\Html.
Solution:
- Introduce Html::truncate() that mirrors/wraps Unicode::truncate() but handles HTML entities, etc.
- FieldPluginBase::trimText() simply wraps Html::truncate() and/or Unicode::truncate()
| Comment | File | Size | Author |
|---|---|---|---|
| #21 | Screen Shot 2014-07-20 at 1.06.51 pm.png | 106.96 KB | thedavidmeister |
Comments
Comment #1
thedavidmeister commentedComment #2
thedavidmeister commentedHow each function works currently:
Unicode::truncate() features:
FieldPluginBase::trimText() features:
Common functionality:
Based on the above, I could see Html::truncate() looking something like something as simple as:
and then trimText() looking like:
Comment #3
thedavidmeister commentedSo, Views isn't great because, unless I've read the function wrong, truncating the following to 10 characters:
<strong>foo bar baz foo</strong>Gives:
<strong>fo</strong>Which is technically correct I suppose, but I'm sure not many people's *intention* when dealing with HTML.
Woah http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-ta...
Comment #4
thedavidmeister commentedFrom IRC:
https://ideone.com/5Gbe5J
https://ideone.com/u0cd9c
https://ideone.com/gm7Zgb
https://ideone.com/0fZUBk
Comment #5
mgiffordYou've made a good case for merging the two. Interesting link to http://alanwhipple.com
Comment #6
thedavidmeister commentedmore reading - http://www.pjgalbraith.com/2011/11/truncating-text-html-with-php/
Comment #7
thedavidmeister commentedThis is sort of a feature request... I guess...
Comment #8
thedavidmeister commentedMore things wrong with the Views approach... Because of the regex being run, an existing malformed HTML entity inside a chunk of text will cause the whole thing to be chopped right back to the entity.
sometext   more textbecomes:
sometextThis is also sort of a bug report because Views seems pretty buggy (or at least naive) at the moment.
Comment #9
mgiffordI think this has to be resolved by the Views folks.
Comment #10
thedavidmeister commentedNo, I don't think this should be something in Views!
This functionality should be lower level than that, and Views should simply use it.
I suspect that the only reason it was ever in Views in the first place is because Core has never provided something decent to achieve this totally normal functionality.
Comment #11
mgiffordAre you going to have time to write up a patch with Html::truncate() & FieldPluginBase::trimText()?
Comment #12
thedavidmeister commentedI've been thinking about it on and off. Not sure the best way yet...
I put the start of a sandbox up at https://github.com/thedavidmeister/html_truncate_sandbox
The bit I'm wondering about atm:
Say you have
'foo bar'and you want to truncate it at 5 characters, what you want to see is"foo b"without word safe, and "foo" with word safe.If we move the cursor to "foo&n", which is 5 characters, we can't easily know that "&n" is actually the start of
(which we'd count as one character).. this messes with our counting.This is basically where I assume the Views people got to, which led to what I was complaining about in #8
Comment #13
mgiffordNice to have this here https://github.com/thedavidmeister/html_truncate_sandbox/blob/master/src...
Would something like this account for the character entities?
We just need a bit of an extra buffer in the function to accommodate for the entities, right?
Comment #14
thedavidmeister commentedwell not exactly that, because some html entities are longer than that, like
¤.I was actually thinking using
get_html_translation_table(), then getting the length of the longest entity from that and doing something similar to what you suggested.Comment #15
mgiffordI've been trying to think of an elegant way to use PHP's get_html_translation_table, but I'm coming up short. We first need to determine if there is a "&" within the first few characters being truncated. We'd then need to isolate that html entity to determine how long it is. Finally we'd adjust the maxsize to account for that.
What about if in every string we just used html_entity_decode to convert them to single characters, then we calculated the maxsize, before finally adding back in the entities with htmlentities.
I do worry about performance for doing this type of check, although it would have already been fairly well optimized in PHP I would assume.
Comment #16
thedavidmeister commentedThat could potentially work, would have to write some more tests to see if we can break that.
Comment #17
damien tournoud commentedIf you know that you don't have any HTML tags in the input, just convert to plaintext, do the truncation there and convert back to HTML.
If you have any HTML tags in the input... all bets are off and good luck with that.
Comment #18
thedavidmeister commentedWell that's exactly what Views claims (and has claimed for years) to handle.
Comment #19
damien tournoud commented@thedavidmeister: I don't see any grand claim in the current implementation. It really just tries to not truncate in the middle of an HTML entity, but that's really about it. If you pass it anything with tags, it's going to mess it up pretty nicely.
I stand by #17: there is only one case truncating HTML is doable, and it's when there is no tags whatsoever. In that case, just convert to plaintext, do the truncation there and convert back to HTML.
If there are any tags, it's basically anyone's guess what the proper behavior should be. Should the result be a truncation of the *visible* text rendered in the browser? How do you know what that is going to be without knowing the CSS context? What is it reasonable to do with other visible elements (images and stuff)?
So I would recommend to stop pretending that we can remotely handle truncation of arbitrary HTML.
Comment #20
mgiffordAt the very least can we move the Views truncation functionality to \Drupal\Component\Utility\Html?
Html::truncate() & FieldPluginBase::trimText() seem like useful central functions even if we don't have a solution for arbitrary HTML.
Comment #21
thedavidmeister commentedNo "grand claims" for sure, but from the D7 interface, I'll show you where the confusion comes from, for me:
and also
I certainly expect, after reading this in the UI and not reading the code, the following:
- Views is aware of HTML (not limited to HTML entities, it just says "HTML")
- The characters being counted for determining the maximum for trimming wont include invisible characters inside tags, after all, we've told Views that this is an HTML string and there's no caveats listed in the UI
- Views won't do anything at all to my HTML entities, it didn't mention HTML entities once in the UI, why would it be damaging those?
What's wrong with the DOMDocument approach - using that to normalize the string, then getting the inner text of tags? That looks like it would work to me.
You're probably right, this issue could benefit from being broken into two parts - improving the organization/centralization of some decent existing functionality, and then improving said functionality.
Comment #24
manuel garcia commentedComment #35
larowlanAdding a new API is a feature request in my book.
Comment #36
larowlan