Problem:

It doesn't make any sense to me that FieldPluginBase::trimText() in Views and Unicode::truncate() have slightly different but almost identical behaviour. I could understand this when Views was contrib and wanted to implement its own logic, and I understand it could be beneficial for the two functions to have different arguments, but now that Views is in Core, can we merge the two? We already have visibly diverging handling of truncation now that Views uses "..." and Unicode uses "…".

One difference that I can see is that Views handles HTML entities at the end of strings, whereas the Unicode truncate doesn't have a concept of HTML. If that's the only reason to keep the Views truncation kicking around then that functionality should probably be moved to \Drupal\Component\Utility\Html.

Solution:

- Introduce Html::truncate() that mirrors/wraps Unicode::truncate() but handles HTML entities, etc.
- FieldPluginBase::trimText() simply wraps Html::truncate() and/or Unicode::truncate()

Comments

thedavidmeister’s picture

Title: Views' FieldPluginBase::trimText() and Unicode::truncate() should be functionally identical. » Views' FieldPluginBase::trimText() should use Unicode::truncate() and/or a new Html::truncate()
thedavidmeister’s picture

How each function works currently:

Unicode::truncate() features:

  • doesn't use legacy procedural code
  • uses a real ellipsis character and not just "..."
  • ability to truncate the ellipsis if $max_length is small (questionable utility now that ellipsis is just 1 character)
  • ability to disable "word safe" if the string is under an optional minimum length
  • knows to remove any "." characters from the end of the trimmed string if an ellipsis is added

FieldPluginBase::trimText() features:

  • Removes "scraps" of HTML entities from the end of trimmed strings
  • Optional toggle to ensure that any HTML tags left in the trimmed string are normalized (closed off) automatically

Common functionality:

  • ability to toggle the visibility of an ellipsis
  • ability to enable "word safe" which will only trim on word boundaries

Based on the above, I could see Html::truncate() looking something like something as simple as:

function truncate($html, $max_length, $word_safe, $add_ellipsis, $min_wordsafe_length) {
  $value = Unicode::truncate($html, $max_length, $word_safe, $add_ellipsis, $min_wordsafe_length);

  // Remove scraps of HTML entities from the end of a strings
  $value = rtrim(preg_replace('/(?:<(?!.+>)|&(?!.+;)).*$/us', '', $value));

  if (!empty($alter['html'])) {
    $value = Html::normalize($value);
  }
}

and then trimText() looking like:

function trimText($alter, $value) {
  if (!empty($alter['html']) {
    $value = Html::truncate($value, $alter['max_length'], $alter['word_boundary'], $alter['ellipsis']);
  }
  else {
    $value = Unicode::truncate($value, $alter['max_length'], $alter['word_boundary'], $alter['ellipsis']);
  }
  return $value;
}
thedavidmeister’s picture

Title: Views' FieldPluginBase::trimText() should use Unicode::truncate() and/or a new Html::truncate() » Add a way to truncate HTML strings without counting or damaging HTML elements (and use it in Views) - Html::truncate()

So, Views isn't great because, unless I've read the function wrong, truncating the following to 10 characters:

<strong>foo bar baz foo</strong>

Gives:

<strong>fo</strong>

Which is technically correct I suppose, but I'm sure not many people's *intention* when dealing with HTML.

Woah http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-ta...

mgifford’s picture

Issue tags: +typography

You've made a good case for merging the two. Interesting link to http://alanwhipple.com

thedavidmeister’s picture

thedavidmeister’s picture

Category: Task » Feature request

This is sort of a feature request... I guess...

thedavidmeister’s picture

Category: Feature request » Bug report

More things wrong with the Views approach... Because of the regex being run, an existing malformed HTML entity inside a chunk of text will cause the whole thing to be chopped right back to the entity.

sometext &nbsp more text

becomes:

sometext

This is also sort of a bug report because Views seems pretty buggy (or at least naive) at the moment.

mgifford’s picture

Component: markup » views.module

I think this has to be resolved by the Views folks.

thedavidmeister’s picture

Component: views.module » markup

No, I don't think this should be something in Views!

This functionality should be lower level than that, and Views should simply use it.

I suspect that the only reason it was ever in Views in the first place is because Core has never provided something decent to achieve this totally normal functionality.

mgifford’s picture

Are you going to have time to write up a patch with Html::truncate() & FieldPluginBase::trimText()?

thedavidmeister’s picture

I've been thinking about it on and off. Not sure the best way yet...

I put the start of a sandbox up at https://github.com/thedavidmeister/html_truncate_sandbox

The bit I'm wondering about atm:

Say you have 'foo&nbsp;bar' and you want to truncate it at 5 characters, what you want to see is "foo&nbsp;b" without word safe, and "foo" with word safe.

If we move the cursor to "foo&n", which is 5 characters, we can't easily know that "&n" is actually the start of &nbsp; (which we'd count as one character).. this messes with our counting.

This is basically where I assume the Views people got to, which led to what I was complaining about in #8

mgifford’s picture

Nice to have this here https://github.com/thedavidmeister/html_truncate_sandbox/blob/master/src...

Would something like this account for the character entities?

        if (!$wordsafe) {
          $delta = $counter - $maxlength;
          $fragment = mb_substr($token, 0, $delta);
          $newtext[] = $fragment;
        }
        elseif (strpos($str, '&') === TRUE) {
          $maxlength = $maxlength + 4;
        }

We just need a bit of an extra buffer in the function to accommodate for the entities, right?

thedavidmeister’s picture

well not exactly that, because some html entities are longer than that, like &curren;.

I was actually thinking using get_html_translation_table(), then getting the length of the longest entity from that and doing something similar to what you suggested.

mgifford’s picture

I've been trying to think of an elegant way to use PHP's get_html_translation_table, but I'm coming up short. We first need to determine if there is a "&" within the first few characters being truncated. We'd then need to isolate that html entity to determine how long it is. Finally we'd adjust the maxsize to account for that.

What about if in every string we just used html_entity_decode to convert them to single characters, then we calculated the maxsize, before finally adding back in the entities with htmlentities.

I do worry about performance for doing this type of check, although it would have already been fairly well optimized in PHP I would assume.

thedavidmeister’s picture

What about if in every string we just used html_entity_decode to convert them to single characters, then we calculated the maxsize, before finally adding back in the entities with htmlentities.

That could potentially work, would have to write some more tests to see if we can break that.

damien tournoud’s picture

If you know that you don't have any HTML tags in the input, just convert to plaintext, do the truncation there and convert back to HTML.

If you have any HTML tags in the input... all bets are off and good luck with that.

thedavidmeister’s picture

If you have any HTML tags in the input...

Well that's exactly what Views claims (and has claimed for years) to handle.

damien tournoud’s picture

@thedavidmeister: I don't see any grand claim in the current implementation. It really just tries to not truncate in the middle of an HTML entity, but that's really about it. If you pass it anything with tags, it's going to mess it up pretty nicely.

I stand by #17: there is only one case truncating HTML is doable, and it's when there is no tags whatsoever. In that case, just convert to plaintext, do the truncation there and convert back to HTML.

If there are any tags, it's basically anyone's guess what the proper behavior should be. Should the result be a truncation of the *visible* text rendered in the browser? How do you know what that is going to be without knowing the CSS context? What is it reasonable to do with other visible elements (images and stuff)?

So I would recommend to stop pretending that we can remotely handle truncation of arbitrary HTML.

mgifford’s picture

At the very least can we move the Views truncation functionality to \Drupal\Component\Utility\Html?

Html::truncate() & FieldPluginBase::trimText() seem like useful central functions even if we don't have a solution for arbitrary HTML.

thedavidmeister’s picture

StatusFileSize
new106.96 KB

No "grand claims" for sure, but from the D7 interface, I'll show you where the confusion comes from, for me:

Trim this field to a maximum length
Enable to trim the field to a maximum length of characters

and also

Field can contain HTML
If checked, HTML corrector will be run to ensure tags are properly closed after trimming.

I certainly expect, after reading this in the UI and not reading the code, the following:

- Views is aware of HTML (not limited to HTML entities, it just says "HTML")
- The characters being counted for determining the maximum for trimming wont include invisible characters inside tags, after all, we've told Views that this is an HTML string and there's no caveats listed in the UI
- Views won't do anything at all to my HTML entities, it didn't mention HTML entities once in the UI, why would it be damaging those?

I stand by #17: there is only one case truncating HTML is doable, and it's when there is no tags whatsoever. In that case, just convert to plaintext, do the truncation there and convert back to HTML.

What's wrong with the DOMDocument approach - using that to normalize the string, then getting the inner text of tags? That looks like it would work to me.

At the very least can we move the Views truncation functionality to \Drupal\Component\Utility\Html?

You're probably right, this issue could benefit from being broken into two parts - improving the organization/centralization of some decent existing functionality, and then improving said functionality.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

manuel garcia’s picture

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.9 was released on December 7, 2022 and is the final full bugfix release for the Drupal 9.4.x series. Drupal 9.4.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.5.x-dev branch from now on, and new development or disruptive changes should be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

larowlan’s picture

Category: Bug report » Feature request
Issue tags: +Bug Smash Initiative

Adding a new API is a feature request in my book.

larowlan’s picture

Version: 9.5.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.