Problem/Motivation

Many HTML tags add horizontal space or a line break between them when rendered.
PlainTextOutput::renderFromHtml() simply removes HTML tags but does not add spaces, which can lead to a result that is surprising for users.

Steps to reproduce

Call PlainTextOutput::renderFromHtml('<p>Foo</p><p>Bar</p>'); .
The result will be FooBar but it would make more sense to have Foo Bar.

We noticed this when using HTML fields as metatag tokens. Some sentences are joined together without a space between them.

Proposed resolution

The method could add a space between each tag before stripping the tags.

Remaining tasks

Review

User interface changes

NA

Introduced terminology

NA

API changes

NA

Data model changes

NA

Release notes snippet

NA

CommentFileSizeAuthor
#7 Screenshot 2023-07-11 at 5.36.21 PM.png48.9 KBkeshavv

Issue fork drupal-3372446

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

prudloff created an issue. See original summary.

prudloff’s picture

Status: Active » Needs review
cilefen’s picture

Version: 9.5.x-dev » 11.x-dev
smustgrave’s picture

Status: Needs review » Needs work
Issue tags: +Needs test
cilefen’s picture

Issue tags: -Needs test +Needs tests

😉

keshavv’s picture

StatusFileSize
new48.9 KB

I have tested the MR with this code snippet.

print_r(\Drupal\Component\Render\PlainTextOutput::renderFromHtml('<div class="test">Some content</div><p>Foo</p><p>Bar</p>'));

It results
After Patch

prudloff’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests

I added some tests.

smustgrave’s picture

Issue summary: View changes
Status: Needs review » Needs work

Fixed up summary just slightly.

Left 1 comment on MR

If you are another contributor eager to jump in, please allow the previous poster at least 48 hours to respond to feedback first, so they have the opportunity to finish what they started!

smustgrave’s picture

Status: Needs work » Reviewed & tested by the community

You are absolutely correct. That was the only feedback I had

nod_’s picture

Status: Reviewed & tested by the community » Needs work

I'll put that at least to NW, it's possibly a won't fix situation depending on how much exists out there that could help with the situation.

So the fix here is to add a space between all tags, this breaks pretty fast, for example:

<strong>test</strong><sup>nospace!</sup>

Here we would not expect a space to be added. Whitespace in HTML is very complex, see https://blog.dwac.dev/posts/html-whitespace/ so hand crafting rules by hand is not a reasonable solution. We could parse the string as HTML and use textContent property from DOMNode and hope it does things correctly with html5.

prudloff’s picture

We could parse the string as HTML and use textContent property from DOMNode and hope it does things correctly with html5.

I did a quick test like this:

$dom = \Drupal\Component\Utility\Html::load('<p>Giraffes.</p><p>Wombats.</p>'); echo $dom->textContent;

But it seems it never adds any space.

A more robust solution would be to use the html2text library but it would be a bit overkill to add a new dependency just for this.

prudloff’s picture

Status: Needs work » Closed (won't fix)

I did more tests and I agree we won't be able to handle various scenarios without having complex rules or add a new dependency, so I'm closing as wontfix.
Sites that need this can use a library like html2text instead.

anybody’s picture

We recently ran into this with Metatag module and created #3555964: PlainTextOutput::renderFromHtml removes important whitespace from HTML as we didn't find this one.

Over there, I suggested the following solutions:

From my perspective, this definitely isn't a feature request, but a (major) bug! The issue with this is, that the outcome of the function is wrong and unexpected and it's used in several places in core and contrib.

One super widely used that has wrong results that are a real-world issue for SEO is Metatags: #3555694: :summary Tokens with HTML are shortened too much - first strip HTML, then trim

Should we reopen this and call it a bug or proceed in #3555964: PlainTextOutput::renderFromHtml removes important whitespace from HTML?

Using a library here instead of the poor-man's strip_tags() might also be able to fix #2896735: PlainTextOutput::renderFromHtml may render potentially dangerous output

I totally understand that this was closed as feature-request, but after I saw the consequences in Metatag and potentially other core places, I'd strongly vote to improve the situation someway.

michael.j.gleeson’s picture

Have to second this,
It is a bug, the metatags are not rendering correctly.

I am getting spaces not being rendered at the end of sentences and headers running into paragraphs.

The HTML tag stripping is not allowing for common sense white spaces.

e.g.

meta description="HeaderThis is my paragraph......"

michael.j.gleeson’s picture

For those finding this issue one solution is to use Smart trim, See documentation