Problem/Motivation

Under certain circumstances, Drupal\Component\Utility\Html::normalize() (in D7, it's _filter_htmlcorrector()) will add messy </body></html> to the resulting HTML. This happens when the HTML ends in the middle of an attribute, for example:

<p>Here <img alt="ao

This will produce output like:

You can reproduce on Drupal 7 or 8 by following these steps:

Drupal 8

  1. Install Drupal 8.3.1 with the standard profile
  2. Go to /admin/structure/types/manage/article/display/teaser and configure the "Body" filed to trim at 20 characters
  3. Go to /admin/config/content/formats/manage/basic_html and both (a) enable the "Correct faulty and chopped off HTML" filter and (b) disable the "Restrict images to this site" filter (only necessary for the example HTML, not necessary to trigger the bug)
  4. Go to /node/add/article and use this HTML as the body (be sure to click the "Source" button in the WYSIWYG toolbar before pasting it, otherwise you're adding text not HTML):
    Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png"  /> it is
    
  5. Go to /node and observe output like in the screenshot

Drupal 7

  1. Install Drupal 7.54 with the standard profile
  2. Go to /admin/structure/types/manage/article/display/teaser and configure the "Body" filed to trim at 20 characters
  3. Go to /admin/config/content/formats/filtered_html and add <img> to the "Allowed HTML tags" (under "Limit allowed HTML tags")
  4. Go to /node/add/article and use this HTML as the body:
    Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png"  /> it is
    
  5. Go to /node and observe output like in the screenshot

Proposed resolution

Remove the potentially-misinterpreted </body></html> closing tags that \Drupal\Component\Utility\Html::load() adds to the end of the text before parsing into the DOM, as they are not required for the HTML soup to be successfully parsed.

By the way, \Drupal\views\Plugin\views\field\FieldPluginBase::trimText() has some code to avoid this exact problem when trimming fields:

      // Remove scraps of HTML entities from the end of a strings
      $value = rtrim(preg_replace('/(?:<(?!.+>)|&(?!.+;)).*$/us', '', $value));

Remaining tasks

  1. Consider backporting patch to Drupal 7 and Drupal 10

User interface changes

None.

API changes

None.

Data model changes

None.

Original summary

_filter_htmlcorrector leaves in fragmentary tags that may be passed in and break the rest of the page. This is most liable to happen when using the "Field can contain HTML" filter in the Views module, but could also occur any other time a developer were to trim a string that contains HTML and pass it to this function.

Example (trimmed to 250 chars):

Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis <a class="sample" href="http://www.example.com/partial/path

Output is identical to input, breaking any HTML that follows on the page. Ideal output would be:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis 

Patch attached.

Issue fork drupal-418620

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

seaneffel’s picture

Confirming the problem, when selecting the "Field can contain HTML" as on or off, the resulting field always displays clipped HTML tags. Glad someone saw this, I thought it was my bad CSS on the theme layer. Here is a sample of an unclosed a and p tag that clued me in to the problem:

<div class="field-content"><p>This Week in NeighborMedia: a recap of the week's captivating Cambridge stories produced by <a href="http://www.cctvcambridge.org/...</div>

The ellipsis comes from the field setting in views.

I'll try the patch and report back.

vann’s picture

This patch corrects the same problem I was having, however it doesn't correct the problem if the "Trim only on a word boundary" checkbox is checked in Views. Since I don't know what code is executed to trim the text, I can't tell precisely where the code is deciding where the "word boundary" is, but it clearly thinks it's somewhere inside a tag. The result is the trimmed tag is incorrectly displayed as text to the end user.

Here is the input and the incorrect output with a trim length of 500, only on word boundary:

input:

<div class="snap_preview"><br />
<p>Listen to an interview with Gustavo Vilchis on Worldview on Chicago Public Radio!</p>
</div>
<div class="snap_preview"><br />
<p>Listen to an interview with Gustavo Vilchis on Worldview on Chicago Public Radio!</p>
<ul>
    <li><a href="http://www.chicagopublicradio.org/Program_WV.aspx?episode=30344">http://www.chicagopublicradio.org/Program_WV.aspx?episode=30344</a></li>
</ul>
<a rel="nofollow"><img alt="" style="border: 0px;" src="http://feeds.wordpress.com/1.0/comments/teachingrebellion.wordpress.com/353/" /></a> <a rel="nofollow"><img alt="" style="border: 0px;" src="http://feeds.wordpress.com/1.0/delicious/teachingrebellion.wordpress.com/353/" /></a> <a rel="nofollow"><img alt="" style="border: 0px;" src="http://feeds.wordpress.com/1.0/stumble/teachingrebellion.wordpress.com/353/" /></a> <a rel="nofollow"><img alt="" style="border: 0px;" src="http://feeds.wordpress.com/1.0/digg/teachingrebellion.wordpress.com/353/" /></a> <a rel="nofollow"><img alt="" style="border: 0px;" src="http://feeds.wordpress.com/1.0/reddit/teachingrebellion.wordpress.com/353/" /></a> <img alt="" style="border: 0px;" src="http://stats.wordpress.com/b.gif?host=teachingrebellion.wordpress.com&amp;blog=5191692&amp;post=353&amp;subd=teachingrebellion&amp;ref=&amp;feed=1" /></div>

output:

<div class="snap_preview"><br/>
<p>Listen to an interview with Gustavo Vilchis on Worldview on Chicago Public Radio!</p>
<ul>
    <li><a href="http://www.chicagopublicradio.org/Program_WV.aspx?episode=30344" class="views-processed">http://www.chicagopublicradio.org/Program_WV.aspx?episode=30344</a></li>
</ul>
<a rel="nofollow" class="views-processed"><img style="border: 0px none ;" src="http://feeds.wordpress.com/1.0/comments/teachingrebellion.wordpress.com/353/" alt=""/></a> <a rel="nofollow" class="views-processed">img alt="" src="http://feeds.wordpress.com/1.0/</a></div>
multiplextor’s picture

Status: Needs review » Closed (won't fix)

Closed. The reason: expired.

dsnopek’s picture

Title: _filter_htmlcorrector leaves tag fragments » _filter_htmlcorrector() leaves messy </body></html> in certain situations
Version: 6.10 » 7.x-dev
Issue summary: View changes
Status: Closed (won't fix) » Active
FileSize
12.63 KB

Update issues summary because this problem still exists in Drupal 7 (might be present in Drupal 8 too)

dsnopek’s picture

Status: Active » Needs review
FileSize
537 bytes

Here's a Drupal 7 patch that works in my testing (based on Views code)

dsnopek’s picture

Drupal 8 theoretically has the same issue. In D8, the function is Drupal\Component\Utility\Html::normalize(), which uses pretty much the exact same code as _filter_htmlcorrector(). If I run this code via Devel or drush, it produces the same messy HTML as in Drupal 7:

use Drupal\Component\Utility\Html;
print Html::normalize('<p>Here <img alt="ao');

However, I can't seem to be able to reproduce on Drupal 8 using similar steps as I can on Drupal 7! I'm not sure what's different, but maybe we can backport whatever that is?

Status: Needs review » Needs work

The last submitted patch, 5: drupal7-htmlcorrector-tag-fragments-418620-5.patch, failed testing.

dsnopek’s picture

Title: _filter_htmlcorrector() leaves messy </body></html> in certain situations » Drupal\Component\Utility\Html::normalize() leaves messy </body></html> in certain situations
Version: 7.x-dev » 8.3.x-dev
Issue summary: View changes

Aha! The difference is that the "Correct faulty and chopped off HTML" filter isn't enabled on the "Basic HTML" text format by default on Drupal 8. So, I can reproduce on D8 now too!

I've updated the issue summary to reflect this.

dsnopek’s picture

Here's an updated Drupal 7 patch which should fix the failed test. It doesn't add new test coverage for this specific bug - which should really be added - for now, it just fixes the existing tests.

Status: Needs review » Needs work

The last submitted patch, 9: drupal7-htmlcorrector-tag-fragments-418620-9.patch, failed testing.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

smustgrave’s picture

Status: Needs work » Postponed (maintainer needs more info)

Wondering if this is still an issue in D9,5?

Normalize() seems to be a different from when this ticket was opened.

mfb’s picture

Version: 9.4.x-dev » 9.5.x-dev
Status: Postponed (maintainer needs more info) » Needs work

I followed the instructions in the issue summary to verify, and yes, unfortunately this is still an issue in D9.5 and D10.1

smustgrave’s picture

Status: Needs work » Needs review
Issue tags: +Bug Smash Initiative
FileSize
833 bytes
1.37 KB

Not really sold on this solution but the regex using for D7 didn't work on D9.5

But added some tests

The last submitted patch, 21: 418620-21-tests-only.patch, failed testing. View results

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

smustgrave’s picture

Wonder if ckeditor5 has any bearing on this.

catch’s picture

Status: Needs review » Needs work

ckeditor5 won't make any difference here. There is #2441811: Upgrade filter system to HTML5 but I don't think that really affects this either.

Added test coverage looks good, I'm also not entirely sure about the string replace but don't really have a better idea either. However if we go that way we should add a comment.

smustgrave’s picture

Status: Needs work » Needs review
FileSize
1.68 KB
1.42 KB

Added comment

larowlan’s picture

+++ b/core/lib/Drupal/Component/Utility/Html.php
@@ -252,7 +252,9 @@ public static function resetSeenIds() {
+    // Remove any leftover <body> or <html> tags.
+    return str_replace(['&lt;/body&gt;', '&lt;/html&gt;'], '', $serialize);

Super nit, but they're technically 'closing' tags.

What I don't understand is how the </body> and </html> get there - \Drupal\Component\Utility\Html::serialize works by finding the body element and then looping over its child elements, so the wrapping html and body elements shouldn't even be in the picture.

I think we need to do some more forensics here.

mfb’s picture

As far as I understand, the HTML-encoded closing tags come from the load() method, where they are parsed as part of the cut-off attribute, not closing tags, and they get "repaired" (i.e. HTML-encoded) to make them valid as part of the attribute.

borisson_’s picture

Status: Needs review » Reviewed & tested by the community

What I don't understand is how the and get there - \Drupal\Component\Utility\Html::serialize works by finding the body element and then looping over its child elements, so the wrapping html and body elements shouldn't even be in the picture.

They are added because of the reasons mentioned by @mfb in #28.
https://3v4l.org/r6nNq#v8.1.18 is a 3v4l with the minimum of load & serialize. I think this could also be a bug in php core's dom handling, but I think we should just fix it like in #26.
Setting this back to rtbc.

olli’s picture

#26 seems to remove any closing body and html elements, also intentional ones inside code blocks.

Should we instead try to fix this somehow in Html::load() that adds them? https://3v4l.org/vm89s

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 26: 418620-26.patch, failed testing. View results

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

mfb’s picture

Yes, it actually seems quite reasonable to just leave out the potentially-misinterpreted </body></html> and rely on DOMDocument to deal with the HTML soup. Let's see if tests pass.

Status: Needs review » Needs work

The last submitted patch, 33: 418620-33.patch, failed testing. View results

mfb’s picture

Status: Needs work » Needs review
FileSize
584 bytes
2.21 KB

Well, the one failing test honestly seems like an improvement, so I'm fixing the test.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Seems odd the closing bracket was the failing test but agreed.

mfb’s picture

Issue summary: View changes

Clarified proposed resolution and remaining tasks. Btw, it looks like these closing tags should still be removed even with #2441811: Upgrade filter system to HTML5, which is not surprising since DOMDocument is still in use under the hood.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 35: 418620-35.patch, failed testing. View results

borisson_’s picture

Status: Needs work » Reviewed & tested by the community

Known fail

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

We recently committed #2441811: Upgrade filter system to HTML5 which makes this one not apply.

mfb’s picture

Status: Needs work » Needs review
FileSize
2.08 KB

Rerolled.

alexpott’s picture

Status: Needs review » Needs work

I think I agree with #33 and

Yes, it actually seems quite reasonable to just leave out the potentially-misinterpreted </body></html> and rely on DOMDocument to deal with the HTML soup. Let's see if tests pass.

+++ b/core/tests/Drupal/Tests/Core/EventSubscriber/RssResponseRelativeUrlFilterTest.php
@@ -49,8 +49,7 @@ public function providerTestOnResponse() {
   <item>
      <title>Drupal 8 turns one!</title>
      <link>https://www.drupal.org/blog/drupal-8-turns-one</link>
-     <description>&lt;a href="localhost/node/1"&gt;Hello&lt;/a&gt;
-    </description>
+     <description>&lt;a href="localhost/node/1"&gt;Hello&lt;/a&gt;</description>
   </item>
   </channel>
 </rss>

This change is not longer needed.

longwave made their first commit to this issue’s fork.

longwave’s picture

Status: Needs work » Needs review

Discussed with @alexpott. It seems that most of the HTML boilerplate in Html::load() is no longer required and we can get away with just parsing <body> . $html.

  • Doctype is not needed, if I reserialize the DOMDocument then the doctype is added.
  • Encoding meta tag is not needed, UTF-8 is the default but we can also specify this to the parser.
  • <html> tag is not needed, again if I reserialize the document then the tag is added around <body>.

Opened an MR with the new test and my suggested changes.

mfb’s picture

Changes look good to me (especially getting rid of the ancient Content-Type meta tag :)

Status: Needs review » Needs work

The last submitted patch, 41: 418620-40.patch, failed testing. View results

longwave’s picture

Status: Needs work » Needs review

Added the comment, reworded slightly.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Looks good! Big win getting #2441811: Upgrade filter system to HTML5 to land!

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Committed f101ff7 and pushed to 11.x. Thanks!

  • alexpott committed f101ff7f on 11.x
    Issue #418620 by mfb, smustgrave, longwave, dsnopek, greenbeans,...
mfb’s picture

This bugfix will need a little tweaking if folks want to backport to Drupal 7 or Drupal 10.1, but should be easy..

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.