Problem/Motivation
Under certain circumstances, Drupal\Component\Utility\Html::normalize()
(in D7, it's _filter_htmlcorrector()
) will add messy </body></html>
to the resulting HTML. This happens when the HTML ends in the middle of an attribute, for example:
<p>Here <img alt="ao
This will produce output like:
You can reproduce on Drupal 7 or 8 by following these steps:
Drupal 8
- Install Drupal 8.3.1 with the standard profile
- Go to
/admin/structure/types/manage/article/display/teaser
and configure the "Body" filed to trim at 20 characters - Go to
/admin/config/content/formats/manage/basic_html
and both (a) enable the "Correct faulty and chopped off HTML" filter and (b) disable the "Restrict images to this site" filter (only necessary for the example HTML, not necessary to trigger the bug) - Go to
/node/add/article
and use this HTML as the body (be sure to click the "Source" button in the WYSIWYG toolbar before pasting it, otherwise you're adding text not HTML):
Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png" /> it is
- Go to
/node
and observe output like in the screenshot
Drupal 7
- Install Drupal 7.54 with the standard profile
- Go to
/admin/structure/types/manage/article/display/teaser
and configure the "Body" filed to trim at 20 characters - Go to
/admin/config/content/formats/filtered_html
and add<img>
to the "Allowed HTML tags" (under "Limit allowed HTML tags") - Go to
/node/add/article
and use this HTML as the body:
Here <img alt="aoeunhteoas unthoaesn theoausnth oaesntheo asnthoae" src="http://flowjournal.org/wp-content/uploads/2011/12/Im-Not-Here.png" /> it is
- Go to
/node
and observe output like in the screenshot
Proposed resolution
Remove the potentially-misinterpreted </body></html>
closing tags that \Drupal\Component\Utility\Html::load()
adds to the end of the text before parsing into the DOM, as they are not required for the HTML soup to be successfully parsed.
By the way, \Drupal\views\Plugin\views\field\FieldPluginBase::trimText()
has some code to avoid this exact problem when trimming fields:
// Remove scraps of HTML entities from the end of a strings
$value = rtrim(preg_replace('/(?:<(?!.+>)|&(?!.+;)).*$/us', '', $value));
Remaining tasks
- Consider backporting patch to Drupal 7 and Drupal 10
User interface changes
None.
API changes
None.
Data model changes
None.
Original summary
_filter_htmlcorrector leaves in fragmentary tags that may be passed in and break the rest of the page. This is most liable to happen when using the "Field can contain HTML" filter in the Views module, but could also occur any other time a developer were to trim a string that contains HTML and pass it to this function.
Example (trimmed to 250 chars):
Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis <a class="sample" href="http://www.example.com/partial/path
Output is identical to input, breaking any HTML that follows on the page. Ideal output would be:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. <strong>Aliquam posuere enim</strong>. Sed ultrices semper tortor. Pellentesque cenim consectetur. Nulla sed risus eu ipsum venenatis
Patch attached.
Comment | File | Size | Author |
---|
Issue fork drupal-418620
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #1
seaneffel CreditAttribution: seaneffel commentedConfirming the problem, when selecting the "Field can contain HTML" as on or off, the resulting field always displays clipped HTML tags. Glad someone saw this, I thought it was my bad CSS on the theme layer. Here is a sample of an unclosed a and p tag that clued me in to the problem:
The ellipsis comes from the field setting in views.
I'll try the patch and report back.
Comment #2
vann CreditAttribution: vann commentedThis patch corrects the same problem I was having, however it doesn't correct the problem if the "Trim only on a word boundary" checkbox is checked in Views. Since I don't know what code is executed to trim the text, I can't tell precisely where the code is deciding where the "word boundary" is, but it clearly thinks it's somewhere inside a tag. The result is the trimmed tag is incorrectly displayed as text to the end user.
Here is the input and the incorrect output with a trim length of 500, only on word boundary:
input:
output:
Comment #3
multiplextor CreditAttribution: multiplextor commentedClosed. The reason: expired.
Comment #4
dsnopekUpdate issues summary because this problem still exists in Drupal 7 (might be present in Drupal 8 too)
Comment #5
dsnopekHere's a Drupal 7 patch that works in my testing (based on Views code)
Comment #6
dsnopekDrupal 8 theoretically has the same issue. In D8, the function is
Drupal\Component\Utility\Html::normalize()
, which uses pretty much the exact same code as_filter_htmlcorrector()
. If I run this code via Devel or drush, it produces the same messy HTML as in Drupal 7:However, I can't seem to be able to reproduce on Drupal 8 using similar steps as I can on Drupal 7! I'm not sure what's different, but maybe we can backport whatever that is?
Comment #8
dsnopekAha! The difference is that the "Correct faulty and chopped off HTML" filter isn't enabled on the "Basic HTML" text format by default on Drupal 8. So, I can reproduce on D8 now too!
I've updated the issue summary to reflect this.
Comment #9
dsnopekHere's an updated Drupal 7 patch which should fix the failed test. It doesn't add new test coverage for this specific bug - which should really be added - for now, it just fixes the existing tests.
Comment #19
smustgrave CreditAttribution: smustgrave at Mobomo commentedWondering if this is still an issue in D9,5?
Normalize() seems to be a different from when this ticket was opened.
Comment #20
mfbI followed the instructions in the issue summary to verify, and yes, unfortunately this is still an issue in D9.5 and D10.1
Comment #21
smustgrave CreditAttribution: smustgrave at Mobomo commentedNot really sold on this solution but the regex using for D7 didn't work on D9.5
But added some tests
Comment #24
smustgrave CreditAttribution: smustgrave at Mobomo commentedWonder if ckeditor5 has any bearing on this.
Comment #25
catchckeditor5 won't make any difference here. There is #2441811: Upgrade filter system to HTML5 but I don't think that really affects this either.
Added test coverage looks good, I'm also not entirely sure about the string replace but don't really have a better idea either. However if we go that way we should add a comment.
Comment #26
smustgrave CreditAttribution: smustgrave at Mobomo commentedAdded comment
Comment #27
larowlanSuper nit, but they're technically 'closing' tags.
What I don't understand is how the
</body> and </html>
get there -\Drupal\Component\Utility\Html::serialize
works by finding the body element and then looping over its child elements, so the wrapping html and body elements shouldn't even be in the picture.I think we need to do some more forensics here.
Comment #28
mfbAs far as I understand, the HTML-encoded closing tags come from the load() method, where they are parsed as part of the cut-off attribute, not closing tags, and they get "repaired" (i.e. HTML-encoded) to make them valid as part of the attribute.
Comment #29
borisson_They are added because of the reasons mentioned by @mfb in #28.
https://3v4l.org/r6nNq#v8.1.18 is a 3v4l with the minimum of load & serialize. I think this could also be a bug in php core's dom handling, but I think we should just fix it like in #26.
Setting this back to rtbc.
Comment #30
olli CreditAttribution: olli commented#26 seems to remove any closing body and html elements, also intentional ones inside code blocks.
Should we instead try to fix this somehow in
Html::load()
that adds them? https://3v4l.org/vm89sComment #33
mfbYes, it actually seems quite reasonable to just leave out the potentially-misinterpreted
</body></html>
and rely on DOMDocument to deal with the HTML soup. Let's see if tests pass.Comment #35
mfbWell, the one failing test honestly seems like an improvement, so I'm fixing the test.
Comment #36
smustgrave CreditAttribution: smustgrave at Mobomo commentedSeems odd the closing bracket was the failing test but agreed.
Comment #37
mfbClarified proposed resolution and remaining tasks. Btw, it looks like these closing tags should still be removed even with #2441811: Upgrade filter system to HTML5, which is not surprising since DOMDocument is still in use under the hood.
Comment #39
borisson_Known fail
Comment #40
alexpottWe recently committed #2441811: Upgrade filter system to HTML5 which makes this one not apply.
Comment #41
mfbRerolled.
Comment #42
alexpottI think I agree with #33 and
This change is not longer needed.
Comment #45
longwaveDiscussed with @alexpott. It seems that most of the HTML boilerplate in
Html::load()
is no longer required and we can get away with just parsing<body> . $html
.<html>
tag is not needed, again if I reserialize the document then the tag is added around<body>
.Opened an MR with the new test and my suggested changes.
Comment #46
mfbChanges look good to me (especially getting rid of the ancient Content-Type meta tag :)
Comment #48
longwaveAdded the comment, reworded slightly.
Comment #49
smustgrave CreditAttribution: smustgrave at Mobomo commentedLooks good! Big win getting #2441811: Upgrade filter system to HTML5 to land!
Comment #50
alexpottCommitted f101ff7 and pushed to 11.x. Thanks!
Comment #52
mfbThis bugfix will need a little tweaking if folks want to backport to Drupal 7 or Drupal 10.1, but should be easy..