Problem/Motivation
The filter system should be upgraded to HTML5 to match modern standards. All modern browsers now parse HTML5, our themes use an HTML5 doctype and so Drupal should treat HTML input and output the same way.
Steps to reproduce
Currently the filter system parses and outputs XHTML only. This causes PHP warnings when trying to parse modern tags:
>>> use \Drupal\Component\Utility\Html;
>>> Html::normalize("<p><figure><figcaption>Caption</figcaption></figure><video></video></p>");
PHP Warning: DOMDocument::loadHTML(): Tag figure invalid in Entity, line: 1 in /var/www/html/drupal/core/lib/Drupal/Component/Utility/Html.php on line 289
PHP Warning: DOMDocument::loadHTML(): Tag figcaption invalid in Entity, line: 1 in /var/www/html/drupal/core/lib/Drupal/Component/Utility/Html.php on line 289
PHP Warning: DOMDocument::loadHTML(): Tag video invalid in Entity, line: 1 in /var/www/html/drupal/core/lib/Drupal/Component/Utility/Html.php on line 289
=> "<p><figure><figcaption>Caption</figcaption></figure><video></video></p>"
Also, <br /> and <img /> tags are self-closing. In HTML5, these are void elements and should be output simply as <br> and <img>.
Proposed resolution
Use masterminds/html5 instead of DOMDocument to parse and output HTML in the \Drupal\Component\Utility\Html utility class.
Remaining tasks
User interface changes
None
API changes
\Drupal\Component\Utility\Html::load(), \Drupal\Component\Utility\Html::serialize() and \Drupal\Component\Utility\Html::normalize() will parse and output HTML5 instead of XHTML. Input filters or other code that relies on this utility class will also now parse and output HTML5 instead of XHTML.
This does mean there are some minor changes to output, but these should not affect valid HTML documents. The change record contains known examples of output changes that may affect tests: https://www.drupal.org/node/3225468
Data model changes
None
Release notes snippet
The filter system has been upgraded to output HTML5. HTML output will have minor changes in some cases, for more information read the change record.
| Comment | File | Size | Author |
|---|---|---|---|
| #95 | 2441811-nr-bot.txt | 90 bytes | needs-review-queue-bot |
Issue fork drupal-2441811
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #1
daffie commentedPostponed on #2441645: Upgrade Drupal\simpletest\WebTestBase to HTML5.
Comment #2
daffie commentedThere are two new tests changes added to this patch from the original patch from #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5.
Comment #3
alexpottComment #4
yched commentedThe CR that was published as part of #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 states that the filter system is already HTML5-compatible, but that's a bit ahead of it's time :-)
Shouldn't this be bumped to critical ?
Comment #5
daffie commentedPostponed on #2441373: Upgrade tests to HTML5.
Comment #6
mgiffordComment #15
catchComment #16
jibran#2441373: Upgrade tests to HTML5 is not postponed anymore.
Comment #17
joseph.olstadI've rerolled a patch in #2441373: Upgrade tests to HTML5
it needs a bit of work to get it in.
meanwhile, can also try to work on this one.
Comment #18
effulgentsia commentedComment #19
longwaveUnsure why we are waiting on tests first, let's see what fails if we just upgrade \Drupal\Component\Utility\Html to the HTML5 parser.
Comment #21
longwaveOpened an MR with #19, some more fixes to the filter system and some test changes to accommodate HTML5 output.
BigPipe rendering is broken and I'm not sure why yet. Also unsure if the caption filter output should now have newlines? The template has them but somehow they were stripped before, the new parser keeps them instead.
Otherwise, this doesn't look too difficult to get the rest of the tests working.
Comment #22
joseph.olstad@longwave, patch 19 works perfectly ! Thanks!
Comment #23
longwaveUn-postponing this - I think we can do this separately from changing the other tests.
Comment #24
longwaveOne way of fixing the caption filter is to wrap the templates for it in
{% apply spaceless %}.Comment #25
longwaveComment #26
andypostMR used so patches are hidden
btw interesting to compare performance of test runs using different parsers
Comment #27
joseph.olstadhow about applying this patch to the merge request?
patch-73
#2441373-73: Upgrade tests to HTML5
Comment #28
longwave@joseph.olstad this issue is about changing the filter system only, and fixing any tests or code that fail directly due to that. That issue is separate, covering all other tests that use DOMDocument to parse HTML.
Comment #29
longwaveAll failed tests work locally, let's see how this testbot run goes.
Comment #30
longwave@andypost re: performance, seems no real difference:
9.3.x HEAD run on DOMDocument:
Unit tests: 44 sec
Kernel tests: 6 min 55 sec
Functional tests: 34 min 12 sec
JavaScript tests: 6 min 55 sec
Latest patch here on HTML5:
Unit tests: 42 sec
Kernel tests: 6 min 51 sec
Functional tests: 34 min 3 sec
JavaScript tests: 6 min 59 sec
Comment #31
joseph.olstadWow great job @longwave!
I started using your patch already, it's fixed an extra xml:lang that we had.
Comment #32
andypostLeft 2 questions in MR - maybe empty attributes have a meaning, probably better to split off
The bigger question is template changes for spaceless
Comment #33
longwaveAddressed @andypost's feedback, put the empty attribute back in as close as I can - HTML5 won't output
arguments=""butargumentsalone means the same thing according to spec.We have made minor changes to classy before, the other option here is to leave the template alone, but this means the output is actually different - the caption filter output will contain newlines and all related tests will have to be updated to match. To me it's better to keep the output the same as it was before.
Comment #34
wim leersAs a pseudo-maintainer of
filter.module: epic work here! 🙏Comment #35
daffie commentedComment #36
longwave@daffie Thanks for the feedback, all points addressed.
Comment #37
daffie commentedComment #38
longwaveComment #39
daffie commentedComment #40
longwaveComment #41
longwaveFigured out a way to simplify it even further using DOMXPath to extract all child nodes under
<body>.Comment #42
daffie commentedAll code changes look good to me.
The MR fixed the problem as described in the IS.
For me it is RTBC.
For the committer: In some places is the theme classy changed, by adding "spaceless" to pieces of the theme. Not sure if that is allowed. I leave it up to the committer to make a decision about it.
Comment #43
longwaveThanks for the code reviews.
I think this still needs an issue summary update to bring it up to our recent issue summary standards and explain what we are doing and why.
I also think this needs a change record because filtered output is changing in subtle ways; e.g. br and img tags are now void elements instead of self closed elements.
Comment #44
longwaveUpdated IS and added CR: https://www.drupal.org/node/3225468
Comment #45
daffie commentedBoth the IS and the CR look good to me.
Back to RTBC.
@longwave: Thank you for your work on this issue!
Edit: For the committer: Can the people who worked on #1277290: Use a proper HTML parser for every core filter get commit credits in this issue.
Comment #47
larowlanFWIW anyone in Maintainers.txt can assign credit for core
Doing this now.
Comment #48
larowlanLeft a couple of questions in the MR
Comment #49
wim leersEpic work here! Left a bunch of questions, one of which is critical to resolve and for which I added — it was previously asked in #24, #32 and #42.
Comment #50
longwaveThanks for the reviews! Fixed up a few test cases as per the feedback, and replied in the MR about the two tricky cases - escaping in attributes and invalid
</br>.Comment #51
daffie commentedThe testbot is failing.
Comment #52
longwaveSeems to be a random fail, cannot reproduce locally. I merged 9.3.x back in and pushed, let's see if it happens again.
Comment #53
longwaveAdded more detail to IS.
Comment #54
longwaveReplaced comments referring to (X)HTML with just HTML, as I think that's more correct now.
::escapeCdataElement()says:Is this still important? We don't call
::loadHTML()any more, and we don't care about HTML4 browsers either, so can we just remove this entirely?Comment #55
longwaveReverted the changes to the filter-caption templates and adjusted the tests to expect newlines in the output instead.
To make this clear: the newlines are already output at runtime in newer versions of libxml2 as per #3204929: Html::load() inconsistent space removal with old libxml2 versions. - three people have reproduced FilterKernelTest failing locally in HEAD due to this, but DrupalCI is on an older version that doesn't exhibit this problem. After discussion with @alexpott we decided it was less disruptive to change the tests over changing the templates.
Comment #56
joseph.olstadLooks like stellar fantastic work to me.
Merge Request needs a rebase
or, reroll from this: https://git.drupalcode.org/project/drupal/-/merge_requests/965.diff
Comment #57
longwaveRebased and force pushed.
Comment #58
lauriiiSeems like the questions that was tagged for frontend framework manager review was already addressed by changing the test assertions.
Comment #59
longwaveRebased again following #3139409: Replace usages of AssertLegacyTrait::assertRaw, that is deprecated
Comment #60
daffie commentedThere are unresolved threads on the MR.
Comment #61
longwaveThanks for the review, added a couple of comments and replied to the other MR feedback.
Comment #62
daffie commentedAll points of @phenaproxima have been addressed.
Back to RTBC.
Comment #63
lauriiiThere are some remaining issues with HTML5 support due to
Drupal\Component\Utility\Xss::filter()not fully supporting HTML5. I added test coverage for this in the MR branch.Comment #64
joseph.olstadearly testing on this fix, (great work btw, hope we can get through this relatively unscathed), we are an early adopter and have now backed off the filter change (we were using patch 19 above), one of my clients tested the patch on D8.9.x, early test results show the change causing an issue with a popular contrib module called linkit.
for some reason this change could cause the node references to not be automatically converted to the path alias after a cache rebuild and page load so they would be /node/123 instead of being converted to /friendly-title-example
I will make sure to run some more tests myself hopefully on the final version of this change.
we may have to adjust contrib logic once this is finalized.
This is just an FYI so that I remind myself to test the latest incarnation of linkit against the finalized solution. I am however using a patch (that I made changes to myself) for the linkit module so not sure if it's my patch or vanilla linkit or both.
Comment #65
longwaveI think this XSS change might need breaking out into its own issue. The underlying problem is this part of the regex in
Xss::filter():This is trivially matching anything that looks like a tag, and doesn't take attributes into account. I don't know how to safely modify this so it can skip quoted values that might contain
>.We then have a further regex which is responsible for splitting up the tag that the previous regex has found:
Similarly this one doesn't seem to be able to take attributes containing
>into account.The famous StackOverflow answer of https://stackoverflow.com/a/1732454 came to mind when I started looking at this; I'm not sure this is even possible to do with regular expressions alone.
I also tried to search for alternative libraries but wasn't able to find an alternative XSS filter written in PHP. kses (which this code is based on) is also used in WordPress and contains the same regex we have here minus the comments, so I assume it suffers from the same issue: https://github.com/WordPress/WordPress/blob/master/wp-includes/kses.php#...
Comment #66
longwaveXss::filter() claims that:
If we keep some of the existing regexes for the first part, can the HTML5 library be trusted to help with the rest? We could construct a DOM document, do any filtering in the DOM, and then serialize back to HTML, which will then be well-formed? This won't be as performant as doing it with regex only, but if it's not physically possible to do it with regex, I don't know what other option we have.
Comment #67
lauriii+1 on moving the problem with
Drupal\Component\Utility\Xss::filter()to a follow-up issue, and moving this issue back to RTBC once the test changes have been reverted.Comment #68
longwaveOpened #3227831: Xss::filter() does not handle HTML tags inside attribute values as followup for the XSS filtering issue.
Reverted @lauriii's failing test and back to RTBC as requested in #67.
Comment #69
effulgentsia commentedSince we don't know if or when #68 will result in any changes to
Xss::filter(), for the purposes of this issue, we have to live with that function as it is. One of the things that function requires is for some html entities (like less-than and greater-than) to be encoded when within attribute values.I think we should have a requirement (and add the needed tests for it) that if for a given input $foo, if
Xss::filter($foo)is rendered and appears a certain way in the browser, then renderingXss::filter(Html::normalize($foo))should appear the same way in the browser (the literal string can be different, but it should be effectively the same in terms of what a browser displays).With this MR as-is, the above doesn't hold, because:
would be seen as ok by Xss::filter(), but when you call Html::normalize() on it, then it gets converted to:
which is not seen as ok by Xss::filter() per #68.
This could be fixed by changing:
to also pass in
'encode_entities' => TRUE.However, that ends up encoding a lot of entities. For example:
would get normalized to:
That might be ok, but it's kind of ugly for anyone who needs to read that HTML.
I don't know if the html5 library provides for any more fine-grained control. It would be great if we could ask it to encode less-than and greater-than within attribute values, without it also encoding every other possible html entity.
Comment #71
andypostThere's https://github.com/Masterminds/html5-php/releases/tag/2.7.5 with PHP 8/8.1 support
Comment #73
smustgrave commentedIf the hold up is https://www.drupal.org/project/drupal/issues/3227831 can you see if its a duplicate of https://www.drupal.org/project/drupal/issues/2552837 or https://www.drupal.org/project/drupal/issues/2544110. Both of those I'm actively working on and keep close on for review / change requests.
Comment #77
joseph.olstadAre we targetting 10.0.x for this or 10.1.x ?
For those that can't wait, I have a simple solution for the track and source tags.
#2857273: \Drupal\filter\Plugin\Filter\FilterHtmlCorrector and Html::normalize() and incorrectly "corrects" <source> tags
Comment #78
chi commentedComment #80
wim leersHTML5 has now been around for >15 years. More than half the lifetime of Drupal. We should get this fixed at some point 😅
This was last RTBC almost 2 years ago, in #68.
It was un-RTBC'd in #69 over concerns about XSS filtering.
Since then, there have been 3 releases: https://github.com/Masterminds/html5-php/releases. Nothing in them relating to what was surfaced in #69 AFAICT.
But they helpfully also provided this overview: https://github.com/Masterminds/html5-php#known-issues-or-things-we-desig.... And even more helpfully: https://github.com/Masterminds/html5-php#serializer-design. For us to be able to reuse 99% of this but provide our own handling of encoding in attribute values, we could:
\Drupal\Component\Utility\Html::serialize()not call\Masterminds\HTML5::saveHTML()which calls\Masterminds\HTML5::save(), which hardcodes the use of\Masterminds\HTML5\Serializer\OutputRules— this would add ~10 LoC\Drupal\Component\Utility\HtmlSerializerRules extends \Masterminds\HTML5\Serializer\OutputRuleswhich handles attribute values differently by only overriding\Masterminds\HTML5\Serializer\OutputRules::attrs(), which is ~30 out of the ~550, or 5%Thoughts? 🤓
Comment #81
andypostFYI looking at PHP 8.3 changes I see many commits for
domextensionMoreover 2 commits already caused regressions and reverted, see
- https://github.com/php/php-src/issues/11469
- https://github.com/php/php-src/issues/11507
So ++ to rely on external library
Comment #82
effulgentsia commented+1. Or maybe rename this to
\Drupal\Component\Html5\Serializer\OutputRulesdepending on the answer to below.What would you do instead? Create a
\Drupal\Component\Html5\Html5that extends\Masterminds\HTML5and overrides thesave()method to use ourOutputRulesinstead? Or are you envisioning a different way of integrating our OutputRules overrides intoHtml::serialize()?Comment #84
longwaveOpened MR!4274 from 11.x with the work from the 9.4.x branch merged in, no other changes yet - some of the conflicts in FilterKernelTest especially were tricky so let's see how this goes first.
Comment #85
longwaveFixed some of the tests, but it looks like
HtmlTest::testTransformRootRelativeUrlsToAbsolutewas perhaps failing correctly?Comment #86
andypostFixed CS to run tests
Comment #87
smustgrave commentedFailure seems like it could be legit to issue.
Comment #88
dwwTagging to be smashed. Hope to get a chance to review + dig in soon.
Comment #89
longwaveThere is a typo in WildcardHtmlSupportTest -
class"instead ofclass- the HTML5 parser is stricter and rejectsclass"as an invalid attribute.I am still not sure about the "fix" to ::testTransformRootRelativeUrlsToAbsolute(), will dig into that again.
This also still needs work to address #69 through #82.
Comment #90
longwaveReverted the change to
HtmlTest::testTransformRootRelativeUrlsToAbsolute()and solved the issue another way.This relates to #3311595: Html::transformRootRelativeUrlsToAbsolute() replaces "\r\n" with " \n". Code was added to Html::load() to pre-normalize newlines because of the way DOMDocument used to handle them. As we are no longer using
DOMDocument::loadHTML()we can hopefully remove the special cases inHtml::load()and normalize newlines inHtml::serialize()instead, I think - this makes HtmlTest pass again locally, and hopefully doesn't break anything else,Comment #91
longwaveAddressed #69 through #82 by adding a test as suggested in #69 and then implementing @Wim Leers' suggestion from #80. FilterKernelTest and XssTest pass locally, let's see what happens with the full suite.
Not sure this is worth introducing as an additional component as suggested in #82, but reviews and comments welcome.
Comment #92
smustgrave commentedRegarding #82 maybe we could add a new component for our ruleset but is that needed right now or could be in a follow up?
From an agile standpoint I think this change makes things better right now and can be improved on.
Comment #93
dwwSorry to do this, but sounds like this still needs more review before truly RTBC.
Comment #94
joseph.olstadWow this is some great work @longwave, you've really been on a marathon run burning the midnight oil on this!
Comment #95
needs-review-queue-bot commentedThe Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".
This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #96
longwaveRebased.
Comment #97
wim leersSorry for the long wait, @longwave 🙈 Here's a very thorough review!
I think this is very close to being ready, and mostly just needs some additional comments to ensure this remains maintainable in the future, as well as additions for the obscure edge cases that are changing, in both the issue summary and the change record.
Superb work, and wonderful to finally see this on the verge of being committable 😊
Comment #98
longwaveThanks for the in-depth review! Responded to all points: fixed some nits, added some docs to both the code and change record, rebased and pushed. Also reverted the BigPipe change but I have a feeling this will fail because the placeholders must match the exact string that is output by the normalizer, let's see though.
Comment #99
longwaveUpdated the IS and change record.
Comment #100
longwaveTest failures.
Comment #101
andypostThere's new RFC to add HTML5 support to PHP (8.4 I bet) https://wiki.php.net/rfc/domdocument_html5_parser
Comment #102
mfbThe HTML5 RFC is good news and it's about time, although there will be more waiting to be able to use it.
Comment #103
longwaveYeah, I feel like we should move forward with this anyway, then when Drupal increases the minimum version enough to use the new HTML5 DOM, we can migrate away from
masterminds/html5. I also like thatmasterminds/html5is explicitly called out in the RFC as it helps to confirm that we have chosen the correct solution here.Comment #104
wim leers💯
Comment #105
longwaveThis should be ready for another round of review. Fixed the test fails - removed the big pipe placeholder
argumentsattribute when it is empty to hopefully avoid confusion, and fixed a bug in HtmlSerializerRules that was double-escaping some ampersands.Comment #106
smustgrave commentedFrom what I can tell all threads have had a response at least.
CR reads well.
And after 8 years think moving to HTML5 makes sense :) also getting this in about a month out from 10.2 would give contrib modules good time to fix.
Comment #107
alexpottI think the changes to bigpipe mean that that module should have a post update to ensure the render cache is clearer so any cached placeholders have the expected HTML.
Comment #108
andypostFYI this upgrade makes the related issue obsolete so removes last blocker for PHP 8.3 compatibility (and independent of libxml version) #3383577-20: TextSummaryTest:testLength() fails on some libxml versions
Comment #109
andypostAdded
pib_pipe.post_update.phpwithbig_pipe_post_update_html5_placeholders()for #107Comment #110
longwaveThank you @andypost.
Comment #111
alexpottCommitted 201ae2e and pushed to 11.x. Thanks!
Comment #113
wim leers🥳 Thanks everyone, and @longwave in particular! 👏 8.5 years in the making 😅
Comment #114
joseph.olstadNicely done folks! I wonder how this will affect contrib? We shall see! :)
Comment #115
andypostThank you all!
It was a blocker for PHP 8.3 compatibility!
Meantime the parent issue needs some attention for follow-ups as mostly all
libxmlrelated issues could be closed #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5Comment #117
cilefen commentedThere is a report calling this a regression #3445910: Issue with HTML ` ` not being correctly filtered out.