Problem/Motivation

The filter system should be upgraded to HTML5 to match modern standards. All modern browsers now parse HTML5, our themes use an HTML5 doctype and so Drupal should treat HTML input and output the same way.

Steps to reproduce

Currently the filter system parses and outputs XHTML only. This causes PHP warnings when trying to parse modern tags:

>>> use \Drupal\Component\Utility\Html;
>>> Html::normalize("<p><figure><figcaption>Caption</figcaption></figure><video></video></p>");
PHP Warning:  DOMDocument::loadHTML(): Tag figure invalid in Entity, line: 1 in /var/www/html/drupal/core/lib/Drupal/Component/Utility/Html.php on line 289
PHP Warning:  DOMDocument::loadHTML(): Tag figcaption invalid in Entity, line: 1 in /var/www/html/drupal/core/lib/Drupal/Component/Utility/Html.php on line 289
PHP Warning:  DOMDocument::loadHTML(): Tag video invalid in Entity, line: 1 in /var/www/html/drupal/core/lib/Drupal/Component/Utility/Html.php on line 289
=> "<p><figure><figcaption>Caption</figcaption></figure><video></video></p>"

Also, <br /> and <img /> tags are self-closing. In HTML5, these are void elements and should be output simply as <br> and <img>.

Proposed resolution

Use masterminds/html5 instead of DOMDocument to parse and output HTML in the \Drupal\Component\Utility\Html utility class.

Remaining tasks

User interface changes

None

API changes

\Drupal\Component\Utility\Html::load(), \Drupal\Component\Utility\Html::serialize() and \Drupal\Component\Utility\Html::normalize() will parse and output HTML5 instead of XHTML. Input filters or other code that relies on this utility class will also now parse and output HTML5 instead of XHTML.

This does mean there are some minor changes to output, but these should not affect valid HTML documents. The change record contains known examples of output changes that may affect tests: https://www.drupal.org/node/3225468

Data model changes

None

Release notes snippet

The filter system has been upgraded to output HTML5. HTML output will have minor changes in some cases, for more information read the change record.

Issue fork drupal-2441811

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

daffie’s picture

Status: Active » Postponed
daffie’s picture

There are two new tests changes added to this patch from the original patch from #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5.

alexpott’s picture

Status: Postponed » Active
yched’s picture

The CR that was published as part of #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 states that the filter system is already HTML5-compatible, but that's a bit ahead of it's time :-)

Shouldn't this be bumped to critical ?

daffie’s picture

Status: Active » Postponed
mgifford’s picture

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

catch’s picture

Title: Upgrade filter system to HTML5 » [pp-2] Upgrade filter system to HTML5
jibran’s picture

Title: [pp-2] Upgrade filter system to HTML5 » [PP-1] Upgrade filter system to HTML5

#2441373: Upgrade tests to HTML5 is not postponed anymore.

joseph.olstad’s picture

I've rerolled a patch in #2441373: Upgrade tests to HTML5
it needs a bit of work to get it in.

meanwhile, can also try to work on this one.

effulgentsia’s picture

Version: 8.9.x-dev » 9.3.x-dev
longwave’s picture

StatusFileSize
new1.71 KB

Unsure why we are waiting on tests first, let's see what fails if we just upgrade \Drupal\Component\Utility\Html to the HTML5 parser.

longwave’s picture

Opened an MR with #19, some more fixes to the filter system and some test changes to accommodate HTML5 output.

BigPipe rendering is broken and I'm not sure why yet. Also unsure if the caption filter output should now have newlines? The template has them but somehow they were stripped before, the new parser keeps them instead.

Otherwise, this doesn't look too difficult to get the rest of the tests working.

joseph.olstad’s picture

@longwave, patch 19 works perfectly ! Thanks!

longwave’s picture

Title: [PP-1] Upgrade filter system to HTML5 » Upgrade filter system to HTML5
Status: Postponed » Needs work

Un-postponing this - I think we can do this separately from changing the other tests.

longwave’s picture

Status: Needs work » Postponed

One way of fixing the caption filter is to wrap the templates for it in {% apply spaceless %}.

longwave’s picture

Status: Postponed » Needs work
andypost’s picture

Issue tags: +Test suite performance

MR used so patches are hidden

btw interesting to compare performance of test runs using different parsers

joseph.olstad’s picture

how about applying this patch to the merge request?

patch-73

#2441373-73: Upgrade tests to HTML5

longwave’s picture

@joseph.olstad this issue is about changing the filter system only, and fixing any tests or code that fail directly due to that. That issue is separate, covering all other tests that use DOMDocument to parse HTML.

longwave’s picture

Status: Needs work » Needs review

All failed tests work locally, let's see how this testbot run goes.

longwave’s picture

@andypost re: performance, seems no real difference:

9.3.x HEAD run on DOMDocument:
Unit tests: 44 sec
Kernel tests: 6 min 55 sec
Functional tests: 34 min 12 sec
JavaScript tests: 6 min 55 sec

Latest patch here on HTML5:
Unit tests: 42 sec
Kernel tests: 6 min 51 sec
Functional tests: 34 min 3 sec
JavaScript tests: 6 min 59 sec

joseph.olstad’s picture

Wow great job @longwave!

I started using your patch already, it's fixed an extra xml:lang that we had.

andypost’s picture

Left 2 questions in MR - maybe empty attributes have a meaning, probably better to split off

The bigger question is template changes for spaceless

longwave’s picture

Addressed @andypost's feedback, put the empty attribute back in as close as I can - HTML5 won't output arguments="" but arguments alone means the same thing according to spec.

We have made minor changes to classy before, the other option here is to leave the template alone, but this means the output is actually different - the caption filter output will contain newlines and all related tests will have to be updated to match. To me it's better to keep the output the same as it was before.

wim leers’s picture

As a pseudo-maintainer of filter.module: epic work here! 🙏

daffie’s picture

Status: Needs review » Needs work
longwave’s picture

Status: Needs work » Needs review

@daffie Thanks for the feedback, all points addressed.

daffie’s picture

Status: Needs review » Needs work
longwave’s picture

Status: Needs work » Needs review
daffie’s picture

Status: Needs review » Needs work
longwave’s picture

Status: Needs work » Needs review
longwave’s picture

Figured out a way to simplify it even further using DOMXPath to extract all child nodes under <body>.

daffie’s picture

Status: Needs review » Reviewed & tested by the community

All code changes look good to me.
The MR fixed the problem as described in the IS.
For me it is RTBC.

For the committer: In some places is the theme classy changed, by adding "spaceless" to pieces of the theme. Not sure if that is allowed. I leave it up to the committer to make a decision about it.

longwave’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs issue summary update, +Needs change record

Thanks for the code reviews.

I think this still needs an issue summary update to bring it up to our recent issue summary standards and explain what we are doing and why.

I also think this needs a change record because filtered output is changing in subtle ways; e.g. br and img tags are now void elements instead of self closed elements.

longwave’s picture

Issue summary: View changes
Status: Needs work » Needs review
Issue tags: -Needs issue summary update, -Needs change record

Updated IS and added CR: https://www.drupal.org/node/3225468

daffie’s picture

Status: Needs review » Reviewed & tested by the community

Both the IS and the CR look good to me.
Back to RTBC.

@longwave: Thank you for your work on this issue!

Edit: For the committer: Can the people who worked on #1277290: Use a proper HTML parser for every core filter get commit credits in this issue.

larowlan credited chx.

larowlan’s picture

Edit: For the committer: Can the people who worked on #1277290: Use a proper HTML parser for every core filter get commit credits in this issue.

FWIW anyone in Maintainers.txt can assign credit for core

Doing this now.

larowlan’s picture

Status: Reviewed & tested by the community » Needs review

Left a couple of questions in the MR

wim leers’s picture

Epic work here! Left a bunch of questions, one of which is critical to resolve and for which I added Needs frontend framework manager review — it was previously asked in #24, #32 and #42.

longwave’s picture

Thanks for the reviews! Fixed up a few test cases as per the feedback, and replied in the MR about the two tricky cases - escaping in attributes and invalid </br>.

daffie’s picture

Status: Needs review » Needs work

The testbot is failing.

longwave’s picture

Status: Needs work » Needs review

Seems to be a random fail, cannot reproduce locally. I merged 9.3.x back in and pushed, let's see if it happens again.

longwave’s picture

Issue summary: View changes

Added more detail to IS.

longwave’s picture

Replaced comments referring to (X)HTML with just HTML, as I think that's more correct now.

::escapeCdataElement() says:

   * \DOMDocument::loadHTML() in \Drupal\Component\Utility\Html::load() makes
   * CDATA sections from the contents of inline script and style tags. This can
   * cause HTML4 browsers to throw exceptions.

Is this still important? We don't call ::loadHTML() any more, and we don't care about HTML4 browsers either, so can we just remove this entirely?

longwave’s picture

Reverted the changes to the filter-caption templates and adjusted the tests to expect newlines in the output instead.

To make this clear: the newlines are already output at runtime in newer versions of libxml2 as per #3204929: Html::load() inconsistent space removal with old libxml2 versions. - three people have reproduced FilterKernelTest failing locally in HEAD due to this, but DrupalCI is on an older version that doesn't exhibit this problem. After discussion with @alexpott we decided it was less disruptive to change the tests over changing the templates.

joseph.olstad’s picture

Looks like stellar fantastic work to me.
Merge Request needs a rebase

or, reroll from this: https://git.drupalcode.org/project/drupal/-/merge_requests/965.diff

longwave’s picture

Rebased and force pushed.

lauriii’s picture

Seems like the questions that was tagged for frontend framework manager review was already addressed by changing the test assertions.

longwave’s picture

daffie’s picture

Status: Needs review » Needs work

There are unresolved threads on the MR.

longwave’s picture

Status: Needs work » Needs review

Thanks for the review, added a couple of comments and replied to the other MR feedback.

daffie’s picture

Status: Needs review » Reviewed & tested by the community

All points of @phenaproxima have been addressed.
Back to RTBC.

lauriii’s picture

Status: Reviewed & tested by the community » Needs work

There are some remaining issues with HTML5 support due to Drupal\Component\Utility\Xss::filter() not fully supporting HTML5. I added test coverage for this in the MR branch.

joseph.olstad’s picture

early testing on this fix, (great work btw, hope we can get through this relatively unscathed), we are an early adopter and have now backed off the filter change (we were using patch 19 above), one of my clients tested the patch on D8.9.x, early test results show the change causing an issue with a popular contrib module called linkit.

for some reason this change could cause the node references to not be automatically converted to the path alias after a cache rebuild and page load so they would be /node/123 instead of being converted to /friendly-title-example

I will make sure to run some more tests myself hopefully on the final version of this change.

we may have to adjust contrib logic once this is finalized.

This is just an FYI so that I remind myself to test the latest incarnation of linkit against the finalized solution. I am however using a patch (that I made changes to myself) for the linkit module so not sure if it's my patch or vanilla linkit or both.

longwave’s picture

I think this XSS change might need breaking out into its own issue. The underlying problem is this part of the regex in Xss::filter():

      <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string

This is trivially matching anything that looks like a tag, and doesn't take attributes into account. I don't know how to safely modify this so it can skip quoted values that might contain >.

We then have a further regex which is responsible for splitting up the tag that the previous regex has found:

    if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9\-]+)\s*([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {

Similarly this one doesn't seem to be able to take attributes containing > into account.

The famous StackOverflow answer of https://stackoverflow.com/a/1732454 came to mind when I started looking at this; I'm not sure this is even possible to do with regular expressions alone.

I also tried to search for alternative libraries but wasn't able to find an alternative XSS filter written in PHP. kses (which this code is based on) is also used in WordPress and contains the same regex we have here minus the comments, so I assume it suffers from the same issue: https://github.com/WordPress/WordPress/blob/master/wp-includes/kses.php#...

longwave’s picture

Xss::filter() claims that:

   * This code does four things:
   * - Removes characters and constructs that can trick browsers.
   * - Makes sure all HTML entities are well-formed.
   * - Makes sure all HTML tags and attributes are well-formed.
   * - Makes sure no HTML tags contain URLs with a disallowed protocol (e.g.
   *   javascript:).

If we keep some of the existing regexes for the first part, can the HTML5 library be trusted to help with the rest? We could construct a DOM document, do any filtering in the DOM, and then serialize back to HTML, which will then be well-formed? This won't be as performant as doing it with regex only, but if it's not physically possible to do it with regex, I don't know what other option we have.

lauriii’s picture

Issue tags: +Needs followup

+1 on moving the problem with Drupal\Component\Utility\Xss::filter() to a follow-up issue, and moving this issue back to RTBC once the test changes have been reverted.

longwave’s picture

Status: Needs work » Reviewed & tested by the community
Issue tags: -Needs followup
Related issues: +#3227831: Xss::filter() does not handle HTML tags inside attribute values

Opened #3227831: Xss::filter() does not handle HTML tags inside attribute values as followup for the XSS filtering issue.

Reverted @lauriii's failing test and back to RTBC as requested in #67.

effulgentsia’s picture

Status: Reviewed & tested by the community » Needs work

Since we don't know if or when #68 will result in any changes to Xss::filter(), for the purposes of this issue, we have to live with that function as it is. One of the things that function requires is for some html entities (like less-than and greater-than) to be encoded when within attribute values.

I think we should have a requirement (and add the needed tests for it) that if for a given input $foo, if Xss::filter($foo) is rendered and appears a certain way in the browser, then rendering Xss::filter(Html::normalize($foo)) should appear the same way in the browser (the literal string can be different, but it should be effectively the same in terms of what a browser displays).

With this MR as-is, the above doesn't hold, because:

$foo = '<span data-caption="foo &lt;em&gt;bar&lt;/em&gt;"></span>';

would be seen as ok by Xss::filter(), but when you call Html::normalize() on it, then it gets converted to:

<span data-caption="foo <em>bar</em>"></span>

which is not seen as ok by Xss::filter() per #68.

This could be fixed by changing:

+    $html5 = new HTML5(['disable_html_ns' => TRUE]);

to also pass in 'encode_entities' => TRUE.

However, that ends up encoding a lot of entities. For example:

<a href="some/url"></a>

would get normalized to:

<a href="some&sol;url"></a>

That might be ok, but it's kind of ugly for anyone who needs to read that HTML.

I don't know if the html5 library provides for any more fine-grained control. It would be great if we could ask it to encode less-than and greater-than within attribute values, without it also encoding every other possible html entity.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

andypost’s picture

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

smustgrave’s picture

If the hold up is https://www.drupal.org/project/drupal/issues/3227831 can you see if its a duplicate of https://www.drupal.org/project/drupal/issues/2552837 or https://www.drupal.org/project/drupal/issues/2544110. Both of those I'm actively working on and keep close on for review / change requests.

edurenye made their first commit to this issue’s fork.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

joseph.olstad’s picture

Are we targetting 10.0.x for this or 10.1.x ?

For those that can't wait, I have a simple solution for the track and source tags.

#2857273: \Drupal\filter\Plugin\Filter\FilterHtmlCorrector and Html::normalize() and incorrectly "corrects" <source> tags

chi’s picture

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

wim leers’s picture

HTML5 has now been around for >15 years. More than half the lifetime of Drupal. We should get this fixed at some point 😅

This was last RTBC almost 2 years ago, in #68.

It was un-RTBC'd in #69 over concerns about XSS filtering.

Since then, there have been 3 releases: https://github.com/Masterminds/html5-php/releases. Nothing in them relating to what was surfaced in #69 AFAICT.

But they helpfully also provided this overview: https://github.com/Masterminds/html5-php#known-issues-or-things-we-desig.... And even more helpfully: https://github.com/Masterminds/html5-php#serializer-design. For us to be able to reuse 99% of this but provide our own handling of encoding in attribute values, we could:

  1. have \Drupal\Component\Utility\Html::serialize() not call \Masterminds\HTML5::saveHTML() which calls \Masterminds\HTML5::save(), which hardcodes the use of \Masterminds\HTML5\Serializer\OutputRules — this would add ~10 LoC
  2. instead provide \Drupal\Component\Utility\HtmlSerializerRules extends \Masterminds\HTML5\Serializer\OutputRules which handles attribute values differently by only overriding \Masterminds\HTML5\Serializer\OutputRules::attrs(), which is ~30 out of the ~550, or 5%

Thoughts? 🤓

andypost’s picture

FYI looking at PHP 8.3 changes I see many commits for dom extension
Moreover 2 commits already caused regressions and reverted, see
- https://github.com/php/php-src/issues/11469
- https://github.com/php/php-src/issues/11507

So ++ to rely on external library

effulgentsia’s picture

provide \Drupal\Component\Utility\HtmlSerializerRules extends \Masterminds\HTML5\Serializer\OutputRules which handles attribute values differently by only overriding \Masterminds\HTML5\Serializer\OutputRules::attrs()

+1. Or maybe rename this to \Drupal\Component\Html5\Serializer\OutputRules depending on the answer to below.

have \Drupal\Component\Utility\Html::serialize() not call \Masterminds\HTML5::saveHTML() which calls \Masterminds\HTML5::save(), which hardcodes the use of \Masterminds\HTML5\Serializer\OutputRules

What would you do instead? Create a \Drupal\Component\Html5\Html5 that extends \Masterminds\HTML5 and overrides the save() method to use our OutputRules instead? Or are you envisioning a different way of integrating our OutputRules overrides into Html::serialize()?

longwave’s picture

Opened MR!4274 from 11.x with the work from the 9.4.x branch merged in, no other changes yet - some of the conflicts in FilterKernelTest especially were tricky so let's see how this goes first.

longwave’s picture

Status: Needs work » Needs review

Fixed some of the tests, but it looks like HtmlTest::testTransformRootRelativeUrlsToAbsolute was perhaps failing correctly?

andypost’s picture

Fixed CS to run tests

smustgrave’s picture

Status: Needs review » Needs work

Failure seems like it could be legit to issue.

dww’s picture

Issue tags: +Bug Smash Initiative

Tagging to be smashed. Hope to get a chance to review + dig in soon.

longwave’s picture

Status: Needs work » Needs review
Issue tags: +ddd2023

There is a typo in WildcardHtmlSupportTest - class" instead of class - the HTML5 parser is stricter and rejects class" as an invalid attribute.

I am still not sure about the "fix" to ::testTransformRootRelativeUrlsToAbsolute(), will dig into that again.

This also still needs work to address #69 through #82.

longwave’s picture

Reverted the change to HtmlTest::testTransformRootRelativeUrlsToAbsolute() and solved the issue another way.

This relates to #3311595: Html::transformRootRelativeUrlsToAbsolute() replaces "\r\n" with "&#13;\n". Code was added to Html::load() to pre-normalize newlines because of the way DOMDocument used to handle them. As we are no longer using DOMDocument::loadHTML() we can hopefully remove the special cases in Html::load() and normalize newlines in Html::serialize() instead, I think - this makes HtmlTest pass again locally, and hopefully doesn't break anything else,

longwave’s picture

Addressed #69 through #82 by adding a test as suggested in #69 and then implementing @Wim Leers' suggestion from #80. FilterKernelTest and XssTest pass locally, let's see what happens with the full suite.

Not sure this is worth introducing as an additional component as suggested in #82, but reviews and comments welcome.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Regarding #82 maybe we could add a new component for our ruleset but is that needed right now or could be in a follow up?

From an agile standpoint I think this change makes things better right now and can be improved on.

dww’s picture

Status: Reviewed & tested by the community » Needs review

Sorry to do this, but sounds like this still needs more review before truly RTBC.

joseph.olstad’s picture

Wow this is some great work @longwave, you've really been on a marathon run burning the midnight oil on this!

needs-review-queue-bot’s picture

Status: Needs review » Needs work
StatusFileSize
new90 bytes

The Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".

This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

longwave’s picture

Status: Needs work » Needs review

Rebased.

wim leers’s picture

Status: Needs review » Needs work
Issue tags: +Needs change record updates, +Needs issue summary update

Sorry for the long wait, @longwave 🙈 Here's a very thorough review!

I think this is very close to being ready, and mostly just needs some additional comments to ensure this remains maintainable in the future, as well as additions for the obscure edge cases that are changing, in both the issue summary and the change record.

Superb work, and wonderful to finally see this on the verge of being committable 😊

longwave’s picture

Status: Needs work » Needs review

Thanks for the in-depth review! Responded to all points: fixed some nits, added some docs to both the code and change record, rebased and pushed. Also reverted the BigPipe change but I have a feeling this will fail because the placeholders must match the exact string that is output by the normalizer, let's see though.

longwave’s picture

Updated the IS and change record.

longwave’s picture

Status: Needs review » Needs work

Test failures.

andypost’s picture

There's new RFC to add HTML5 support to PHP (8.4 I bet) https://wiki.php.net/rfc/domdocument_html5_parser

mfb’s picture

The HTML5 RFC is good news and it's about time, although there will be more waiting to be able to use it.

longwave’s picture

Yeah, I feel like we should move forward with this anyway, then when Drupal increases the minimum version enough to use the new HTML5 DOM, we can migrate away from masterminds/html5. I also like that masterminds/html5 is explicitly called out in the RFC as it helps to confirm that we have chosen the correct solution here.

wim leers’s picture

Yeah, I feel like we should move forward with this anyway

💯

longwave’s picture

Status: Needs work » Needs review

This should be ready for another round of review. Fixed the test fails - removed the big pipe placeholder arguments attribute when it is empty to hopefully avoid confusion, and fixed a bug in HtmlSerializerRules that was double-escaping some ampersands.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

From what I can tell all threads have had a response at least.

CR reads well.

And after 8 years think moving to HTML5 makes sense :) also getting this in about a month out from 10.2 would give contrib modules good time to fix.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

I think the changes to bigpipe mean that that module should have a post update to ensure the render cache is clearer so any cached placeholders have the expected HTML.

andypost’s picture

FYI this upgrade makes the related issue obsolete so removes last blocker for PHP 8.3 compatibility (and independent of libxml version) #3383577-20: TextSummaryTest:testLength() fails on some libxml versions

andypost’s picture

Status: Needs work » Needs review

Added pib_pipe.post_update.php with big_pipe_post_update_html5_placeholders() for #107

longwave’s picture

Status: Needs review » Reviewed & tested by the community

Thank you @andypost.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed
Issue tags: +10.2.0 release highlights, +10.2.0 release notes

Committed 201ae2e and pushed to 11.x. Thanks!

  • alexpott committed 201ae2e3 on 11.x
    Issue #2441811 by longwave, daffie, andypost, edurenye, lauriii, joseph....
wim leers’s picture

🥳 Thanks everyone, and @longwave in particular! 👏 8.5 years in the making 😅

joseph.olstad’s picture

Nicely done folks! I wonder how this will affect contrib? We shall see! :)

andypost’s picture

Thank you all!
It was a blocker for PHP 8.3 compatibility!

Meantime the parent issue needs some attention for follow-ups as mostly all libxml related issues could be closed #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

cilefen’s picture

There is a report calling this a regression #3445910: Issue with HTML `&nbsp;` not being correctly filtered out.