Problem/Motivation

There's feature freeze of PHP 8.4 coming but new HTML5 classes already available (in alpha releases) and core nww using masterminds/html5 library after #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5

Steps to reproduce

- new parser and classes implemented in https://wiki.php.net/rfc/domdocument_html5_parser (WHATWG DOM spec added)
- new API is opt-in https://wiki.php.net/rfc/opt_in_dom_spec_compliance
- CSS selectors and more https://wiki.php.net/rfc/dom_additions_84

Proposed resolution

Explore effect of replacement of html5-php library with native implementation from PHP 8.2 and prepare core to get rid off PHP-library in favour of native dom extension (which is required already)

Remaining tasks

- replace create/load methods with new classes and try fix tests
- compare replacement of CSS-selectors library with native implementation

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Issue fork drupal-3463613

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

andypost created an issue. See original summary.

andypost’s picture

mfb’s picture

andypost’s picture

Priority: Normal » Major
Issue tags: +Vienna2025
Related issues: +#3551569: Update symfony/* with dependencies for PHP 8.5

Raising priority as upgrade of masterminds library caused tests to fail in #3551569: Update symfony/* with dependencies for PHP 8.5

longwave’s picture

Symfony 7.4 has chosen to default to the PHP HTML5 parser in their DomCrawler component if you are on PHP 8.4, and it will become the only option in Symfony 8, so we should probably think about doing the same.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.

berdir made their first commit to this issue’s fork.

berdir’s picture

Status: Active » Needs work

Started looking into this.

This is going to require some work and we won't be able to make this an internal replacement.

The new API has completely new classes in the \Dom namespace, masterminds/html5 returns a classic \DOMDocument, and those things are very much part of the API's we offer around HTML::load() and HTML::serialize()

I started by providing completely new replacement methods for them, and for testing, converted Html::normalize() as well as Html::transformRootRelativeUrlsToAbsolute() to those new methods.

Some tests I did run for those methods required some adjustments, they don't treat invalid HTML in quite the same way.

First I had the HTML_NO_DEFAULT_NS flag, somehow that didn't work at all with the transform unit tests. I also have no idea what to do with the special stuff in HTML::serialize(). I think we maybe don't need to bother with that cdata/non-html5-browser stuff. But the escaping I'm not sure yet how to convert to the new API.

Among other things, I used https://devm.io/php/better-html-parsing-in-php as a source/guide.

berdir’s picture

As the test fail in core/tests/Drupal/Tests/Component/Utility/XssTest.php for example shows is that this breaks some stuff due to the removed special handling of attributes. I haven't investigated yet if this is possible to restore but my gut feeling says no because while currently this part is PHP code, with the new native approach is not and likely not designed to allow extensibility at that point. Happy to be proven wrong.

My suggestion would be to revert the change to Html::normalize() for now, stick with the custom approach there and instead update existing calls and usages that specifically call out Html::load() and Html::serialize() in more specific scenarios where we know that it should not result in the issues that we see here, such as tests and *maybe* existing filter stuff.

And instead, we'll likely need to start from within our XSS filtering and undo the limitations that are currently documented on \Drupal\Component\Utility\HtmlSerializerRules, which possibly doesn't just affect PHP but also JS and will need to be one very carefully.

Either way, best scenario to me for 12.0 is that we can offer an alternative, more performant API for cases where that works and only deprecate masterminds/html5 later on.

longwave’s picture

Maybe we introduce an Html5 class alongside Html that uses the new library, and then deprecate the Html class/methods?