Problem/Motivation
There's feature freeze of PHP 8.4 coming but new HTML5 classes already available (in alpha releases) and core nww using masterminds/html5 library after #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5
Steps to reproduce
- new parser and classes implemented in https://wiki.php.net/rfc/domdocument_html5_parser (WHATWG DOM spec added)
- new API is opt-in https://wiki.php.net/rfc/opt_in_dom_spec_compliance
- CSS selectors and more https://wiki.php.net/rfc/dom_additions_84
Proposed resolution
Explore effect of replacement of html5-php library with native implementation from PHP 8.2 and prepare core to get rid off PHP-library in favour of native dom extension (which is required already)
Remaining tasks
- replace create/load methods with new classes and try fix tests
- compare replacement of CSS-selectors library with native implementation
User interface changes
Introduced terminology
API changes
Data model changes
Release notes snippet
Issue fork drupal-3463613
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
andypostComment #3
andypostComment #4
mfbComment #5
andypostRaising priority as upgrade of masterminds library caused tests to fail in #3551569: Update symfony/* with dependencies for PHP 8.5
Comment #6
longwaveSymfony 7.4 has chosen to default to the PHP HTML5 parser in their DomCrawler component if you are on PHP 8.4, and it will become the only option in Symfony 8, so we should probably think about doing the same.
Comment #10
berdirStarted looking into this.
This is going to require some work and we won't be able to make this an internal replacement.
The new API has completely new classes in the \Dom namespace, masterminds/html5 returns a classic \DOMDocument, and those things are very much part of the API's we offer around HTML::load() and HTML::serialize()
I started by providing completely new replacement methods for them, and for testing, converted Html::normalize() as well as Html::transformRootRelativeUrlsToAbsolute() to those new methods.
Some tests I did run for those methods required some adjustments, they don't treat invalid HTML in quite the same way.
First I had the HTML_NO_DEFAULT_NS flag, somehow that didn't work at all with the transform unit tests. I also have no idea what to do with the special stuff in HTML::serialize(). I think we maybe don't need to bother with that cdata/non-html5-browser stuff. But the escaping I'm not sure yet how to convert to the new API.
Among other things, I used https://devm.io/php/better-html-parsing-in-php as a source/guide.
Comment #11
berdirAs the test fail in core/tests/Drupal/Tests/Component/Utility/XssTest.php for example shows is that this breaks some stuff due to the removed special handling of attributes. I haven't investigated yet if this is possible to restore but my gut feeling says no because while currently this part is PHP code, with the new native approach is not and likely not designed to allow extensibility at that point. Happy to be proven wrong.
My suggestion would be to revert the change to Html::normalize() for now, stick with the custom approach there and instead update existing calls and usages that specifically call out Html::load() and Html::serialize() in more specific scenarios where we know that it should not result in the issues that we see here, such as tests and *maybe* existing filter stuff.
And instead, we'll likely need to start from within our XSS filtering and undo the limitations that are currently documented on \Drupal\Component\Utility\HtmlSerializerRules, which possibly doesn't just affect PHP but also JS and will need to be one very carefully.
Either way, best scenario to me for 12.0 is that we can offer an alternative, more performant API for cases where that works and only deprecate masterminds/html5 later on.
Comment #12
longwaveMaybe we introduce an
Html5class alongsideHtmlthat uses the new library, and then deprecate theHtmlclass/methods?