Problem/Motivation
libxml2 prior to 2.9.0 removed blank elements (whitespace) when parsing HTML. This behaviour was reverted in later versions of libxml2 in https://gitlab.gnome.org/GNOME/libxml2/-/commit/f933c898132f20a50ba39ac6...
In #3204929: Html::load() inconsistent space removal with old libxml2 versions. we added LIBXML_NOBLANKS to DOMDocument::loadHTML() calls to preserve compatibility with the previous behaviour, so our tests pass the same way on both the old and new versions.
We should decide whether stripping whitespace is the right thing to do, or whether we should now preserve whitespace and update our tests to match the new behaviour.
Steps to reproduce
Proposed resolution
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
| Comment | File | Size | Author |
|---|---|---|---|
| #8 | 3298822-8.patch | 1.51 KB | olli |
Comments
Comment #2
catchIf we change this, would we need a minimum libxml2 version as a dev dependency?
Comment #3
longwaveI guess the action here is either preserve whitespace, or close this as won't fix.
There are other issues with DOMDocument::loadHTML() in that it does not parse HTML5 elements properly. #2441811: Upgrade filter system to HTML5 fixes this and needs similar changes to tests, as the PHP-based HTML5 parsing library also preserves whitespace, so to me one reason to make this change here is to help prepare for that issue.
Comment #4
longwave@catch Yes, although the parent issue claims to show different behaviour in 2.9.4 vs 2.9.10, yet the libxml2 changelog notes this as a change in 2.9.0, so this might be OS-dependent in some way, which will be hard for us to detect - I guess OS packages may patch libxml2 in their own ways?
https://gitlab.gnome.org/GNOME/libxml2/-/blob/5930fe01963136ab92125feec0...
Maybe we either need an explicit check for the behaviour, or we rewrite the tests so they work either way?
Also note that libxml2 2.9.0 is now 9 years old, maybe we should just forget about the old behaviour once DrupalCI is running new enough versions everywhere.
Comment #5
longwaveHaving said that there are also bug fixes around the HTML blanks behaviour in 2.9.5, so maybe that should be our minimum version. This is very difficult to test without compiling or acquiring old versions of libxml2!
Comment #6
andypostShould we extend report/status with warning?
We already checking APCu, opCache, db, supported image formats
Comment #7
andypostMaybe there's some package to detect system capabilities?
Comment #8
olli commentedHere is a patch that reverts #3204929: Html::load() inconsistent space removal with old libxml2 versions..
Without this
O<sub>2</sub> <strong>oxygen</strong>is rendered without space as O2oxygen.Comment #9
longwave@olli Can you add a test case that covers your example?
Comment #11
longwaveThis will also need a change record, and can only go into 10.1.x now as it's a behaviour change.
Comment #13
wim leersIsn't this obsolete thanks to #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5?
Comment #14
poker10 commentedAfter #2441811: Upgrade filter system to HTML5, it seems like there is still one single usage of
DOMDocument::loadHTML()inActiveLinkResponseFilter::setLinkActiveClass().Comment #15
andypostMore changes can come in PHP 8.4 https://wiki.php.net/rfc/opt_in_dom_spec_compliance