Problem/Motivation

libxml2 prior to 2.9.0 removed blank elements (whitespace) when parsing HTML. This behaviour was reverted in later versions of libxml2 in https://gitlab.gnome.org/GNOME/libxml2/-/commit/f933c898132f20a50ba39ac6...

In #3204929: Html::load() inconsistent space removal with old libxml2 versions. we added LIBXML_NOBLANKS to DOMDocument::loadHTML() calls to preserve compatibility with the previous behaviour, so our tests pass the same way on both the old and new versions.

We should decide whether stripping whitespace is the right thing to do, or whether we should now preserve whitespace and update our tests to match the new behaviour.

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

CommentFileSizeAuthor
#8 3298822-8.patch1.51 KBolli

Comments

longwave created an issue. See original summary.

catch’s picture

If we change this, would we need a minimum libxml2 version as a dev dependency?

longwave’s picture

Title: Decide whether to preserve whitespace in DOMDocument::loadHTML() calls » Preserve whitespace in DOMDocument::loadHTML() calls

I guess the action here is either preserve whitespace, or close this as won't fix.

There are other issues with DOMDocument::loadHTML() in that it does not parse HTML5 elements properly. #2441811: Upgrade filter system to HTML5 fixes this and needs similar changes to tests, as the PHP-based HTML5 parsing library also preserves whitespace, so to me one reason to make this change here is to help prepare for that issue.

longwave’s picture

@catch Yes, although the parent issue claims to show different behaviour in 2.9.4 vs 2.9.10, yet the libxml2 changelog notes this as a change in 2.9.0, so this might be OS-dependent in some way, which will be hard for us to detect - I guess OS packages may patch libxml2 in their own ways?

https://gitlab.gnome.org/GNOME/libxml2/-/blob/5930fe01963136ab92125feec0...

Maybe we either need an explicit check for the behaviour, or we rewrite the tests so they work either way?

Also note that libxml2 2.9.0 is now 9 years old, maybe we should just forget about the old behaviour once DrupalCI is running new enough versions everywhere.

longwave’s picture

Having said that there are also bug fixes around the HTML blanks behaviour in 2.9.5, so maybe that should be our minimum version. This is very difficult to test without compiling or acquiring old versions of libxml2!

andypost’s picture

Should we extend report/status with warning?

We already checking APCu, opCache, db, supported image formats

andypost’s picture

Maybe there's some package to detect system capabilities?

olli’s picture

Status: Active » Needs review
StatusFileSize
new1.51 KB

Here is a patch that reverts #3204929: Html::load() inconsistent space removal with old libxml2 versions..

Without this O<sub>2</sub> <strong>oxygen</strong> is rendered without space as O2oxygen.

longwave’s picture

@olli Can you add a test case that covers your example?

Status: Needs review » Needs work

The last submitted patch, 8: 3298822-8.patch, failed testing. View results

longwave’s picture

Version: 10.0.x-dev » 10.1.x-dev
Issue tags: +Needs change record

This will also need a change record, and can only go into 10.1.x now as it's a behaviour change.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

wim leers’s picture

poker10’s picture

After #2441811: Upgrade filter system to HTML5, it seems like there is still one single usage of DOMDocument::loadHTML() in ActiveLinkResponseFilter::setLinkActiveClass().

andypost’s picture

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.