Problem/Motivation

The filters 'htmlcorrector', 'html' and the testing system needs html parsing and a valid DOM to work with. This is done by the libxml2 library provided in PHP that cleans html and transform it to a dom. Libxml2 assumes all html is HTML4 and correct it with HTML4 rules. As Drupal will be based on HTML5, typical HTML5 tags and constructions will be marked invalid, added or stripped.
A small test example is that <span lang="en"> transforms to <span lang="en" xml:lang="en">

Proposed resolution

Chosen solution

The consensus is that the best solution is to use html5-php (rewrite of html5lib). The library has almost all the functionality that we need. And the functionality is already implemented. There was a problem with the library and that has been fixed. The functionality that we miss is when the DOMDocument has a default namespace it's not possible to parse it using XPath unless the namespace is registered as something else. In the patch there are two helper-classes to add this functionality.

Remaining tasks

#2441373: Upgrade tests to HTML5
#2667340: Usage of field_prefix and container-inline creates invalid markup.
#2441811: Upgrade filter system to HTML5

@todo: needs an issue created or existing issue linked from this one- convert the filter module to use masterminds/html5

User interface changes

None

API changes

Change in behavior of filter_dom_load, the html-filter and html-corrector

Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue priority Major because the bug has many repercussions on filters using the HTML filter, and under some circumstances could result in an invalid DOM. Not critical because it does not render the HTML filter unusable.
Prioritized changes The main goal of this issue is a bugfix to for HTML5 support, which is part of the Drupal 8 product.

Blocked Issues

#1277290: Use a proper HTML parser for every core filter

PHP Bug #60021 DOMDocument errors on HTML5 tags
issue that replaced own (faulty) function with libxml for the html corrector filter in 2009: #374441: Refactor Drupal HTML corrector (PHP5)
#725260: Use PHP's Tidy component for the clean-up HTML filter

Original report

The html filter corrector is now based on XHTML, but Drupal8 should output html5
At the moment
<span lang="en"> transforms to <span lang="en" xml:lang="en">
(see issue #1328768: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss)

two possible causes:
function filter_dom_load currently loads @$dom_document->loadHTML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');

function filter_dom_serialize uses $dom_document->saveXML

CommentFileSizeAuthor
#233 entity_empbed-invalid_tag-2802269-1.patch1.02 KBvictor-shelepen
#226 1333730-226.patch14.37 KBtwistor
#207 interdiff_193.txt4.16 KBMile23
#207 1333730_207.patch498.57 KBMile23
#193 1333730-193-do-not-test.patch77.23 KBdaffie
#193 1333730-193.patch494.89 KBdaffie
#192 interdiff_186.txt6.21 KBMile23
#192 1333730_192.patch494.76 KBMile23
#186 1333730-186-without-html5-library-do-not-test.patch73.71 KBbabruix
#186 1333730-186.patch491.3 KBbabruix
#183 1333730-183-do-not-test.patch700.24 KBbabruix
#182 interdiff-1333730-180-do-not-test-182.txt201.06 KBbabruix
#182 1333730-182.patch700.24 KBbabruix
#182 1333730-182.patch700.24 KBbabruix
#180 interdiff-1333730-175-180-do-not-test.txt1.08 KBbabruix
#180 1333730-180-do-not-test.patch489.86 KBbabruix
#175 interdiff-1333730-172-175.txt1.98 KBbabruix
#175 1333730-175.patch489.73 KBbabruix
#172 interdiff-1333730-169-172.txt1.08 KBbabruix
#172 1333730-172.patch489.86 KBbabruix
#170 interdiff-1333730-167-169.txt18.81 KBbabruix
#169 interdiff-1333730-167-169.txt0 bytesbabruix
#169 1333730-169.patch489.21 KBbabruix
#167 1333730-167.patch489.78 KBbabruix
#163 1333730-163.patch490.72 KBtwistor
#163 interdiff.txt1.78 KBtwistor
#161 1333730-160.patch490.25 KBtwistor
#161 interdiff.txt1.02 KBtwistor
#159 1333730-159.patch490.08 KBtwistor
#159 interdiff.txt35.12 KBtwistor
#155 1333730-155.patch482.67 KBtwistor
#155 interdiff.txt641 bytestwistor
#153 1333730-153.patch482.04 KBtwistor
#153 interdiff.txt5.25 KBtwistor
#149 1333730-148.patch478.26 KBtwistor
#149 interdiff.txt8.49 KBtwistor
#142 1333730-142.patch470.63 KBtwistor
#142 interdiff.txt1.02 KBtwistor
#140 interdiff.txt27.42 KBtwistor
#140 1333730-140.patch469.84 KBtwistor
#128 interdiff.txt22.86 KBWim Leers
#128 1333730-128.patch443.38 KBWim Leers
#125 1333730_125-do-not-test.patch1.77 KBchx
#122 interdiff-1333730-82-121.txt10.65 KBfilijonka
#121 interdiff-1333730-82-121.txt5.29 KBfilijonka
#121 1333730-121.patch424.43 KBfilijonka
#118 interdiff-1333730-82-118.txt5.29 KBfilijonka
#118 1333730-118.patch424.43 KBfilijonka
#85 interdiff-1333730-79-82.txt5.34 KBfilijonka
#82 interdiff-1333730-79-82.txt2.16 MBfilijonka
#82 1333730-82.patch424.62 KBfilijonka
#79 1333730-79.patch424.92 KBfilijonka
#76 1333730-76.patch9.97 KBfilijonka
#22 filter.html5_.22.patch18.61 KBsun
#22 interdiff.txt2.9 KBsun
#18 filter.html5_.18.patch17.35 KBsun
#18 interdiff.txt2.94 KBsun
#15 interdiff.txt10.76 KBsun
#14 filter.html5_.14.patch17.3 KBsun
#12 drupal8.html5-dom.12.patch4.12 KBsun
#10 libxml2-html5-test.txt1.06 KBDamien Tournoud
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

RobW’s picture

Yep, I'm pretty sure this is a limitation of the loader and serializer functions. Unfortunately changing everything that says XML to HTML doesn't help -- in order to get new HTML5 elements parsed correctly you need to slip them through the anything goes XML element name loophole. HTML here basically means HTML4 only.

There's a new php html parser called html5lib that I believe addresses this. It's MIT licensed, which I'm pretty sure I've read in the issue queue is GPLv2 compatible.

Related links:
http://stackoverflow.com/questions/5033504/php-domdocument-loadhtml-retu...
http://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-...
http://code.google.com/p/html5lib/

sun’s picture

Priority: Normal » Major
Issue tags: +html5, +Testing system

Ugh. Add WebTestBase::parse() to that.

Hanno’s picture

sun’s picture

Title: htmlcorrector is based on XHTML, while D8 will be based on html5 » PHP DOM (libxml2) only understands XHTML4, misinterprets HTML5, but D8 must cope with HTML5
Priority: Major » Critical

DomCrawler is also based on PHP's DOM extension:
http://api.symfony.com/2.1/Symfony/Component/DomCrawler/Crawler.html
https://github.com/symfony/DomCrawler/blob/master/Crawler.php

As this PHP bug clarifies https://bugs.php.net/bug.php?id=60021 the actual issue is caused by libxml2, which DOM is using under the hood.

libxml2's changelog mentions a commit for the new possible HTML5 meta charset value, but otherwise does not mention html5.

Unless I'm missing something big, this means we need to use a user-space library. Aforementioned PHP bug seems to suggest html5lib.

(Even if libxml2 was fixed [but there doesn't even seem to be an issue to do that], that would only be contained in the latest and greatest release of PHP, and we cannot require that.)

The impact of all of this goes way beyond Filter module, so I'm adjusting the title and bumping this to critical.

Hanno’s picture

Same discussion about html5 parsing with Symfony: https://github.com/symfony/symfony/issues/1733
Is this something we should or can fix in Symfony and let modules use this DOM crawler? Symfony has a workaround, but needs a real solution: https://github.com/symfony/symfony/commit/a57a4aff

Hanno’s picture

Same discussion about html5 parsing with Symfony: https://github.com/symfony/symfony/issues/1733
Is this something we should or can fix in Symfony and let modules use this DOM crawler? Symfony has a workaround, but needs a real solution: https://github.com/symfony/symfony/commit/a57a4aff

Crell’s picture

Symfony's workaround doesn't seem to be a workaround, if I'm reading that correctly. The tests are still verifying that HTML5 does not validate, which is not what we want.

Generally something like this should be fixed as far upstream as possible. None of us are C coders as far as I know, so that means in Symfony. I'm fine with making a fix upstream and then pulling in DomCrawler. What the fix should be, though, I have no idea.

sun’s picture

From where do you get that? Did I overlook something?

I linked to the DomCrawler source code in #4, and that calls into the regular, native DOMDocument & Co PHP classes, which are all bound to the current libxml2 feature-set.

Crell’s picture

I was looking at the tests found via the comment in #6. I didn't look at the current source yet.

In any event, my previous statement still stands. If we and Symfony have the same bug, let's workaround it once in DomCrawler and then leverage it. No sense there being two workarounds for libxml2 floating around.

Damien Tournoud’s picture

Priority: Critical » Normal
FileSize
1.06 KB

This is nothing more then a normal bug.

We only have two things to do:

  • Switch to a HTML doctype instead of our XHTML doctype in filter_dom_load()
  • Silence the warnings of libxml2 with libxml_use_internal_errors(true);

The first change solves the issue highlighted in the original post. The second change silences the invalid tag warnings.

In WebTestBase, we can decide to only ignore "invalid tag" errors (error code 801).

Test script attached.

RobW’s picture

Switch to a HTML doctype instead of our XHTML doctype in filter_dom_load()
Silence the warnings of libxml2 with libxml_use_internal_errors(true);

Would this regress advanced html correction that the current filter provides for xhtml? Or are attribute corrections (like the one the OP was looking to avoid) not really an issue in HTML5?

sun’s picture

Priority: Normal » Critical
Status: Active » Needs review
FileSize
4.12 KB

I'm surprised that it took this long to face these false-errors in actual test results.

This bug blocks #1168246: Freedom For Fieldsets! Long Live The DETAILS. now.

The patch in that issue is causing DOM parser exceptions on the DETAILS and SUMMARY elements:
http://qa.drupal.org/pifr/test/370623

DOMDocument::loadHTML(): Tag details invalid in Entity, line: 1
	Warning	WebTestBase.php	1371	Drupal\simpletest\WebTestBase->drupalPostAJAX()
DOMDocument::loadHTML(): Tag summary invalid in Entity, line: 1
	Warning	WebTestBase.php	1371	Drupal\simpletest\WebTestBase->drupalPostAJAX()

Systematic/infrastructural bugs like this, which are causing irrelevant test failures for other changes, and worse, which are blocking up API changes, are critical.

Attached patch translated #10 into a patch. Not yet sure how to test this.

Status: Needs review » Needs work

The last submitted patch, drupal8.html5-dom.12.patch, failed testing.

sun’s picture

Status: Needs work » Needs review
FileSize
17.3 KB
  1. Updated HTMLCorrector filter tests to HTML5.
  2. Updated TextSummaryTest to HTML5.
  3. Updated drupal_html_to_text() to HTML5.

This patch is expected to come back green.

sun’s picture

FileSize
10.76 KB
sun’s picture

Assigned: Unassigned » sun
Damien Tournoud’s picture

Status: Needs review » Needs work
@@ -472,7 +474,7 @@ function drupal_html_to_text($string, $allowed_tags = NULL) {
           break;
 
         // Horizontal rulers
-        case 'hr':
+        case 'hr/':
           // Insert immediately.
           $output .= drupal_wrap_mail('', implode('', $indent)) . "\n";
           $output = _drupal_html_to_text_pad($output, '-');

I suppose we should support both forms?

-    // Do not transform empty tags to a single closed tag if the tag's content model is not EMPTY.
+    // Do not transform tags whose content model is not EMPTY.
     $f = _filter_htmlcorrector('<div></div>');
-    $this->assertEqual($f, '<div></div>');
+    $this->assertEqual($f, '<div/>');

Is this a regression? If it is not, we should update the comments, because they do not match the result.

sun’s picture

Status: Needs work » Needs review
FileSize
2.94 KB
17.35 KB
  1. Prepared HtmlToTextTest for DrupalUnitTestBase.
  2. Make drupal_html_to_text() also account for HRs with attributes.
  3. Make drupal_html_to_text() account for empty paragraphs.
  4. Updated FilterUnitTest comments for HTML5.
sun’s picture

I think this patch is ready to fly. Would be great to get this blocker out of the way.

Damien Tournoud’s picture

-    // Transform tags whose content model is EMPTY.
+    // Transform tags whose content model is empty.
     $f = _filter_htmlcorrector('<img src="http://example.com/test.jpg">test</img>');
     $this->assertEqual($f, '<img src="http://example.com/test.jpg"/>test');
     $f = _filter_htmlcorrector('<br></br>');
     $this->assertEqual($f, '<br/>');

This is called a "Void element" now in HTML5.

-    // Do not transform tags whose content model is not EMPTY.
+    // Transform empty elements into self-closing tags.
     $f = _filter_htmlcorrector('<div></div>');
     $this->assertEqual($f, '<div/>');

This one is troublesome. <div/> is invalid HTML5 (the / is only permitted at the end of a Start tag). In HTML5 tree insertion algorithm, the self-closing flag is purely informational, so <div/> will be interpreted as <div> in a conformant browser.

This feels like a deal breaker. Someone wants to take a look at the (seemingly unmaintained) html5lib?

Damien Tournoud’s picture

I assume that using ->saveHTML() instead of ->saveXML() should fix the issue described above.

sun’s picture

FileSize
2.9 KB
18.61 KB

Hmmm.

Changed saveXML() to saveHTML().

Regressions:

- Pretty-formats *some* (but not all) of the output. Wasn't able to make it stop that. ::$formatOutput is ignored.

- CDATA escaping for scripts and styles is stripped from the output.

The test results should expose the differences "live" as preformatted debug output in the testbot test results.

RobW’s picture

If I remember correctly we don't need CDATA anymore in html5, unless we're sending the page as application/xhtml+xml and trying to validate (which no one has ever or will ever do). So I think the cdata stripping is correct behavior.

Status: Needs review » Needs work

The last submitted patch, filter.html5_.22.patch, failed testing.

Damien Tournoud’s picture

Note: the $node argument of ->saveHTML() appeared in PHP 5.3.6. I think that's a good reason to bump from 5.3.5 to 5.3.6 as our minimum requirement.

sun’s picture

We have two related issues for that already - I followed up on both to note this issue:
#1751100: Bump minimum version of php required to 5.3.5
#1800122: Bump minimum version of php required to 5.3.10

sun’s picture

Priority: Critical » Major
Issue tags: +Release blocker

Meanwhile, I worked around the test exceptions in #1168246: Freedom For Fieldsets! Long Live The DETAILS. by adding error suppression to drupalPostAjax(), which - unlike all other usages of DOMDocument - did not suppress errors.

However, I still believe that there is no way we can release D8 with HTML5 markup but a parser that expects XHTML all over the place. Thus, this is borderline critical, but I'm demoting it to major, plus tag.

We should bump the minimum PHP version to 5.3.6 or 5.3.10, and quite potentially, just simply ignore libxml's extra formatting for subsequent paragraphs. (in my testing, all markup remained as is, only <p></p><p></p> was formatted with a linebreak between paragraphs, which seems to be a bug in libxml — perhaps even fixed in later PHP versions)

Hanno’s picture

Maybe there is some progress:

PHP 5.3.10 minimum is since March '13 in core, so that's not a blocker anymore. #1800122: Bump minimum version of php required to 5.3.10

FYI There are also some enhancements in the Symfony DOMcrawler:
http://symfony.com/blog/new-in-symfony-2-3-domcrawler-enhancements

mbutcher’s picture

We're just about done with a standards-complaint PHP HTML5 parser and serializer.

https://github.com/technosophos/HTML5-PHP

This was a fork of libhtml5, but has since turned into an almost complete rewrite. It parses HTML 5 (and 3 and 4) and creates a DOMDocument (libxml). It can also serialize any HTML DOMDocument into an HTML5 string/file.

The goal isn't necessarily to clean or transform documents (you can't for instance, feed it HTML2 and expect it to spit out gorgeous and semantic HTML5). But it does parse and write HTML5.

dcam’s picture

http://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.

dcam’s picture

Issue summary: View changes

.

Hanno’s picture

Issue summary: View changes

Summary added

Hanno’s picture

Issue summary: View changes

PHP bug added

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Issue summary: View changes

more info

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

I added a summary and hope this describes the issue well enough, please correct. Writing this summary I was unsure whether surpressing errors is still a possible solution, or that we need a real html5 parser.
If we need a html5 parser, best thing to me is that we try to get that in PHP/libxml and a similar solution in Symfony for older PHP versions.
If we have to find a HTML5 parser, some discussion is needed about the expectations of such a parser. I think the expectations for testing differs from the htmlcorrector filter.

Hanno’s picture

Issue summary: View changes

improvements

Hanno’s picture

Issue summary: View changes

.

Hanno’s picture

Issue summary: View changes

issue that introduced libxml for html corrector filter in 2009

Hanno’s picture

mgifford’s picture

Pancho’s picture

Priority: Major » Critical
Issue tags: -Release blocker

Re #27:

However, I still believe that there is no way we can release D8 with HTML5 markup but a parser that expects XHTML all over the place. Thus, this is borderline critical, but I'm demoting it to major, plus tag.

"No way we can release D8" means "critical", and I agree with that. Anyway, we're way beyond thresholds both regarding critical and major bugs.
So all this "major plus 'Release blocker' tag" leads to is covering our real technical debt, as discussed on #1810428: [Policy, no patch] Adjust major and critical task thresholds during code thaw. Which isn't good, even more because this specific issue will take quite some time to be properly solved.

Crell’s picture

mbutcher: Do you think your library is stable enough for us to leverage for D8 core? Or will it be soon?

mfer’s picture

I'm jumping in here because Hanno reached out to me. Thanks.

I'm co-developing html5-php along with mbutcher. There is a definite need for a html5 parser and writer. html5-php parses html5 into either a DOMDocument (for a full document) or a DOMDocumentFragment (for fragment parsing) and the writer can serialize these. Note, fragment parsing and document parsing are two different methods. Using the DOM supplied with libxml means it's compatible with other tools.

It's currently a beta1 release and we believe it to be API and feature complete. We're looking for feedback to improve stability and performance. The sooner we get good feedback the sooner we can release a stable version.

html5-php has composer support and we're using semantic versioning.

If you want to use the libxml parser for html prior to html5 you'd need code to detect which is being used. You could look at the document tag at the start.

I hope this helps fill in the blanks. The release of html5-php will be as fast as we can get feedback on it's use to verify it's all working as expected. The testing setup has over 128 tests and 2k assertions. It is run through travis ci for php 5.3, 5.4, and 5.5.

One note, for proper utf8 handling we do need mbstring to be present. iconv can do some things but there are some versions of PHP that have issues with iconv.

if you have questions I'd be happy to answer them.

mfer’s picture

Issue summary: View changes

Tidy added

Crell’s picture

Thanks, Matt! Which PHP versions do we need to be concerned about, iconv vs. mbstring-wise?

mfer’s picture

At the very least, all or almost all of PHP 5.4 and some of the newer versions of PHP 5.3. I've not tested on PHP 5.5. The details are up at http://us1.php.net/manual/en/function.iconv.php#108643.

The issue pops up when we go to filter invalid characters from a document or fragment. If there are no invalid characters there is no issue. If there are invalid characters and iconv is used we get nothing back from it. This may be an edge case for your use. How often do people put invalid characters in html documents?

Hanno’s picture

@mfer thanks for this excellent overview and it sounds good.
Invalid characters: We will have to deal with invalid characters for the filter module as text can be automatically imported from external sources. Should we filter these characters before parsing?

mfer’s picture

@Hanno by invalid characters I mean characters like utf-16 ones in a utf-8 document. How would you do that in Drupal for the filter system? To see how we do that (with lots of comments) see \HTML5\Parser\UTF8utils::convertToUTF8().

Hanno’s picture

Ah ok, sounds good. Indeed, utf-16 characters can not survive in a mysql utf-8 environment, as mysql throws an error for an Incorrect string value on input.
For illegal utf-8 control characters I see that a function checkForIllegalCodepoints() is included that throws an error if these are found in the text.

mfer’s picture

FYI, the html5-php project just had the first stable release. You can check it out at http://masterminds.github.io/html5-php/.

cosmicdreams’s picture

mfer, what's the way forward? We're going to be sprinting at Drupal Camp Fox Valley this weekend. Anything we can do to help this issue along?

mfer’s picture

Thanks for your interest in working on this. I think there are a few parts to this if I'm reading the issue correctly. Note, I was just here to help this along and am not intimately familiar with the issue.

First, the original description gets into how the html handling in the filter functions should output html5. At it's simplist you can pull in html5-php via composer.json and then update filter_dom_serialize() to use it. I'd start with trying to replace $dom_document->saveXML with \HTML5::saveHTML($dom_document);. Make sure you review the current patches as well.

Second, and you might want to do this first, investigate #1277290: Use a proper HTML parser for every core filter to have a common solution.

Third, as you use this file any bugs or question with html5-php in the html5-php project and I'll try to jump right on responding to them.

Make sure the html5 tests talked about here pass.

Again, others know this issue far better than I do. I'd reach out to them.

cosmicdreams’s picture

That's awesome Matt, Larry is here so I'll see if we can get going on this.

sun’s picture

Assigned: sun » Unassigned

I strongly disagree with the need for an external user-space library for this issue.

Native PHP functions are sufficient. The minimum required PHP version was the only blocker, but that appears to be resolved now.

The use-case is not to filter arbitrary web pages or whatnot. The use-case is to filter simple user input.

cosmicdreams’s picture

sun, are you saying that we can simply close this critical issue because it's not relevant anymore?

mfer’s picture

@sun, the build in PHP functions don't do html5. They can handle html4 and xml. But, there are noticeable differences.

Let's look at a parsing example. PHP sees

<audio src="foo.ogg">
    <track kind="captions" src="foo.en.vtt" srclang="en" label="English">
    <track kind="captions" src="foo.sv.vtt" srclang="sv" label="Svenska">
</audio>

as...

<audio src="foo.ogg">
    <track kind="captions" src="foo.en.vtt" srclang="en" label="English">
        <track kind="captions" src="foo.sv.vtt" srclang="sv" label="Svenska">
        </track>
    </track>
</audio>

While handling this example PHP will also throw a warnings for the audio tag and each of the track tags.

How do you see having native PHP functions handle html5 properly when they currently just don't?

I wish PHP handled html5 well. Then there would be no need for html5-php. I look foward to the day that library can be depreciated.

At the very least the doctype when wrapping a fragment in Drupal should be switched to an html5 one. That will get things closer to html5 when saveHTML is used for the native functions.

sun’s picture

Yes. And for that reason, I'd recommend to proceed with the patch in #22 first.

With that, only if we cannot either blame user input or if we cannot blame outdated PHP versions, we can look into more pain.

Is it legit to ask people to run Drupal on a current version of PHP for full/better HTML5 support? — Yes.

We do not want to scrape and parse entire HTML pages.

And we do not give any kind of guarantee that your input equals the output. If that is desired, disable the HTML filter for privileged users.

Drupal's user input filter system shoots for simple user input only. Nothing else.

KISS.

mfer’s picture

@sun a couple quick things...

Is it legit to ask people to run Drupal on a current version of PHP for full/better HTML5 support? — Yes.

PHP 5.5 does not have html5 support in DOM handling. It's using libxml that doesn't have html5 support either. PHP 5.5.4 (latest as of this post) still fails as my comment in #48 shows. The DOM/libxml handling that works for html5 happens to be the parts compatible with html4.

That being said, what is the current version of PHP with the full or better html5 support? If you're talking better, what does better mean?

You also said,

Drupal's user input filter system shoots for simple user input only. Nothing else.

Can you point me to where this is documented, where it comes from, and the details of what it means? For example, list of supported tags and attributes.

These details will be great for writing tests to make sure the implementation meets it. Maybe with some tweaks the built in PHP functionality will work. But, if full html5 support isn't the requirement than the deviations from that should be worked out, in writing, in tests, and passed through Dries or his proper lieutenant.

mfer’s picture

Issue summary: View changes

Add blocked issue

jibran’s picture

Status: Needs work » Needs review

22: filter.html5_.22.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, 22: filter.html5_.22.patch, failed testing.

The last submitted patch, 22: filter.html5_.22.patch, failed testing.

larowlan’s picture

Crell’s picture

If html5-php actually does what we need (I've not verified that myself), and can be used for even more, that doesn't mean we shouldn't use it if it does the job and we can eliminate or not write our own code. Zend Feed also has Atom support, which core doesn't use at all, but it now means contrib modules can leverage it as the go-to tool for Atom support. That's a good thing.

If we have working code now that does what we need, or is easy to modify, there's no reason to replace it. If it's use a library that is overkill vs. write our own junior version of said library, use the existing library.

longwave’s picture

html5-php's own documentation mentions that it should not be used for filtering:

html5-php is not an appropriate tool to use for this form of filtering. Neither is the PHP DOMDocument system provided by libxml.

If you need this form of filtering please look into tools such as HTML Purifier which can be installed via composer from Packagist.

Maybe we should follow that suggestion, and pull in HTML Purifier for the htmlcorrector filter?

Wim Leers’s picture

#57: filters like the caption filter (also in D8 core) do need HTML5 support, they're not about whitelisting certain HTML and removing the rest, but transforming HTML. For that, DOMDocument is ideal.

Hanno’s picture

Issue in which this parsing leads to invalid html. When a user adds <span lang="en"> in the basic HTML text format, it alters to a duplicate attribute: <span lang="en" xml:lang="en" xml:lang="en">
#2346989: [PP-1] 'Restrict images to this site' leads to incorrect HTML with language attribute

mgifford’s picture

filijonka’s picture

Assigned: Unassigned » filijonka
chx’s picture

I talked to filijonka about next steps for this rather important issue:

  1. pull in htmlpurifier via composer
  2. talk to the htmlpurifier maintainers (heddn and ezyang) about their experience with it
  3. rewrite the htmlcorrector and html filterr based on it. Looking at the beginning of http://htmlpurifier.org/docs and http://htmlpurifier.org/docs#toclink3 this should be really just a few lines of code.
  4. quite probably there are already tests for this so hopefully no testing will be required -- but make sure to check that there are actually tests and if not add them. No need for anything extensive I think since htmlpurifier itself must be tested.
chx’s picture

Issue summary: View changes
Hanno’s picture

@chx interesting. htmlpurifier can't do html5 right now (http://htmlpurifier.org/phorum/read.php?3,7497,7497#msg-7497). Are the maintainers willing to work on html5 and what are the advantages compared to html5-php?

chx’s picture

#57 said that html5-php is not suitable for this sort of filtering?

Hanno’s picture

Ah, but 'this form of filtering' is about sanitizing. HTML5-lib says:

When end users of a website or web application submit markup and that markup is going to be displayed back it needs to be sanitized. For example, it's usually not appropriate to display a script tag as that could leave to script injection security issues. html5-php is not an appropriate tool to use for this form of filtering. Neither is the PHP DOMDocument system provided by libxml. If you need this form of filtering please look into tools such as HTML Purifier which can be installed via composer from Packagist.

(https://github.com/Masterminds/html5-php/wiki/End-User-Submitted-Markup)
So if I read this correct, HTML5-lib is 'only' creating a DOMDocument, but not doing further processing for output, like sanitizing malicious script tags. In Drupal we have our own filter functions for that, so we probably even don't want to have a sanitizing function.

heddn’s picture

It is possible with a few days work upstream. As of now, htmlpurifier only has experimental HTML 5 support. But adding only support to handle specific HTML 5 tags would shrink the amount of necessary upstream work. How thorough an HTML 5 implementation are we seeking?

There are two thrusts to HTML5. The first is implementing the HTML5 parsing specification; absent an extremely motivated individual, this is probably never going to happen. The second is implementing support for all of the new tags that HTML5 adds; I'm not planning on doing this, but the process is not difficult, just a little tedious, and if someone submits a patch to upstream for support and is willing to work with me to get it up to the standards of the HTML Purifier codebase, getting this in is plausible.

I think you could get a large subset of HTML5 going in a few days of work. The easy stuff to do is add the new tags which HTML5 added support, which are basically the same as DIVs. The hard stuff is the really new things, like SVG or Canvas or parsing.

See #1321490: HTML5 Support? & http://htmlpurifier.org/phorum/read.php?3,7497,7497#msg-7497

As far as getting this into core, I'd love to see it happen. Configuring and making htmlpurifier run without any of the additional configuration forms would be fairly simple. The module's implementation is fairly simple & strait forward. It has its own internal caching layer that made the D6 => D7 port a little more involved. I think we've got that figured out and it won't be an issue going to D8.

chx’s picture

So security wise we are good. Security filtering is done by a custom parser that has stood the test of time pretty well. In _filter_html we do use the dom document to apply nofollow. Not sure whether this helps but I thought it relevant.

filijonka’s picture

hi ho

ok so when reading #66 and #67 I get the feeling with have actually two options for a third party vendor.
which one should we use is the question and trying to figure out what they can do and not do and in what state they are in.

HTML5-php seems to be in a ready state, supports composer, supports php namespace (a must?)

HTMLpurifier seems to not be in a ready state (but soon?) and no composer support as far as I can see, doesn't support namespace(?)

Well I'll check these two out and see how they works within drupal to beginning with.

heddn’s picture

htmlpurifier has Composer support and has a PHP namespace. htmlpurifier has been around in Drupal since at least 4.7, so it has a large and continually growing community.

filijonka’s picture

ok good but the documentation doesn't state how to use composer and in the documentation it says it has no support for php namespace but good it's wrong.

Mile23’s picture

On Packagist.org:

HTMLPurifier: https://packagist.org/packages/ezyang/htmlpurifier

HTML5-lib: https://packagist.org/packages/masterminds/html5

All other things being equal, here are some things to consider:

I was able to clone html5-lib and run the tests without having to learn anything. Also, it has a .travis.yml file, and it generated an HTML coverage report without being asked. Yay. (PHPUnit says the biggest change risk scores 18, which is excellent.)

htmlpurifier apparently uses SimpleTest for testing, and doesn't tell you how to install it. The docs say it requires a writable file system for caching, which is a bit of a non-starter.

Should we have a 'modernize htmlpurifier' campaign?

heddn’s picture

I'm not at all familiar with html5-php, but is it more of a parser? htmlpurifier is a HTML sanitizer/filter. From the issue summary, this issue is really looking for a parser, which htmlpurifier is not. Are these assumptions correct?

BTW, htmlpurifier's drupal module uses a DB backed cache.

Hanno’s picture

As I understand from the documentation of HTMLpurifier:

  • HTMLpurifier works with whitelisting. So that does mean we should provide HTMLpurifier with all possible HTML5 tags. Whitelisting is what we want in the filter 'limit allowed html tags', but not in other filters. (See this thread: 'I want to allow all tags' - http://htmlpurifier.org/phorum/read.php?3,3683,3683. a method to extend HTMLpurifier with HTML5 is to declare a whitelist of new HTML tags, and what kind of children and attributes are allowed.)
  • HTMLpurifier itself is planning to integrate html5lib. According to http://htmlpurifier.org/live/TODO this is planned for release 5.0. Meanwhile there are initiatives like https://github.com/kennberg/php-htmlpurfier-html5
ezyang’s picture

Hey guys,

I'm lead dev for HTML Purifier, so I thought I'd answer a few of the questions that were raised by this thread.

1. HTML Purifier is currently in "maintenance" mode. This means that I am fixing bugs and security problems, but not actively working on features. For new features like an alternative parser or more tag support, I need community help to make happen.

2. HTML Purifier currently uses either the built-in dom parser, its built-in (and slightly idiosyncratic) pure PHP parser, or the html5lib parser (which has fallen off the wagon and is pretty out-of-date.) The choice of parser backend is quite swappable, so it would not be difficult to plug in html5-php (indeed, it probably is a drop-in replacement for html5lib).

3. HTML Purifier operates on a whitelist, so yes, you have to teach it about every element/attribute that you want to support. https://github.com/kennberg/php-htmlpurfier-html5 is a good example of how easy it is to add support for some elements and attributes. The main reason why it's not supported in the core is I've been waiting for someone to go through and methodically add support for *as many* elements as humanly possible; most people have just added support for the dozen or things they need and left it at that.

4. Documentation for how to install SimpleTest for use with HTML Purifier can be found by looking at test-settings.sample.php. We have A LOT of tests, it's kind of excessive.

5. By default HTML Purifier does use a file system cache to speed things up (the performance win is nontrivial). However, as mentioned by heddn, what cache is used is pluggable and HTMLPurifier_DefinitionCache_Drupal stores hooks to make this information stored in Drupal's built-in cache.

6. HTML Purifier's documentation doesn't state how to use Composer because I don't use Composer, and the patches adding support for it were contributed by the community. I'd be happy to accept docpatches with more information for how to use Composer (but I assume it's pretty similar across projects?)

7. HTML Purifier is written in an old fashioned PHP 4 style, and doesn't put any code in a namespace (but prefixes all classes with HTMLPurifier_). It would be helpful to know what specific namespace support you need so we can add it; the main reason namespaces haven't been added is inertia, and also desire to avoid breaking backwards-compatibility.

filijonka’s picture

Status: Needs work » Needs review
FileSize
9.97 KB

hi

ok been reading and listening to what u all are saying and my personal view is that perhaps htmlpurifier isn't the right choice for us but, 1. I might be wrong 2. I probably need to read more about it and test it properly, so more input is appreciated.

So by that said this first patch is done with html5-php and it will need more work (no tests i.e) but 1. I wanted it to move along a bit and 2. needed to see what testbot says about it.

note: this patch has nothing to do with previous patches done so no interdiff.

Status: Needs review » Needs work

The last submitted patch, 76: 1333730-76.patch, failed testing.

filijonka’s picture

dooh sorry brain wasn't connected I guess or something, corrected patch will come soon

filijonka’s picture

Status: Needs work » Needs review
FileSize
424.92 KB

added the vendor to the patch..

Status: Needs review » Needs work

The last submitted patch, 79: 1333730-79.patch, failed testing.

filijonka’s picture

FYI all fails are the same type but right now whatever I test on head locally nothing works..so kinda hard to debug

filijonka’s picture

Status: Needs work » Needs review
FileSize
424.62 KB
2.16 MB

I doubt this will be green but it has solved a lot, changing the the way of loading the input to a DOM tree.

It's also a reroll that went straith on wit hno conflicts (mostly composer.*)

Status: Needs review » Needs work

The last submitted patch, 82: 1333730-82.patch, failed testing.

filijonka’s picture

hmm wonder if that interdiff gone haywild or something..

filijonka’s picture

FileSize
5.34 KB

redid the interdiff only.

filijonka’s picture

ok I've been debugging this to my death..

In FilterApiTest we buil up an element, starting with $build, calling it with drupal_render_root.

Via that function we end up in function preRendertext (Drupal\Filter\Element\ProcessedText), this runs through all filters, one of these are FilterTestPostRenderCache (which comes before FilterHtmlCorrector) and in this process there's a placeholder added to the html with tag <drupal-render-cache.....>.

The process for FilterHtmlCorrector is then called and when saving the dom to html the the parser interepret the tag <drupal-render-...> as <drupal ....>.

So when coming to renderDynamicThing (Drupal\filter_test\Plugin\Filter\FilterTestPostRenderCache) the string replacing simply doesn't return a correct html and hence the test $this->assertEqual($expected_markup, $build['#markup'],....) simply doesn't return true any longer.

ok how to solve this?
I'm not sure why we are doing this really, perhaps we can add our own rules? so we can allow this pattern but can we do that only in these special cases or does it actually mean that users could do that too?

[Edited to show the parts of the comment that were filtered out. (April 20, 2024)]

Wim Leers’s picture

#86: a few parts from your comment seem to be missing because you didn't wrap HTML tags in <code> tags like this: <code><p>like this</p>. Can y ou please edit your comment so we can read the full contents? :)

filijonka’s picture

I didn't actually write any html tags or anything in the post. it's just a description of the problem.

I could obviously copy and paste the exact code that I'm mentioning.

Wim Leers’s picture

But it says e.g.

[…] placeholder added to the html with tag .

and

[…] dom to html the the parser interepret the tag as

Surely there are HTML tags there that I can't read?

chx’s picture

Quote from IRC when I talked to @filijonka about this, not sure why he didn't post it here:

When drupal_render_root gets a text <p>hello world!</p> we expect it to get back <p>Hello, world!</p><p>This is a dynamic llama.</p>; while we get back <p>Hello, world!</p><p><drupal-render-cache-placeholder callback="\Drupal\filter_test\Plugin\Filter\FilterTestPostRenderCache::renderDynamicThing.

Wim Leers’s picture

Assigned: filijonka » Wim Leers

#90: that helps a lot, thank you.

Now also reproduced. #86 can be summarized as: FilterAPITest::testProcessedTextElement() fails when #82 is applied.

I'm currently debugging.

Wim Leers’s picture

Assigned: Wim Leers » Unassigned

The problem is quite simple, really. If I add debug() statements to FilterHtmlCorrector, before and after it does its thing:

  public function process($text, $langcode) {
    debug($text);
    $html5 = new HTML5();
    $html_dom = $html5->loadHTMLFragment($text);
    $text = $html5->saveHTML($html_dom);
    debug($text);
    return new FilterProcessResult($text);
  }

then I see this:

'<p>Hello, world!</p><p><drupal-render-cache-placeholder callback="\\Drupal\\filter_test\\Plugin\\Filter\\FilterTestPostRenderCache::renderDynamicThing" token="IsryQfDME-ADk1VMEqLl3v72VUYdbVxBM-SDkh2INIsqPsF4o5TZH7b0p5HHqGPwA0HH5YpQLQ"></drupal-render-cache-placeholder></p>'

and

'<p>Hello, world!</p><p><drupal callback="\\Drupal\\filter_test\\Plugin\\Filter\\FilterTestPostRenderCache::renderDynamicThing" token="IsryQfDME-ADk1VMEqLl3v72VUYdbVxBM-SDkh2INIsqPsF4o5TZH7b0p5HHqGPwA0HH5YpQLQ"></drupal></p>'

(Note how <drupal-render-cache-placeholder … was changed to <drupal …!)

In other words: either HTML5::loadHTMLFragment() or HTML5::saveHTML() break elements with dashes in them.

And AFAIK that's valid HTML5. Worse, still, the spec even states that custom elements (think web components) must have dashes in their name, and that standard elements must not.

Conclusion: masterminds/html5 also doesn't support the spec…

chx’s picture

http://w3c.github.io/webcomponents/spec/custom/

The custom element type identifies a custom element interface and is a sequence of characters that must match the NCName production, must contain a U+002D HYPHEN-MINUS character, and must not contain any uppercase ASCII letters.

Just helping defining what is a "dash".

chx’s picture

However, http://www.w3.org/TR/html5/references.html#references [CUSTOM] is non-normative. I have no idea what to do. This is a superset of the HTML5 standard so masterminds/html5 supports the HTML5 spec but not Web Components. Which should be supported? Perhaps there perhaps in a fork. No idea what to do.

filijonka’s picture

Hi

1. I do apologize to @Wim Leers cause what you concluded in your debug was something I already knew I should have written a better description BUT it also good to have it confirmed from another source.

2. This isn't a tag we actaully use so perhaps a renaming would work in this case BUT should users be able to add there own webcomponents then masterminds/html5 needs to be able to parse those elements to.

Wim Leers’s picture

This isn't a tag we actaully use so perhaps a renaming would work in this case

Indeed.

BUT should users be able to add there own webcomponents then masterminds/html5 needs to be able to parse those elements to.

Exactly!

So: #sadllama :(

filijonka’s picture

ok so this is my plan for now:

1. see if we can easily rename the drupal-render-... tag to drupalrender and see the effects of that

2. See what the plans are from the masterminds to support webcomponents? Did we have any of the masterminds developer in this thread than please respond to this :)

Wim Leers’s picture

Agreed with the plan.

Opened an issue against masterminds/html5: https://github.com/Masterminds/html5-php/issues/65.

filijonka’s picture

Assigned: Unassigned » filijonka
catch’s picture

At this point I think I'd be OK with a patch here that doesn't break render cache placeholders (i.e. via rename of the tag), but also doesn't support web components yet.

If the upstream issue gets positive feedback, we can have a separate critical or major issue to update to the version that adds support?

Wim Leers’s picture

#100: agreed on all aspects.

filijonka’s picture

@catch read my mind. I was thinking on this on the way home thinking that I should ask if we would go on and ignore the support for webcomponents at the moment.

I'm also already thinking a little bit ahead of what to do with exisiting calls for load, serialize and normalize but when I began to write realising that I need to rethink that so I can write it understandable (hopefully).

/* edit: corrected the english, sorry */

filijonka’s picture

did a test to check if html5-lib do support webcomponents,

$html = <<< 'HERE'
  <!doctype html>
  <html>
  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <drupal-renderer>This is a test of the HTML5 parser.</drupal-renderer>
  </body>
  </html>
HERE;

$html5 = new HTML5();
$dom = $html5->loadHTML($html);
print $html5->saveHTML($dom);

print "<br><br>";

$dom = $html5->loadHTMLFragment('<drupal-renderer>This is a test of the HTML5 parser.</drupal-renderer>');
print $html5->saveHTML($dom);

this is the output codewise:

<!DOCTYPE html>
<html>
<head>
<title>TEST</title>
</head>
<body id="foo">
<drupal>This is a test of the HTML5 parser. </drupal>
<br>
<br>
<drupal>This is a test of the HTML5 parser.</drupal>
</body>
</html>

I tried to add this the issue that @Wim Leers opened at masterminds but it ignored my code-tags and since I've never posted there I've no idea how to do that correctly. So added it here instead and also perhaps someone can confirm this.

Wim Leers’s picture

@filijonka: GitHub uses (Github-flavored) Markdown, not HTML. Over there, you can post code like this:

```
your code here, between those two lines of triple backticks
```
filijonka’s picture

hi

looks like support for web components could be included in v. 2.0.2 per https://github.com/Masterminds/html5-php/pull/66

So I'm waiting a lil with a new patch.

Wim Leers’s picture

Awesome! :) Super simple pull request too!

filijonka’s picture

just some thoughts, perhaps i'm overthinking this but began to think on the existing functions HTML::serialize, normalize and load.

I just took from some test file this:

$f = Html::normalize('<p>text');

In the future I guess we should deprecate these three functions so above code would be something like this:

$html5 = new HTML5();
$html_dom = $html5->loadHTMLFragment($text);
$html = $html5->saveHTML($html_dom);

Which is basicly the code we now have in FilterHTMLCorrector.

Perhaps this is not the best approach, perhaps it would be better to actually rewrite the code in the static functions in \Drupal\Component\Utility\Html. That would make it easier for the testwriting.

chx’s picture

@filijonka we can deprecate but we can't remove because beta. API is not set in stone but needs a very good reason to break. And I do not think there's a real good one here.

Wim Leers’s picture

#108++

filijonka’s picture

hmm so updating the code in i.e HTML::normalize would not be ok but updating the code in process() in class FilterHTMLCorrector is ok? I fail to see the difference actually.

No changes need to written about the descriptions in the API for these functions excepts that it works for HTML5 also and not only (X)HTML.

Wim Leers’s picture

No, chx and I are saying it's okay to change implementation details, but we should keep the public API unchanged.

filijonka’s picture

hi

maybe I was unclear but this was actually two suggestions,

one was to keep the code as in current patch and deprecate the functions in HTML OR

keep the code for process as in head and rewrite the functions in HTML.

chx’s picture

the upstream PR has been merged.

Wim Leers’s picture

We don't want to break public APIs unless for a very good reason. Here we don't have a very good reason. So:

keep the code for process as in head and rewrite the functions in HTML.

is highly preferred. Where "HTML" is \Drupal\Component\Utility\Html.

Mile23’s picture

filijonka’s picture

hi

just begun with some testing to rewrite the functions in the HTML class and I think we actually can't do that

example:

Let's say we have made the changes in load so it can parse a html5 and return a correct DOM.

we then have tests like this:

    $dom = Html::load($cached_element['#markup']);
    $xpath = new \DOMXPath($dom);
    $parent = $xpath->query('//details[@class="form-wrapper" and @open="open"]/summary[@role="button" and @aria-expanded and text()="Parent"]')->length;

but since PHP DOM can't handle html5, hence the DOMXPath can't handle a html5 dom. So in this case the xpath variable isn't the same as if the it was xhtml hence the parent variable won't be correct.

I'm I thinking correct?

twistor’s picture

html5-php builds a real DOMDocument.

If you're seeing problems with XPath, it's probably because html elements are in another namespace. Specifically, http://www.w3.org/1999/xhtml.

filijonka’s picture

Status: Needs work » Needs review
FileSize
424.43 KB
5.29 KB

So this patch is only a correction of patch in #82 in the sense of changed the masterminds/html5 to dev-master version so the latest fix from them are included.

So hopefully this will be green

Wim Leers’s picture

Thanks!

+++ b/core/modules/filter/filter.module
@@ -457,13 +457,12 @@
-    $html5 = new HTML5();
-    $html_dom = $html5->loadHTML($text);
+    $html_dom = Html::load($text);

Good.

But the actual patch still says:

-    $html_dom = Html::load($text);
+    $html5 = new HTML5();
+    $html_dom = $html5->loadHTML($text);

So… wrong patch? :)

filijonka’s picture

ehm the diff seems to be all f* up..why are the lines 457 in this diff?

the patch is correct, going to double check this

filijonka’s picture

redid my branches and hence the patch and diff just for the being on the safe side.

sorry for this but better safe than sorry.

filijonka’s picture

FileSize
10.65 KB

gaaah how i can ul wrong diff again..sorry. the patch was ul correctly..here's the diff..

Wim Leers’s picture

Status: Needs review » Needs work

Green patch, yay! :)

+++ b/core/modules/filter/filter.module
@@ -489,12 +490,13 @@ function _filter_html($text, $filter) {
-    $html_dom = Html::load($text);
+    $html5 = new HTML5();
+    $html_dom = $html5->loadHTML($text);

+++ b/core/modules/filter/src/Plugin/Filter/FilterHtmlCorrector.php
@@ -27,7 +28,10 @@ class FilterHtmlCorrector extends FilterBase {
-    return new FilterProcessResult(Html::normalize($text));
+    $html5 = new HTML5();

Why make these changes, why use the HTML5 library directly? We should update our \Drupal\Component\Utility\Html to use this instead. That's what chx and I have been saying since #110 through #114.

filijonka’s picture

@Wim Leers hey read my comment on #118

next step is to find out a way to actually do what we have said in #110-114. And read my comment on #116 on the problems I encountered when trying to do that. But got some ideas actually that I'm going to test.

Does current patch results in any API changes or in any way break our beta rules?

chx’s picture

So I reviewed this (a review only patch is attached) and apparently it addresses the filter system but doesn't address the testing system. Just observing.

filijonka’s picture

elaborate pls?

chx’s picture

I do not have a clear idea on how to continue; just wanted to make it easy for others to grasp where we are at. Sorry for not making this clear, your progress is great.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
443.38 KB
22.86 KB

At this point, it's more productive for me to just make the changes I think to be necessary rather than continue to repeat them. So I made the changes I asked for in #123 and many comments before that. I also updated WebTestBase.

Many of the tests in FilterUnitTest basically assume XHTML-like parser behavior. But XHTML is the past, HTML5 is the present and the future. Switching to \MasterMinds\HTML5 in favor of \DOMDocument does mean that we'll switch to HTML5-like parser behavior. The most noticable consequence of that is how self-closing tags are omnipresent in XHTML and contrastingly absent in HTML5. See http://tiffanybbrown.com/2011/03/23/html5-does-not-allow-self-closing-tags/ for a clear explanation

I tested with FilterUnitTest and FrameworkTest, those pass; let's see if other tests pass or not.

Status: Needs review » Needs work

The last submitted patch, 128: 1333730-128.patch, failed testing.

filijonka’s picture

Assigned: filijonka » Unassigned

well personally I don't give a fuck about what is productive for you since I'm the one that got this on my shoulders and instead of just marching in you can talk with me. Since we already in #116 noticed some problems with the codeshanges we need to figure that out too. And since I'm working on this trying to find solutions for that and also been writing the patch for it you just made my day totally unproductive.
I thought that was the idea of assigning that people will udnerstand that someone is working on this. I'm unassigning myself and let anyone who wants do this do so.

Wim Leers’s picture

I'm sorry to have uploaded a patch in the mean time. :( But after several miscommunications, I thought this would be clearer for everybody. I was concerned we'd continue to walk in circles. I'm very sorry if you were working to do exactly the same!

But … quite often it's actually relatively productive for two people to have done the exact same work. Because it means you're in a great position to review the changes I'm proposing. It makes it easy for you to spot problems in my patch. In other words, where you did things differently, why did you?

Please consider posting a review based on your experience/the work you were doing in parallel.

In any case, my apologies :(

pfrenssen’s picture

@filijonka, please don't be offended, and don't consider your work to be in vain. These critical issues tend to attract a lot of attention, and it is not unusual for people to work on it in parallel. @WimLeers is also very invested in this issue and you can see from his comments here that his only intention is to help solving this important issue.

Your work here has been invaluable, it would be a great loss if you are put off working on this further.

filijonka’s picture

This code won't work.

    $dom = Html::load($cached_element['#markup']);
    $xpath = new \DOMXPath($dom);
    $parent = $xpath->query('//details[@class="form-wrapper" and @open="open"]/summary[@role="button" and @aria-expanded and text()="Parent"]')->length;

Best way to solve this is to use Symfony/dom-crawler

and change it to this:

    use Symfony\Component\DomCrawler\Crawler;

    $crawler = new Crawler($cached_element['#markup']);
    $parent = count($crawler->filterXPath('//details[@class="form-wrapper" and @open="open"]/summary[@role="button" and @aria-expanded and text()="Parent"]'));

The dom-crawler is if I understand it correctly also considered in issue #2232861: Create BrowserTestBase for web-testing on top of Mink

twistor’s picture

If you want to use the Symfony dom-crawler, you might as well use a regular DOMDocument, that's what it uses internally.

To use the HTML5 Dom, and XPath, you have to do this.

$dom = Html::load($cached_element['#markup']);
$xpath = new \DOMXPath($dom);
$xpath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');
$parent = $xpath->query('//html:details[@class="form-wrapper" and @open="open"]/html:summary[@role="button" and @aria-expanded and text()="Parent"]')->length;
twistor’s picture

I have some *very* old, but well tested, code that could be updated to add the namespaces automatically to the XPath queries.

http://cgit.drupalcode.org/feeds_xpathparser/tree/FeedsXPathParserQueryP...

Wim Leers’s picture

Can you explain why that won't work? XPath logic works fine in WebTestBase, so why can't it work correctly there?

twistor’s picture

My guess would be that Drupal\simpletest\AssertContentTrait::parse() is still using a regular \DOMDocument.

twistor’s picture

Assigned: Unassigned » twistor

I'm just going to go ahead and run with this for a bit.

Wim Leers’s picture

Thanks! :)

twistor’s picture

Status: Needs work » Needs review
FileSize
469.84 KB
27.42 KB

I stuck the XPathExpressionParser in simpletest for now, even though it doesn't belong there. Not sure what to do with it yet.

Let's see what explodes.

Status: Needs review » Needs work

The last submitted patch, 140: 1333730-140.patch, failed testing.

twistor’s picture

FileSize
1.02 KB
470.63 KB

So, the HTML5 parser treats tags inside <noscript> elements as plain text. Not sure if that's correct, don't have time to go trawling through the spec at the moment.

twistor’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 142: 1333730-142.patch, failed testing.

filijonka’s picture

two question of things I don't understand:

in #123 @Wim Leers writes

why use the HTML5 library directly? We should update our \Drupal\Component\Utility\Html to use this instead.

which was done in #128
This latest patch is not following that as far as I can see.

why not use Symfony/dom-crawler which is well tested and working? It will also be introduced by another patch as I wrote earlier.

andypost’s picture

twistor’s picture

I don't see what using dom-crawler has to do with anything. It would just be a slightly different version of what we have now. It uses the standard DOMDocument internally, so doesn't buy us anything besides maybe a nicer API.

What I'm trying to do in this patch is convert direct usage of \DOMDocument to HTML5 where it makes sense. That includes \Drupal\Component\Utility\Html, as-well-as parts of the testing framework.

My goal is to switch DOMDocument usage with HTML5 usage while changing the tests as little as possible.

filijonka’s picture

ok I'm just trying to understand here;

In my comment #133 I presented a problem with DomXpath with a possible solution using the symfony dom-crawler.

In #134 you presented another solution (which I never got to work personally but might have not done correct).

But from there I got lost, the introduction of XPathExpressionParser? I had the understanding that this was a solution the problem I presented?

I also asked why there are calls directly to html5 like this outside the Html class

$html5 = new HTML5();
$dom = $html5->loadHTML($this->getRawContent());

Trying to understand to be able to give better feedback, that's all.

twistor’s picture

Status: Needs work » Needs review
FileSize
8.49 KB
478.26 KB

Some of there failures are because we produce invalid markup in forms and the HTML5 parser fails to add all of the input elements.

For example, when setting #field_prefix to '<div class="container-inline">' it gets wrapped in a span tag and things fail to parse.

twistor’s picture

In my comment #133 I presented a problem with DomXpath with a possible solution using the symfony dom-crawler.

dom-crawler is not a solution. Someone, somewhere, has to do XPath via the DOM. Whether that's using dom-crawler, or not, the XPath queries have to be namespaced correctly. dom-crawler doesn't do anything different than what SimpleTest already does. They solve a similar problem. It would be nice in the future to use dom-crawler, sure, but that won't help us now.

In #134 you presented another solution (which I never got to work personally but might have not done correct).

It works, as of #142 the render tests are passing.

But from there I got lost, the introduction of XPathExpressionParser? I had the understanding that this was a solution the problem I presented?

The XPathExpressionParser needs a better name, possibly Namespacer, but it's just a way to automatically namespace an xpath query, so that we don't have to update every single one, and makes writing XPath queries significantly easier. It converts //ul/li into //x:ul//x:li.

Masterminds\HTML5 namespaces all HTML elements in the http://www.w3.org/1999/xhtml namespace. This means that to use XPath on it, you have to register the namespace with DOMXPath::registerNamespace(prefix, namespace), and then prefix every element with the same prefix used in the registerNamespace() call. /x:div/x:a[@href = something]

I also asked why there are calls directly to html5 like this outside the Html class

Drupal\Component\Utility\Html provides helpers for dealing with HTML. It needed to be updated as the API needs to be maintained. It was never suggested that that should be the only place masterminds/html5 should ever be used, simply that it needs to be used there. If you look at the patch, an existing API is never changed for direct usage of HTML5. But, most direct usage of the DOMDocument is.

Status: Needs review » Needs work

The last submitted patch, 149: 1333730-148.patch, failed testing.

twistor’s picture

Another comment about Drupal\Component\Utility\Html.

The only thing that would be needed for simpletest would be Drupal\Component\Utility\Html::load(). This won't actually work at the moment, since we already have a fully formed html document and the load() call assumes the load is being called with the body text.

Regarding the XPathExpressionParser, don't focus on that too much right now. I have an updated version that easier to follow and debug, but I'm just using it at the moment to get tests to pass. It will be simple to remove, update, whatever once things get settled.

twistor’s picture

Status: Needs work » Needs review
FileSize
5.25 KB
482.04 KB

The commented out #field_prefix's are just to get the tests to pass, I haven't figured out the correct fix yet.

Status: Needs review » Needs work

The last submitted patch, 153: 1333730-153.patch, failed testing.

twistor’s picture

Status: Needs work » Needs review
FileSize
641 bytes
482.67 KB

This is an interesting one. $this->entity->label() is returning a TranslationWrapper. The wrapper isn't being cast to a string, but rather, being set as attributes on the input element. This was happening before the patch. The reason the failures are happening now is that the HTML5 parser is strict about attributes it will accept.

I'm not sure where the actual fix should go, the current patch just illustrates what needs to happen. Perhaps in FormBuilder?

amateescu’s picture

That sounds really weird. Can you post a backtrace for that label() call? In any case, we might want to do the string casting in \Drupal\Core\Entity\Entity::label() (i.e. $label = (string) $this->{$label_key};).

andypost’s picture

@twistor there css-select in core https://www.drupal.org/node/2276689 is it possible to use it for xpath?

twistor’s picture

@andypost, Unfortunately, no. Symfony\Component\CssSelector just translates CSS selectors to XPath expressions. So the same problem exists, just a little lower.

We're hiding the X-Path complexity for the most part by doing it AssertContentTrait::xpath(). That actually takes care of AssertContentTrait::cssSelect() automatically. The edge cases are where tests are using the DOM + DOMXPath directly.

For some reason, FieldRdfaTestBase and FieldWebTest are re-implementing parts of AssertContentTrait, so that's adding some noise to the patch.

twistor’s picture

FileSize
35.12 KB
490.08 KB

Cleaned up the XPath thingy. I know everybody hates it, but I'm not sure what the alternative is.

Status: Needs review » Needs work

The last submitted patch, 159: 1333730-159.patch, failed testing.

twistor’s picture

Status: Needs work » Needs review
FileSize
1.02 KB
490.25 KB

Status: Needs review » Needs work

The last submitted patch, 161: 1333730-160.patch, failed testing.

twistor’s picture

Status: Needs work » Needs review
FileSize
1.78 KB
490.72 KB
mgifford’s picture

The bot likes it well enough. I can't seem to apply it here though http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org...

Not sure what I could be missing. maybe it's something just weird with SimplyTest.me at the moment.

Thanks for pushing ahead on this @twistor

What are the steps for testing this before marking it RTBC?

daffie’s picture

Status: Needs review » Needs work
Issue tags: +Needs reroll

My first review. It looks good. The patch also needs a reroll. And can you add a "do-not-test" patch without the vendor stuff. Dreditor does not like very large patches. Especially on an old computer like mine.

+++ b/core/lib/Drupal/Component/Utility/Html.php
@@ -314,8 +315,9 @@ public static function serialize(\DOMDocument $document) {
+      $html .= $html5->saveHTML($node, ['encode_entities' => FALSE]);

The options parameter encode_entities = FALSE is not necessary. It defaults to false.

+++ b/core/lib/Drupal/Component/XpathHelper/Lexer.php
@@ -0,0 +1,240 @@
+   * @return []string

Can we change this to @return array. And on the next line explain it so more.

+++ b/core/lib/Drupal/Component/XpathHelper/Namespacer.php
@@ -0,0 +1,354 @@
+  public function __construct($expression, $prefix, Lexer $lexer) {

Please add "use Drupal\Component\XpathHelper\Lexer" at the top of the page. It is probably technically not necessary, but it is very helpful for other programmers.

+++ b/core/modules/filter/src/Tests/FilterUnitTest.php
@@ -834,37 +870,29 @@ function testHtmlCorrectorFilter() {
-    $f = Html::normalize('line1<hr>line2');
-    $this->assertEqual($f, 'line1<hr />line2', 'HTML corrector -- Automatically close single tags.');
-
-    $f = Html::normalize('line1<HR>line2');
-    $this->assertEqual($f, 'line1<hr />line2', 'HTML corrector -- Automatically close single tags.');
-
-    $f = Html::normalize('<img src="http://example.com/test.jpg">test</img>');
-    $this->assertEqual($f, '<img src="http://example.com/test.jpg" />test', 'HTML corrector -- Automatically close single tags.');

Reverse the tests from: 'line1


line2' to: 'line1

line2'. And so forth.
+++ b/core/modules/filter/src/Tests/FilterUnitTest.php
@@ -834,37 +870,29 @@ function testHtmlCorrectorFilter() {
+    // @todo Blocked on https://github.com/Masterminds/html5-php/issues/68

We need to make a followup issue for this.

+++ b/core/modules/system/tests/modules/form_test/src/Form/FormTestCheckboxesZeroForm.php
@@ -56,7 +56,7 @@ public function buildForm(array $form, FormStateInterface $form_state, $json = T
-    if ($form_state->has('json')) {
+    if ($form_state->get('json')) {

Why this change?

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/LexerTest.php
@@ -0,0 +1,45 @@
+ * @group Namespacer

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
@@ -0,0 +1,100 @@
+ * @group Namespacer

Should this not be @group XpathHelper.

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/LexerTest.php
@@ -0,0 +1,45 @@
+  public function test($input, $output) {

Can we rename this function to testLexer($expected, $input).

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/LexerTest.php
@@ -0,0 +1,45 @@
+  public function xpathProvider() {

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
@@ -0,0 +1,100 @@
+  public function xpathProvider() {

Can you add a docblock for the xpathProvider function. The default is to name the function providerSomething. My suggestion would be to name this function providerLexer() and providerNamespacer(). An other default is to have the first parameter be $expected. That is the reverse from what it is now.

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
@@ -0,0 +1,100 @@
+  public function test($input, $output) {

Can we rename this function to testNamespacer($expected, $input).

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
@@ -0,0 +1,100 @@
+  public function testCache() {
+    $this->assertSame('x:a', Namespacer::prefix('a'));
+    $this->assertSame('x:a', Namespacer::prefix('a'));
+
+    $this->assertSame('*[local-name() = "a"]', Namespacer::localize('a'));
+    $this->assertSame('*[local-name() = "a"]', Namespacer::localize('a'));
+  }

Add a docblock for this function. And why twice the same test? What do you want to prove?

daffie’s picture

Issue summary: View changes
Issue tags: -Needs issue summary update

I am not completely sure what is the best naming for the "@group" in the LexerTest and NamespacerTest classes.

+++ b/core/lib/Drupal/Component/Utility/Html.php
@@ -272,8 +274,8 @@ public static function normalize($html) {
 <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>

Remove the space and the slash at the end of the meta tag.

+++ b/core/modules/filter/src/Tests/FilterUnitTest.php
@@ -834,37 +870,29 @@ function testHtmlCorrectorFilter() {
+    // @todo Blocked on https://github.com/Masterminds/html5-php/issues/68
+    //$f = Html::normalize('<p>line1<br/><hr/>line2</p>');
+    //$this->assertEqual($f, '<p>line1<br /></p><hr />line2', 'HTML corrector -- Move non-inline elements outside of inline containers.');

The issue is fixed in the html5 library. So this can uncommented. Maybe at some more tests for this.

babruix’s picture

Issue tags: -Needs reroll
FileSize
489.78 KB

This is re-roll of patch in #163. Next will try to add changes from comments #165-166.

babruix’s picture

Status: Needs work » Needs review

Should we wait for test results before adding next patch?

babruix’s picture

babruix’s picture

FileSize
18.81 KB

Status: Needs review » Needs work

The last submitted patch, 169: 1333730-169.patch, failed testing.

babruix’s picture

Status: Needs work » Needs review
FileSize
489.86 KB
1.08 KB
catch’s picture

Looks like that issue's been closed as fixed already?

joseph.olstad’s picture

@mgifford, I had the same issue with patch 163 however patch 167 will build with simplytest.me

@babruix thanks for the work on the patch

*edit*
I tested the patch, my test case passes with or without the patch (patch 167). I had pasted some source html markup into a new node including an openning <source> and <track> and closing </source> and </track> tags ,
in html5 these <source> and <track> tags are supposed to be void tags (no ending tag), in both tests I ran the end tag was properly removed during validation.

This was observed in both of my tests where I tested with the patch and I also re-tested without the patch. both tests gave me desired results.

so my test case probably won't help much but I did test it.

*edit*
Thanks for everyones great work on this issue. I do have some feedback to those that know how to create this patch and have generated these patches. For clarity purposes it would be handy to have a "do not test" patch similar to the one that @chx created in comment #125 and perhaps if someone was really nice maybe a simple readme of where to put the library and what version of the library is meant for the do-not-test patch (a version ? what hash value of html5-php from mastermind/github). I read through this entire issue (all 170+ comments) before posting this.

*edit* where to put the library: the masterminds html5-php library goes in core/vendor

babruix’s picture

@catch yes, but that test still fails, because

Html::normalize('<p>line1<br/><hr/>line2</p>')

becomes
<p>line1<br><hr>line2</p>
and we expect it to be
<p>line1<br/></p><hr/>line2

I have tried to change references to masterminds/html5 github latest commit, maybe I have missed something?

kim.pepper’s picture

Can we please have an do-not-test patch that excludes the library?

Status: Needs review » Needs work

The last submitted patch, 175: 1333730-175.patch, failed testing.

joseph.olstad’s picture

@babruix I added some details to the github queue for the html5-php masterminds library developers (the aforementioned github issue looks to be relevant however it was closed a while back)

From what I've read about the html5 masterminds library I'd expect this output from your test:
<p>line1<br></p><hr>line2 (this is how chrome/firefox do it)

rather than:
<p>line1<br/></p><hr/>line2 (what the current test case assertion expects)

this is because masterminds html5-php library is mimicking firefox/chrome interpretation of html5 but in this case below the actual output is not as expected

what we're currently seeing from html5-php masterminds (fail):
<p>line1<br><hr>line2</p>

It looks like there's still quite a bit of activity in the github queue for the html5-php masterminds library.

This is probably a good enough reason for a no-test-patch and a readme so that others can continue to quickly configure this with the latest pull of the library from gitlab. chx made a no-test-patch on comment #125 but there's no readme on using it .

joseph.olstad’s picture

@babruix
I did some followup with the github community for html5-php masterminds

@Technosophos, insists that the drupal unit test in question needs to be verified with the w3.org spec http://www.w3.org/TR/2014/REC-html5-20141028/ with regards to the unit test case we're concerned with

What the html5-php developers are telling us is before they make a drupal specific exception for html5 validation because we think that the unit test should pass but does not then we should back it up with a reference to the html5 spec to say that our unit test is indeed following spec. It could be that chrome/firefox is not following spec in this case. We should verify the spec, make adjustments to the unit test (if necessary) or if necessary follow up with the html5-php masterminds issue queue with a reference to the w3c html5 spec .

Translation: They're asking us to first provide a specific reference to the html5 spec that justifies the unit assertion test https://qa.drupal.org/pifr/test/941363 prior to making a drupal specific exception in the library.

So in question is the unit test for "HTML corrector -- Move non-inline elements outside of inline containers." , is it valid html5 to w3c spec or not?

babruix’s picture

Status: Needs work » Needs review
FileSize
489.86 KB
1.08 KB

Changed test to expect <p>line1<br></p><hr>line2 and commented it out for now.

Added "-do-not-test" in filename. Not sure which readme is needed, I assumed it is only to skip testing by testbot?

Regarding HTML5 specs, I haven`t found what should be normalised behaviour for <p>line1<br/><hr/>line2</p>.
Seems we are expecting that <hr> tag will be moved out of <p> tag after normalize,
however test message says "Move non-inline elements outside of inline containers."
As paragraph is a block level element, should we change message or the test itself is incorrect?

joseph.olstad’s picture

Hi @babruix, just fyi: the do-not-test patch from #180 will not patch using simplytest.me

I did some additional tests with simplytest.me and patch 175 no longer applies to 8.0.x dev either
and then I decided to see if there is a problem with simplytest.me or 8.0.x dev, and re-tested patch 167 with simplytest.me and it ALSO no longer will build. What is curious is that yesterday morning I WAS able to apply patch 167 successfully. SO I am wondering if a recent change to the 8.0.x dev branch is affecting things or else perhaps a problem with simplytest.me?

http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org/files/issues/1333730-180-do-not-test.patch

"An error occurred while patching the project."

http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org/files/issues/1333730-175.patch

"An error occurred while patching the project."

http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org/files/issues/1333730-167.patch

"An error occurred while patching the project."

So if I understand correctly if someone wanted to test this patch with a different checkout/hashtag/version of the masterminds vendor code all they'd have to do is apply the patch, say patch 175 (which is applying) to the latest drupal 8 core dev branch of 8.0.x and then remove the /core/vendor/masterminds files that is inserted by the patch and then replace that /core/vendor/masterminds with the desired version of masterminds/html5 that they want, place it in /core/vendor and then either make their own patch or run their own tests.

I also grabbed the latest drupal 8.0.x core dev tried applying patch 175 and it will no longer apply. Here's the output from git apply:

<stdin>:1926: trailing whitespace.
php:
<stdin>:1961: trailing whitespace.
KITAITI Makoto [KitaitiMakoto] <KitaitiMakoto@example.com> (contributor)
<stdin>:1977: trailing whitespace.
this software and associated documentation files (the "Software"), to deal in
<stdin>:1978: trailing whitespace.
the Software without restriction, including without limitation the rights to
<stdin>:2019: trailing whitespace.
this software and associated documentation files (the "Software"), to deal in
error: le patch a échoué : composer.json:28
error: composer.json : le patch ne s'applique pas
error: composer.lock : Aucun fichier ou dossier de ce type

translation:
the patch failed : composer.json:28
composer.json :the patch won't apply
composer.lock :no file or folder of this type

Thanks again @babruix for your hard work on this issue.

babruix’s picture

I have rerolled patch and changed HTML5 lib code to the latest release. Now test passes on my local env.

Should we have separate patch with "do-not-test" in name that excludes the external library?
Or we should always have "do-not-test" in patch name for this issue and test manually?

babruix’s picture

FileSize
700.24 KB

Added do-not-test patch that excludes the html5 library.

babruix’s picture

joseph.olstad’s picture

Hi @babruix, I reset my local clone of drupal core dev 8.0.x , did a git pull
then applied your patch 182, it successfully applied (with whitespace warnings on a few lines of masterminds code)

so the corresponding checkout of drupal core 8.0.x dev would be:
1e08b50f421bd (Fri Jan 9 15:14:23 2015 +0000)

patch 182 is tested to build on that checkout.

We'll see if we have a winner. Seeing as how fast dev core is moving it'd be nice to get the patch moving quickly , it might seem like trying to get airspace at the kennedy airport.

babruix’s picture

Here that test is uncommented finally and library correctly updated.
Passes on local.

daffie’s picture

Good work @babruix!

  1. +++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
    @@ -0,0 +1,100 @@
    +  public function testCache() {
    +    $this->assertSame('x:a', Namespacer::prefix('a'));
    +
    +    $this->assertSame('*[local-name() = "a"]', Namespacer::localize('a'));
    +  }
    

    If I look at the function name then it suggest that it tests the value of the static $cache variable. But in the function there is no testing for that. The function also needs a docblock.

  2. +++ b/core/tests/Drupal/Tests/Component/XpathHelper/LexerTest.php
    @@ -0,0 +1,47 @@
    +   * @param $expected
    +   * @param $input
    

    Move this to the docblock of providerLexer().

  3. +++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
    @@ -0,0 +1,100 @@
    +   * @param $expected
    +   * @param $input
    

    Move this to the docblock of providerNamespacer().

babruix’s picture

1. Which name would you suggest? How about testStaticCache? Because it returns from static value (if that exists): static::$cache[$prefix][$xpath]
2. and 3. Do not see the point to move docblock parameters to methods which have no arguments.

daffie’s picture

1. If we add a test function called testCache(), we should test the working of the class variable called $cache. An other possibility is to remove the function testCache().
2. and 3. For the provider-functions return an array with two elements ($expected and $input). Those need to be documented in a docblock. We do this at the provider-function because some provider-functions are used multiple times.

Mile23’s picture

Daffie asked me to look at the testCache() method.

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
@@ -0,0 +1,100 @@
+  public function testCache() {
+    $this->assertSame('x:a', Namespacer::prefix('a'));
+
+    $this->assertSame('*[local-name() = "a"]', Namespacer::localize('a'));
+  }

This test method tests two methods, prefix() and localize().

Some analysis of these methods reveal that localize() is a special case of prefix(), and should really just call static::prefix($xpath, NULL);

But for the purposes of testing... Both prefix() and localize() have a dependency on a static cache variable, and the method parse(), which should definitely have unit test coverage.

In an ideal world, parse() would get refactored to accept parameters instead of using class properties, but forget that for now. You can use reflection and other techniques to poke in dependencies.

So to sum up:

  • Consider refactoring localize() to just call prefix() with NULL.
  • Make a testPrefix() with a @dataProvider that hits a few edge cases.
  • Make a testLocalize() with a @dataProvider if you didn't refactor (and maybe even if you did).
  • testPrefix() and testLocalize() replace testCache(), which should then go away.
  • Make a testParse().

Thanks.

Mile23’s picture

Assigned: twistor » Mile23

daffie asked me to take this on.

Mile23’s picture

Status: Needs work » Needs review
FileSize
494.76 KB
6.21 KB

OK, so some of my analysis was a little off, but here's something to chew on.

BTW: Anything static is automatically an anti-pattern. :-)

daffie’s picture

Status: Needs review » Reviewed & tested by the community
FileSize
494.89 KB
77.23 KB

I have made some minor changes to the patch. Which I have listed below. They are mostly documentation. I have tried to make an interdiff.txt file, but I did not succeed.

  1. My first change in the LexerTest class
    -   * @dataProvider providerLexer
    -   * @param $expected
    -   * @param $input
    -   */
    -  public function testLexer($expected, $input) {
    ---
    +   * @covers ::lex
    +   * @dataProvider providerLex
    +   */
    +  public function testLex($expected, $input) {
    

    Change the docblock and renamed the function from testLexer() to testLex().

  2. My second change in the LexerTest class
    -  public function providerLexer() {
    ---
    +  /**
    +   * Data provider for testLex().
    +   *
    +   * @return array
    +   *   - An xpath argument to lex().
    +   *   - Expected output from lex().
    +   */
    +  public function providerLex() {
    

    Added a the docblock and renamed the function from providerLexer() to providerLex().

  3. My first change in the NamespacerTest class
    -   * @covers localize
    ---
    +   * @covers ::localize
    
    -   * @covers prefix
    ---
    +   * @covers ::prefix
    
    -   * @covers parse
    ---
    +   * @covers ::parse
    

    These change are with the following function: testLocalize(), testPrefix() and testParse().

  4. My second change in the NamespacerTest class
    -      ->willReturn ($get_namespaced_element_value);
    ---
    +      ->willReturn($get_namespaced_element_value);
    

    This change is on line 179 in the function testParse().

  5. My third change in the NamespacerTest class is that I removed an empty line (on line 202) in the function testParse().

With the minor changes that I made it all looks good to me.
With the addition from @Mile23 the testing is completed. Thanks Mile23! :-)
All documentation is in order.
The changes is this patch are minor, so I have taken the liberty to give this patch a RTBC.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs change record

Looks like we need a change record.

daffie’s picture

Status: Needs work » Reviewed & tested by the community

Added a change record for this issue: [#2407077].

daffie’s picture

Issue tags: -Needs change record
Wim Leers’s picture

The change record is very misleading: With this Drupal 8 now supports HTML5.. This should mention that it's Drupal 8's filter system that now supports it.

daffie’s picture

@winleers: Made the requested change to the change record.

Wim Leers’s picture

Status: Reviewed & tested by the community » Needs work
  1. +++ b/core/lib/Drupal/Component/Utility/Html.php
    @@ -314,8 +315,9 @@ public static function serialize(\DOMDocument $document) {
    +      $html .= $html5->saveHTML($node);
    

    Unlike #165, I think it'd be better to explicitly list the 'encode_entities' => FALSE option, just to be on the safe side when upgrading in the future, and to document/convey that Drupal doesn't encode entities. But that's a matter of opinion, so it's fine to keep it as is.

    (Also: I'd swear the default was different when I worked on it!)

  2. +++ b/core/lib/Drupal/Component/Utility/Html.php
    --- /dev/null
    +++ b/core/lib/Drupal/Component/XpathHelper/Lexer.php
    
    +++ b/core/lib/Drupal/Component/XpathHelper/Lexer.php
    --- /dev/null
    +++ b/core/lib/Drupal/Component/XpathHelper/Namespacer.php
    

    This is a HUGE amount of code being added. Why do we need this?

  3. +++ b/core/modules/editor/editor.admin.inc
    @@ -92,8 +92,8 @@ function editor_image_upload_settings_form(Editor $editor) {
    -    '#field_prefix' => '<div class="container-inline clearfix">',
    -    '#field_suffix' => '</div>',
    +    // '#field_prefix' => '<div class="container-inline clearfix">',
    +    // '#field_suffix' => '</div>',
    
    +++ b/core/modules/image/src/Plugin/Field/FieldType/ImageItem.php
    @@ -206,8 +206,8 @@ public function fieldSettingsForm(array $form, FormStateInterface $form_state) {
    -      '#field_prefix' => '<div class="container-inline">',
    -      '#field_suffix' => '</div>',
    +      // '#field_prefix' => '<div class="container-inline">',
    +      // '#field_suffix' => '</div>',
    
    @@ -233,8 +233,8 @@ public function fieldSettingsForm(array $form, FormStateInterface $form_state) {
    -      '#field_prefix' => '<div class="container-inline">',
    -      '#field_suffix' => '</div>',
    +      // '#field_prefix' => '<div class="container-inline">',
    +      // '#field_suffix' => '</div>',
    

    Needs to be resolved before it can be committed.

  4. +++ b/core/modules/rdf/src/Tests/Field/FieldRdfaTestBase.php
    @@ -178,6 +179,14 @@ protected function parseContent($content) {
    +      // Register the default namespace if it exists.
    +      $namespaces = $elements->getDocNamespaces();
    +      if (!empty($namespaces[''])) {
    +        $xpath = $this->prefixXpath($xpath);
    +        $elements->registerXPathNamespace('x', $namespaces['']);
    +      }
    
    +++ b/core/modules/simpletest/src/AssertContentTrait.php
    @@ -219,6 +228,14 @@ protected function buildXPathQuery($xpath, array $args = array()) {
    +      // Register the default namespace if it exists.
    +      $namespaces = $this->elements->getDocNamespaces();
    +      if (!empty($namespaces[''])) {
    +        $xpath = $this->prefixXpath($xpath);
    +        $this->elements->registerXPathNamespace('x', $namespaces['']);
    +      }
    
    @@ -231,6 +248,32 @@ protected function xpath($xpath, array $arguments = array()) {
    +  protected function prefixXpath($xpath) {
    ...
    +  protected function localizeXpath($xpath) {
    
    +++ b/core/modules/simpletest/src/WebTestBase.php
    @@ -1875,12 +1876,11 @@ protected function drupalProcessAjaxResponse($content, array $ajax_response, arr
    -            $wrapperNode = $xpath->query('//' . $command['selector'])->item(0);
    +            $wrapperNode = $xpath->query($this->localizeXpath('//' . $command['selector']))->item(0);
    
    @@ -1930,14 +1930,14 @@ protected function drupalProcessAjaxResponse($content, array $ajax_response, arr
    -          $buildId = $xpath->query('//input[@name="form_build_id" and @value="' . $command['old'] . '"]')->item(0);
    +          $buildId = $xpath->query($this->localizeXpath('//input[@name="form_build_id" and @value="' . $command['old'] . '"]'))->item(0);
    

    So this is what that big amount of code earlier is for. But why exactly is this necessary?

  5. +++ b/core/modules/simpletest/src/AssertContentTrait.php
    @@ -122,15 +124,22 @@ protected function setDrupalSettings($settings) {
    +      // Check for XML preamble.
    +      if (substr($this->getRawContent(), 0, 5) === '<?xml') {
    +        $dom = new \DOMDocument();
    +        $dom->loadXML($this->getRawContent());
    +      }
    

    What are the use cases for this in Drupal core? Let's document them here.

  6. +++ b/core/modules/system/tests/modules/form_test/src/Form/FormTestCheckboxesZeroForm.php
    @@ -56,7 +56,7 @@ public function buildForm(array $form, FormStateInterface $form_state, $json = T
    -    if ($form_state->has('json')) {
    +    if ($form_state->get('json')) {
    

    This seems wrong.

Wim Leers’s picture

I forgot to say the two most important things: *thank you*, *incredible* progress has been made here! :)

daffie’s picture

Assigned: Mile23 » Unassigned

Thank you @Mile23 for adding some of the tests.

dawehner’s picture

+++ b/core/tests/Drupal/Tests/Component/XpathHelper/NamespacerTest.php
@@ -0,0 +1,204 @@
+  public function testParse($expected, $token_array, $should_copy, $get_namespaced_element) {

Given that amount of ifs in that test I would wonder whether it would be better to just write a couple of dedicated test methods with proper names instead.

cosmicdreams’s picture

Is it ok to update the change notice. I have a revision that I've worked up.

daffie’s picture

@cosmicdreams: give it your best :-)

cosmicdreams’s picture

Mile23’s picture

@dawehner Re: testParse(): Given that amount of ifs in that test I would wonder whether it would be better to just write a couple of dedicated test methods with proper names instead.

There's no good way to break it out and keep the expectations. You'd just end up with three different tests of escalating complexity, the third one being the same as testParse().

Mile23’s picture

Status: Needs work » Needs review
FileSize
498.57 KB
4.16 KB

I did a coverage report for Namespacer and Lexer, and here's a test that addresses the biggest risk, Namespacer::shouldCopy(), which has a CRAP score of 156. 12 with this patch.

Lexer::readToken() is the other big risk, with CRAP of 72 but I'm short on time.

dawehner’s picture

There's no good way to break it out and keep the expectations. You'd just end up with three different tests of escalating complexity, the third one being the same as testParse().

Okay, fair.

Wim Leers’s picture

Status: Needs review » Needs work

None of #199 has been addressed yet, NW for that.

daffie’s picture

@wimleers: I have tried to answer your questions as good as I can.

  1. #199.1 I think that you have a good argument to add the default parameter.
  2. #199.2 and #199.4 This code is added because the html5 library with the command "$xpath->query" can only work with namespaced code. The added classes changes the html5 code to html5 code with namespaces.
  3. #199.3 I do not know why this code is commented out. If this code with this patch generates an ugly user interface. We should create a new issue to fix it. I have removed this part from the patch and let the testbot have a go at it. I got a 21 fails.
  4. #199.5 The html5 library only supports html5 code and not xml code. That is why the if-statement is added.
  5. #199.6 I have know idea why that is in the patch. I have asked the same question and got no answer. I have removed this part from the patch and let the testbot have a go at it. I got a 4 exceptions.
joseph.olstad’s picture

@Wim Leers,
regarding comment #199,

item 1) I'll put your own comment in "a matter of opinion, so it's fine to keep it as is."
item 2) please see @twister comment #150

-    '#field_prefix' => '<div class="container-inline clearfix">',
-    '#field_suffix' => '</div>',
+    // '#field_prefix' => '<div class="container-inline clearfix">',
+    // '#field_suffix' => '</div>',

item 3) good suggestion, just removed those // lines of code , they aren't necessary as they're just commented out lines of code
item 4) "But why exactly is this necessary?" please see again @twister comment #152 and his other comments and also please see the issue descripton under "Chosen solution" and the various discussions that led up to the helper classes being included in the patch (it's quite a long discussion).

item 5) "// Check for XML preamble" . use case for this; checking for markup type before validation, it is logical to check that input is xml and should be validated as xml (not as html5) so this check is necessary. ( this is how I interpret this section of code and the description)

+++ b/core/modules/system/tests/modules/form_test/src/Form/FormTestCheckboxesZeroForm.php
@@ -56,7 +56,7 @@ public function buildForm(array $form, FormStateInterface $form_state, $json = T
-    if ($form_state->has('json')) {
+    if ($form_state->get('json')) {

item 6) Any suggestions for making this line of code clearer? according to the api docs on public function FormState::get an array is returned containing the property from the form. So one might ask, is it normal to put this "get('property')" into a boolean check? I'd say let the php interpreter deal with it for now, we could always revise it down the road in a new issue after the commit.

thanks.

daffie’s picture

The added test for Namespacer::shouldCopy() in #207 is for me RTBC. Thanks @Mile23!

xjm’s picture

Issue summary: View changes
Priority: Critical » Major
Status: Needs work » Postponed

@catch, @alexpott, @webchick, @effulgentsia and I discussed this issue this week. Thanks so much to everyone to sticking with this challenging issue! The examples of this bug in the issue summary (where filter.module does not understand valid HTML5 markup and sometimes even turning valid HTML5 markup into invalid markup) definitely indicate a major bug since we want to fully support HTML5 in Drupal 8. The bug primarily affects markup entered as user input in "blob" fields (as well as test code), and does not regress the overall HTML5 support on the rest of the page.

The example of filter module breaking the DOM is the most serious, but that will only occur under a specific set of circumstances. Furthermore, there is no impact on data integrity, since the user's original input is still stored (just rendered incorrectly on output.) So, the bug does not render the whole filter module unusable, so per our issue priority policy, this is a major bug, not a critical one.

We also discussed scope management for the issue. The attached patch adds the new library to vendor, adds new core components to implement the library, makes broad changes in the filter module to use it for parsing, and also fixes simpletest and all the tests throughout core that are not HTML5-compliant. This is many different bug scopes in one patch, so we have a disruptive 500K patch. To make these fixes less disruptive and more reviewable, let's split this up into 3 or more separate issues. I'd recommend:

  1. An issue to add the library to Drupal. This should include a partial -do-not-test.patch along with the full patch so that we can review the changes to core that are needed to add the library to vendor. (Major task with the "blocker" tag.)
  2. An issue to fix filter module itself once the library is added. (Major bug, postponed on #1)
  3. Issues to make SimpleTest use HTML5 parsing and to fix invalid HTML5 markup in tests. (A major bug and possibly some normal bugs, depending on how it breaks down.)

I've added the beta evaluation to the summary for this issue for the filter bug and marked it postponed (though we might want to create a new issue once the library issue goes in). Who can help with the task of creating the new issues, splitting up the patch, and moving the unaddressed feedback for the relevant parts into the relevant issues?

jibran’s picture

Apparently this blocks #2331685-37: PHPUnit deprecated assertTag(), assertNotTag() and assertSelectEquals() which is critical so this should be critical.

daffie’s picture

@jibran: There are different options for solving [#9492471]. And one possibility has to do with using this issue.

xjm’s picture

Thanks @jibran and @daffie. See #2331685-44: PHPUnit deprecated assertTag(), assertNotTag() and assertSelectEquals(); this is not a blocker to that issue.

cweagans’s picture

Of interest for people following this issue: see #2415305: Replace core text filters with versions that are HTML5 aware for a (perhaps overly simplistic) D7 implementation of this same library in html5_tools

daffie’s picture

Title: PHP DOM (libxml2) only understands XHTML4, misinterprets HTML5, but D8 must cope with HTML5 » [Meta] PHP DOM (libxml2) only understands XHTML4, misinterprets HTML5, but D8 must cope with HTML5
Status: Postponed » Active

This issue is changed to meta issue. See comment #213.

The first child issue is #2429363: Add HTML5-lib to Drupal 8 core for the filter system and for the testing system.

Berdir’s picture

That went in, great.

Question, will this also help to fix #2327341: Xss::attributes() corrupts data attributes that contain colons or JSON-encoded data (and the three related/duplicate issues)?

Hanno’s picture

Great that html5-lib is in!! Thanks all for all the efforts. @Berdir I'm afraid this won't help that issue, as that is caused by another filter function in the Drupal system.

Wim Leers’s picture

Wim Leers’s picture

We should try to move this forward again.

mgifford’s picture

Status: Active » Needs work
Issue tags: +Needs reroll

Agreed.. Important for being a mobile & accessible UI.

twistor’s picture

Assigned: Unassigned » twistor
twistor’s picture

Issue tags: -Needs reroll

I'm just going to re-write this patch. It will be much simpler since the HTML5 parser now supports the disable_html_ns option.

twistor’s picture

Status: Needs work » Needs review
FileSize
14.37 KB

This is mostly a find/replace with the HTML5 parser.

daffie’s picture

Status: Needs review » Active

Hi twistor, this issue has become a meta issue. The patch has been moved to #2441373: Upgrade tests to HTML5. See comment #213 for move info about this. And thank you for working on this problem. :-)

Status: Active » Needs work

The last submitted patch, 226: 1333730-226.patch, failed testing.

daffie’s picture

Status: Needs work » Active
twistor’s picture

Assigned: twistor » Unassigned

Ahh, thanks, I knew that at some point.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

victor-shelepen’s picture

Hello. What is a state of the issue? I want to suppress warnings because of HTML5 rules. I have got the related issue here. #2802269: drupal-entity-inline invalid in Entity

daffie’s picture

In comment #213 have the D8 maintainers decided that it should be solved in multiple child issues. So changing this to the category "plan". The first issue to be solved is #2667340: Usage of field_prefix and container-inline creates invalid markup..

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

lhguerra’s picture

Have we gave up on this? I tried the related issues but most of the job seems stale for years. I'm having the xml:lang issue and it is keeping my portal pages from being indexed by Google's AMP service.

By the way, in my case, the xml:lang isn't only being added like a side-effect but it is also being added twice!

This markup:

<span lang=\"en\">test</span>

Becomes this markup:

<span lang=\"en\" xml:lang=\"en\" xml:lang=\"en\">test</span>

This happens when adding the basic_html format to a field.

Wim Leers’s picture

This came up again over at #3191041: [PP-1] PHP Warning: DOMDocument::loadHTML(): Tag drupal-media invalid in Entity, line: 34 in Drupal\Component\Utility\Html::load() — with the media_embed filter that was added in #2940029: Add an input filter to display embedded Media entities we're going to increasingly have custom HTML tags stored/managed/parsed by Drupal, and hence the importance of this issue is increasing.

catch’s picture

effulgentsia’s picture

Title: [Meta] PHP DOM (libxml2) only understands XHTML4, misinterprets HTML5, but D8 must cope with HTML5 » [Meta] PHP DOM (libxml2) misinterprets HTML5
Version: 8.9.x-dev » 9.3.x-dev
andypost’s picture

andypost’s picture

Digging it and found that this bug gone on old tracker https://bugzilla.gnome.org/show_bug.cgi?id=761534 and no new issue filed in gitlab

On PHP side looks it also hanging in https://bugs.php.net/bug.php?id=60021

alexpott’s picture

@andypost here's the new link to the HTML5 support in libxml2 - https://gitlab.gnome.org/GNOME/libxml2/-/issues/211

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Ghost of Drupal Past’s picture

andypost’s picture

fgm’s picture

Just to be sure, is there no equivalent issue for whitespace in pure XML documents ? Asking for XML-RPC.

Also, is there any impact on contrib/custom filters ? Asking for Glossary 2. And Glossify may have the same question.

andypost’s picture

XML-RPC very probably affected as extension has no maintainer https://pecl.php.net/package/xmlrpc and PHP 8.3 got a lot of dom/libxml fixes but in long run https://wiki.php.net/rfc/domdocument_html5_parser

fgm’s picture

@andypost : Drupal XML-RPC stopped using the XML-RPC extension around the PHP4.7 IIRC, so it doesn't matter.

However, reexamining it, we've actually changed it to use the XMLParser extension (meaning the expat parser) instead, so there is no issue about libxml.

longwave’s picture

@fgm With the move to the HTML5 parser/serializer, in most cases there should be no impact on rendering or parsing HTML, but there are some edge case changes if you rely on exact string matches or significant whitespace (e.g. in tests) - see the change record for the examples we discovered in tests in core: https://www.drupal.org/node/3225468

andypost’s picture

andypost’s picture

Looks like all child issues are done so it could be marked fixed in 10.2?!

Wim Leers’s picture

Status: Active » Reviewed & tested by the community

@andypost I was thinking the same thing! 😄

I just went over all the related issues that are in Drupal core and were not yet closed:

I think the honor is @longwave's 🤓

needs-review-queue-bot’s picture

Status: Reviewed & tested by the community » Needs work

The Needs Review Queue Bot tested this issue.

While you are making the above changes, we recommend that you convert this patch to a merge request. Merge requests are preferred over patches. Be sure to hide the old patch files as well. (Converting an issue to a merge request without other contributions to the issue will not receive credit.)

nod_’s picture

Status: Needs work » Reviewed & tested by the community
Issue tags: +no-needs-review-bot
longwave’s picture

Status: Reviewed & tested by the community » Fixed

Agree that all child issues have been fixed, some related issues still remain where we can clean things up but the primary case of using libxml2 for HTML parsing is no longer an issue as we are using the masterminds HTML5 library instead now.

Tried to credit everyone who made useful comments or helped move this issue forward in some way. Thanks everyone, I'm so happy that Drupal now fully supports HTML5!

longwave’s picture

Version: 11.x-dev » 10.2.x-dev

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.