Problem/Motivation
The filters 'htmlcorrector', 'html' and the testing system needs html parsing and a valid DOM to work with. This is done by the libxml2 library provided in PHP that cleans html and transform it to a dom. Libxml2 assumes all html is HTML4 and correct it with HTML4 rules. As Drupal will be based on HTML5, typical HTML5 tags and constructions will be marked invalid, added or stripped.
A small test example is that <span lang="en">
transforms to <span lang="en" xml:lang="en">
Proposed resolution
Chosen solution
The consensus is that the best solution is to use html5-php (rewrite of html5lib). The library has almost all the functionality that we need. And the functionality is already implemented. There was a problem with the library and that has been fixed. The functionality that we miss is when the DOMDocument has a default namespace it's not possible to parse it using XPath unless the namespace is registered as something else. In the patch there are two helper-classes to add this functionality.
Remaining tasks
#2441373: Upgrade tests to HTML5
#2667340: Usage of field_prefix and container-inline creates invalid markup.
#2441811: Upgrade filter system to HTML5
@todo: needs an issue created or existing issue linked from this one- convert the filter module to use masterminds/html5
User interface changes
None
API changes
Change in behavior of filter_dom_load, the html-filter and html-corrector
Beta phase evaluation
Issue priority | Major because the bug has many repercussions on filters using the HTML filter, and under some circumstances could result in an invalid DOM. Not critical because it does not render the HTML filter unusable. |
---|---|
Prioritized changes | The main goal of this issue is a bugfix to for HTML5 support, which is part of the Drupal 8 product. |
Blocked Issues
#1277290: Use a proper HTML parser for every core filter
Related Issues
PHP Bug #60021 DOMDocument errors on HTML5 tags
issue that replaced own (faulty) function with libxml for the html corrector filter in 2009: #374441: Refactor Drupal HTML corrector (PHP5)
#725260: Use PHP's Tidy component for the clean-up HTML filter
Original report
The html filter corrector is now based on XHTML, but Drupal8 should output html5
At the moment
<span lang="en">
transforms to <span lang="en" xml:lang="en">
(see issue #1328768: attributes 'xml:lang' and 'xml:id' transform to 'lang' and 'id' in filter_xss)
two possible causes:
function filter_dom_load
currently loads @$dom_document->loadHTML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');
function filter_dom_serialize
uses $dom_document->saveXML
Comment | File | Size | Author |
---|---|---|---|
#233 | entity_empbed-invalid_tag-2802269-1.patch | 1.02 KB | victor-shelepen |
#226 | 1333730-226.patch | 14.37 KB | twistor |
Comments
Comment #1
RobW CreditAttribution: RobW commentedYep, I'm pretty sure this is a limitation of the loader and serializer functions. Unfortunately changing everything that says XML to HTML doesn't help -- in order to get new HTML5 elements parsed correctly you need to slip them through the anything goes XML element name loophole. HTML here basically means HTML4 only.
There's a new php html parser called html5lib that I believe addresses this. It's MIT licensed, which I'm pretty sure I've read in the issue queue is GPLv2 compatible.
Related links:
http://stackoverflow.com/questions/5033504/php-domdocument-loadhtml-retu...
http://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-...
http://code.google.com/p/html5lib/
Comment #2
sunUgh. Add WebTestBase::parse() to that.
Comment #3
Hanno CreditAttribution: Hanno commentedCan Symfony help us with this? (http://symfony.com/doc/current/components/dom_crawler.html)
Comment #4
sunDomCrawler is also based on PHP's DOM extension:
http://api.symfony.com/2.1/Symfony/Component/DomCrawler/Crawler.html
https://github.com/symfony/DomCrawler/blob/master/Crawler.php
As this PHP bug clarifies https://bugs.php.net/bug.php?id=60021 the actual issue is caused by libxml2, which DOM is using under the hood.
libxml2's changelog mentions a commit for the new possible HTML5 meta charset value, but otherwise does not mention html5.
Unless I'm missing something big, this means we need to use a user-space library. Aforementioned PHP bug seems to suggest html5lib.
(Even if libxml2 was fixed [but there doesn't even seem to be an issue to do that], that would only be contained in the latest and greatest release of PHP, and we cannot require that.)
The impact of all of this goes way beyond Filter module, so I'm adjusting the title and bumping this to critical.
Comment #5
Hanno CreditAttribution: Hanno commentedSame discussion about html5 parsing with Symfony: https://github.com/symfony/symfony/issues/1733
Is this something we should or can fix in Symfony and let modules use this DOM crawler? Symfony has a workaround, but needs a real solution: https://github.com/symfony/symfony/commit/a57a4aff
Comment #6
Hanno CreditAttribution: Hanno commentedSame discussion about html5 parsing with Symfony: https://github.com/symfony/symfony/issues/1733
Is this something we should or can fix in Symfony and let modules use this DOM crawler? Symfony has a workaround, but needs a real solution: https://github.com/symfony/symfony/commit/a57a4aff
Comment #7
Crell CreditAttribution: Crell commentedSymfony's workaround doesn't seem to be a workaround, if I'm reading that correctly. The tests are still verifying that HTML5 does not validate, which is not what we want.
Generally something like this should be fixed as far upstream as possible. None of us are C coders as far as I know, so that means in Symfony. I'm fine with making a fix upstream and then pulling in DomCrawler. What the fix should be, though, I have no idea.
Comment #8
sunFrom where do you get that? Did I overlook something?
I linked to the DomCrawler source code in #4, and that calls into the regular, native DOMDocument & Co PHP classes, which are all bound to the current libxml2 feature-set.
Comment #9
Crell CreditAttribution: Crell commentedI was looking at the tests found via the comment in #6. I didn't look at the current source yet.
In any event, my previous statement still stands. If we and Symfony have the same bug, let's workaround it once in DomCrawler and then leverage it. No sense there being two workarounds for libxml2 floating around.
Comment #10
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is nothing more then a normal bug.
We only have two things to do:
filter_dom_load()
libxml_use_internal_errors(true);
The first change solves the issue highlighted in the original post. The second change silences the invalid tag warnings.
In WebTestBase, we can decide to only ignore "invalid tag" errors (error code 801).
Test script attached.
Comment #11
RobW CreditAttribution: RobW commentedWould this regress advanced html correction that the current filter provides for xhtml? Or are attribute corrections (like the one the OP was looking to avoid) not really an issue in HTML5?
Comment #12
sunI'm surprised that it took this long to face these false-errors in actual test results.
This bug blocks #1168246: Freedom For Fieldsets! Long Live The DETAILS. now.
The patch in that issue is causing DOM parser exceptions on the DETAILS and SUMMARY elements:
http://qa.drupal.org/pifr/test/370623
Systematic/infrastructural bugs like this, which are causing irrelevant test failures for other changes, and worse, which are blocking up API changes, are critical.
Attached patch translated #10 into a patch. Not yet sure how to test this.
Comment #14
sunThis patch is expected to come back green.
Comment #15
sunComment #16
sunComment #17
Damien Tournoud CreditAttribution: Damien Tournoud commentedI suppose we should support both forms?
Is this a regression? If it is not, we should update the comments, because they do not match the result.
Comment #18
sunComment #19
sunI think this patch is ready to fly. Would be great to get this blocker out of the way.
Comment #20
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is called a "Void element" now in HTML5.
This one is troublesome.
<div/>
is invalid HTML5 (the / is only permitted at the end of a Start tag). In HTML5 tree insertion algorithm, the self-closing flag is purely informational, so<div/>
will be interpreted as<div>
in a conformant browser.This feels like a deal breaker. Someone wants to take a look at the (seemingly unmaintained) html5lib?
Comment #21
Damien Tournoud CreditAttribution: Damien Tournoud commentedI assume that using
->saveHTML()
instead of->saveXML()
should fix the issue described above.Comment #22
sunHmmm.
Changed saveXML() to saveHTML().
Regressions:
- Pretty-formats *some* (but not all) of the output. Wasn't able to make it stop that. ::$formatOutput is ignored.
- CDATA escaping for scripts and styles is stripped from the output.
The test results should expose the differences "live" as preformatted debug output in the testbot test results.
Comment #23
RobW CreditAttribution: RobW commentedIf I remember correctly we don't need CDATA anymore in html5, unless we're sending the page as application/xhtml+xml and trying to validate (which no one has ever or will ever do). So I think the cdata stripping is correct behavior.
Comment #25
Damien Tournoud CreditAttribution: Damien Tournoud commentedNote: the
$node
argument of->saveHTML()
appeared in PHP 5.3.6. I think that's a good reason to bump from 5.3.5 to 5.3.6 as our minimum requirement.Comment #26
sunWe have two related issues for that already - I followed up on both to note this issue:
#1751100: Bump minimum version of php required to 5.3.5
#1800122: Bump minimum version of php required to 5.3.10
Comment #27
sunMeanwhile, I worked around the test exceptions in #1168246: Freedom For Fieldsets! Long Live The DETAILS. by adding error suppression to
drupalPostAjax()
, which - unlike all other usages of DOMDocument - did not suppress errors.However, I still believe that there is no way we can release D8 with HTML5 markup but a parser that expects XHTML all over the place. Thus, this is borderline critical, but I'm demoting it to major, plus tag.
We should bump the minimum PHP version to 5.3.6 or 5.3.10, and quite potentially, just simply ignore libxml's extra formatting for subsequent paragraphs. (in my testing, all markup remained as is, only
<p></p><p></p>
was formatted with a linebreak between paragraphs, which seems to be a bug in libxml — perhaps even fixed in later PHP versions)Comment #28
Hanno CreditAttribution: Hanno commentedMaybe there is some progress:
PHP 5.3.10 minimum is since March '13 in core, so that's not a blocker anymore. #1800122: Bump minimum version of php required to 5.3.10
FYI There are also some enhancements in the Symfony DOMcrawler:
http://symfony.com/blog/new-in-symfony-2-3-domcrawler-enhancements
Comment #29
mbutcher CreditAttribution: mbutcher commentedWe're just about done with a standards-complaint PHP HTML5 parser and serializer.
https://github.com/technosophos/HTML5-PHP
This was a fork of libhtml5, but has since turned into an almost complete rewrite. It parses HTML 5 (and 3 and 4) and creates a DOMDocument (libxml). It can also serialize any HTML DOMDocument into an HTML5 string/file.
The goal isn't necessarily to clean or transform documents (you can't for instance, feed it HTML2 and expect it to spit out gorgeous and semantic HTML5). But it does parse and write HTML5.
Comment #30
dcam CreditAttribution: dcam commentedhttp://drupal.org/node/1427826 contains instructions for updating the issue summary with the summary template.
Comment #30.0
dcam CreditAttribution: dcam commented.
Comment #30.1
Hanno CreditAttribution: Hanno commentedSummary added
Comment #30.2
Hanno CreditAttribution: Hanno commentedPHP bug added
Comment #30.3
Hanno CreditAttribution: Hanno commented.
Comment #30.4
Hanno CreditAttribution: Hanno commentedmore info
Comment #30.5
Hanno CreditAttribution: Hanno commented.
Comment #30.6
Hanno CreditAttribution: Hanno commented.
Comment #30.7
Hanno CreditAttribution: Hanno commented.
Comment #31
Hanno CreditAttribution: Hanno commentedI added a summary and hope this describes the issue well enough, please correct. Writing this summary I was unsure whether surpressing errors is still a possible solution, or that we need a real html5 parser.
If we need a html5 parser, best thing to me is that we try to get that in PHP/libxml and a similar solution in Symfony for older PHP versions.
If we have to find a HTML5 parser, some discussion is needed about the expectations of such a parser. I think the expectations for testing differs from the htmlcorrector filter.
Comment #31.0
Hanno CreditAttribution: Hanno commentedimprovements
Comment #31.1
Hanno CreditAttribution: Hanno commented.
Comment #31.2
Hanno CreditAttribution: Hanno commentedissue that introduced libxml for html corrector filter in 2009
Comment #32
Hanno CreditAttribution: Hanno commentedSame issue #1277290: Use a proper HTML parser for every core filter
Comment #33
mgiffordAdding related meta isue #2035139: [meta] Comply with WCAG guideline 4.1: Compatibility
Comment #34
PanchoRe #27:
"No way we can release D8" means "critical", and I agree with that. Anyway, we're way beyond thresholds both regarding critical and major bugs.
So all this "major plus 'Release blocker' tag" leads to is covering our real technical debt, as discussed on #1810428: [Policy, no patch] Adjust major and critical task thresholds during code thaw. Which isn't good, even more because this specific issue will take quite some time to be properly solved.
Comment #35
Crell CreditAttribution: Crell commentedmbutcher: Do you think your library is stable enough for us to leverage for D8 core? Or will it be soon?
Comment #36
mfer CreditAttribution: mfer commentedI'm jumping in here because Hanno reached out to me. Thanks.
I'm co-developing html5-php along with mbutcher. There is a definite need for a html5 parser and writer. html5-php parses html5 into either a DOMDocument (for a full document) or a DOMDocumentFragment (for fragment parsing) and the writer can serialize these. Note, fragment parsing and document parsing are two different methods. Using the DOM supplied with libxml means it's compatible with other tools.
It's currently a beta1 release and we believe it to be API and feature complete. We're looking for feedback to improve stability and performance. The sooner we get good feedback the sooner we can release a stable version.
html5-php has composer support and we're using semantic versioning.
If you want to use the libxml parser for html prior to html5 you'd need code to detect which is being used. You could look at the document tag at the start.
I hope this helps fill in the blanks. The release of html5-php will be as fast as we can get feedback on it's use to verify it's all working as expected. The testing setup has over 128 tests and 2k assertions. It is run through travis ci for php 5.3, 5.4, and 5.5.
One note, for proper utf8 handling we do need mbstring to be present. iconv can do some things but there are some versions of PHP that have issues with iconv.
if you have questions I'd be happy to answer them.
Comment #36.0
mfer CreditAttribution: mfer commentedTidy added
Comment #37
Crell CreditAttribution: Crell commentedThanks, Matt! Which PHP versions do we need to be concerned about, iconv vs. mbstring-wise?
Comment #38
mfer CreditAttribution: mfer commentedAt the very least, all or almost all of PHP 5.4 and some of the newer versions of PHP 5.3. I've not tested on PHP 5.5. The details are up at http://us1.php.net/manual/en/function.iconv.php#108643.
The issue pops up when we go to filter invalid characters from a document or fragment. If there are no invalid characters there is no issue. If there are invalid characters and iconv is used we get nothing back from it. This may be an edge case for your use. How often do people put invalid characters in html documents?
Comment #39
Hanno CreditAttribution: Hanno commented@mfer thanks for this excellent overview and it sounds good.
Invalid characters: We will have to deal with invalid characters for the filter module as text can be automatically imported from external sources. Should we filter these characters before parsing?
Comment #40
mfer CreditAttribution: mfer commented@Hanno by invalid characters I mean characters like utf-16 ones in a utf-8 document. How would you do that in Drupal for the filter system? To see how we do that (with lots of comments) see \HTML5\Parser\UTF8utils::convertToUTF8().
Comment #41
Hanno CreditAttribution: Hanno commentedAh ok, sounds good. Indeed, utf-16 characters can not survive in a mysql utf-8 environment, as mysql throws an error for an Incorrect string value on input.
For illegal utf-8 control characters I see that a function checkForIllegalCodepoints() is included that throws an error if these are found in the text.
Comment #42
mfer CreditAttribution: mfer commentedFYI, the html5-php project just had the first stable release. You can check it out at http://masterminds.github.io/html5-php/.
Comment #43
cosmicdreams CreditAttribution: cosmicdreams commentedmfer, what's the way forward? We're going to be sprinting at Drupal Camp Fox Valley this weekend. Anything we can do to help this issue along?
Comment #44
mfer CreditAttribution: mfer commentedThanks for your interest in working on this. I think there are a few parts to this if I'm reading the issue correctly. Note, I was just here to help this along and am not intimately familiar with the issue.
First, the original description gets into how the html handling in the filter functions should output html5. At it's simplist you can pull in html5-php via composer.json and then update
filter_dom_serialize()
to use it. I'd start with trying to replace$dom_document->saveXML
with\HTML5::saveHTML($dom_document);
. Make sure you review the current patches as well.Second, and you might want to do this first, investigate #1277290: Use a proper HTML parser for every core filter to have a common solution.
Third, as you use this file any bugs or question with html5-php in the html5-php project and I'll try to jump right on responding to them.
Make sure the html5 tests talked about here pass.
Again, others know this issue far better than I do. I'd reach out to them.
Comment #45
cosmicdreams CreditAttribution: cosmicdreams commentedThat's awesome Matt, Larry is here so I'll see if we can get going on this.
Comment #46
sunI strongly disagree with the need for an external user-space library for this issue.
Native PHP functions are sufficient. The minimum required PHP version was the only blocker, but that appears to be resolved now.
The use-case is not to filter arbitrary web pages or whatnot. The use-case is to filter simple user input.
Comment #47
cosmicdreams CreditAttribution: cosmicdreams commentedsun, are you saying that we can simply close this critical issue because it's not relevant anymore?
Comment #48
mfer CreditAttribution: mfer commented@sun, the build in PHP functions don't do html5. They can handle html4 and xml. But, there are noticeable differences.
Let's look at a parsing example. PHP sees
as...
While handling this example PHP will also throw a warnings for the audio tag and each of the track tags.
How do you see having native PHP functions handle html5 properly when they currently just don't?
I wish PHP handled html5 well. Then there would be no need for html5-php. I look foward to the day that library can be depreciated.
At the very least the doctype when wrapping a fragment in Drupal should be switched to an html5 one. That will get things closer to html5 when saveHTML is used for the native functions.
Comment #49
sunYes. And for that reason, I'd recommend to proceed with the patch in #22 first.
With that, only if we cannot either blame user input or if we cannot blame outdated PHP versions, we can look into more pain.
Is it legit to ask people to run Drupal on a current version of PHP for full/better HTML5 support? — Yes.
We do not want to scrape and parse entire HTML pages.
And we do not give any kind of guarantee that your input equals the output. If that is desired, disable the HTML filter for privileged users.
Drupal's user input filter system shoots for simple user input only. Nothing else.
KISS.
Comment #50
mfer CreditAttribution: mfer commented@sun a couple quick things...
PHP 5.5 does not have html5 support in DOM handling. It's using libxml that doesn't have html5 support either. PHP 5.5.4 (latest as of this post) still fails as my comment in #48 shows. The DOM/libxml handling that works for html5 happens to be the parts compatible with html4.
That being said, what is the current version of PHP with the full or better html5 support? If you're talking better, what does better mean?
You also said,
Can you point me to where this is documented, where it comes from, and the details of what it means? For example, list of supported tags and attributes.
These details will be great for writing tests to make sure the implementation meets it. Maybe with some tweaks the built in PHP functionality will work. But, if full html5 support isn't the requirement than the deviations from that should be worked out, in writing, in tests, and passed through Dries or his proper lieutenant.
Comment #50.0
mfer CreditAttribution: mfer commentedAdd blocked issue
Comment #51
jibran22: filter.html5_.22.patch queued for re-testing.
Comment #54
larowlanNote http://masterminds.github.io/html5-php/
P.i.e.
Comment #55
Crell CreditAttribution: Crell commentedIf html5-php actually does what we need (I've not verified that myself), and can be used for even more, that doesn't mean we shouldn't use it if it does the job and we can eliminate or not write our own code. Zend Feed also has Atom support, which core doesn't use at all, but it now means contrib modules can leverage it as the go-to tool for Atom support. That's a good thing.
If we have working code now that does what we need, or is easy to modify, there's no reason to replace it. If it's use a library that is overkill vs. write our own junior version of said library, use the existing library.
Comment #56
Wim LeersNote that the discussion here also affects another issue — see #1979468-233: ".active" from linkGenerator(), l() and theme_links() forces an upper limit of per-page caching for all content containing links.
Comment #57
longwavehtml5-php's own documentation mentions that it should not be used for filtering:
Maybe we should follow that suggestion, and pull in HTML Purifier for the htmlcorrector filter?
Comment #58
Wim Leers#57: filters like the caption filter (also in D8 core) do need HTML5 support, they're not about whitelisting certain HTML and removing the rest, but transforming HTML. For that,
DOMDocument
is ideal.Comment #59
Hanno CreditAttribution: Hanno commentedIssue in which this parsing leads to invalid html. When a user adds
<span lang="en">
in the basic HTML text format, it alters to a duplicate attribute:<span lang="en" xml:lang="en" xml:lang="en">
#2346989: [PP-1] 'Restrict images to this site' leads to incorrect HTML with language attribute
Comment #60
mgiffordComment #61
filijonka CreditAttribution: filijonka commentedComment #62
chx CreditAttribution: chx commentedI talked to filijonka about next steps for this rather important issue:
Comment #63
chx CreditAttribution: chx commentedComment #64
Hanno CreditAttribution: Hanno commented@chx interesting. htmlpurifier can't do html5 right now (http://htmlpurifier.org/phorum/read.php?3,7497,7497#msg-7497). Are the maintainers willing to work on html5 and what are the advantages compared to html5-php?
Comment #65
chx CreditAttribution: chx commented#57 said that html5-php is not suitable for this sort of filtering?
Comment #66
Hanno CreditAttribution: Hanno commentedAh, but 'this form of filtering' is about sanitizing. HTML5-lib says:
(https://github.com/Masterminds/html5-php/wiki/End-User-Submitted-Markup)
So if I read this correct, HTML5-lib is 'only' creating a DOMDocument, but not doing further processing for output, like sanitizing malicious script tags. In Drupal we have our own filter functions for that, so we probably even don't want to have a sanitizing function.
Comment #67
heddnIt is possible with a few days work upstream. As of now, htmlpurifier only has experimental HTML 5 support. But adding only support to handle specific HTML 5 tags would shrink the amount of necessary upstream work. How thorough an HTML 5 implementation are we seeking?
See #1321490: HTML5 Support? & http://htmlpurifier.org/phorum/read.php?3,7497,7497#msg-7497
As far as getting this into core, I'd love to see it happen. Configuring and making htmlpurifier run without any of the additional configuration forms would be fairly simple. The module's implementation is fairly simple & strait forward. It has its own internal caching layer that made the D6 => D7 port a little more involved. I think we've got that figured out and it won't be an issue going to D8.
Comment #68
chx CreditAttribution: chx commentedSo security wise we are good. Security filtering is done by a custom parser that has stood the test of time pretty well. In
_filter_html
we do use the dom document to apply nofollow. Not sure whether this helps but I thought it relevant.Comment #69
filijonka CreditAttribution: filijonka commentedhi ho
ok so when reading #66 and #67 I get the feeling with have actually two options for a third party vendor.
which one should we use is the question and trying to figure out what they can do and not do and in what state they are in.
HTML5-php seems to be in a ready state, supports composer, supports php namespace (a must?)
HTMLpurifier seems to not be in a ready state (but soon?) and no composer support as far as I can see, doesn't support namespace(?)
Well I'll check these two out and see how they works within drupal to beginning with.
Comment #70
heddnhtmlpurifier has Composer support and has a PHP namespace. htmlpurifier has been around in Drupal since at least 4.7, so it has a large and continually growing community.
Comment #71
filijonka CreditAttribution: filijonka commentedok good but the documentation doesn't state how to use composer and in the documentation it says it has no support for php namespace but good it's wrong.
Comment #72
Mile23On Packagist.org:
HTMLPurifier: https://packagist.org/packages/ezyang/htmlpurifier
HTML5-lib: https://packagist.org/packages/masterminds/html5
All other things being equal, here are some things to consider:
I was able to clone html5-lib and run the tests without having to learn anything. Also, it has a .travis.yml file, and it generated an HTML coverage report without being asked. Yay. (PHPUnit says the biggest change risk scores 18, which is excellent.)
htmlpurifier apparently uses SimpleTest for testing, and doesn't tell you how to install it. The docs say it requires a writable file system for caching, which is a bit of a non-starter.
Should we have a 'modernize htmlpurifier' campaign?
Comment #73
heddnI'm not at all familiar with html5-php, but is it more of a parser? htmlpurifier is a HTML sanitizer/filter. From the issue summary, this issue is really looking for a parser, which htmlpurifier is not. Are these assumptions correct?
BTW, htmlpurifier's drupal module uses a DB backed cache.
Comment #74
Hanno CreditAttribution: Hanno commentedAs I understand from the documentation of HTMLpurifier:
Comment #75
ezyang CreditAttribution: ezyang commentedHey guys,
I'm lead dev for HTML Purifier, so I thought I'd answer a few of the questions that were raised by this thread.
1. HTML Purifier is currently in "maintenance" mode. This means that I am fixing bugs and security problems, but not actively working on features. For new features like an alternative parser or more tag support, I need community help to make happen.
2. HTML Purifier currently uses either the built-in dom parser, its built-in (and slightly idiosyncratic) pure PHP parser, or the html5lib parser (which has fallen off the wagon and is pretty out-of-date.) The choice of parser backend is quite swappable, so it would not be difficult to plug in html5-php (indeed, it probably is a drop-in replacement for html5lib).
3. HTML Purifier operates on a whitelist, so yes, you have to teach it about every element/attribute that you want to support. https://github.com/kennberg/php-htmlpurfier-html5 is a good example of how easy it is to add support for some elements and attributes. The main reason why it's not supported in the core is I've been waiting for someone to go through and methodically add support for *as many* elements as humanly possible; most people have just added support for the dozen or things they need and left it at that.
4. Documentation for how to install SimpleTest for use with HTML Purifier can be found by looking at test-settings.sample.php. We have A LOT of tests, it's kind of excessive.
5. By default HTML Purifier does use a file system cache to speed things up (the performance win is nontrivial). However, as mentioned by heddn, what cache is used is pluggable and HTMLPurifier_DefinitionCache_Drupal stores hooks to make this information stored in Drupal's built-in cache.
6. HTML Purifier's documentation doesn't state how to use Composer because I don't use Composer, and the patches adding support for it were contributed by the community. I'd be happy to accept docpatches with more information for how to use Composer (but I assume it's pretty similar across projects?)
7. HTML Purifier is written in an old fashioned PHP 4 style, and doesn't put any code in a namespace (but prefixes all classes with HTMLPurifier_). It would be helpful to know what specific namespace support you need so we can add it; the main reason namespaces haven't been added is inertia, and also desire to avoid breaking backwards-compatibility.
Comment #76
filijonka CreditAttribution: filijonka commentedhi
ok been reading and listening to what u all are saying and my personal view is that perhaps htmlpurifier isn't the right choice for us but, 1. I might be wrong 2. I probably need to read more about it and test it properly, so more input is appreciated.
So by that said this first patch is done with html5-php and it will need more work (no tests i.e) but 1. I wanted it to move along a bit and 2. needed to see what testbot says about it.
note: this patch has nothing to do with previous patches done so no interdiff.
Comment #78
filijonka CreditAttribution: filijonka commenteddooh sorry brain wasn't connected I guess or something, corrected patch will come soon
Comment #79
filijonka CreditAttribution: filijonka commentedadded the vendor to the patch..
Comment #81
filijonka CreditAttribution: filijonka commentedFYI all fails are the same type but right now whatever I test on head locally nothing works..so kinda hard to debug
Comment #82
filijonka CreditAttribution: filijonka commentedI doubt this will be green but it has solved a lot, changing the the way of loading the input to a DOM tree.
It's also a reroll that went straith on wit hno conflicts (mostly composer.*)
Comment #84
filijonka CreditAttribution: filijonka commentedhmm wonder if that interdiff gone haywild or something..
Comment #85
filijonka CreditAttribution: filijonka commentedredid the interdiff only.
Comment #86
filijonka CreditAttribution: filijonka commentedok I've been debugging this to my death..
In FilterApiTest we buil up an element, starting with $build, calling it with drupal_render_root.
Via that function we end up in function preRendertext (Drupal\Filter\Element\ProcessedText), this runs through all filters, one of these are FilterTestPostRenderCache (which comes before FilterHtmlCorrector) and in this process there's a placeholder added to the html with tag
<drupal-render-cache.....>
.The process for FilterHtmlCorrector is then called and when saving the dom to html the the parser interepret the tag
<drupal-render-...>
as<drupal ....>
.So when coming to renderDynamicThing (Drupal\filter_test\Plugin\Filter\FilterTestPostRenderCache) the string replacing simply doesn't return a correct html and hence the test $this->assertEqual($expected_markup, $build['#markup'],....) simply doesn't return true any longer.
ok how to solve this?
I'm not sure why we are doing this really, perhaps we can add our own rules? so we can allow this pattern but can we do that only in these special cases or does it actually mean that users could do that too?
[Edited to show the parts of the comment that were filtered out. (April 20, 2024)]
Comment #87
Wim Leers#86: a few parts from your comment seem to be missing because you didn't wrap HTML tags in
<code>
tags like this:<code><p>like this</p>
. Can y ou please edit your comment so we can read the full contents? :)Comment #88
filijonka CreditAttribution: filijonka commentedI didn't actually write any html tags or anything in the post. it's just a description of the problem.
I could obviously copy and paste the exact code that I'm mentioning.
Comment #89
Wim LeersBut it says e.g.
and
Surely there are HTML tags there that I can't read?
Comment #90
chx CreditAttribution: chx commentedQuote from IRC when I talked to @filijonka about this, not sure why he didn't post it here:
When
drupal_render_root
gets a text<p>hello world!</p>
we expect it to get back<p>Hello, world!</p><p>This is a dynamic llama.</p>;
while we get back<p>Hello, world!</p><p><drupal-render-cache-placeholder callback="\Drupal\filter_test\Plugin\Filter\FilterTestPostRenderCache::renderDynamicThing
.Comment #91
Wim Leers#90: that helps a lot, thank you.
Now also reproduced. #86 can be summarized as: .
I'm currently debugging.
Comment #92
Wim LeersThe problem is quite simple, really. If I add
debug()
statements toFilterHtmlCorrector
, before and after it does its thing:then I see this:
and
(Note how
<drupal-render-cache-placeholder …
was changed to<drupal …
!)In other words: either
HTML5::loadHTMLFragment()
orHTML5::saveHTML()
break elements with dashes in them.And AFAIK that's valid HTML5. Worse, still, the spec even states that custom elements (think web components) must have dashes in their name, and that standard elements must not.
Conclusion:
masterminds/html5
also doesn't support the spec…Comment #93
chx CreditAttribution: chx commentedhttp://w3c.github.io/webcomponents/spec/custom/
Just helping defining what is a "dash".
Comment #94
chx CreditAttribution: chx commentedHowever, http://www.w3.org/TR/html5/references.html#references [CUSTOM] is non-normative. I have no idea what to do. This is a superset of the HTML5 standard so masterminds/html5 supports the HTML5 spec but not Web Components. Which should be supported? Perhaps there perhaps in a fork. No idea what to do.
Comment #95
filijonka CreditAttribution: filijonka commentedHi
1. I do apologize to @Wim Leers cause what you concluded in your debug was something I already knew I should have written a better description BUT it also good to have it confirmed from another source.
2. This isn't a tag we actaully use so perhaps a renaming would work in this case BUT should users be able to add there own webcomponents then masterminds/html5 needs to be able to parse those elements to.
Comment #96
Wim LeersIndeed.
Exactly!
So: #sadllama :(
Comment #97
filijonka CreditAttribution: filijonka commentedok so this is my plan for now:
1. see if we can easily rename the drupal-render-... tag to drupalrender and see the effects of that
2. See what the plans are from the masterminds to support webcomponents? Did we have any of the masterminds developer in this thread than please respond to this :)
Comment #98
Wim LeersAgreed with the plan.
Opened an issue against
masterminds/html5
: https://github.com/Masterminds/html5-php/issues/65.Comment #99
filijonka CreditAttribution: filijonka commentedComment #100
catchAt this point I think I'd be OK with a patch here that doesn't break render cache placeholders (i.e. via rename of the tag), but also doesn't support web components yet.
If the upstream issue gets positive feedback, we can have a separate critical or major issue to update to the version that adds support?
Comment #101
Wim Leers#100: agreed on all aspects.
Comment #102
filijonka CreditAttribution: filijonka commented@catch read my mind. I was thinking on this on the way home thinking that I should ask if we would go on and ignore the support for webcomponents at the moment.
I'm also already thinking a little bit ahead of what to do with exisiting calls for load, serialize and normalize but when I began to write realising that I need to rethink that so I can write it understandable (hopefully).
/* edit: corrected the english, sorry */
Comment #103
filijonka CreditAttribution: filijonka commenteddid a test to check if html5-lib do support webcomponents,
this is the output codewise:
I tried to add this the issue that @Wim Leers opened at masterminds but it ignored my code-tags and since I've never posted there I've no idea how to do that correctly. So added it here instead and also perhaps someone can confirm this.
Comment #104
Wim Leers@filijonka: GitHub uses (Github-flavored) Markdown, not HTML. Over there, you can post code like this:
Comment #105
filijonka CreditAttribution: filijonka commentedhi
looks like support for web components could be included in v. 2.0.2 per https://github.com/Masterminds/html5-php/pull/66
So I'm waiting a lil with a new patch.
Comment #106
Wim LeersAwesome! :) Super simple pull request too!
Comment #107
filijonka CreditAttribution: filijonka commentedjust some thoughts, perhaps i'm overthinking this but began to think on the existing functions HTML::serialize, normalize and load.
I just took from some test file this:
In the future I guess we should deprecate these three functions so above code would be something like this:
Which is basicly the code we now have in FilterHTMLCorrector.
Perhaps this is not the best approach, perhaps it would be better to actually rewrite the code in the static functions in \Drupal\Component\Utility\Html. That would make it easier for the testwriting.
Comment #108
chx CreditAttribution: chx commented@filijonka we can deprecate but we can't remove because beta. API is not set in stone but needs a very good reason to break. And I do not think there's a real good one here.
Comment #109
Wim Leers#108++
Comment #110
filijonka CreditAttribution: filijonka commentedhmm so updating the code in i.e HTML::normalize would not be ok but updating the code in process() in class FilterHTMLCorrector is ok? I fail to see the difference actually.
No changes need to written about the descriptions in the API for these functions excepts that it works for HTML5 also and not only (X)HTML.
Comment #111
Wim LeersNo, chx and I are saying it's okay to change implementation details, but we should keep the public API unchanged.
Comment #112
filijonka CreditAttribution: filijonka commentedhi
maybe I was unclear but this was actually two suggestions,
one was to keep the code as in current patch and deprecate the functions in HTML OR
keep the code for process as in head and rewrite the functions in HTML.
Comment #113
chx CreditAttribution: chx commentedthe upstream PR has been merged.
Comment #114
Wim LeersWe don't want to break public APIs unless for a very good reason. Here we don't have a very good reason. So:
is highly preferred. Where "HTML" is
\Drupal\Component\Utility\Html
.Comment #115
Mile23Comment #116
filijonka CreditAttribution: filijonka commentedhi
just begun with some testing to rewrite the functions in the HTML class and I think we actually can't do that
example:
Let's say we have made the changes in load so it can parse a html5 and return a correct DOM.
we then have tests like this:
but since PHP DOM can't handle html5, hence the DOMXPath can't handle a html5 dom. So in this case the xpath variable isn't the same as if the it was xhtml hence the parent variable won't be correct.
I'm I thinking correct?
Comment #117
twistor CreditAttribution: twistor commentedhtml5-php builds a real DOMDocument.
If you're seeing problems with XPath, it's probably because html elements are in another namespace. Specifically,
http://www.w3.org/1999/xhtml
.Comment #118
filijonka CreditAttribution: filijonka commentedSo this patch is only a correction of patch in #82 in the sense of changed the masterminds/html5 to dev-master version so the latest fix from them are included.
So hopefully this will be green
Comment #119
Wim LeersThanks!
Good.
But the actual patch still says:
So… wrong patch? :)
Comment #120
filijonka CreditAttribution: filijonka commentedehm the diff seems to be all f* up..why are the lines 457 in this diff?
the patch is correct, going to double check this
Comment #121
filijonka CreditAttribution: filijonka commentedredid my branches and hence the patch and diff just for the being on the safe side.
sorry for this but better safe than sorry.
Comment #122
filijonka CreditAttribution: filijonka commentedgaaah how i can ul wrong diff again..sorry. the patch was ul correctly..here's the diff..
Comment #123
Wim LeersGreen patch, yay! :)
Why make these changes, why use the
HTML5
library directly? We should update our\Drupal\Component\Utility\Html
to use this instead. That's what chx and I have been saying since #110 through #114.Comment #124
filijonka CreditAttribution: filijonka commented@Wim Leers hey read my comment on #118
next step is to find out a way to actually do what we have said in #110-114. And read my comment on #116 on the problems I encountered when trying to do that. But got some ideas actually that I'm going to test.
Does current patch results in any API changes or in any way break our beta rules?
Comment #125
chx CreditAttribution: chx commentedSo I reviewed this (a review only patch is attached) and apparently it addresses the filter system but doesn't address the testing system. Just observing.
Comment #126
filijonka CreditAttribution: filijonka commentedelaborate pls?
Comment #127
chx CreditAttribution: chx commentedI do not have a clear idea on how to continue; just wanted to make it easy for others to grasp where we are at. Sorry for not making this clear, your progress is great.
Comment #128
Wim LeersAt this point, it's more productive for me to just make the changes I think to be necessary rather than continue to repeat them. So I made the changes I asked for in #123 and many comments before that. I also updated
WebTestBase
.Many of the tests in
FilterUnitTest
basically assume XHTML-like parser behavior. But XHTML is the past, HTML5 is the present and the future. Switching to\MasterMinds\HTML5
in favor of\DOMDocument
does mean that we'll switch to HTML5-like parser behavior. The most noticable consequence of that is how self-closing tags are omnipresent in XHTML and contrastingly absent in HTML5. See http://tiffanybbrown.com/2011/03/23/html5-does-not-allow-self-closing-tags/ for a clear explanationI tested with
FilterUnitTest
andFrameworkTest
, those pass; let's see if other tests pass or not.Comment #130
filijonka CreditAttribution: filijonka commentedwell personally I don't give a fuck about what is productive for you since I'm the one that got this on my shoulders and instead of just marching in you can talk with me. Since we already in #116 noticed some problems with the codeshanges we need to figure that out too. And since I'm working on this trying to find solutions for that and also been writing the patch for it you just made my day totally unproductive.
I thought that was the idea of assigning that people will udnerstand that someone is working on this. I'm unassigning myself and let anyone who wants do this do so.
Comment #131
Wim LeersI'm sorry to have uploaded a patch in the mean time. :( But after several miscommunications, I thought this would be clearer for everybody. I was concerned we'd continue to walk in circles. I'm very sorry if you were working to do exactly the same!
But … quite often it's actually relatively productive for two people to have done the exact same work. Because it means you're in a great position to review the changes I'm proposing. It makes it easy for you to spot problems in my patch. In other words, where you did things differently, why did you?
Please consider posting a review based on your experience/the work you were doing in parallel.
In any case, my apologies :(
Comment #132
pfrenssen@filijonka, please don't be offended, and don't consider your work to be in vain. These critical issues tend to attract a lot of attention, and it is not unusual for people to work on it in parallel. @WimLeers is also very invested in this issue and you can see from his comments here that his only intention is to help solving this important issue.
Your work here has been invaluable, it would be a great loss if you are put off working on this further.
Comment #133
filijonka CreditAttribution: filijonka commentedThis code won't work.
Best way to solve this is to use Symfony/dom-crawler
and change it to this:
The dom-crawler is if I understand it correctly also considered in issue #2232861: Create BrowserTestBase for web-testing on top of Mink
Comment #134
twistor CreditAttribution: twistor commentedIf you want to use the Symfony dom-crawler, you might as well use a regular DOMDocument, that's what it uses internally.
To use the HTML5 Dom, and XPath, you have to do this.
Comment #135
twistor CreditAttribution: twistor commentedI have some *very* old, but well tested, code that could be updated to add the namespaces automatically to the XPath queries.
http://cgit.drupalcode.org/feeds_xpathparser/tree/FeedsXPathParserQueryP...
Comment #136
Wim LeersCan you explain why that won't work? XPath logic works fine in
WebTestBase
, so why can't it work correctly there?Comment #137
twistor CreditAttribution: twistor commentedMy guess would be that Drupal\simpletest\AssertContentTrait::parse() is still using a regular \DOMDocument.
Comment #138
twistor CreditAttribution: twistor commentedI'm just going to go ahead and run with this for a bit.
Comment #139
Wim LeersThanks! :)
Comment #140
twistor CreditAttribution: twistor commentedI stuck the XPathExpressionParser in simpletest for now, even though it doesn't belong there. Not sure what to do with it yet.
Let's see what explodes.
Comment #142
twistor CreditAttribution: twistor commentedSo, the HTML5 parser treats tags inside
<noscript>
elements as plain text. Not sure if that's correct, don't have time to go trawling through the spec at the moment.Comment #143
twistor CreditAttribution: twistor commentedComment #145
filijonka CreditAttribution: filijonka commentedtwo question of things I don't understand:
in #123 @Wim Leers writes
why use the HTML5 library directly? We should update our \Drupal\Component\Utility\Html to use this instead.
which was done in #128
This latest patch is not following that as far as I can see.
why not use Symfony/dom-crawler which is well tested and working? It will also be introduced by another patch as I wrote earlier.
Comment #146
andypost+1 to dom-crawler
PS: just compare usage https://packagist.org/packages/masterminds/html5 vs https://packagist.org/packages/symfony/dom-crawler
Comment #147
twistor CreditAttribution: twistor commentedI don't see what using dom-crawler has to do with anything. It would just be a slightly different version of what we have now. It uses the standard DOMDocument internally, so doesn't buy us anything besides maybe a nicer API.
What I'm trying to do in this patch is convert direct usage of \DOMDocument to HTML5 where it makes sense. That includes \Drupal\Component\Utility\Html, as-well-as parts of the testing framework.
My goal is to switch DOMDocument usage with HTML5 usage while changing the tests as little as possible.
Comment #148
filijonka CreditAttribution: filijonka commentedok I'm just trying to understand here;
In my comment #133 I presented a problem with DomXpath with a possible solution using the symfony dom-crawler.
In #134 you presented another solution (which I never got to work personally but might have not done correct).
But from there I got lost, the introduction of XPathExpressionParser? I had the understanding that this was a solution the problem I presented?
I also asked why there are calls directly to html5 like this outside the Html class
Trying to understand to be able to give better feedback, that's all.
Comment #149
twistor CreditAttribution: twistor commentedSome of there failures are because we produce invalid markup in forms and the HTML5 parser fails to add all of the input elements.
For example, when setting #field_prefix to
'<div class="container-inline">'
it gets wrapped in a span tag and things fail to parse.Comment #150
twistor CreditAttribution: twistor commenteddom-crawler is not a solution. Someone, somewhere, has to do XPath via the DOM. Whether that's using dom-crawler, or not, the XPath queries have to be namespaced correctly. dom-crawler doesn't do anything different than what SimpleTest already does. They solve a similar problem. It would be nice in the future to use dom-crawler, sure, but that won't help us now.
It works, as of #142 the render tests are passing.
The XPathExpressionParser needs a better name, possibly Namespacer, but it's just a way to automatically namespace an xpath query, so that we don't have to update every single one, and makes writing XPath queries significantly easier. It converts
//ul/li
into//x:ul//x:li
.Masterminds\HTML5 namespaces all HTML elements in the
http://www.w3.org/1999/xhtml
namespace. This means that to use XPath on it, you have to register the namespace withDOMXPath::registerNamespace(prefix, namespace)
, and then prefix every element with the same prefix used in the registerNamespace() call./x:div/x:a[@href = something]
Drupal\Component\Utility\Html
provides helpers for dealing with HTML. It needed to be updated as the API needs to be maintained. It was never suggested that that should be the only place masterminds/html5 should ever be used, simply that it needs to be used there. If you look at the patch, an existing API is never changed for direct usage of HTML5. But, most direct usage of the DOMDocument is.Comment #152
twistor CreditAttribution: twistor commentedAnother comment about Drupal\Component\Utility\Html.
The only thing that would be needed for simpletest would be Drupal\Component\Utility\Html::load(). This won't actually work at the moment, since we already have a fully formed html document and the load() call assumes the load is being called with the body text.
Regarding the XPathExpressionParser, don't focus on that too much right now. I have an updated version that easier to follow and debug, but I'm just using it at the moment to get tests to pass. It will be simple to remove, update, whatever once things get settled.
Comment #153
twistor CreditAttribution: twistor commentedThe commented out #field_prefix's are just to get the tests to pass, I haven't figured out the correct fix yet.
Comment #155
twistor CreditAttribution: twistor commentedThis is an interesting one. $this->entity->label() is returning a TranslationWrapper. The wrapper isn't being cast to a string, but rather, being set as attributes on the input element. This was happening before the patch. The reason the failures are happening now is that the HTML5 parser is strict about attributes it will accept.
I'm not sure where the actual fix should go, the current patch just illustrates what needs to happen. Perhaps in FormBuilder?
Comment #156
amateescu CreditAttribution: amateescu commentedThat sounds really weird. Can you post a backtrace for that label() call? In any case, we might want to do the string casting in
\Drupal\Core\Entity\Entity::label()
(i.e.$label = (string) $this->{$label_key};
).Comment #157
andypost@twistor there css-select in core https://www.drupal.org/node/2276689 is it possible to use it for xpath?
Comment #158
twistor CreditAttribution: twistor commented@andypost, Unfortunately, no.
Symfony\Component\CssSelector
just translates CSS selectors to XPath expressions. So the same problem exists, just a little lower.We're hiding the X-Path complexity for the most part by doing it AssertContentTrait::xpath(). That actually takes care of AssertContentTrait::cssSelect() automatically. The edge cases are where tests are using the DOM + DOMXPath directly.
For some reason, FieldRdfaTestBase and FieldWebTest are re-implementing parts of AssertContentTrait, so that's adding some noise to the patch.
Comment #159
twistor CreditAttribution: twistor commentedCleaned up the XPath thingy. I know everybody hates it, but I'm not sure what the alternative is.
Comment #161
twistor CreditAttribution: twistor commentedComment #163
twistor CreditAttribution: twistor commentedComment #164
mgiffordThe bot likes it well enough. I can't seem to apply it here though http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org...
Not sure what I could be missing. maybe it's something just weird with SimplyTest.me at the moment.
Thanks for pushing ahead on this @twistor
What are the steps for testing this before marking it RTBC?
Comment #165
daffie CreditAttribution: daffie commentedMy first review. It looks good. The patch also needs a reroll. And can you add a "do-not-test" patch without the vendor stuff. Dreditor does not like very large patches. Especially on an old computer like mine.
The options parameter encode_entities = FALSE is not necessary. It defaults to false.
Can we change this to @return array. And on the next line explain it so more.
Please add "use Drupal\Component\XpathHelper\Lexer" at the top of the page. It is probably technically not necessary, but it is very helpful for other programmers.
Reverse the tests from: 'line1
line2' to: 'line1
line2'. And so forth.
We need to make a followup issue for this.
Why this change?
Should this not be @group XpathHelper.
Can we rename this function to testLexer($expected, $input).
Can you add a docblock for the xpathProvider function. The default is to name the function providerSomething. My suggestion would be to name this function providerLexer() and providerNamespacer(). An other default is to have the first parameter be $expected. That is the reverse from what it is now.
Can we rename this function to testNamespacer($expected, $input).
Add a docblock for this function. And why twice the same test? What do you want to prove?
Comment #166
daffie CreditAttribution: daffie commentedI am not completely sure what is the best naming for the "@group" in the LexerTest and NamespacerTest classes.
Remove the space and the slash at the end of the meta tag.
The issue is fixed in the html5 library. So this can uncommented. Maybe at some more tests for this.
Comment #167
babruix CreditAttribution: babruix commentedThis is re-roll of patch in #163. Next will try to add changes from comments #165-166.
Comment #168
babruix CreditAttribution: babruix commentedShould we wait for test results before adding next patch?
Comment #169
babruix CreditAttribution: babruix commentedComment #170
babruix CreditAttribution: babruix commentedComment #172
babruix CreditAttribution: babruix commentedCommented out test blocked on https://github.com/Masterminds/html5-php/issues/68
Comment #173
catchLooks like that issue's been closed as fixed already?
Comment #174
joseph.olstad@mgifford, I had the same issue with patch 163 however patch 167 will build with simplytest.me
@babruix thanks for the work on the patch
*edit*
I tested the patch, my test case passes with or without the patch (patch 167). I had pasted some source html markup into a new node including an openning
<source>
and<track>
and closing</source>
and</track>
tags ,in html5 these
<source>
and<track>
tags are supposed to be void tags (no ending tag), in both tests I ran the end tag was properly removed during validation.This was observed in both of my tests where I tested with the patch and I also re-tested without the patch. both tests gave me desired results.
so my test case probably won't help much but I did test it.
*edit*
Thanks for everyones great work on this issue. I do have some feedback to those that know how to create this patch and have generated these patches. For clarity purposes it would be handy to have a "do not test" patch similar to the one that @chx created in comment #125 and perhaps if someone was really nice maybe a simple readme of where to put the library and what version of the library is meant for the do-not-test patch (a version ? what hash value of html5-php from mastermind/github). I read through this entire issue (all 170+ comments) before posting this.
*edit* where to put the library: the masterminds html5-php library goes in core/vendor
Comment #175
babruix CreditAttribution: babruix commented@catch yes, but that test still fails, because
becomes
<p>line1<br><hr>line2</p>
and we expect it to be
<p>line1<br/></p><hr/>line2
I have tried to change references to masterminds/html5 github latest commit, maybe I have missed something?
Comment #176
kim.pepperCan we please have an do-not-test patch that excludes the library?
Comment #178
joseph.olstad@babruix I added some details to the github queue for the html5-php masterminds library developers (the aforementioned github issue looks to be relevant however it was closed a while back)
From what I've read about the html5 masterminds library I'd expect this output from your test:
<p>line1<br></p><hr>line2
(this is how chrome/firefox do it)rather than:
<p>line1<br/></p><hr/>line2
(what the current test case assertion expects)this is because masterminds html5-php library is mimicking firefox/chrome interpretation of html5 but in this case below the actual output is not as expected
what we're currently seeing from html5-php masterminds (fail):
<p>line1<br><hr>line2</p>
It looks like there's still quite a bit of activity in the github queue for the html5-php masterminds library.
This is probably a good enough reason for a no-test-patch and a readme so that others can continue to quickly configure this with the latest pull of the library from gitlab. chx made a no-test-patch on comment #125 but there's no readme on using it .
Comment #179
joseph.olstad@babruix
I did some followup with the github community for html5-php masterminds
@Technosophos, insists that the drupal unit test in question needs to be verified with the w3.org spec http://www.w3.org/TR/2014/REC-html5-20141028/ with regards to the unit test case we're concerned with
What the html5-php developers are telling us is before they make a drupal specific exception for html5 validation because we think that the unit test should pass but does not then we should back it up with a reference to the html5 spec to say that our unit test is indeed following spec. It could be that chrome/firefox is not following spec in this case. We should verify the spec, make adjustments to the unit test (if necessary) or if necessary follow up with the html5-php masterminds issue queue with a reference to the w3c html5 spec .
Translation: They're asking us to first provide a specific reference to the html5 spec that justifies the unit assertion test https://qa.drupal.org/pifr/test/941363 prior to making a drupal specific exception in the library.
So in question is the unit test for "HTML corrector -- Move non-inline elements outside of inline containers." , is it valid html5 to w3c spec or not?
Comment #180
babruix CreditAttribution: babruix commentedChanged test to expect
<p>line1<br></p><hr>line2
and commented it out for now.Added "-do-not-test" in filename. Not sure which readme is needed, I assumed it is only to skip testing by testbot?
Regarding HTML5 specs, I haven`t found what should be normalised behaviour for
<p>line1<br/><hr/>line2</p>
.Seems we are expecting that
<hr>
tag will be moved out of<p>
tag after normalize,however test message says "Move non-inline elements outside of inline containers."
As paragraph is a block level element, should we change message or the test itself is incorrect?
Comment #181
joseph.olstadHi @babruix, just fyi: the do-not-test patch from #180 will not patch using simplytest.me
I did some additional tests with simplytest.me and patch 175 no longer applies to 8.0.x dev either
and then I decided to see if there is a problem with simplytest.me or 8.0.x dev, and re-tested patch 167 with simplytest.me and it ALSO no longer will build. What is curious is that yesterday morning I WAS able to apply patch 167 successfully. SO I am wondering if a recent change to the 8.0.x dev branch is affecting things or else perhaps a problem with simplytest.me?
http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org/files/issues/1333730-180-do-not-test.patch
http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org/files/issues/1333730-175.patch
http://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.org/files/issues/1333730-167.patch
So if I understand correctly if someone wanted to test this patch with a different checkout/hashtag/version of the masterminds vendor code all they'd have to do is apply the patch, say patch 175 (which is applying) to the latest drupal 8 core dev branch of 8.0.x and then remove the /core/vendor/masterminds files that is inserted by the patch and then replace that /core/vendor/masterminds with the desired version of masterminds/html5 that they want, place it in /core/vendor and then either make their own patch or run their own tests.
I also grabbed the latest drupal 8.0.x core dev tried applying patch 175 and it will no longer apply. Here's the output from git apply:
translation:
the patch failed : composer.json:28
composer.json :the patch won't apply
composer.lock :no file or folder of this type
Thanks again @babruix for your hard work on this issue.
Comment #182
babruix CreditAttribution: babruix commentedI have rerolled patch and changed HTML5 lib code to the latest release. Now test passes on my local env.
Should we have separate patch with "do-not-test" in name that excludes the external library?
Or we should always have "do-not-test" in patch name for this issue and test manually?
Comment #183
babruix CreditAttribution: babruix commentedAdded do-not-test patch that excludes the html5 library.
Comment #184
babruix CreditAttribution: babruix commentedComment #185
joseph.olstadHi @babruix, I reset my local clone of drupal core dev 8.0.x , did a git pull
then applied your patch 182, it successfully applied (with whitespace warnings on a few lines of masterminds code)
so the corresponding checkout of drupal core 8.0.x dev would be:
1e08b50f421bd (Fri Jan 9 15:14:23 2015 +0000)
patch 182 is tested to build on that checkout.
We'll see if we have a winner. Seeing as how fast dev core is moving it'd be nice to get the patch moving quickly , it might seem like trying to get airspace at the kennedy airport.
Comment #186
babruix CreditAttribution: babruix commentedHere that test is uncommented finally and library correctly updated.
Passes on local.
Comment #187
daffie CreditAttribution: daffie commentedGood work @babruix!
If I look at the function name then it suggest that it tests the value of the static $cache variable. But in the function there is no testing for that. The function also needs a docblock.
Move this to the docblock of providerLexer().
Move this to the docblock of providerNamespacer().
Comment #188
babruix CreditAttribution: babruix commented1. Which name would you suggest? How about
testStaticCache
? Because it returns from static value (if that exists):static::$cache[$prefix][$xpath]
2. and 3. Do not see the point to move docblock parameters to methods which have no arguments.
Comment #189
daffie CreditAttribution: daffie commented1. If we add a test function called testCache(), we should test the working of the class variable called $cache. An other possibility is to remove the function testCache().
2. and 3. For the provider-functions return an array with two elements ($expected and $input). Those need to be documented in a docblock. We do this at the provider-function because some provider-functions are used multiple times.
Comment #190
Mile23Daffie asked me to look at the
testCache()
method.This test method tests two methods,
prefix()
andlocalize()
.Some analysis of these methods reveal that
localize()
is a special case ofprefix()
, and should really just callstatic::prefix($xpath, NULL);
But for the purposes of testing... Both
prefix()
andlocalize()
have a dependency on a static cache variable, and the methodparse()
, which should definitely have unit test coverage.In an ideal world,
parse()
would get refactored to accept parameters instead of using class properties, but forget that for now. You can use reflection and other techniques to poke in dependencies.So to sum up:
localize()
to just callprefix()
with NULL.testPrefix()
with a@dataProvider
that hits a few edge cases.testLocalize()
with a@dataProvider
if you didn't refactor (and maybe even if you did).testPrefix()
andtestLocalize()
replacetestCache()
, which should then go away.testParse()
.Thanks.
Comment #191
Mile23daffie asked me to take this on.
Comment #192
Mile23OK, so some of my analysis was a little off, but here's something to chew on.
BTW: Anything static is automatically an anti-pattern. :-)
Comment #193
daffie CreditAttribution: daffie commentedI have made some minor changes to the patch. Which I have listed below. They are mostly documentation. I have tried to make an interdiff.txt file, but I did not succeed.
Change the docblock and renamed the function from testLexer() to testLex().
Added a the docblock and renamed the function from providerLexer() to providerLex().
These change are with the following function: testLocalize(), testPrefix() and testParse().
This change is on line 179 in the function testParse().
With the minor changes that I made it all looks good to me.
With the addition from @Mile23 the testing is completed. Thanks Mile23! :-)
All documentation is in order.
The changes is this patch are minor, so I have taken the liberty to give this patch a RTBC.
Comment #194
alexpottLooks like we need a change record.
Comment #195
daffie CreditAttribution: daffie commentedAdded a change record for this issue: [#2407077].
Comment #196
daffie CreditAttribution: daffie commentedComment #197
Wim LeersThe change record is very misleading:
. This should mention that it's Drupal 8's filter system that now supports it.Comment #198
daffie CreditAttribution: daffie commented@winleers: Made the requested change to the change record.
Comment #199
Wim LeersUnlike #165, I think it'd be better to explicitly list the
'encode_entities' => FALSE
option, just to be on the safe side when upgrading in the future, and to document/convey that Drupal doesn't encode entities. But that's a matter of opinion, so it's fine to keep it as is.(Also: I'd swear the default was different when I worked on it!)
This is a HUGE amount of code being added. Why do we need this?
Needs to be resolved before it can be committed.
So this is what that big amount of code earlier is for. But why exactly is this necessary?
What are the use cases for this in Drupal core? Let's document them here.
This seems wrong.
Comment #200
Wim LeersI forgot to say the two most important things: *thank you*, *incredible* progress has been made here! :)
Comment #201
daffie CreditAttribution: daffie commentedThank you @Mile23 for adding some of the tests.
Comment #202
dawehnerGiven that amount of ifs in that test I would wonder whether it would be better to just write a couple of dedicated test methods with proper names instead.
Comment #203
cosmicdreams CreditAttribution: cosmicdreams commentedIs it ok to update the change notice. I have a revision that I've worked up.
Comment #204
daffie CreditAttribution: daffie commented@cosmicdreams: give it your best :-)
Comment #205
cosmicdreams CreditAttribution: cosmicdreams commentedhttps://www.drupal.org/node/2407077 look good / accurate?
Comment #206
Mile23There's no good way to break it out and keep the expectations. You'd just end up with three different tests of escalating complexity, the third one being the same as
testParse()
.Comment #207
Mile23I did a coverage report for
Namespacer
andLexer
, and here's a test that addresses the biggest risk,Namespacer::shouldCopy()
, which has a CRAP score of 156. 12 with this patch.Lexer::readToken()
is the other big risk, with CRAP of 72 but I'm short on time.Comment #208
dawehnerOkay, fair.
Comment #209
Wim LeersNone of #199 has been addressed yet, NW for that.
Comment #210
daffie CreditAttribution: daffie commented@wimleers: I have tried to answer your questions as good as I can.
Comment #211
joseph.olstad@Wim Leers,
regarding comment #199,
item 1) I'll put your own comment in"a matter of opinion, so it's fine to keep it as is."item 2)please see @twister comment #150item 3)good suggestion, just removed those // lines of code , they aren't necessary as they're just commented out lines of codeitem 4)"But why exactly is this necessary?" please see again @twister comment #152 and his other comments and also please see the issue descripton under "Chosen solution" and the various discussions that led up to the helper classes being included in the patch (it's quite a long discussion).item 5)"// Check for XML preamble" . use case for this; checking for markup type before validation, it is logical to check that input is xml and should be validated as xml (not as html5) so this check is necessary. ( this is how I interpret this section of code and the description)item 6) Any suggestions for making this line of code clearer? according to the api docs on public function FormState::get an array is returned containing the property from the form. So one might ask, is it normal to put this "get('property')" into a boolean check? I'd say let the php interpreter deal with it for now, we could always revise it down the road in a new issue after the commit.
thanks.
Comment #212
daffie CreditAttribution: daffie commentedThe added test for
Namespacer::shouldCopy()
in #207 is for me RTBC. Thanks @Mile23!Comment #213
xjm@catch, @alexpott, @webchick, @effulgentsia and I discussed this issue this week. Thanks so much to everyone to sticking with this challenging issue! The examples of this bug in the issue summary (where filter.module does not understand valid HTML5 markup and sometimes even turning valid HTML5 markup into invalid markup) definitely indicate a major bug since we want to fully support HTML5 in Drupal 8. The bug primarily affects markup entered as user input in "blob" fields (as well as test code), and does not regress the overall HTML5 support on the rest of the page.
The example of filter module breaking the DOM is the most serious, but that will only occur under a specific set of circumstances. Furthermore, there is no impact on data integrity, since the user's original input is still stored (just rendered incorrectly on output.) So, the bug does not render the whole filter module unusable, so per our issue priority policy, this is a major bug, not a critical one.
We also discussed scope management for the issue. The attached patch adds the new library to vendor, adds new core components to implement the library, makes broad changes in the filter module to use it for parsing, and also fixes simpletest and all the tests throughout core that are not HTML5-compliant. This is many different bug scopes in one patch, so we have a disruptive 500K patch. To make these fixes less disruptive and more reviewable, let's split this up into 3 or more separate issues. I'd recommend:
-do-not-test.patch
along with the full patch so that we can review the changes to core that are needed to add the library to vendor. (Major task with the "blocker" tag.)I've added the beta evaluation to the summary for this issue for the filter bug and marked it postponed (though we might want to create a new issue once the library issue goes in). Who can help with the task of creating the new issues, splitting up the patch, and moving the unaddressed feedback for the relevant parts into the relevant issues?
Comment #214
jibranApparently this blocks #2331685-37: PHPUnit deprecated assertTag(), assertNotTag() and assertSelectEquals() which is critical so this should be critical.
Comment #215
daffie CreditAttribution: daffie commented@jibran: There are different options for solving [#9492471]. And one possibility has to do with using this issue.
Comment #216
xjmThanks @jibran and @daffie. See #2331685-44: PHPUnit deprecated assertTag(), assertNotTag() and assertSelectEquals(); this is not a blocker to that issue.
Comment #217
cweagansOf interest for people following this issue: see #2415305: Replace core text filters with versions that are HTML5 aware for a (perhaps overly simplistic) D7 implementation of this same library in html5_tools
Comment #218
daffie CreditAttribution: daffie commentedThis issue is changed to meta issue. See comment #213.
The first child issue is #2429363: Add HTML5-lib to Drupal 8 core for the filter system and for the testing system.
Comment #219
BerdirThat went in, great.
Question, will this also help to fix #2327341: Xss::attributes() corrupts data attributes that contain colons or JSON-encoded data (and the three related/duplicate issues)?
Comment #220
Hanno CreditAttribution: Hanno commentedGreat that html5-lib is in!! Thanks all for all the efforts. @Berdir I'm afraid this won't help that issue, as that is caused by another filter function in the Drupal system.
Comment #221
Wim Leers#2327341: Xss::attributes() corrupts data attributes that contain colons or JSON-encoded data is now fixed for D8 (it's open only for D7 backport).
Comment #222
Wim LeersWe should try to move this forward again.
Comment #223
mgiffordAgreed.. Important for being a mobile & accessible UI.
Comment #224
twistor CreditAttribution: twistor as a volunteer commentedComment #225
twistor CreditAttribution: twistor as a volunteer commentedI'm just going to re-write this patch. It will be much simpler since the HTML5 parser now supports the disable_html_ns option.
Comment #226
twistor CreditAttribution: twistor as a volunteer commentedThis is mostly a find/replace with the HTML5 parser.
Comment #227
daffie CreditAttribution: daffie commentedHi twistor, this issue has become a meta issue. The patch has been moved to #2441373: Upgrade tests to HTML5. See comment #213 for move info about this. And thank you for working on this problem. :-)
Comment #229
daffie CreditAttribution: daffie commentedComment #230
twistor CreditAttribution: twistor as a volunteer commentedAhh, thanks, I knew that at some point.
Comment #233
victor-shelepen CreditAttribution: victor-shelepen commentedHello. What is a state of the issue? I want to suppress warnings because of HTML5 rules. I have got the related issue here. #2802269: drupal-entity-inline invalid in Entity
Comment #234
daffie CreditAttribution: daffie commentedIn comment #213 have the D8 maintainers decided that it should be solved in multiple child issues. So changing this to the category "plan". The first issue to be solved is #2667340: Usage of field_prefix and container-inline creates invalid markup..
Comment #241
lhguerra CreditAttribution: lhguerra at Taller commentedHave we gave up on this? I tried the related issues but most of the job seems stale for years. I'm having the xml:lang issue and it is keeping my portal pages from being indexed by Google's AMP service.
By the way, in my case, the
xml:lang
isn't only being added like a side-effect but it is also being added twice!This markup:
Becomes this markup:
This happens when adding the basic_html format to a field.
Comment #242
Wim LeersThis came up again over at #3191041: [PP-1] PHP Warning: DOMDocument::loadHTML(): Tag drupal-media invalid in Entity, line: 34 in Drupal\Component\Utility\Html::load() — with the
media_embed
filter that was added in #2940029: Add an input filter to display embedded Media entities we're going to increasingly have custom HTML tags stored/managed/parsed by Drupal, and hence the importance of this issue is increasing.Comment #243
catchIt looks like the next steps here are:
#2667340: Usage of field_prefix and container-inline creates invalid markup. which blocks
#2441373: Upgrade tests to HTML5 which blocks
#2441811: Upgrade filter system to HTML5
Comment #244
effulgentsia CreditAttribution: effulgentsia at Acquia commentedComment #245
andypostFaced it with #3133749-24: Deprecate masterminds/html5 as a production dependency for Drupal 10
Comment #246
andypostDigging it and found that this bug gone on old tracker https://bugzilla.gnome.org/show_bug.cgi?id=761534 and no new issue filed in gitlab
On PHP side looks it also hanging in https://bugs.php.net/bug.php?id=60021
Comment #247
alexpott@andypost here's the new link to the HTML5 support in libxml2 - https://gitlab.gnome.org/GNOME/libxml2/-/issues/211
Comment #252
Ghost of Drupal PastNote https://wiki.php.net/rfc/domdocument_html5_parser
Comment #253
andypostAs just commited #2441811: Upgrade filter system to HTML5
TODOs
- needs to check how many parts are still affected by libxml
- triaging remaining child issues
Comment #254
fgmJust to be sure, is there no equivalent issue for whitespace in pure XML documents ? Asking for XML-RPC.
Also, is there any impact on contrib/custom filters ? Asking for Glossary 2. And Glossify may have the same question.
Comment #255
andypostXML-RPC very probably affected as extension has no maintainer https://pecl.php.net/package/xmlrpc and PHP 8.3 got a lot of dom/libxml fixes but in long run https://wiki.php.net/rfc/domdocument_html5_parser
Comment #256
fgm@andypost : Drupal XML-RPC stopped using the XML-RPC extension around the PHP4.7 IIRC, so it doesn't matter.
However, reexamining it, we've actually changed it to use the XMLParser extension (meaning the expat parser) instead, so there is no issue about libxml.
Comment #257
longwave@fgm With the move to the HTML5 parser/serializer, in most cases there should be no impact on rendering or parsing HTML, but there are some edge case changes if you rely on exact string matches or significant whitespace (e.g. in tests) - see the change record for the examples we discovered in tests in core: https://www.drupal.org/node/3225468
Comment #258
andypostPHP 8.4 accepted HTML5 parser https://github.com/php/php-src/commit/f09340905263d0abb7cb9d60dc2dd1c1a5...
Comment #259
andypostLooks like all child issues are done so it could be marked fixed in 10.2?!
Comment #260
Wim Leers@andypost I was thinking the same thing! 😄
I just went over all the related issues that are in Drupal core and were not yet closed:
I think the honor is @longwave's 🤓
Comment #261
needs-review-queue-bot CreditAttribution: needs-review-queue-bot as a volunteer commentedThe Needs Review Queue Bot tested this issue.
While you are making the above changes, we recommend that you convert this patch to a merge request. Merge requests are preferred over patches. Be sure to hide the old patch files as well. (Converting an issue to a merge request without other contributions to the issue will not receive credit.)
Comment #262
nod_Comment #263
longwaveAgree that all child issues have been fixed, some related issues still remain where we can clean things up but the primary case of using libxml2 for HTML parsing is no longer an issue as we are using the masterminds HTML5 library instead now.
Tried to credit everyone who made useful comments or helped move this issue forward in some way. Thanks everyone, I'm so happy that Drupal now fully supports HTML5!
Comment #264
longwave