The real reason for wrapping the snippet in a HTML document was that DOM can't export a snippet back only a full document so you needed a wrapper around it. That's totally unnecessary IMO. If you are worried about UTF-8 there's an issue open already to throw validation errors on entering those. The database won't store them anyways.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Status: Needs review » Needs work

The last submitted patch, 72_lines_less_with_simplexml.patch, failed testing.

chx’s picture

Status: Needs work » Needs review
FileSize
4.22 KB

I made a typo there.

Damien Tournoud’s picture

The reasons all this is there are documented, both in the code and in the commit messages. Doing a simple HTML-to-XML is definitely not necessary.

chx’s picture

Care to copy-paste some here? Even, write tests that fail with this code?

chx’s picture

FileSize
2.54 KB

Well, small moves then: remove the ugly step-by-step. I will add a test.

Edit: https://bugs.php.net/bug.php?id=54429 code comes from here.

Edit: i tested with script and style tags both.

chx’s picture

FileSize
4.72 KB

Well, #721536: HTML corrector filter has problems with unescaped CDATA and incorrectly closed tags clearly stated it's only the XML parser that has problems with this. Well, converting to text nodes instead of adding CDATA makes it work and just look at the tests how much prettier the output is. The code is much nicer too.

JacobSingh’s picture

It's been too long for me to recall of the intricacies of this, but the fix seems reasonable. My only concern is that the test has changd, but chx explained that we now convert to CDATA and then iterate through and remove the CDATA elements and turn them to text nodes. I guess this won't cause problems. So RTBC, but I'd like to see us move away from this kludge eventually and use a source code corrector / formatter library instead of the DOM extension + hacks.

JacobSingh’s picture

Status: Needs review » Reviewed & tested by the community
JacobSingh’s picture

Status: Reviewed & tested by the community » Needs work

heh, I guess not. I sent this to chx to test: <p><script type="text/javascript">alert("<script>test</script>")</script></p> And it borked.

chx’s picture

Yes but it's broken on the original as well!

chx’s picture

Assigned: chx » Unassigned
chx’s picture

This is impossible to fix.Feeding <p><script type="text/javascript">1 > 0; alert("<script>test</script>")</script></p> into loadHTML results in <p><script type="text/javascript">1 > 0; alert("<script>test</script>")</p> the closing script tag is simply gone immediately. The only way to fix this is to not use the DOM extension.

sun’s picture

chx’s picture

Title: Simplify the HTML corrector » Use a proper HTML parser for every core filter

Feeding <p><script type="text/javascript">alert("<script>test</script>")</script></p> into filter_xss doesnt yield <p></p> as it should either.

grendzy’s picture

das-peter’s picture

Being evil here #998590-22: Prevent double CDATA section escaping in filter_dom_serialize_escape_cdata_element() to avoid warnings lead me to this issue.
In a discussion with sun, eugenmayer and derein, about the above mentioned issue, eugenmayer brought up a link to this html purifier: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/

According to chx this could be a candidate for this issue - thus it shall be mentioned here :)

Here the result of a short test with it (no further configuration of the purifier):
Input:
<p><script type="text/javascript">1 > 0; alert("<script>test</script>")</script></p> and <p><script type="text/javascript"><!--//--><![CDATA[// ><!-- alert("test");//--><!]]></script></p>
Output:
<p><script type="text/javascript">1 &gt; 0; alert("test</script>")</p> and <p><script type="text/javascript"><!--// --><![CDATA[// ><!-- alert("test");//--><!]]></script></p>

grendzy’s picture

FileSize
9.03 KB

chx: re #12 what would the expected behavior be? According to http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions that's not valid HTML:

The text in raw text and RCDATA elements must not contain any occurrences of the string "</"…

WebKit produces the same DOM as loadHTML in this case:

The first </script> closes the element and the final, spurious </script> is removed.

chx’s picture

We can find simpler broken stuff. Try <script>alert(']]>');</script> for example (coming from the issue das-peter linked)

grendzy’s picture

I haven't read all of #998590: Prevent double CDATA section escaping in filter_dom_serialize_escape_cdata_element() to avoid warnings, but CDATA sections have a similar limitation, right? http://www.w3.org/TR/html5/syntax.html#cdata-sections

the text must not contain the string "]]>".

Anyway, if there are fatal flaws in PHP DOM that prevent us from using it, can we get links to a bug report with the upstream vendor? If I understand correctly, that would be https://bugzilla.gnome.org/buglist.cgi?product=libxml2.

chx’s picture

You constantly link to HTML5 -- do we expect and rely , even for security (!) that we run in an HTML5 compatible browser-parser?

chx’s picture

Priority: Normal » Major

also just because a standard says the text must not contain the string "]]>". what happens if it does?

grendzy’s picture

I'll refer to html5 again, since it is the first version of html to specify detailed parsing rules and error handling.

Consume every character up to the next occurrence of the three character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF), whichever comes first. Emit a series of character tokens consisting of all the characters consumed except the matching three character sequence at the end (if one was found before the end of the file).

Switch to the data state.

Once in the data state, subsequent characters are consumed as plain text (at least until the next & or < is encountered).

Of course we can't assume the user agent follows html5 rules, but the error handling rules in html5 were based on largely existing browser behaviors. In this case XHTML behaves the same, unless the subsequent text node causes a parse error - in which case the document is either firmly rejected (with XML mime type), or error-corrected in a user-agent-dependant way.

In IRC chx suggested Internet Explorer might have flaws that would leave it vulnerable when html was parsed and filtered according to PHP DOM rules. I don't know much about IE's weaknesses here, so can't really comment. If anyone has an example that will actually execute in IE after being parsed / filtered with PHP DOM that would be useful.

chx’s picture

OK so I createda file continaing <html><body><script>alert(']]>');</script> </body></html> loaded into Chrome and it alerted ]]> just fine. If you try to DOM-load that and then CDATA-escape it, all hell breaks lose. Here's my preferred method (because it's very easy to see), namely convert it into SimpleXML then dump it: <html><body><script><![CDATA[alert(']]]]><![CDATA[>');]]></script></body></html> (simplexml_import_dom(DOMDocument::loadHTMLFile('test.html'))->asXml())

If you run current HEAD on this snippet, a warning galore is all you get.

And yet, the browser works.

grendzy’s picture

Thanks, that clears up the example. Sorry if this is a dumb question... what was the purpose of adding the CDATA wrapper? It seems like that is the root if this issue, and not any particular flaw in the parser. Was it just for consistency with the XHTML doctype? Now that the we have #1077566: Convert html.tpl.php to HTML5, perhaps we no longer need to add the CDATA wrapper? (Omitting the wrapper causes no problem for html5, or html4 for that matter).

Thinking more about SimpleXML... in addition to this issue if there are other behaviors we don't like in the future... we are pretty much powerless to change it. Even if we have an issue accepted by the libxml developers it could take years to filter downstream into CentOS / Ubuntu etc.

If we are searching for alternatives has anyone looked at http://code.google.com/p/html5lib/ ? It 's based directly on the WHATWG specification so we could have a filter system that's truly fluent in HTML with tokenizing / parsing rules consistent with modern browsers.

chx’s picture

I did look at http://code.google.com/p/html5lib/downloads/list and immediately scratched the idea, the project seems to be dead.

#721536: HTML corrector filter has problems with unescaped CDATA and incorrectly closed tags is the original issue. We need some sort of HTML corrector that keeps JavaScript alive while correcting HTML.

greggles’s picture

htmlpurifier seems like a good call as well. It has lots of tests and decent usage - http://drupal.org/project/htmlpurifier

sun’s picture

At some point, there was a rumor that parts of our filter system would be based on a very old (initial) version of htmlpurifier.

So check_markup() was check_output() before, and filter_xss() was valid_input_data() before. filter_xss() is borrowed from http://sourceforge.net/projects/kses/ but has never been updated to the latest version of 2005. That said, neither check_output() nor kses is based on htmlpurifier; kses is based on http://savannah.nongnu.org/projects/gnuheter but diverged from it. And so did we.

grendzy’s picture

FWIW, I asked in #whatwg about the html5lib project:

grendzy: Hi! Drupal community is looking for a more sophisticated parser to replace PHP DOM (a.k.a SimpleXML, I think based on libxml2). Is http://code.google.com/p/html5lib/ abandoned? Last commit was almost 2 years ago. Thanks!
jgraham: I am not aware thatanyone is actively working on the PHP port
jgraham: If you would like to take over that would be easy to arrange
jgraham: But you should maybe check the performance before you decide what you want to do
smaug____: wasn't there some plan to support hsivonen's parser with libxml2
smaug____: grendzy: take hsivonen's parser, and generate php code from java files
hsivonen: smaug____: there's a plan. now that View Source is out of the way, it might actually become real
erlehmann: grendzy, as far as i can say, html5lib was usable 1 year ago.
erlehmann: i used the PHP portion for a wordpress plugin.
erlehmann: and am now using python.
erlehmann: PHP is pig disgusting.
miketayl_r is now known as miketaylr.
grendzy: thanks folks… anyone mind if I quote this chat on a drupal.org discussion?
AryehGregor: Go ahead.
AryehGregor: It's publicly logged.
grendzy: cool, thanks again for the feedback

YesCT’s picture

I think an issue summary update (tips to do that: http://drupal.org/node/1427826)
and some suggested next steps would be helpful here.

I'll take a stab:
Next step:
implement an initial try at a patch that would use: htmlpurifier

mgifford’s picture

Pancho’s picture

Status: Needs work » Postponed

Not exactly. But unsurprisingly, both filter issues lead to the search for a replacement to PHP DOM aka SimpleXML.
The other library looks more promising, though.
So we should probably postpone this one on #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 and focus our efforts there. As soon as we're having a proper HTML5 parser, this here will be a very straightforward followup.

mgifford’s picture

Sounds good.

Berdir’s picture

Issue summary: View changes
Status: Postponed » Active

I think this can be unpostponed now? The linked issue is still open but it is a meta, we have a html5 parser library now.

Hanno’s picture

Hanno’s picture

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

longwave’s picture

Version: 8.9.x-dev » 9.3.x-dev

I think this can be closed as a duplicate of #2441811: Upgrade filter system to HTML5 now.

daffie’s picture

Status: Active » Closed (duplicate)

Lets closed this issue as a duplicate of #2441811: Upgrade filter system to HTML5.