Use a proper HTML parser for every core filter [#1277290]

Comment	File	Size	Author
	72_lines_less_with_simplexml.patch	4.27 KB	chx

#2	1277290.patch	4.22 KB	chx

#5	1277290_5.patch	2.54 KB	chx

#6	1277290_6.patch	4.72 KB	chx

#17	webkit_script.png	9.03 KB	grendzy

Comment #1

13 September 2011 at 04:22

Status:

Needs review

» Needs work

The last submitted patch, 72_lines_less_with_simplexml.patch, failed testing.

Log in or register to post comments

Comment #2

chx CreditAttribution: chx commented 13 September 2011 at 04:26

Status:

Needs work

» Needs review

File	Size
1277290.patch	4.22 KB

I made a typo there.

Log in or register to post comments

Comment #3

Damien Tournoud CreditAttribution: Damien Tournoud commented 13 September 2011 at 15:27

The reasons all this is there are documented, both in the code and in the commit messages. Doing a simple HTML-to-XML is definitely not necessary.

Log in or register to post comments

Comment #4

chx CreditAttribution: chx commented 13 September 2011 at 15:29

Care to copy-paste some here? Even, write tests that fail with this code?

Log in or register to post comments

Comment #5

chx CreditAttribution: chx commented 13 September 2011 at 19:18

File	Size
1277290_5.patch	2.54 KB

Well, small moves then: remove the ugly step-by-step. I will add a test.

Edit: https://bugs.php.net/bug.php?id=54429 code comes from here.

Edit: i tested with script and style tags both.

Log in or register to post comments

Comment #6

chx CreditAttribution: chx commented 13 September 2011 at 19:36

File	Size
1277290_6.patch	4.72 KB

Well, #721536: HTML corrector filter has problems with unescaped CDATA and incorrectly closed tags clearly stated it's only the XML parser that has problems with this. Well, converting to text nodes instead of adding CDATA makes it work and just look at the tests how much prettier the output is. The code is much nicer too.

Log in or register to post comments

Comment #7

JacobSingh CreditAttribution: JacobSingh commented 13 September 2011 at 20:57

It's been too long for me to recall of the intricacies of this, but the fix seems reasonable. My only concern is that the test has changd, but chx explained that we now convert to CDATA and then iterate through and remove the CDATA elements and turn them to text nodes. I guess this won't cause problems. So RTBC, but I'd like to see us move away from this kludge eventually and use a source code corrector / formatter library instead of the DOM extension + hacks.

Log in or register to post comments

Comment #8

JacobSingh CreditAttribution: JacobSingh commented 13 September 2011 at 20:57

Status:

Needs review

» Reviewed & tested by the community

Log in or register to post comments

Comment #9

JacobSingh CreditAttribution: JacobSingh commented 13 September 2011 at 20:59

Status:

Reviewed & tested by the community

» Needs work

heh, I guess not. I sent this to chx to test: <p><script type="text/javascript">alert("<script>test</script>")</script></p> And it borked.

Log in or register to post comments

Comment #10

chx CreditAttribution: chx commented 13 September 2011 at 21:36

Yes but it's broken on the original as well!

Log in or register to post comments

Comment #11

chx CreditAttribution: chx commented 14 September 2011 at 01:09

Assigned:

chx

» Unassigned

Log in or register to post comments

Comment #12

chx CreditAttribution: chx commented 16 September 2011 at 05:32

This is impossible to fix.Feeding <p><script type="text/javascript">1 > 0; alert("<script>test</script>")</script></p> into loadHTML results in <p><script type="text/javascript">1 > 0; alert("<script>test</script>")</p> the closing script tag is simply gone immediately. The only way to fix this is to not use the DOM extension.

Log in or register to post comments

Comment #13

sun

German

Karlsruhe

CreditAttribution: sun commented 16 September 2011 at 12:31

Log in or register to post comments

Comment #14

chx CreditAttribution: chx commented 27 September 2011 at 23:27

Title:

Simplify the HTML corrector

» Use a proper HTML parser for every core filter

Feeding <p><script type="text/javascript">alert("<script>test</script>")</script></p> into filter_xss doesnt yield <p></p> as it should either.

Log in or register to post comments

Comment #15

grendzy CreditAttribution: grendzy commented 27 September 2011 at 23:35

Log in or register to post comments

Comment #16

das-peter CreditAttribution: das-peter commented 11 November 2011 at 08:43

Being evil here #998590-22: Prevent double CDATA section escaping in filter_dom_serialize_escape_cdata_element() to avoid warnings lead me to this issue.
In a discussion with sun, eugenmayer and derein, about the above mentioned issue, eugenmayer brought up a link to this html purifier: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/

According to chx this could be a candidate for this issue - thus it shall be mentioned here :)

Here the result of a short test with it (no further configuration of the purifier):
Input:
<p><script type="text/javascript">1 > 0; alert("<script>test</script>")</script></p> and <p><script type="text/javascript"><![CDATA[// ><!]]></script></p>
Output:
<p><script type="text/javascript">1 > 0; alert("test</script>")</p> and <p><script type="text/javascript"><![CDATA[// ><!]]></script></p>

Log in or register to post comments

Comment #17

grendzy CreditAttribution: grendzy commented 11 November 2011 at 17:48

File	Size
webkit_script.png	9.03 KB

chx: re #12 what would the expected behavior be? According to http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions that's not valid HTML:

The text in raw text and RCDATA elements must not contain any occurrences of the string "</"…

WebKit produces the same DOM as loadHTML in this case:

The first </script> closes the element and the final, spurious </script> is removed.

Log in or register to post comments

Comment #18

chx CreditAttribution: chx commented 13 November 2011 at 12:10

We can find simpler broken stuff. Try <script>alert(']]>');</script> for example (coming from the issue das-peter linked)

Log in or register to post comments

Comment #19

grendzy CreditAttribution: grendzy commented 13 November 2011 at 19:27

I haven't read all of #998590: Prevent double CDATA section escaping in filter_dom_serialize_escape_cdata_element() to avoid warnings, but CDATA sections have a similar limitation, right? http://www.w3.org/TR/html5/syntax.html#cdata-sections

the text must not contain the string "]]>".

Anyway, if there are fatal flaws in PHP DOM that prevent us from using it, can we get links to a bug report with the upstream vendor? If I understand correctly, that would be https://bugzilla.gnome.org/buglist.cgi?product=libxml2.

Log in or register to post comments

Comment #20

chx CreditAttribution: chx commented 13 November 2011 at 21:47

You constantly link to HTML5 -- do we expect and rely , even for security (!) that we run in an HTML5 compatible browser-parser?

Log in or register to post comments

Comment #21

chx CreditAttribution: chx commented 17 November 2011 at 05:54

Priority:

Normal

» Major

also just because a standard says the text must not contain the string "]]>". what happens if it does?

Log in or register to post comments

Comment #22

grendzy CreditAttribution: grendzy commented 17 November 2011 at 18:12

I'll refer to html5 again, since it is the first version of html to specify detailed parsing rules and error handling.

Consume every character up to the next occurrence of the three character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF), whichever comes first. Emit a series of character tokens consisting of all the characters consumed except the matching three character sequence at the end (if one was found before the end of the file).

Switch to the data state.

Once in the data state, subsequent characters are consumed as plain text (at least until the next & or < is encountered).

Of course we can't assume the user agent follows html5 rules, but the error handling rules in html5 were based on largely existing browser behaviors. In this case XHTML behaves the same, unless the subsequent text node causes a parse error - in which case the document is either firmly rejected (with XML mime type), or error-corrected in a user-agent-dependant way.

In IRC chx suggested Internet Explorer might have flaws that would leave it vulnerable when html was parsed and filtered according to PHP DOM rules. I don't know much about IE's weaknesses here, so can't really comment. If anyone has an example that will actually execute in IE after being parsed / filtered with PHP DOM that would be useful.

Log in or register to post comments

Comment #23

chx CreditAttribution: chx commented 20 November 2011 at 03:15

OK so I createda file continaing <html><body><script>alert(']]>');</script> </body></html> loaded into Chrome and it alerted ]]> just fine. If you try to DOM-load that and then CDATA-escape it, all hell breaks lose. Here's my preferred method (because it's very easy to see), namely convert it into SimpleXML then dump it: <html><body><script><![CDATA[alert(']]]]><![CDATA[>');]]></script></body></html> (simplexml_import_dom(DOMDocument::loadHTMLFile('test.html'))->asXml())

If you run current HEAD on this snippet, a warning galore is all you get.

And yet, the browser works.

Log in or register to post comments

Comment #24

grendzy CreditAttribution: grendzy commented 20 November 2011 at 03:33

Thanks, that clears up the example. Sorry if this is a dumb question... what was the purpose of adding the CDATA wrapper? It seems like that is the root if this issue, and not any particular flaw in the parser. Was it just for consistency with the XHTML doctype? Now that the we have #1077566: Convert html.tpl.php to HTML5, perhaps we no longer need to add the CDATA wrapper? (Omitting the wrapper causes no problem for html5, or html4 for that matter).

Thinking more about SimpleXML... in addition to this issue if there are other behaviors we don't like in the future... we are pretty much powerless to change it. Even if we have an issue accepted by the libxml developers it could take years to filter downstream into CentOS / Ubuntu etc.

If we are searching for alternatives has anyone looked at http://code.google.com/p/html5lib/ ? It 's based directly on the WHATWG specification so we could have a filter system that's truly fluent in HTML with tokenizing / parsing rules consistent with modern browsers.

Log in or register to post comments

Comment #25

chx CreditAttribution: chx commented 20 November 2011 at 03:46

I did look at http://code.google.com/p/html5lib/downloads/list and immediately scratched the idea, the project seems to be dead.

#721536: HTML corrector filter has problems with unescaped CDATA and incorrectly closed tags is the original issue. We need some sort of HTML corrector that keeps JavaScript alive while correcting HTML.

Log in or register to post comments

Comment #26

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 9 May 2012 at 20:15

htmlpurifier seems like a good call as well. It has lots of tests and decent usage - http://drupal.org/project/htmlpurifier

Log in or register to post comments

Comment #27

sun

German

Karlsruhe

CreditAttribution: sun commented 9 May 2012 at 21:59

At some point, there was a rumor that parts of our filter system would be based on a very old (initial) version of htmlpurifier.

So check_markup() was check_output() before, and filter_xss() was valid_input_data() before. filter_xss() is borrowed from http://sourceforge.net/projects/kses/ but has never been updated to the latest version of 2005. That said, neither check_output() nor kses is based on htmlpurifier; kses is based on http://savannah.nongnu.org/projects/gnuheter but diverged from it. And so did we.

Log in or register to post comments

Comment #28

grendzy CreditAttribution: grendzy commented 9 May 2012 at 22:06

FWIW, I asked in #whatwg about the html5lib project:

grendzy: Hi! Drupal community is looking for a more sophisticated parser to replace PHP DOM (a.k.a SimpleXML, I think based on libxml2). Is http://code.google.com/p/html5lib/ abandoned? Last commit was almost 2 years ago. Thanks!
jgraham: I am not aware thatanyone is actively working on the PHP port
jgraham: If you would like to take over that would be easy to arrange
jgraham: But you should maybe check the performance before you decide what you want to do
smaug____: wasn't there some plan to support hsivonen's parser with libxml2
smaug____: grendzy: take hsivonen's parser, and generate php code from java files
hsivonen: smaug____: there's a plan. now that View Source is out of the way, it might actually become real
erlehmann: grendzy, as far as i can say, html5lib was usable 1 year ago.
erlehmann: i used the PHP portion for a wordpress plugin.
erlehmann: and am now using python.
erlehmann: PHP is pig disgusting.
miketayl_r is now known as miketaylr.
grendzy: thanks folks… anyone mind if I quote this chat on a drupal.org discussion?
AryehGregor: Go ahead.
AryehGregor: It's publicly logged.
grendzy: cool, thanks again for the feedback

Log in or register to post comments

Comment #29

YesCT CreditAttribution: YesCT commented 9 January 2013 at 07:54

Issue tags:

+Needs issue summary update

I think an issue summary update (tips to do that: http://drupal.org/node/1427826)
and some suggested next steps would be helpful here.

I'll take a stab:
Next step:
implement an initial try at a patch that would use: htmlpurifier

Log in or register to post comments

Comment #30

mgifford

he/him

English

CreditAttribution: mgifford commented 2 June 2013 at 12:09

Duplicate? - #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5

Log in or register to post comments

Comment #31

Pancho

UTC+2 🇪🇺 EU

CreditAttribution: Pancho commented 11 July 2013 at 04:00

Status:

Needs work

» Postponed

Not exactly. But unsurprisingly, both filter issues lead to the search for a replacement to PHP DOM aka SimpleXML.
The other library looks more promising, though.
So we should probably postpone this one on #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 and focus our efforts there. As soon as we're having a proper HTML5 parser, this here will be a very straightforward followup.

Log in or register to post comments

Comment #32

mgifford

he/him

English

CreditAttribution: mgifford commented 12 July 2013 at 13:47

Sounds good.

Log in or register to post comments

Comment #33

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 8 October 2015 at 00:27

Issue summary:	View changes
Status:	Postponed	» Active

I think this can be unpostponed now? The linked issue is still open but it is a meta, we have a html5 parser library now.

Log in or register to post comments

Comment #34

Hanno CreditAttribution: Hanno commented 8 October 2015 at 07:49

Log in or register to post comments

Comment #35

Hanno CreditAttribution: Hanno commented 8 October 2015 at 07:52

Yes, good idea to wait until #2441373: Upgrade tests to HTML5 is fixed?

Log in or register to post comments

Comment #36

8 October 2015 at 07:52

Version:

8.0.x-dev

» 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #37

8 October 2015 at 07:52

Version:

8.1.x-dev

» 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #38

8 October 2015 at 07:52

Version:

8.2.x-dev

» 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #39

8 October 2015 at 07:52

Version:

8.3.x-dev

» 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #40

8 October 2015 at 07:52

Version:

8.4.x-dev

» 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #41

8 October 2015 at 07:52

Version:

8.5.x-dev

» 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Log in or register to post comments

Comment #42

8 October 2015 at 07:52

Version:

8.6.x-dev

» 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Log in or register to post comments

Comment #43

8 October 2015 at 07:52

Version:

8.8.x-dev

» 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.