Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
The real reason for wrapping the snippet in a HTML document was that DOM can't export a snippet back only a full document so you needed a wrapper around it. That's totally unnecessary IMO. If you are worried about UTF-8 there's an issue open already to throw validation errors on entering those. The database won't store them anyways.
Comment | File | Size | Author |
---|---|---|---|
72_lines_less_with_simplexml.patch | 4.27 KB | chx | |
#2 | 1277290.patch | 4.22 KB | chx |
#5 | 1277290_5.patch | 2.54 KB | chx |
#6 | 1277290_6.patch | 4.72 KB | chx |
#17 | webkit_script.png | 9.03 KB | grendzy |
Comments
Comment #2
chx CreditAttribution: chx commentedI made a typo there.
Comment #3
Damien Tournoud CreditAttribution: Damien Tournoud commentedThe reasons all this is there are documented, both in the code and in the commit messages. Doing a simple HTML-to-XML is definitely not necessary.
Comment #4
chx CreditAttribution: chx commentedCare to copy-paste some here? Even, write tests that fail with this code?
Comment #5
chx CreditAttribution: chx commentedWell, small moves then: remove the ugly step-by-step. I will add a test.
Edit: https://bugs.php.net/bug.php?id=54429 code comes from here.
Edit: i tested with script and style tags both.
Comment #6
chx CreditAttribution: chx commentedWell, #721536: HTML corrector filter has problems with unescaped CDATA and incorrectly closed tags clearly stated it's only the XML parser that has problems with this. Well, converting to text nodes instead of adding CDATA makes it work and just look at the tests how much prettier the output is. The code is much nicer too.
Comment #7
JacobSingh CreditAttribution: JacobSingh commentedIt's been too long for me to recall of the intricacies of this, but the fix seems reasonable. My only concern is that the test has changd, but chx explained that we now convert to CDATA and then iterate through and remove the CDATA elements and turn them to text nodes. I guess this won't cause problems. So RTBC, but I'd like to see us move away from this kludge eventually and use a source code corrector / formatter library instead of the DOM extension + hacks.
Comment #8
JacobSingh CreditAttribution: JacobSingh commentedComment #9
JacobSingh CreditAttribution: JacobSingh commentedheh, I guess not. I sent this to chx to test:
<p><script type="text/javascript">alert("<script>test</script>")</script></p>
And it borked.Comment #10
chx CreditAttribution: chx commentedYes but it's broken on the original as well!
Comment #11
chx CreditAttribution: chx commentedComment #12
chx CreditAttribution: chx commentedThis is impossible to fix.Feeding
<p><script type="text/javascript">1 > 0; alert("<script>test</script>")</script></p>
into loadHTML results in<p><script type="text/javascript">1 > 0; alert("<script>test</script>")</p>
the closing script tag is simply gone immediately. The only way to fix this is to not use the DOM extension.Comment #13
sunComment #14
chx CreditAttribution: chx commentedFeeding
<p><script type="text/javascript">alert("<script>test</script>")</script></p>
into filter_xss doesnt yield<p></p>
as it should either.Comment #15
grendzy CreditAttribution: grendzy commentedComment #16
das-peter CreditAttribution: das-peter commentedBeing evil here #998590-22: Prevent double CDATA section escaping in filter_dom_serialize_escape_cdata_element() to avoid warnings lead me to this issue.
In a discussion with sun, eugenmayer and derein, about the above mentioned issue, eugenmayer brought up a link to this html purifier: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/
According to chx this could be a candidate for this issue - thus it shall be mentioned here :)
Here the result of a short test with it (no further configuration of the purifier):
Input:
<p><script type="text/javascript">1 > 0; alert("<script>test</script>")</script></p> and <p><script type="text/javascript"><!--//--><![CDATA[// ><!-- alert("test");//--><!]]></script></p>
Output:
<p><script type="text/javascript">1 > 0; alert("test</script>")</p> and <p><script type="text/javascript"><!--// --><![CDATA[// ><!-- alert("test");//--><!]]></script></p>
Comment #17
grendzy CreditAttribution: grendzy commentedchx: re #12 what would the expected behavior be? According to http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions that's not valid HTML:
WebKit produces the same DOM as loadHTML in this case:
The first </script> closes the element and the final, spurious </script> is removed.
Comment #18
chx CreditAttribution: chx commentedWe can find simpler broken stuff. Try <script>alert(']]>');</script> for example (coming from the issue das-peter linked)
Comment #19
grendzy CreditAttribution: grendzy commentedI haven't read all of #998590: Prevent double CDATA section escaping in filter_dom_serialize_escape_cdata_element() to avoid warnings, but CDATA sections have a similar limitation, right? http://www.w3.org/TR/html5/syntax.html#cdata-sections
Anyway, if there are fatal flaws in PHP DOM that prevent us from using it, can we get links to a bug report with the upstream vendor? If I understand correctly, that would be https://bugzilla.gnome.org/buglist.cgi?product=libxml2.
Comment #20
chx CreditAttribution: chx commentedYou constantly link to HTML5 -- do we expect and rely , even for security (!) that we run in an HTML5 compatible browser-parser?
Comment #21
chx CreditAttribution: chx commentedalso just because a standard says the text must not contain the string "]]>". what happens if it does?
Comment #22
grendzy CreditAttribution: grendzy commentedI'll refer to html5 again, since it is the first version of html to specify detailed parsing rules and error handling.
Once in the data state, subsequent characters are consumed as plain text (at least until the next & or < is encountered).
Of course we can't assume the user agent follows html5 rules, but the error handling rules in html5 were based on largely existing browser behaviors. In this case XHTML behaves the same, unless the subsequent text node causes a parse error - in which case the document is either firmly rejected (with XML mime type), or error-corrected in a user-agent-dependant way.
In IRC chx suggested Internet Explorer might have flaws that would leave it vulnerable when html was parsed and filtered according to PHP DOM rules. I don't know much about IE's weaknesses here, so can't really comment. If anyone has an example that will actually execute in IE after being parsed / filtered with PHP DOM that would be useful.
Comment #23
chx CreditAttribution: chx commentedOK so I createda file continaing <html><body><script>alert(']]>');</script> </body></html> loaded into Chrome and it alerted ]]> just fine. If you try to DOM-load that and then CDATA-escape it, all hell breaks lose. Here's my preferred method (because it's very easy to see), namely convert it into SimpleXML then dump it: <html><body><script><![CDATA[alert(']]]]><![CDATA[>');]]></script></body></html> (
simplexml_import_dom(DOMDocument::loadHTMLFile('test.html'))->asXml()
)If you run current HEAD on this snippet, a warning galore is all you get.
And yet, the browser works.
Comment #24
grendzy CreditAttribution: grendzy commentedThanks, that clears up the example. Sorry if this is a dumb question... what was the purpose of adding the CDATA wrapper? It seems like that is the root if this issue, and not any particular flaw in the parser. Was it just for consistency with the XHTML doctype? Now that the we have #1077566: Convert html.tpl.php to HTML5, perhaps we no longer need to add the CDATA wrapper? (Omitting the wrapper causes no problem for html5, or html4 for that matter).
Thinking more about SimpleXML... in addition to this issue if there are other behaviors we don't like in the future... we are pretty much powerless to change it. Even if we have an issue accepted by the libxml developers it could take years to filter downstream into CentOS / Ubuntu etc.
If we are searching for alternatives has anyone looked at http://code.google.com/p/html5lib/ ? It 's based directly on the WHATWG specification so we could have a filter system that's truly fluent in HTML with tokenizing / parsing rules consistent with modern browsers.
Comment #25
chx CreditAttribution: chx commentedI did look at http://code.google.com/p/html5lib/downloads/list and immediately scratched the idea, the project seems to be dead.
#721536: HTML corrector filter has problems with unescaped CDATA and incorrectly closed tags is the original issue. We need some sort of HTML corrector that keeps JavaScript alive while correcting HTML.
Comment #26
greggleshtmlpurifier seems like a good call as well. It has lots of tests and decent usage - http://drupal.org/project/htmlpurifier
Comment #27
sunAt some point, there was a rumor that parts of our filter system would be based on a very old (initial) version of htmlpurifier.
So check_markup() was check_output() before, and filter_xss() was valid_input_data() before. filter_xss() is borrowed from http://sourceforge.net/projects/kses/ but has never been updated to the latest version of 2005. That said, neither check_output() nor kses is based on htmlpurifier; kses is based on http://savannah.nongnu.org/projects/gnuheter but diverged from it. And so did we.
Comment #28
grendzy CreditAttribution: grendzy commentedFWIW, I asked in #whatwg about the html5lib project:
Comment #29
YesCT CreditAttribution: YesCT commentedI think an issue summary update (tips to do that: http://drupal.org/node/1427826)
and some suggested next steps would be helpful here.
I'll take a stab:
Next step:
implement an initial try at a patch that would use: htmlpurifier
Comment #30
mgiffordDuplicate? - #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5
Comment #31
PanchoNot exactly. But unsurprisingly, both filter issues lead to the search for a replacement to PHP DOM aka SimpleXML.
The other library looks more promising, though.
So we should probably postpone this one on #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 and focus our efforts there. As soon as we're having a proper HTML5 parser, this here will be a very straightforward followup.
Comment #32
mgiffordSounds good.
Comment #33
BerdirI think this can be unpostponed now? The linked issue is still open but it is a meta, we have a html5 parser library now.
Comment #34
Hanno CreditAttribution: Hanno commentedComment #35
Hanno CreditAttribution: Hanno commentedYes, good idea to wait until #2441373: Upgrade tests to HTML5 is fixed?
Comment #44
longwaveI think this can be closed as a duplicate of #2441811: Upgrade filter system to HTML5 now.
Comment #45
daffie CreditAttribution: daffie commentedLets closed this issue as a duplicate of #2441811: Upgrade filter system to HTML5.