Problem/Motivation
Initially reported by @lauriii in #2441811: Upgrade filter system to HTML5, HTML5 allows unescaped less-than and greater-than in HTML attributes, e.g.
<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />
Xss::filter() does not handle this:
>>> use \Drupal\Component\Utility\Xss;
>>> Xss::filter('<img src="llama.jpg" data-caption="Loquacious llama!" />', ['img', 'em']);
=> "<img src="llama.jpg" data-caption="Loquacious llama!" />"
>>> Xss::filter('<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />', ['img', 'em']);
=> "<img src="llama.jpg">Loquacious llama!</em>" />"
In other words when an attribute contains a tag (or even just a >) the output is mangled, and part of the attribute value may end up in the HTML body instead.
Xss::filter() uses two regular expressions to try and extract tags from HTML:
<[^>]*(>|$) # a string that starts with a <, up until the > or the end of the string
This trivially matches anything that looks like a tag, but does not handle attributes that contain >.
if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9\-]+)\s*([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {
Similarly this seems unable to handle attributes that contain >.
Steps to reproduce
Proposed resolution
Remaining tasks
Determine whether regex is sufficient to filter HTML in this way: https://stackoverflow.com/a/1732454
Improve the regex to handle attributes that contain tag characters, or replace Xss::filter() with something more robust.
User interface changes
API changes
Data model changes
Release notes snippet
| Comment | File | Size | Author |
|---|
Issue fork drupal-3227831
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
longwaveFailing test from @lauriii in #2441811: Upgrade filter system to HTML5, backported to 9.3.x without the HTML5 filter changes.
Comment #4
longwaveWith the help of https://stackoverflow.com/a/1736801 I think I might have solved this.
I replaced the naive regex
<[^>]*(>|$)with<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*(>|$)which allows single and double quoted attributes to be ignored when processing the tags, this change needed to be done in two places. As the second regex was quite long I converted it to multiline form and added comments at the same time. Finally,<and>were always stripped from attributes, but HTML5 allows this so I removed that line.Comment #5
longwaveAs this is changing XSS filtering I think this needs a security review.
Comment #7
longwaveThe changes to \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest will need some scrutiny.
Comment #8
effulgentsia commentedDidn't HTML4 allow that too? If so, what's the new use case where it's actually coming up? The regex has always been more restrictive than necessary, and I'm not opposed to making it less restrictive if/when it's safe to do so, but I'm curious what's prompting this issue.
Comment #9
longwaveThe XML spec says this is not allowed for
<at least:https://www.w3.org/TR/2008/REC-xml-20081126/#NT-AttValue
I so far haven't found an explicit reference to this in the HTML4 spec at https://www.w3.org/TR/html4/
The HTML 5 spec is much more succinct: https://html.spec.whatwg.org/#attributes
The driver for this issue is CKEditor 5, which wants to store unescaped HTML in data attributes: #3222808: Follow-up for #3201646: markup in image captions is lost
Comment #10
longwavehttps://www.w3.org/TR/html4/charset.html has this snippet which implies unescaped attributes are allowed, although should not be used:
Comment #11
longwaveComment #12
effulgentsia commentedI'm pretty nervous about us adding more complexity to the regular expression in an effort to mimic HTML5's complex parsing logic. The regex we have in HEAD has been battle tested for security over many years, and any change we introduce has the risk of introducing an XSS vector, unless it's a regex that we're grabbing from a source that has a very good track record for security.
I asked in #3222808-18: Follow-up for #3201646: markup in image captions is lost if we really need this. If we really do, I think switching from the regex to the masterminds/html5 library might be a more secure approach than tweaking the regex.
Comment #13
longwaveI was also wary of changing these regexes although \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest is a pretty good set of test cases, most of them are unchanged, but of the ones that are different most of those look like the output is better than before - a number of bad cases reduced to the empty string and others where the HTML still looks protected but more valid than before.
However I share the concerns and understand if we think trying out masterminds/html5 is a good idea - although that will likely come with an entirely different set of changes to \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest that require even more scrutiny.
Comment #14
longwaveHaving said that, masterminds/html5 says it is not appropriate for filtering: https://github.com/Masterminds/html5-php/wiki/End-User-Submitted-Markup - maybe we need to look at http://htmlpurifier.org/
Comment #17
smustgrave commentedThis appears to be a duplicate of
https://www.drupal.org/project/drupal/issues/2552837 or https://www.drupal.org/project/drupal/issues/2544110 both of which I've been working on.
If you agree can you close and move over credit?
Comment #19
smustgrave commentedClosing as a duplicate of #2552837: XSS::filter and filter_xss can create malformed attributes when you would expect them to be stripped