Problem/Motivation

Initially reported by @lauriii in #2441811: Upgrade filter system to HTML5, HTML5 allows unescaped less-than and greater-than in HTML attributes, e.g.

<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />

Xss::filter() does not handle this:

>>> use \Drupal\Component\Utility\Xss;

>>> Xss::filter('<img src="llama.jpg" data-caption="Loquacious llama!" />', ['img', 'em']);
=> "<img src="llama.jpg" data-caption="Loquacious llama!" />"

>>> Xss::filter('<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />', ['img', 'em']);
=> "<img src="llama.jpg">Loquacious llama!</em>" /&gt;"

In other words when an attribute contains a tag (or even just a >) the output is mangled, and part of the attribute value may end up in the HTML body instead.

Xss::filter() uses two regular expressions to try and extract tags from HTML:

      <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string

This trivially matches anything that looks like a tag, but does not handle attributes that contain >.

    if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9\-]+)\s*([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {

Similarly this seems unable to handle attributes that contain >.

Steps to reproduce

Proposed resolution

Remaining tasks

Determine whether regex is sufficient to filter HTML in this way: https://stackoverflow.com/a/1732454
Improve the regex to handle attributes that contain tag characters, or replace Xss::filter() with something more robust.

User interface changes

API changes

Data model changes

Release notes snippet

CommentFileSizeAuthor
#2 3227831-test-only.patch2.81 KBlongwave

Issue fork drupal-3227831

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

longwave created an issue. See original summary.

longwave’s picture

Status: Active » Needs review
StatusFileSize
new2.81 KB

Failing test from @lauriii in #2441811: Upgrade filter system to HTML5, backported to 9.3.x without the HTML5 filter changes.

longwave’s picture

With the help of https://stackoverflow.com/a/1736801 I think I might have solved this.

I replaced the naive regex <[^>]*(>|$) with <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*(>|$) which allows single and double quoted attributes to be ignored when processing the tags, this change needed to be done in two places. As the second regex was quite long I converted it to multiline form and added comments at the same time. Finally, < and > were always stripped from attributes, but HTML5 allows this so I removed that line.

longwave’s picture

Issue tags: +Needs security review

As this is changing XSS filtering I think this needs a security review.

Status: Needs review » Needs work

The last submitted patch, 2: 3227831-test-only.patch, failed testing. View results

longwave’s picture

Status: Needs work » Needs review

The changes to \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest will need some scrutiny.

effulgentsia’s picture

HTML5 allows unescaped less-than and greater-than in HTML attributes

Didn't HTML4 allow that too? If so, what's the new use case where it's actually coming up? The regex has always been more restrictive than necessary, and I'm not opposed to making it less restrictive if/when it's safe to do so, but I'm curious what's prompting this issue.

longwave’s picture

The XML spec says this is not allowed for < at least:

https://www.w3.org/TR/2008/REC-xml-20081126/#NT-AttValue

AttValue	   ::=   	'"' ([^<&"] | Reference)* '"'
			|  "'" ([^<&'] | Reference)* "'"
The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively.

The right angle bracket (>) may be represented using the string " &gt; ", and MUST, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

I so far haven't found an explicit reference to this in the HTML4 spec at https://www.w3.org/TR/html4/

The HTML 5 spec is much more succinct: https://html.spec.whatwg.org/#attributes

An attribute value is a string. Except where otherwise specified, attribute values on HTML elements may be any string value, including the empty string, and there is no restriction on what text can be specified in such attribute values.

The driver for this issue is CKEditor 5, which wants to store unescaped HTML in data attributes: #3222808: Follow-up for #3201646: markup in image captions is lost

longwave’s picture

Status: Needs review » Needs work

https://www.w3.org/TR/html4/charset.html has this snippet which implies unescaped attributes are allowed, although should not be used:

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

longwave’s picture

Status: Needs work » Needs review
effulgentsia’s picture

I'm pretty nervous about us adding more complexity to the regular expression in an effort to mimic HTML5's complex parsing logic. The regex we have in HEAD has been battle tested for security over many years, and any change we introduce has the risk of introducing an XSS vector, unless it's a regex that we're grabbing from a source that has a very good track record for security.

I asked in #3222808-18: Follow-up for #3201646: markup in image captions is lost if we really need this. If we really do, I think switching from the regex to the masterminds/html5 library might be a more secure approach than tweaking the regex.

longwave’s picture

I was also wary of changing these regexes although \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest is a pretty good set of test cases, most of them are unchanged, but of the ones that are different most of those look like the output is better than before - a number of bad cases reduced to the empty string and others where the HTML still looks protected but more valid than before.

However I share the concerns and understand if we think trying out masterminds/html5 is a good idea - although that will likely come with an entirely different set of changes to \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest that require even more scrutiny.

longwave’s picture

Having said that, masterminds/html5 says it is not appropriate for filtering: https://github.com/Masterminds/html5-php/wiki/End-User-Submitted-Markup - maybe we need to look at http://htmlpurifier.org/

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

smustgrave’s picture

This appears to be a duplicate of

https://www.drupal.org/project/drupal/issues/2552837 or https://www.drupal.org/project/drupal/issues/2544110 both of which I've been working on.

If you agree can you close and move over credit?

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

smustgrave’s picture