Xss::filter() does not handle HTML tags inside attribute values [#3227831]

Problem/Motivation

Initially reported by @lauriii in #2441811: Upgrade filter system to HTML5, HTML5 allows unescaped less-than and greater-than in HTML attributes, e.g.

<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />

Xss::filter() does not handle this:

>>> use \Drupal\Component\Utility\Xss;

>>> Xss::filter('<img src="llama.jpg" data-caption="Loquacious llama!" />', ['img', 'em']);
=> "<img src="llama.jpg" data-caption="Loquacious llama!" />"

>>> Xss::filter('<img src="llama.jpg" data-caption="<em>Loquacious llama!</em>" />', ['img', 'em']);
=> "<img src="llama.jpg">Loquacious llama!</em>" /&gt;"

In other words when an attribute contains a tag (or even just a >) the output is mangled, and part of the attribute value may end up in the HTML body instead.

Xss::filter() uses two regular expressions to try and extract tags from HTML:

      <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string

This trivially matches anything that looks like a tag, but does not handle attributes that contain >.

    if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9\-]+)\s*([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {

Similarly this seems unable to handle attributes that contain >.

Steps to reproduce

Proposed resolution

Remaining tasks

Determine whether regex is sufficient to filter HTML in this way: https://stackoverflow.com/a/1732454
Improve the regex to handle attributes that contain tag characters, or replace Xss::filter() with something more robust.

User interface changes

API changes

Data model changes

Release notes snippet

Comment	File	Size	Author
#2	3227831-test-only.patch	2.81 KB	longwave

Issue fork drupal-3227831

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3227831-xssfilter-html-in-attributes changes, plain diff MR !1047
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

12 August 2021 at 09:35

longwave created an issue. See original summary.

Comment #2

longwave

he/him

English

commented 12 August 2021 at 09:50

Status:

Active

» Needs review

Status	File	Size
new	3227831-test-only.patch	2.81 KB

Failing test from @lauriii in #2441811: Upgrade filter system to HTML5, backported to 9.3.x without the HTML5 filter changes.

Comment #3

12 August 2021 at 10:38

longwave opened merge request !1047

Comment #4

longwave

he/him

English

commented 12 August 2021 at 10:41

1 file was hidden/shown/deleted

Status	File	Size
hidden	3227831-test-only.patch	2.81 KB

With the help of https://stackoverflow.com/a/1736801 I think I might have solved this.

I replaced the naive regex <[^>]*(>|$) with <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])*(>|$) which allows single and double quoted attributes to be ignored when processing the tags, this change needed to be done in two places. As the second regex was quite long I converted it to multiline form and added comments at the same time. Finally, < and > were always stripped from attributes, but HTML5 allows this so I removed that line.

Comment #5

longwave

he/him

English

commented 12 August 2021 at 10:42

Issue tags:

+Needs security review

As this is changing XSS filtering I think this needs a security review.

Comment #6

12 August 2021 at 10:50

Status:

Needs review

» Needs work

The last submitted patch, 2: 3227831-test-only.patch, failed testing. View results

Comment #7

longwave

he/him

English

commented 12 August 2021 at 12:03

Status:

Needs work

» Needs review

The changes to \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest will need some scrutiny.

Comment #8

effulgentsia commented 12 August 2021 at 14:02

HTML5 allows unescaped less-than and greater-than in HTML attributes

Didn't HTML4 allow that too? If so, what's the new use case where it's actually coming up? The regex has always been more restrictive than necessary, and I'm not opposed to making it less restrictive if/when it's safe to do so, but I'm curious what's prompting this issue.

Comment #9

longwave

he/him

English

commented 12 August 2021 at 14:47

The XML spec says this is not allowed for < at least:

https://www.w3.org/TR/2008/REC-xml-20081126/#NT-AttValue

AttValue	   ::=   	'"' ([^<&"] | Reference)* '"'
			|  "'" ([^<&'] | Reference)* "'"

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively.

The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

I so far haven't found an explicit reference to this in the HTML4 spec at https://www.w3.org/TR/html4/

The HTML 5 spec is much more succinct: https://html.spec.whatwg.org/#attributes

An attribute value is a string. Except where otherwise specified, attribute values on HTML elements may be any string value, including the empty string, and there is no restriction on what text can be specified in such attribute values.

The driver for this issue is CKEditor 5, which wants to store unescaped HTML in data attributes: #3222808: Follow-up for #3201646: markup in image captions is lost

Comment #10

longwave

he/him

English

commented 12 August 2021 at 14:56

Status:

Needs review

» Needs work

https://www.w3.org/TR/html4/charset.html has this snippet which implies unescaped attributes are allowed, although should not be used:

Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Comment #11

longwave

he/him

English

commented 12 August 2021 at 14:56

Status:

Needs work

» Needs review

Comment #12

effulgentsia commented 12 August 2021 at 15:23

I'm pretty nervous about us adding more complexity to the regular expression in an effort to mimic HTML5's complex parsing logic. The regex we have in HEAD has been battle tested for security over many years, and any change we introduce has the risk of introducing an XSS vector, unless it's a regex that we're grabbing from a source that has a very good track record for security.

I asked in #3222808-18: Follow-up for #3201646: markup in image captions is lost if we really need this. If we really do, I think switching from the regex to the masterminds/html5 library might be a more secure approach than tweaking the regex.

Comment #13

longwave

he/him

English

commented 12 August 2021 at 15:38

I was also wary of changing these regexes although \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest is a pretty good set of test cases, most of them are unchanged, but of the ones that are different most of those look like the output is better than before - a number of bad cases reduced to the empty string and others where the HTML still looks protected but more valid than before.

However I share the concerns and understand if we think trying out masterminds/html5 is a good idea - although that will likely come with an entirely different set of changes to \Drupal\Tests\editor\Unit\EditorXssFilter\StandardTest that require even more scrutiny.

Comment #14

longwave

he/him

English

commented 12 August 2021 at 15:42

Having said that, masterminds/html5 says it is not appropriate for filtering: https://github.com/Masterminds/html5-php/wiki/End-User-Submitted-Markup - maybe we need to look at http://htmlpurifier.org/

Comment #15

12 August 2021 at 15:42

Version:

9.3.x-dev

» 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #16

12 August 2021 at 15:42

Version:

9.4.x-dev

» 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #17

smustgrave commented 26 July 2022 at 23:58

This appears to be a duplicate of

https://www.drupal.org/project/drupal/issues/2552837 or https://www.drupal.org/project/drupal/issues/2544110 both of which I've been working on.

If you agree can you close and move over credit?

Comment #18

26 July 2022 at 23:58

Version:

9.5.x-dev

» 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #19

smustgrave commented 17 February 2023 at 21:28

Status:

Needs review

» Closed (duplicate)

Closing as a duplicate of #2552837: XSS::filter and filter_xss can create malformed attributes when you would expect them to be stripped

Xss::filter() does not handle HTML tags inside attribute values

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Issue fork drupal-3227831

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Related issues

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community