All,
I'm sorry for the delay but with some misunderstanding under my part, this has been delayed... Below is the issue and a proposed resolution
https://www.drupal.org/node/2899340

Original Report

[...] All code everywhere else assumes that the node title is HTML escaped, and therefore safe for attributes. By switching to filter_xss() it breaks that assumption and allows people to break out of for example a title attribute easily.

To fix this, I'd suggest going back to check_plain(), and then manually decoding specific HTML tags - i.e. look for <em> and switch that back to .

That way everything else that check_plain() does is handled and it's just a case of decoding entities known to be safe.

Problem

Xss::filterAdmin() can only be used on complete HTML strings...If we XSS filter a string but don't escape it, then using it as an attribute (any attribute) is an XSS vector - you can break out of the attribute with foo" onmouseover="alert(1)

How to reproduce

An example of how this impacts 7.x:

Create a custom theme (or for this example, just modify Bartik)
Note that the docs of node.tpl.php say $title: the (sanitized) title of the node., which is true because template_preprocess_node() calls $variables['title'] = check_plain($node->title);.
Ok, great, so now modify your theme's (or Bartik's) node.tpl.php, so that instead of

<a href="<?php print $node_url; ?>"><?php print $title; ?></a>

you have

<a href="<?php print $node_url; ?>" title="<?php print $title; ?>"><?php print $title; ?></a>

You might ask: why would you duplicate link text in the title attribute. I don't know, but I've seen that pattern quite a bit on the web. In any case, that's safe, because check_plain() is suitable for attribute values.
But, now install https://www.drupal.org/project/html_title. And with that enabled, create a node with the following title: foo" onmouseover="alert(1).
Boom: you now have XSS. Because, html_title_preprocess_node() does $vars['title'] = filter_xss($vars['elements']['#node']->title, ...);, which is a fine sanitization for something printed as an HTML fragment, but not an ok sanitization for an attribute value.

Proposed Soutions

To see the original proposed solution see the Original Report.

My solution is as follows:
The idea is that I use the same HTML correction that the filter module does (which is basically load it into php dom and let it go to town) and THEN I apply all of

  • escaping to double quotes
  • single quotes
  • and xss filter checking

Because php dom doesn't escape quotes (because it's valid XML) I had to hack that part up a bit... but I make sure to only do this to "text nodes". This is because I want to maintain the same usability for class, id, and other attributes as to break a little as possible while still preventing the security hole.

Comments

generalredneck created an issue. See original summary.

generalredneck’s picture

Attached is my approach. Second some feedback for the approach from the security issue. I fixed the comment issue

@general redneck: The strategy you lay out makes sense to me. I'm not super experienced with php dom functions you're using and if maybe it's a performance issue?

+/**
+ * Helper function to help filter out unwanted XSS opportunities.
+ */
+function html_title_filter_xss($title) {

I think this could get a lot more detail in the doblock to explain when it is appropriate to use this function vs. other functions.

My response for context

About the approach, Yes, PHPDom is a much slower way to tackle this, however, I couldn't find an acceptable way to handle HTML using regex or string replacing without destroying other attributes one may add to the allowed HTML tags. Removing these attributes could be devastating to existing sites and would decrease the usefulness (particularly in styling) of html_title. Given that, I opted to use a technology known for parsing HTML. Additionally, the check HTML filter is typically applied to at least every body field by default when administrators use filtered_html as the default text formatter. In doing so, it runs a very close approximation to what I'm doing here using _filter_html(). My question is would this be called any more (or less) than _filter_html(), and is it a big enough change

With that said, it may be acceptable to store the filtered version of the html as the title instead of filtering on output, which may help. I don't know what the consequences of this would be, or why we don't do something similar instead of "filtering" on the body field...

As for documentation... yeah you definitely hit on one of my weaknesses and I could improve that. Your suggestions are invaluable.

generalredneck’s picture

Issue summary: View changes
generalredneck’s picture

StatusFileSize
new5.34 KB

attaching modified template for people to use to test

  • generalredneck committed 76218fb on 7.x-1.x
    Issue #2968477 by generalredneck, greggles, catch: html_title -...
generalredneck’s picture

Status: Needs review » Fixed

Pushing this fix. If anyone has any issues feel free to open a new issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

tostinni’s picture

Issue summary: View changes

Fixing HTML code for clarity

weynhamz’s picture

CI&T's 8.x port is also applied with same fix https://github.com/ciandt-china-dev/drupal-module-html_title/pull/4