Follow-up from #715142-71: [beta blocker blocker] Various URLs escaped twice, since check_url() resp. filter_xss_bad_protocol() calls check_plain().

In Drupal 6, the following functions all implement protocol stripping (via check_url()) before outputting a URL:
l()
theme_image()
theme_textfield()
theme_form()
theme_more_help_link()
theme_feed_icon()
theme_more_link()
template_preprocess_search_result()

In Drupal 7, l() and theme_image() do not, but the others do.

I don't understand why the inconsistency in D7. Either we made a mistake with l() and theme_image() and they should strip protocols as they do in D6, or we took a partial step in the correct direction by deciding that generic functions should not be biased against dangerous protocols: i.e., dangerous protocols aren't bad per se, they are only bad for user-entered text, and that therefore, only those functions that specifically deal with variables containing user-entered text should strip protocols. In other words, code that calls l() or theme('image') for a user-entered URL should be responsible for filtering out dangerous protocols, and that it *should* be possible for a module to call l() or theme('image') with a URL containing a custom protocol that would normally be disallowed for user-entered text, but that whose value the module knows to be safe.

If the latter, we need to finish the job, which is what this patch does, and document it (which this patch does not yet do).

Or maybe, we need to revert l() and theme_image() to the D6 way.

This is critical due to security implications. This patch is an "API change" because it requires callers of these functions to strip protocols if they are passing variables containing user-entered text. #715142: [beta blocker blocker] Various URLs escaped twice, since check_url() resp. filter_xss_bad_protocol() calls check_plain() adds a drupal_strip_dangerous_protocols() function that makes it possible for callers to do so.

This patch will need to be re-rolled after #715142: [beta blocker blocker] Various URLs escaped twice, since check_url() resp. filter_xss_bad_protocol() calls check_plain() lands, but there's no reason to wait on that issue before having the discussion of whether this patch is desirable in principle.

CommentFileSizeAuthor
#1 limit_url_protocol_stripping.patch3.88 KBeffulgentsia
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

effulgentsia’s picture

webchick’s picture

My preference would be to revert the behaviour to D6, but have something opt-in in the $options array to bypass security checks in "generic" functions like l() or theme_image(), if that's actually required. I'm not sure why exactly it is, since we've gotten fine an entire 2+ years of D6's lifecycle without it, and afaik D5 was the same.

I tried reading through the annotation of l() for a few mins, but couldn't seem to track down the commit where this behaviour got changed in D7. Any ideas of its origin? That might help inform the decision here.

chx’s picture

Priority: Critical » Normal

The l behaviour was changed in #354812: filter_xss_bad_protocol is called hundreds of times on some pages. for performance reasons. And while we do not strip dangerous protocols any more, there is no security hole -- the colon after the dangerous protocol will be URL escaped and prefixed with base_path(). See webchick's test in #715142-66: [beta blocker blocker] Various URLs escaped twice, since check_url() resp. filter_xss_bad_protocol() calls check_plain(). The reason for this lies in the way url recognizes external URLs. It finds the colon inside an so fires protocol stripping over it which happily removes the dangerous protocol then the comparison against the original will fail so it will be recognized as an internal path. There is nothing to stop javascript:alert('xss'); to be a valid URL alias after all.

theme_image falls back to file_create_url(). If there is no :// in the URL then it's made sure it starts with a / and so it can't have a dangerous protocol in it. If it does have a :// then it's either a known file scheme, HTTP or HTTPS or we return FALSE.

Demoted to normal but I think this is a won't fix.

sun’s picture

I seriously think this is critical and we won't do ourselves a favor by ignoring the issue. Effectively, the originally identified unpredictable rendering chain problems in #10 and #21 of the originating issue need to be considered.

drupal_strip_dangerous_protocols(), as introduced in the other issue/patch now, is a first step to bring clarity into output handling of URLs.

More precisely, handling of potentially harmful URLs, which originate from user input, have to be sanitized via drupal_strip_dangerous_protocols(). As this new function does not escape for HTML, it can be invoked at any time - ideally, as early as possible, e.g., in a preprocess function, i.e., at a point where themers are less affected and concerned. And, at a point where we avoid the rendering chain problems.

The actual escaping for output in HTML then happens via

  • @-placeholder in t(), i.e., check_plain()
  • drupal_attributes(), which invokes check_plain() for all attributes (which is more or less the trigger for this entire chaining problem)
  • l(), always returning HTML

Effectively: check_plain() is and has to be invoked in theme() functions only. If a URL originates from user input, then the URL must be safe for output at any point in time; i.e., having dangerous stuff stripped off as early as possible in the rendering chain.

Decoupling URL security from HTML output escaping.

D7 supporting preprocess functions for regular theme functions helps here; though it's also debatable whether stripping harmful protocols and other security measures shouldn't happen before entering the theme system.

chx’s picture

@sun have you read my analysis?

sun’s picture

yes, @chx, I did. No mention of user input or rendering chain or theme system in there.

Who is responsible for stripping bad protocols from a URL that's derived from user input? Who is responsible for escaping a URL for HTML output?

All I can read out of the analysis in #3 is that url() tries to detect potentially harmful things, and if it happens to do so, it automatically invokes nothing and simply treats the URL as internal. If that is the case and actually sufficient, why do we manually invoke check_url() and filter_xss_bad_protocol() then, and why don't we always use l() or url() instead?

Counter question: Did we understand the OP and #6 on the original issue? Those just showcase two of many examples, which constitute that we have a unresolved chaining issue regarding escaping of URLs for HTML, which is too closely tied to stripping of bad protocols and general URL sanitation.

I'd bet that most Drupal developers + themers do not really have a clue of as to when to call which URL related function for sanitation/escaping in which context.

The patch in #1 simply continues what has been done in previous patches for D7. Rightfully, IMO, as all of those generic theme functions just receive data for output in HTML. Hence, check_plain() is all those theme functions are responsible for. Most of them can therefore be simplified, cleaned up, and shortened to use drupal_attributes(). The latter being no API change, therefore exists as patch in another issue already.

Continuing this route, the caller asking for themed output, or a preprocessor that runs prior to the effective theme function, is responsible for sanitizing URLs that can derive from user input. In a way, so that it can be used and passed/carried on as theme variable to subsequent theme functions, prematurely output in a theme wrapper, #pre_render, or whatnot.

At least, that would sound like a crystal clear resolution for the chaining issue at hand.

neclimdul’s picture

We really seem to be making a mountain out of a mole hill here. For one the patch in #1 seems almost reasonable(though I've got a gut feeling its not) and completely unrelated to the OP out regressions and the ensuing discussion about theme_image() and l().

Keeping with said discussion, chx specifically addressed the concerns that there was some sort of regression in l() and theme_image()'s stripping of bad protocols. For various flexibility(a slew of file api patches) and performance(#354812: filter_xss_bad_protocol is called hundreds of times on some pages.) reasons that filtering has been obfuscated some but its there. This directly address's webchick's comments.

Next, hidden in the OP is “or we took a partial step in the correct direction by deciding that generic functions should not be biased against dangerous protocols: i.e., dangerous protocols aren't bad per se...” Correct me if I miss understand this but it seems flat out wrong to assume that the code calling l() or url() or any general function like it could be called with sanitized URLs.

For l() and url(), we're passing in code that could be an alias or anything, we can't sanitize that prior to sending it in. This is also in chx's comments, ie

javascript:alert('xss');

is a valid alias. If we want l('Foo', 'node/123'); to path correctly, we can't change the assumptions url() and l() make. Technically you can do this anyways by passing array('external' => TRUE) in the options to l() though.

The same can be extended to functions dealing with the File API since the alter hook or stream handlers could do literally anything with the raw argument. So at worst, the regression there would be the possibility that file_create_url can't link to some really goofy url unless you've setup a stream handler.

The reverse doesn't really make sense since you have to expect l()'s HTML to be safe, and changing the fact that we expect url()'s return to be unsafe seems a bit insane. This really actually does address your "user input or rendering chain or theme system" concerns it seems. l(), url() are designed so they can be used late, ie in the theme system so they do what they do for a reason. They also error on the side of helping avoid bad protocols slipping through when they are used. This seems like the ideal situation.

effulgentsia’s picture

Status: Needs review » Needs work

chx and I discussed this in IRC. Here's the summary:

  • url() and file_create_url() are indeed safe (with the exception of the next bullet point), in that they return URLs that are either scheme-less (i.e., start with "/") or with a white-listed scheme. In the case of url(), the white list is $conf['filter_allowed_protocols']. In the case of file_create_url(), it's whatever a file stream wrapper class outputs in its getExternalUrl() method.
  • The exception to the above is if url() is called with $options['external'] set to TRUE. chx and I agree that this exception is bad, and if we want to support url() being called with the ability to bypass protocol security, then we should add a new option for that (e.g., $options['bypass protocol security']) and decouple it from $options['external']. Better name ideas welcome :)
  • Given the above, the only time that check_url() or drupal_strip_dangerous_protocols() needs to be called is when outputting a URL that is not the result of url() or file_create_url(). This should be rare: we should encourage developers to always route URLs through one of these two functions.
  • Given the above, we can achieve the separation sun wants. If theme functions or preprocess functions are given URLs to work with, they should only need to call check_plain(). It is the module's responsibility to ensure that URLs passed to theme functions have gone through url(), file_create_url(), or drupal_strip_dangerous_protocols() (this last one being rare, as per the above bullet point). In some cases, theme/preprocess functions aren't passed URLs but rather either a Drupal path or a variable that can either be a Drupal path or a URL. In this case, it is already the case that the theme/preprocess function must call url() or file_create_url(), or a function that will call one of those (e.g., l()).
  • The above means that essentially, we can deprecate check_url(), since we're fully decoupling where in the pipeline protocols are stripped from where in the pipeline check_plain() is called. But we need to leave this soon-to-be-useless wrapper function around for D7 BC.

I'll try to roll a patch reflecting all this (both code changes and docs) this week. But please feel free to comment in the meantime.

chx’s picture

$options['bypass protocol security']) and decouple it from $options['external']. <= more, we can remove code that acts on external entirely. Aside from the bypass protocol security, it only serves as a performance boost option and a dangerous one at that. Once again, if you send in an external url with a whitelist protocol it's handled the same if external is set or if it's not (aside from an expensive check).

effulgentsia’s picture

I haven't forgot about this, and hope to get to it soon. I just marked #858874: The more link in the node module's "Recent content" block always links to http://admin/content a duplicate.

ohnobinki’s picture

+