When token_replace() is called with array('sanitize' => FALSE), the format of the return value is undefined.

Sometimes it is plain text (i.e. text that must be passed through check_plain() before being printed on an HTML page), sometimes it is HTML (i.e. text that can be printed directly on an HTML page, but probably should be passed through filter_xss_admin() or similar), and sometimes it is somewhere in between, i.e. no matter which function you pass it through, the output will be garbled.

It works most of the time, though, because most tokens do not contain characters that are special to HTML, e.g. &, < etc.

This issue has been discussed in #461938-53: Core should consistently filter_xss_admin() on $site_slogan and check_plain $site_name and the following comments, but now it has its own ticket. There is a small overlap with the attached patch in this ticket and the latest patch for #461938, so we need a quick reroll once the first of these has been committed.

Comments

sun’s picture

Title: Text format of token_replace(,, array('sanitize' => FALSE)) is undefined » Resulting string format of token_replace(..., array('sanitize' => FALSE)) is undefined
Issue tags: +Security
sun’s picture

+++ modules/comment/comment.tokens.inc	17 Nov 2010 20:13:33 -0000
@@ -122,86 +122,85 @@ function comment_tokens($type, $tokens, 
-          $replacements[$original] = $sanitize ? check_plain($mail) : $mail;
+          $replacements[$original] = check_plain($mail);

Changes like this are definitely wrong. When $sanitize is FALSE, then the caller is asking for the original, raw token values, not to be output, but used in other contexts.

Powered by Dreditor.

dave reid’s picture

I would much rather prefer we can pass in $options['output'] = 'plain' or $options['output'] = 'html' or similar. By automatically applying check_plain() to token results we're removing flexibility to contrib modules that need the raw output.

c960657’s picture

Changes like this are definitely wrong. When $sanitize is FALSE, then the caller is asking for the original, raw token values, not to be output, but used in other contexts.

I think that depends on whether $sanitize == FALSE is supposed to indicate 1) that the output is plain text, or 2) that the output is unsanitized HTML (that should be passed through filter_xss_admin(), not check_plain()). I thought there was consensus about case 1 (i.e. the behaviour is similar to the CHECK_PLAIN/PASS_THROUGH option for drupal_set_title()).

In case 2 we should run decode_entities(strip_tags()) on tokens that may contain HTML in order to convert them to plain text. We can do that, but that actually destroys information. I don't think “sanitize” is the right word to describe this process.

Or, 3) we can have a three-state option, so the function may return either sanitized HTML, unsanitized HTML or plain-text.

sun’s picture

So far, the current $sanitize option is meant to return either 1) HTML or 2) *raw* token values (not escaped or sanitized in any way).

If $sanitize is FALSE, then token values are not going to be output, and also not going to end up in HTML. The caller asks for the raw token values as-is. It basically maps to the PASS_THROUGH option of dst().

If $sanitize is TRUE, then token values are going to be output in a HTML context and should therefore be sanitized and escaped in the usual way.

Right now, for some tokens, this means that we're merely escaping via check_plain(), which would map to the CHECK_PLAIN option of dst(). For other tokens, this means that we're sanitizing the value via filter_xss() or other functions.

Can we write down a concrete use-case, in which the current $sanitize option fails?

It sounds like you'd want to replace the current TRUE/FALSE values with:

  * @param $sanitize
  *   (optional) Defaults to TRUE. One of the following values:
  *   - TRUE: Escape and sanitize token values for HTML, respecting user access.
  *   - CHECK_PLAIN: Escape token values for HTML.
  *   - PASS_THROUGH: Return raw token values.
  *   - FALSE: Same as PASS_THROUGH for backwards compatibility.
David_Rothstein’s picture

So is this a critical bug, or is it a feature request?

I don't think I understand where the bug is. Currently, if 'sanitize' is TRUE you get something that is safe to put in HTML (but with the implementing modules deciding exactly what filtering is appropriate for each token depending on where it came from). If you don't want the default filtering, you can pass 'sanitize' as FALSE and then do your own. The user mail system is an example of somewhere that does that (http://api.drupal.org/api/drupal/modules--user--user.module/function/_us...).

So like @sun, I would be curious to see a specific example of where the above doesn't work.

fago’s picture

I think with sanitize TRUE everything is ok, however some stuff - like text fields with text processing enabled, is sanitized by default. This might be problematic for modules expecting raw input values.

Thus, for Entity metadata, I've added a flag 'sanitized' => TRUE for the metadata of those properties. Then to ease using it I've adding a 'decode' option:

   *   (optional) A keyed array of options:
   *   - sanitize: A boolean flag indicating that textual properties should be
   *     sanitized for display to a web browser. Defaults to FALSE.
   *   - decode: If set to TRUE and some textual data is already sanitized, it
   *     strips HTML tags and decodes HTML entities. Defaults to FALSE.

Maybe this would fit for token too?
I don't see anything critical here though.

@ improving $sanitize
I don't see why one want have specific CHECK_PLAIN, TRUE, and PASS_THROUGH options though. Only the provider of the data can know how it needs to be sanitized, not the caller.

c960657’s picture

Can we write down a concrete use-case, in which the current $sanitize option fails?

I'm not sure we understand each other correctly. My point is that if users just gets the raw token values, he doesn't know which format they are in.

For instance, if he calls token_replace('Our slogan is [site:slogan]', array(), array('sanitize' => FALSE)), he will get an HTML string. If he need to use that in a plain-text context (e.g. in a log file or an email subject), he will need to convert it to plain text using decode_entities(strip_tags()). On the other hand, if calls calls token_replace('This is [site:name]', array(), array('sanitize' => FALSE)), he will get a plain-text string that can be used as is. If he calls token_replace($text, array(), array('sanitize' => FALSE)) where $text is some user-supplied value, he has no way of knowing whether he needs to translate the string to plain-text or not. Calling decode_entities(strip_tags()) on a string that is already plain text will not always work.

sun’s picture

Priority: Critical » Normal

Somehow the provided examples are bogus/wonky.

Calling token_replace('Our slogan is [site:slogan]', array(), array('sanitize' => FALSE)) for usage in a HTML context is wrong. If you want to output HTML, then you use the default value for $sanitize (TRUE).

The $sanitize FALSE option solely exists for code that intends to NOT output token values, but use them in other contexts -- e.g., the file path configuration in a field's settings form.

Demoting to normal, as long as we don't have a valid bug report here.

c960657’s picture

StatusFileSize
new52.07 KB

Calling token_replace('Our slogan is [site:slogan]', array(), array('sanitize' => FALSE)) for usage in a HTML context is wrong. If you want to output HTML, then you use the default value for $sanitize (TRUE).

I was talking about a non-HTML context, e.g. in a log file or an email subject.

But let's consider the example with the file field, and let's assume that file_field_widget_uri() actually passes array('sanitize' => FALSE) to token_replace() (it currently does not, but I assume it's supposed to - I have added this change to the patch):

Assume that I have a site with geek humour with the witty name “<jokes>” and the witty slogan “and <stuff>”, i.e. that's what the users will see. On the admin form I have to enter the slogan as and &lt;stuff>, because the slogan field is supposed to contain HTML and is not check_plain()'ed before being displayed on the site. I have configured a file field to store its files in the directory [site:name]-[site:slogan]. This will store the files in http://example.com/sites/default/files/%3Cjokes%3E-and%20%26lt%3Bstuff%3E/foo.txt - in pretty-print (the text displayed in the status bar of modern browser when you hover over the link) http://example.com/sites/default/files/<jokes>-and &lt;stuff>/foo.txt.

What the user sees is the raw values, i.e. exactly what was input on the Site information page. From an end-user point of view it does not make sense, that &lt; appears verbatim in the URL - he would expect to see the < character be encoded the same way for the name and the slogan, i.e. so that the URL would pretty-print as http://example.com/sites/default/files/<jokes>-and <stuff>/foo.txt.

Do you agree?

c960657’s picture

Version: 7.x-dev » 8.x-dev
StatusFileSize
new35.43 KB
dave reid’s picture

The example for using the site slogan isn't really valid in this case. Although it *can* be used it's not a realistic example and this direction is not acceptable to me.

Regardless adding 'sanitize' => FALSE in file_field_widget_uri() is something that *should* be done.

Status: Needs review » Needs work

The last submitted patch, token-sanitize-5.patch, failed testing.

dave reid’s picture

We probably also probably need to fix some locations where we should also be calling 'sanitize' => FALSE:

system_mail() with $context['subject']
system_send_email_action() with $context['recipient']
system_goto_action()

c960657’s picture

> The example for using the site slogan isn't really valid in this case. Although it *can*
> be used it's not a realistic example and this direction is not acceptable to me.
Some other examples: Putting the output from token_replace() in a mail subject, in a CSV file or other text file, or in a message posted on IRC.

In these cases we want the result in plain-text, not in HTML, even though some tokens contain HTML and some contain just plain text. Currently, the only way to do this is decode_entities(strip_tags(token_replace($foo, $context, array('sanitize' => TRUE))).

Example:
Create a node with the title “foo &copy; bar” and the body “bar &copy; <strong>foo”. On the node view page, the title appears as “foo &copy; bar” (i.e. exactly as entered) and the body as “bar © foo” (i.e. interpreted as HTML).

Now consider this code used to generate a text file:

$string = "Title: [node:title]\nBody: [node:body]";
$node = node_load(1);
$context = array('node' => $node);
file_put_contents('/tmp/node-a.txt', decode_entities(strip_tags(token_replace($string, $context, array('sanitize' => TRUE)))));
file_put_contents('/tmp/node-b.txt', token_replace($string, $context, array('sanitize' => FALSE)));

node-a.txt contains this:

Title: foo &copy; bar
Body: bar © foo

This is a plain-text representation of what was shown on the node view page, so this is good.

node-b.txt contains this:

Title: foo &copy; bar
Body: bar &copy; <strong>foo</strong>

Here the title is right, but the body contains the raw HTML source code, i.e. the resulting text file contains a mix between HTML and plain text. This is not really useful, except perhaps for generating debug output. In most other real-world use-cases you want to convert the result into some specific format, either plain-text or HTML (or some other format) independent of how the value is represented internally in Drupal.

c960657’s picture

StatusFileSize
new38.18 KB
c960657’s picture

Status: Needs work » Needs review
damien tournoud’s picture

I'm not sure I'm happy with this patch.

Just to clarify, we are mixing up two issues all over the place:

  • The output format: HTML, plain-text, something else?
  • The sanitization: when output as HTML, is the content pre-filtered for XSS or not?

In core, we basically have four possible options:

  1. raw value (ie. not processed in any way); you are on your own
  2. plain-text (if the source value was HTML, is has been un-encoded)
  3. HTML (unfiltered) (note: if the source value was plain-text, it has been check_plained())
  4. HTML (filtered) (note: if the source value was plain-text, it has been check_plained())

Because we only have one parameter to control those options 'sanitized', the debate is which one of the four it should be:

  • @c960657 here is arguing that it should be (3) and (4)
  • @sun and @David_Rothstein are arguing that it should be (1) and (4)
  • I am arguing that it should be (2) and (4)

After all, the option "you are on your own" has absolutely no use case in core. You cannot use it to output it as plain-text, because it is *not* plain-text :)

c960657’s picture

@Damian, thank you for the clear summary.

I don't have strong opinions on whether it should be (2)+(4) or (3)+(4). My patch is implementing (3)+(4), but I'd be happy to make a new one implementing (2)+(4). My main concern is that we don't do (1).

c960657’s picture

StatusFileSize
new70.34 KB

(One year later ...)

@Damian: I agree, (2)+(4) (or posibly (2)+(3)) makes more sense.

This patch implements (2) and (4). Text passed through _text_sanitize() or check_markup() is considered safe enough (is this how we usually do?).

I also eliminated renamed the option “sanitize” to favour “html” (similar to the option for l()). I think this wording is more precise.

Status: Needs review » Needs work

The last submitted patch, token-sanitize-8.patch, failed testing.

c960657’s picture

Status: Needs work » Needs review
StatusFileSize
new70.48 KB
c960657’s picture

StatusFileSize
new66.67 KB

Reroll.

Status: Needs review » Needs work

The last submitted patch, token-sanitize-10.patch, failed testing.

c960657’s picture

Status: Needs work » Needs review
StatusFileSize
new66.87 KB
c960657’s picture

StatusFileSize
new66.02 KB

Reroll.

c960657’s picture

StatusFileSize
new67.78 KB

Status: Needs review » Needs work
Issue tags: -Security

The last submitted patch, token-sanitize-13.patch, failed testing.

mgifford’s picture

Issue summary: View changes

Looks like this patch needs to be completely rewritten..

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

catch’s picture

Status: Needs work » Closed (outdated)

The API has changed (with SafeMarkup and etc.) and there's no longer a $sanitize parameter.

catch’s picture

Issue tags: +Bug Smash Initiative