When token_replace() is called with array('sanitize' => FALSE), the format of the return value is undefined.
Sometimes it is plain text (i.e. text that must be passed through check_plain() before being printed on an HTML page), sometimes it is HTML (i.e. text that can be printed directly on an HTML page, but probably should be passed through filter_xss_admin() or similar), and sometimes it is somewhere in between, i.e. no matter which function you pass it through, the output will be garbled.
It works most of the time, though, because most tokens do not contain characters that are special to HTML, e.g. &, < etc.
This issue has been discussed in #461938-53: Core should consistently filter_xss_admin() on $site_slogan and check_plain $site_name and the following comments, but now it has its own ticket. There is a small overlap with the attached patch in this ticket and the latest patch for #461938, so we need a quick reroll once the first of these has been committed.
| Comment | File | Size | Author |
|---|---|---|---|
| #27 | token-sanitize-13.patch | 67.78 KB | c960657 |
| #26 | token-sanitize-12.patch | 66.02 KB | c960657 |
| #25 | token-sanitize-11.patch | 66.87 KB | c960657 |
| #23 | token-sanitize-10.patch | 66.67 KB | c960657 |
| #22 | token-sanitize-9.patch | 70.48 KB | c960657 |
Comments
Comment #1
sunComment #2
sunChanges like this are definitely wrong. When $sanitize is FALSE, then the caller is asking for the original, raw token values, not to be output, but used in other contexts.
Powered by Dreditor.
Comment #3
dave reidI would much rather prefer we can pass in $options['output'] = 'plain' or $options['output'] = 'html' or similar. By automatically applying check_plain() to token results we're removing flexibility to contrib modules that need the raw output.
Comment #4
c960657 commentedI think that depends on whether $sanitize == FALSE is supposed to indicate 1) that the output is plain text, or 2) that the output is unsanitized HTML (that should be passed through filter_xss_admin(), not check_plain()). I thought there was consensus about case 1 (i.e. the behaviour is similar to the CHECK_PLAIN/PASS_THROUGH option for drupal_set_title()).
In case 2 we should run decode_entities(strip_tags()) on tokens that may contain HTML in order to convert them to plain text. We can do that, but that actually destroys information. I don't think “sanitize” is the right word to describe this process.
Or, 3) we can have a three-state option, so the function may return either sanitized HTML, unsanitized HTML or plain-text.
Comment #5
sunSo far, the current $sanitize option is meant to return either 1) HTML or 2) *raw* token values (not escaped or sanitized in any way).
If $sanitize is FALSE, then token values are not going to be output, and also not going to end up in HTML. The caller asks for the raw token values as-is. It basically maps to the PASS_THROUGH option of dst().
If $sanitize is TRUE, then token values are going to be output in a HTML context and should therefore be sanitized and escaped in the usual way.
Right now, for some tokens, this means that we're merely escaping via check_plain(), which would map to the CHECK_PLAIN option of dst(). For other tokens, this means that we're sanitizing the value via filter_xss() or other functions.
Can we write down a concrete use-case, in which the current $sanitize option fails?
It sounds like you'd want to replace the current TRUE/FALSE values with:
Comment #6
David_Rothstein commentedSo is this a critical bug, or is it a feature request?
I don't think I understand where the bug is. Currently, if 'sanitize' is TRUE you get something that is safe to put in HTML (but with the implementing modules deciding exactly what filtering is appropriate for each token depending on where it came from). If you don't want the default filtering, you can pass 'sanitize' as FALSE and then do your own. The user mail system is an example of somewhere that does that (http://api.drupal.org/api/drupal/modules--user--user.module/function/_us...).
So like @sun, I would be curious to see a specific example of where the above doesn't work.
Comment #7
fagoI think with sanitize TRUE everything is ok, however some stuff - like text fields with text processing enabled, is sanitized by default. This might be problematic for modules expecting raw input values.
Thus, for Entity metadata, I've added a flag 'sanitized' => TRUE for the metadata of those properties. Then to ease using it I've adding a 'decode' option:
Maybe this would fit for token too?
I don't see anything critical here though.
@ improving $sanitize
I don't see why one want have specific CHECK_PLAIN, TRUE, and PASS_THROUGH options though. Only the provider of the data can know how it needs to be sanitized, not the caller.
Comment #8
c960657 commentedI'm not sure we understand each other correctly. My point is that if users just gets the raw token values, he doesn't know which format they are in.
For instance, if he calls token_replace('Our slogan is [site:slogan]', array(), array('sanitize' => FALSE)), he will get an HTML string. If he need to use that in a plain-text context (e.g. in a log file or an email subject), he will need to convert it to plain text using decode_entities(strip_tags()). On the other hand, if calls calls token_replace('This is [site:name]', array(), array('sanitize' => FALSE)), he will get a plain-text string that can be used as is. If he calls token_replace($text, array(), array('sanitize' => FALSE)) where $text is some user-supplied value, he has no way of knowing whether he needs to translate the string to plain-text or not. Calling decode_entities(strip_tags()) on a string that is already plain text will not always work.
Comment #9
sunSomehow the provided examples are bogus/wonky.
Calling
token_replace('Our slogan is [site:slogan]', array(), array('sanitize' => FALSE))for usage in a HTML context is wrong. If you want to output HTML, then you use the default value for $sanitize (TRUE).The $sanitize FALSE option solely exists for code that intends to NOT output token values, but use them in other contexts -- e.g., the file path configuration in a field's settings form.
Demoting to normal, as long as we don't have a valid bug report here.
Comment #10
c960657 commentedI was talking about a non-HTML context, e.g. in a log file or an email subject.
But let's consider the example with the file field, and let's assume that file_field_widget_uri() actually passes
array('sanitize' => FALSE)to token_replace() (it currently does not, but I assume it's supposed to - I have added this change to the patch):Assume that I have a site with geek humour with the witty name “<jokes>” and the witty slogan “and <stuff>”, i.e. that's what the users will see. On the admin form I have to enter the slogan as
and <stuff>, because the slogan field is supposed to contain HTML and is not check_plain()'ed before being displayed on the site. I have configured a file field to store its files in the directory[site:name]-[site:slogan]. This will store the files inhttp://example.com/sites/default/files/%3Cjokes%3E-and%20%26lt%3Bstuff%3E/foo.txt- in pretty-print (the text displayed in the status bar of modern browser when you hover over the link)http://example.com/sites/default/files/<jokes>-and <stuff>/foo.txt.What the user sees is the raw values, i.e. exactly what was input on the Site information page. From an end-user point of view it does not make sense, that
<appears verbatim in the URL - he would expect to see the<character be encoded the same way for the name and the slogan, i.e. so that the URL would pretty-print ashttp://example.com/sites/default/files/<jokes>-and <stuff>/foo.txt.Do you agree?
Comment #11
c960657 commentedComment #12
dave reidThe example for using the site slogan isn't really valid in this case. Although it *can* be used it's not a realistic example and this direction is not acceptable to me.
Regardless adding 'sanitize' => FALSE in file_field_widget_uri() is something that *should* be done.
Comment #14
dave reidWe probably also probably need to fix some locations where we should also be calling 'sanitize' => FALSE:
system_mail() with $context['subject']
system_send_email_action() with $context['recipient']
system_goto_action()
Comment #15
c960657 commented> The example for using the site slogan isn't really valid in this case. Although it *can*
> be used it's not a realistic example and this direction is not acceptable to me.
Some other examples: Putting the output from token_replace() in a mail subject, in a CSV file or other text file, or in a message posted on IRC.
In these cases we want the result in plain-text, not in HTML, even though some tokens contain HTML and some contain just plain text. Currently, the only way to do this is decode_entities(strip_tags(token_replace($foo, $context, array('sanitize' => TRUE))).
Example:
Create a node with the title “foo © bar” and the body “bar © <strong>foo”. On the node view page, the title appears as “foo © bar” (i.e. exactly as entered) and the body as “bar © foo” (i.e. interpreted as HTML).
Now consider this code used to generate a text file:
node-a.txt contains this:
This is a plain-text representation of what was shown on the node view page, so this is good.
node-b.txt contains this:
Here the title is right, but the body contains the raw HTML source code, i.e. the resulting text file contains a mix between HTML and plain text. This is not really useful, except perhaps for generating debug output. In most other real-world use-cases you want to convert the result into some specific format, either plain-text or HTML (or some other format) independent of how the value is represented internally in Drupal.
Comment #16
c960657 commentedComment #17
c960657 commentedComment #18
damien tournoud commentedI'm not sure I'm happy with this patch.
Just to clarify, we are mixing up two issues all over the place:
In core, we basically have four possible options:
Because we only have one parameter to control those options
'sanitized', the debate is which one of the four it should be:After all, the option "you are on your own" has absolutely no use case in core. You cannot use it to output it as plain-text, because it is *not* plain-text :)
Comment #19
c960657 commented@Damian, thank you for the clear summary.
I don't have strong opinions on whether it should be (2)+(4) or (3)+(4). My patch is implementing (3)+(4), but I'd be happy to make a new one implementing (2)+(4). My main concern is that we don't do (1).
Comment #20
c960657 commented(One year later ...)
@Damian: I agree, (2)+(4) (or posibly (2)+(3)) makes more sense.
This patch implements (2) and (4). Text passed through _text_sanitize() or check_markup() is considered safe enough (is this how we usually do?).
I also eliminated renamed the option “sanitize” to favour “html” (similar to the option for l()). I think this wording is more precise.
Comment #22
c960657 commentedComment #23
c960657 commentedReroll.
Comment #25
c960657 commentedComment #26
c960657 commentedReroll.
Comment #27
c960657 commentedComment #30
mgiffordLooks like this patch needs to be completely rewritten..
Comment #39
catchThe API has changed (with SafeMarkup and etc.) and there's no longer a $sanitize parameter.
Comment #40
catch