Clean up plain-text checking [#18817]

Back story

Bart Jansens brought up some possible XSS issues recently which prompted me to reevaluate the text checking in Drupal.

The big issue is "plain text": what a user typically enters in single-line boxes. When this text is put into HTML, special characters like < > and & need to be escaped. When it is put into an HTML attribute, " needs to be escaped too. Note that it is not about filtering (what check_output() does), but about handling plain-text (commonly associated with single-line editboxes).

The current 'standard' in Drupal is to do escaping when outputting the titles in HTML, and keep the plain-text in the database. We should keep this as it is in line with "keep exactly what the user typed and process it later". Furthermore it is hard to undo this escaping, which is needed when communicating with other services or when exporting data.

This brings up the problem: where should text be escaped? There are in fact many places where this escaping is not done yet, which means there are problems if the user enters e.g. < or & in a title. Usually these bugs are in admin areas, so not critical for security, but they should still be fixed in the spirit of XHTML validation and usability.

So this means that before the call to check_plain(), the data is plain-text and no HTML can be used in it. Once the text has been run through check_plain(), it is HTML, and tags can be added. Logically, it is easy: data should be escaped as soon as it becomes HTML, but exactly deciding where data becomes HTML is tricky.

Any parameter which is passed as plain-text and only checked later in the code can not contain HTML, so I've had to decide where to add missing escapings based on context and common sense.

This issue also affects URLs, but there the XSS issues are more important. Plus all of this should not be confused with urlencode() which is simply used to put arbitrary data into an URL without it disrupting normal URL semantics (directories, anchors, query arguments, etc).

Concrete changes

drupal_specialchars() and check_form() were almost the same, except that one would escape " while the other wouldn't. This difference is negligable and it does no harm to escape too much. In the interest of cleaning up text handling, I merged these into check_plain().
check_url() had a toss at extra XSS protection in it, but it was in fact broken and wholly incorrect. I replaced it with what was probably the original intention.
Many URL handling functions in fact worked with HTML-escaped URLs (using & instead of & for example). I got rid of this, and did the escaping in a more logical place (e.g. drupal_goto no longer has to reconvert the &).
Node/comment titles used to be run through strip_tags(), which means you couldn't use HTML but were still forced to escape angle brackets and amps. I changed them to be consistent with other plain-text fields in Drupal.
Profile checkbox items were handled inconsistently as well and were fixed.
The l() function now escapes the title by default, in the interest of keeping the code cleaner. But the old behaviour is still available (and needed in some places) with an extra parameter.

Text handling in Drupal after this patch

Any piece of user submitted text should be run through one of the check_ functions before being put into HTML. Use check_plain() for plain-text, check_output() for rich text.
Dynamic data in URLs should be urlencode()d, regardless of where that URL ends up.
URLs in HTML attributes should be check_url()'d.

If you are unsure if you're doing the right thing, the easiest test is to try out putting HTML tags in your module's fields where they are not intended. They should be displayed on-screen and not interpreted.

Comment	File	Size	Author
#9	check_1.patch	121.43 KB	Steven
#9
#7	check_0.patch	91.92 KB	Steven
#7
	check.patch	91.83 KB	Steven

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Bart Jansens CreditAttribution: Bart Jansens commented 15 March 2005 at 13:10

Looks good.
I spent some time trying to find places where the check_plain() call was missing and I found only one; the URL of an aggregator feed isn't checked.

One thing that isn't entirely clear is who is responsible for calling check_plain() or check_output(). In some modules this is done by a theme function, in others the data passed to the theme functions is already checked. I'm not saying this patch should fix all this (it's already big enough), but just something to keep in mind for the future. Currently some themes might forget a check.

Comment #2

Steven CreditAttribution: Steven commented 16 March 2005 at 01:29

For nodes/comments the responsibility lies with the theme engine anyway, not the template. The other cases are not themed as often.

Comment #3

TDobes CreditAttribution: TDobes commented 28 March 2005 at 10:49

Steven, these are excellent and important changes, but the patch is quite large. If this is intended to land before 4.6 is released, I recommend that it land sooner rather than later so that it receives more widespread testing to iron out any bugs that may appear.

In the patches to theme.inc, themes/chameleon.theme, and themes/engines/xtemplate/xtemplate.engine, you pass the page title through strip_tags (fixing this bug) but not through check_plain. After we run it through strip_tags, is there any reason why we wouldn't want to run it through check_plain to escape text just like anywhere else? In fact, it might be useful to add a second parameter to check_plain that would do this for us (strip_tags first, then htmlspecialchars) We need to update the theme documentation after this patch lands so that contrib themes display titles in a manner consistent with core.

In the patch for modules/comment.inc, the strip_tags function is removed from the comment subject auto-generation code. Because comment bodies can contain HTML, there should probably be a strip_tags in there somewhere to prevent undesirable results. I see that the comment subjects are passed through a check_plain which should render the HTML inert, but it would still be more pleasant to see something like "What I think is" rather than "What I think"

In modules/path.module, there is a typo: check_plani should be check_plain

Note that this patch also includes a fix for this bug as well. (Invalid search results link in the watchdog.)

Comment #4

TDobes CreditAttribution: TDobes commented 28 March 2005 at 10:52

I forgot to say:
After the minor issues I noted are addressed, +1. This is very helpful for usability as it removes the burden of escaping text from users. It's also a major plus for consistency; before this patch, I could name at least one field that was escaped on some pages and not on others. Thanks for addressing this!

Comment #5

Uwe Hermann CreditAttribution: Uwe Hermann commented 28 March 2005 at 16:41

+1. Looks good.

Comment #6

Dries CreditAttribution: Dries commented 28 March 2005 at 17:04

The approach taken in this patch looks good. I second that this should be committed ASAP (1) such that it gets tested before the Drupal 4.6 release and (2) such that contributed modules/themes can update their code accordingly. (Note that I haven't actually tested the patch -- I just looked at the diff.)

I'm a bit worried about people not using check_plain() -- it is easy enough to forget wrapping $node->title in check_plain(). I wonder what we can do about that, or what the implications could be like. The least we could do is add a test to the code-checker script. Either way, it improves the current behavior so +1 from me.

Comment #7

Steven CreditAttribution: Steven commented 28 March 2005 at 23:24

File	Size
check_0.patch	91.92 KB

The forgetting to use check_plain() argument is IMO not valid, because such checking needs to be done already. Because of the very inconsistent way that node/comment titles were handled, this was not necessary before for just this case, leading people to believe that it was allowed to be omitted everywhere else too. Some modules (like aggregator) had no escaping in them whatsoever.

In any case, I will write a short text explaining how to deal with text processing in Drupal. It is an important topic, and I must say I'm quite disappointed by how the size of this patch has scared away most people. It's an important issue that needs to be adressed and taken into account by everyone.

But the best way to check is to try to put HTML into plain-text fields and making sure it comes out literally. E.g. "this text should not be underlined".

In the patches to theme.inc, themes/chameleon.theme, and themes/engines/xtemplate/xtemplate.engine, you pass the page title through strip_tags (fixing this bug) but not through check_plain. After we run it through strip_tags, is there any reason why we wouldn't want to run it through check_plain to escape text just like anywhere else? In fact, it might be useful to add a second parameter to check_plain that would do this for us (strip_tags first, then htmlspecialchars)

First of all, a plain strip_tags() should never be used to validate or check user input, as it prevents you from using any HTML, but still forces you to escape special characters manually. Hence you get all the downsides of HTML but no advantages. This is a very rare case, so I would not modify check_plain() for this.

The reason we do strip_tags() on the page title is different: the page title, at this point, has already been converted into HTML. It may contains HTML tags (as the theme_confirm issue shows). And it is inserted without any extra processing into the <h1> tag of the page. However, the <title> tag in the HTML <head> may not contain any tags, just HTML-escaped text. So, we strip out any tags, and present only the text. Because the title was HTML before this point, we do not need to do any further processing. Remember, check_plain() converts from plain-text to HTML. Strip_tags() still outputs HTML, albeit without any tags in it.

If the source of the page title was dynamic plain-text user input, then at one point or the other it should've been check_plain()ed before being passed to drupal_set_title().

In the patch for modules/comment.inc, the strip_tags function is removed from the comment subject auto-generation code. Because comment bodies can contain HTML, there should probably be a strip_tags in there somewhere to prevent undesirable results. I see that the comment subjects are passed through a check_plain which should render the HTML inert, but it would still be more pleasant to see something like "What I think is" rather than "What I think"

The reason I did this is because strip_tags() automatically assumes the comment is formatted as HTML. This is not always the case due to flexible input formats being used. On top of that, due to the comment title extraction being run before filtering, the chance of there being HTML inside the title is lower.

And finally, because now we auto-generate the comment on preview rather than submit, users will see if HTML is included in the subject, and can correct it if needed. I think this is an acceptable compromise.

I've attached an up-to-date patch.

Comment #8

TDobes CreditAttribution: TDobes commented 29 March 2005 at 09:23

Steven: After reading your comments, I agree with you and retract my earlier complaints. +1 for applying this soon.

However, I still think leaving the strip_tags on the comment subject autogeneration code will do no harm and may help in some cases. Some sites may actually operate with the comment subject field turned off so all comment subjects are auto-generated. If those sites allow HTML, HTML in comment subjects is nearly guaranteed. For sites that don't allow HTML, although it will not be helpful, I can't see how it will do any harm.

Also, your patches do not include a strip_tags for the <title> element in chameleon.theme. You'll notice that the $title variable actually isn't yet defined when it's used... there's a patch waiting to fix that.

Comment #9

Steven CreditAttribution: Steven commented 29 March 2005 at 19:59

File	Size
check_1.patch	121.43 KB

Here's an updated patch. Uwe pointed to some more issues in user.module, and I found several more myself.

By TDobes' suggestion, I readded the strip_tags() for comment subject autogeneration, but only if the body matches '<[A-Za-z]'. This is the best compromise I think. Of course you are still free to use amps and angles yourself in the subject, they are simply left out when autogenerating.

I also fixed the chameleon issue while I was at it, because this patch affects the same code and it's gotten a green light anyway.

Comment #10

Steven CreditAttribution: Steven commented 29 March 2005 at 20:01

There is one more improvement that can go in... the creation of a theme_dynamic_text() function which returns ''. check_plain($foo) .'' . This is a very common piece of code.

I'll post an updated patch later, but in the meantime you can try it out already, as this last change won't affect functionality.

The best test you can do is to fill in (complete!) HTML tags in various plain text fields and see if they come out literally rather than as HTML.

Comment #11

Steven CreditAttribution: Steven commented 29 March 2005 at 20:40

Okay, I forgot the chameleon fix... will post a complete patch later, with theme_dynamic_text() implemented.

Comment #12

Steven CreditAttribution: Steven commented 31 March 2005 at 09:12

I committed a modified patch after a green light from Dries:

- I added theme('placeholder', $text) which returns ''. check_plain($text) .''. This is used a lot, plus it improves themability.

- I fixed the chameleon title bug.

- I changed the comment subject generation. Now it:
1) Filters the comment body into HTML (check_output)
2) Strips out all HTML tags (strip_tags)
3) Converts entities back to plain-text (decode_entities)
This will safely handle any markup in the comment. For example, if the comment starts with a codefilter escaped <?php ?> block, it will end up literally in the subject again without any color markup or entities.