When handling and outputting text in HTML, you need to be careful that proper filtering or escaping is done. Otherwise, there might be bugs when users try to use angle brackets or ampersands, or worse you could open up XSS exploits.
When handling data, the golden rule is to store exactly what the user typed. When a user edits a post they created earlier, the form should contain the same things as it did when they first submitted it. This means that conversions are performed when content is output, not when saved to the database (be sure to read the db_query() documentation on how to use the database API securely).
To help you see where checks are needed, it is handy to mentally 'color' in each string depending on which format its data is in. Is it plain-text, HTML, BBcode or Textile? Then, whenever you concatenate two strings, you need to make sure they are both in the same format. If they are not, an appropriate check, conversion or filtering must be applied.
User-submitted data in Drupal can be divided into three categories:
This is simple text without any markup. What the user entered is displayed exactly on screen as is, and is not interpreted in any form. This is generally the format used for single-line text fields.
When outputting plain-text, you need to pass it through check_plain() before it can be put inside HTML. This will convert quotes, ampersands and angle brackets into entities, causing the string to be shown literally on screen in the browser.
Most themeable functions and APIs take HTML for their arguments, and there are a few that automatically sanitize text by first passing it through check_plain():
- t(): the placeholders (e.g. '%name' or '@name') are passed as plain-text and will be escaped when inserted into the translatable string. You can disable this escaping by using placeholders of the form '!name' but only if you are sure that the string is safe.
- l(): the link caption should be passed as plain-text (unless overridden with the
- menu items and breadcrumbs: the menu item titles and breadcrumb titles are automatically sanitized.
- theme('placeholder'): the placeholder text is plain-text.
- Block descriptions (but not titles--see below)
- User names when printed using theme_username() (Drupal 6 and earlier only. D7 expects the name to be sanitized already).
- Form API (FAPI) #default_value element
- Form API (FAPI) #options element when the type is a select box (in Drupal 7 #options element is always sanitized).
$form['safe'] = array( '#type' => 'textfield', '#default_value' => $u_supplied, // FAPI will pass through check_plain(), ); $form['also_safe'] = array( '#type' => 'select', '#default_value' => 0, // FAPI will pass through check_plain(), '#options' => node_get_types('names'), // FAPI will sanitize the '#options' attribute with check_plain() for select boxes. ); // In Drupal 7 this is XSS safe as options are run through filter_xss_admin(). $form['drupal6_unsafe'] = array( '#type' => 'checkboxes', '#default_value' => 0, // FAPI will pass through check_plain(), '#options' => node_get_types('names'), // In Drupal 6, FAPI will NOT sanitize the '#options' attribute on other elements than select boxes. );
Some places require that you first sanitize any text:
- Drupal 6 and earlier only: page titles set through drupal_set_title(), if the page title is displayed in the HTML, where it makes sense to use tags like
<em>for clarity. When the page title is displayed in the HTML tag however, all tags will be stripped out.
drupal_set_title($node->title); // XSS vulnerability in D6, correct in D7 drupal_set_title(check_plain($node->title)); // Correct in D6
- Drupal 6 and earlier only: block titles passed in through hook_block(). For the same reason as the page title, using HTML here is commonly done.
- Watchdog messages
Drupal 6/7 (The message and variables are passed through t() by the watchdog function): watchdog('content', "Deleted !title", array('!title' => $node->title)); // XSS watchdog('content', "Deleted %title", array('%title' => $node->title)); // or @
- Form elements #description and #title
$form['bad'] = array( '#type' => 'textfield', '#default_value' => check_plain($u_supplied), // bad: escaped twice '#description' => t("Old data: !data", array('!data' => $u_supplied)), // XSS ); $form['good'] = array( '#type' => 'textfield', '#default_value' => $u_supplied, '#description' => t("Old data: @data", array('@data' => $u_supplied)), );
- Drupal 6 only: Form elements - #options when #type = checkboxes or #type = radios
// This is XSS safe in Drupal 7+. $form['drupal6_bad'] = array( '#type' => 'checkboxes', '#options' => array($u_supplied0, $u_supplied1), ); $form['good'] = array( '#type' => 'checkboxes', '#options' => array(check_plain($u_supplied0), check_plain($u_supplied1)), );
- Form elements - #value of #type markup and item need to be safe. Note that the
default form element #type is markup!
$form['unsafe'] = array('#value' => $user->name); //XSS $form['safe'] = array('#value' => check_plain($user->name)); or $form['safe'] = array('#value' => theme('username', $user));
This is text which is marked up in some language (HTML, Textile, etc). It is stored in the markup-specific format, and converted to HTML on output using the various filters that are enabled. This is generally the format used for multi-line text fields.
All you need to do is pass the rich text through check_markup() and you'll get HTML returned, safe for outputting. You should also allow the user to choose the input format with a format widget through filter_form() and should pass the chosen format along to check_markup().
Note that you must make sure that the author of a post is allowed to use a particular input format, typically by checking with filter_access() when the content is being submitted. Note that in Drupal 6 check_markup() performs this check for the current user by default. However, because content is filtered on output, this is often not the person who originally wrote the content. In that case, you can disable this check by passing
$check = false to check_markup().
As of Drupal 4.7 there is a third way of dealing with text. There are some places in the administration section where it is impractical to invoke the filter system (for rich text), but where some simple markup is desired, such as a link or some emphasis (so plain text is not acceptable).
Examples include the mission statement, posting guidelines, and forum descriptions.
For such cases, you can use a regular text-area, and pass the text through filter_xss_admin() when you output it. This will allow most HTML tags to pass through, while still blocking possibly harmful script or styles.
URLs across Drupal require special handling in two ways:
- If you wish to put any sort of dynamic data into a URL, you need to pass it through urlencode(). If you don't, characters like '#' or '?' will disrupt the normal URL semantics. urlencode() will prevent this by escaping them with
%XXsyntax. Note that Drupal paths (e.g. 'node/123') are passed through urlencode() as a whole since Drupal 4.7 so you don't need to urlencode individual parts of it. This convenience does not apply to other parts of the URL like GET query arguments or fragment identifiers.
- When using user-submitted URLs in a hyperlink, you need to use check_url() rather than just check_plain(). check_url() will call check_plain(), but also perform additional XSS checks to ensure the URL is safe for clicking on.
Note that all Drupal functions which return URLs (url(), request_uri(), etc.) output plain URLs which have not been HTML escaped in any way (in other words, they are plain-text). Remember to use check_url() to escape them when outputting HTML (or XML). Don't use check_url() in situations where a real URL is expected, e.g. in the HTTP
Location: ... header.
All the rules above can be summed up quite easily: no piece of user-submitted content should ever be placed into HTML. If you are unsure of whether this is the case, you can always test it by submitting a piece of text like
<u>xss</u> into your module's fields. If the text comes out underlined or mangles existing tags, you know you have a problem.
Here are some examples of good and bad code.
$url are assumed to be user-submitted fields containing a title, a piece of marked up text and a URL respectively. They are fresh from the database and thus contain exactly what the user submitted without any changes.
<?php print '<tr><td>$title</td><td>'; ?>
<?php print '<a href="/..." title="$title">view node</a>'; ?>
Good (the title is plain-text and may not be placed into HTML as is):
<?php print '<tr><td>'. check_plain($title) .'</td></tr>'; ?>
<?php print '<a href="/..." title="'. check_plain($title) .'">view node</a>'; ?>
<?php print l(check_plain($title), 'node/'. $nid); ?>
<?php print '<a href="/$url">'; ?>
<?php print '<a href="/'. check_plain($url) .'">'; ?>
Good (URLs must be checked with check_url()):
<?php print '<a href="/'. check_url($url) .'">'; ?>
When writing a filter which translates from another markup language into HTML, you need to ensure you don't open any holes yourself. Generally, the same rules apply: check URLs with check_url() and ensure no literal HTML can be injected by escaping appropriately using check_plain().