Handle text in a secure fashion

When handling and outputting text in HTML, you need to be careful that proper filtering or escaping is done. Otherwise there might be bugs when users try to use angle brackets or ampersands, or worse you could open up XSS exploits.

When handling data, the golden rule is to store exactly what the user typed. When a user edits a post they created earlier, the form should contain the same things as it did when they first submitted it. This means that conversions are performed when content is output, not when saved to the database (be sure to read the db_query() documentation on how to use the database API securely).

To help you see where checks are needed, it is handy to mentally 'color' in each string depending on which format its data is in. Is it plain-text, HTML, BBcode or Textile? Then, whenever you concatenate two strings, you need to make sure they are both in the same format. If they are not, an appropriate check, conversion or filtering must be applied.

User-submitted data in Drupal can be divided into three categories:

  1. Plain-text

    This is simple text without any markup. What the user entered is displayed exactly on screen as is, and is not interpreted in any form. This is generally the format used for single-line text fields.

    When outputting plain-text, you need to pass it through check_plain() before it can be put inside HTML. This will convert quotes, ampersands and angle brackets into entities, causing the string to be shown literally on screen in the browser.

    Most themeable functions and APIs take HTML for their arguments, and there are a few that automatically sanitize text by first passing it through check_plain():

    • t(): the placeholders (e.g. '%name' or '@name') are passed as plain-text and will be escaped when inserted into the translatable string. You can disable this escaping by using placeholders of the form '!name' (more info).
    • l(): the link caption should be passed as plain-text (unless overridden with the $html parameter).
    • menu items and breadcrumbs: the menu item titles and breadcrumb titles are automatically sanitized.
    • theme('placeholder'): the placeholder text is plain-text.
    • Block descriptions (but not titles--see below)
    • User names when printed using theme_username()
    • Form API (FAPI) #default_value element and #options element when the type is a select box.
      Examples:
      $form['safe'] = array(
        '#type' => 'textfield';
        '#default_value' => $u_supplied,
      );


      $form['also_safe'] = array(
        '#type' => 'select';
        '#default_value' => 0,
        '#options' => node_get_types('names'),  // Could contain unsafe values but FAPI will pass through check_plain() before displaying to user.
      );

    Some places require that you first sanitize any text:

    • page titles set through drupal_set_title(). The page title is displayed in the HTML, where it makes sense to use tags like <em> for clarity. When the page title is displayed in the HTML tag however, all tags will be stripped out.
      Examples:
      drupal_set_title($node->title); // XSS vulnerability, bad
      drupal_set_title(check_plain($node->title));  // Correct
    • block titles passed in through hook_block(). For the same reason as the page title, using HTML here is commonly done.
    • Watchdog messages
      Examples:
      Drupal 5:
      watchdog('content', t("Deleted !title", array('!title' => $node->title)); // XSS
      watchdog('content', t("Deleted %title", array('%title' => $node->title)); // or @

      Drupal 6 (The message and variables are passed through t() by the watchdog function):
      watchdog('content', "Deleted !title", array('!title' => $node->title); // XSS
      watchdog('content', "Deleted %title", array('%title' => $node->title); // or @
    • Form elements #description and #title
      Examples:
      $form['bad'] = array(
      '#type' => 'textfield';
      '#default_value' => check_plain($u_supplied),  // bad: escaped twice
      '#description' => t("Old data: !data", array('!data' => $u_supplied)), // XSS
      );

      $form['good'] = array(
      '#type' => 'textfield';
      '#default_value' => $u_supplied,
      '#description' => t("Old data: @data", array('@data' => $u_supplied)),
      );
    • Form elements - #options when #type = checkboxes
      Examples:
      $form['bad'] = array(
        '#type' => 'checkboxes',
        '#options' => array($u_supplied0, $u_supplied1),
      );

      $form['good'] = array(
        '#type' => 'checkboxes',
        '#options' => array(check_plain($u_supplied0), check_plain($u_supplied1)),
      );
    • Form elements - #value of #type markup and item need to be safe. Note that the
      default form element #type is markup!
      Examples:
      $form['unsafe'] = array('#value' => $user->name); //XSS
      $form['safe'] = array('#value' => check_plain($user->name));
      or
      $form['safe'] = array('#value' => theme('username', $user));
  2. Rich text

    This is text which is marked up in some language (HTML, Textile, etc). It is stored in the markup-specific format, and converted to HTML on output using the various filters that are enabled. This is generally the format used for multi-line text fields.

    All you need to do is pass the rich text through check_markup() and you'll get HTML returned, safe for outputting. You should also allow the user to choose the input format with a format widget through filter_form() and should pass the chosen format along to check_markup().

    Note that you must make sure that the author of a post is allowed to use a particular input format. As a safe-guard, check_markup() performs this check for the current user by default. However, because content is filtered on output, this is often not the person who originally wrote the content. In that case, you must disable this check by passing $check = false to check_markup(), and making sure that the format is being checked with filter_access() when the content is being submitted.

  3. Admin-only HTML

    As of Drupal 4.7 there is a third way of dealing with text. There are some places in the administration section where it is impractical to invoke the filter system (for rich text), but where some simple markup is desired, such as a link or some emphasis (so plain text is not acceptable).

    Examples include the mission statement, posting guidelines, and forum descriptions.

    For such cases, you can use a regular text-area, and pass the text through filter_xss_admin() when you output it. This will allow most HTML tags to pass through, while still blocking possibly harmful script or styles.

URLs across Drupal require special handling in two ways:

  1. If you wish to put any sort of dynamic data into a URL, you need to pass it through urlencode(). If you don't, characters like '#' or '?' will disrupt the normal URL semantics. urlencode() will prevent this by escaping them with %XX syntax. Note that Drupal paths (e.g. 'node/123') are passed through urlencode() as a whole since Drupal 4.7 so you don't need to urlencode individual parts of it. This convenience does not apply to other parts of the URL like GET query arguments or fragment identifiers.
  2. When using user-submitted URLs in a hyperlink, you need to use check_url() rather than just check_plain(). check_url() will call check_plain(), but also perform additional XSS checks to ensure the URL is safe for clicking on.

Note that all Drupal functions which return URLs (url(), request_uri(), etc.) output plain URLs which have not been HTML escaped in any way (in other words, they are plain-text). Remember to use check_url() to escape them when outputting HTML (or XML). Don't use check_url() in situations where a real URL is expected, e.g. in the HTTP Location: ... header.

In practice

All the rules above can be summed up quite easily: no piece of user-submitted content should ever be placed as-is into HTML. If you are unsure of whether this is the case, you can always test it by submitting a piece of text like <u>xss</u> into your module's fields. If the text comes out underlined or mangles existing tags, you know you have a problem.

Here are some examples of good and bad code. $title, $body and $url are assumed to be user-submitted fields containing a title, a piece of marked up text and a URL respectively. They are fresh from the database and thus contain exactly what the user submitted without any changes.

Bad:
<?php print '<tr><td>$title</td><td>'; ?>
<?php print '<a href="/..." title="$title">view node</a>'; ?>

Good (the title is plain-text and may not be placed into HTML as is):
<?php print '<tr><td>'. check_plain($title) .'</td></tr>'; ?>
<?php print '<a href="/..." title="'. check_plain($title) .'">view node</a>'; ?>

Bad:
<?php print l(check_plain($title), 'node/'. $nid); ?>

Good (l() already contains a check_plain() call by default):
<?php print l($title, 'node/'. $nid); ?>

Bad:
<?php print '<a href="/$url">'; ?>
<?php print '<a href="/'. check_plain($url) .'">'; ?>

Good (URLs must be checked with check_url()):
<?php print '<a href="/'. check_url($url) .'">'; ?>

Writing filters

When writing a filter which translates from another markup language into HTML, you need to ensure you don't open any holes yourself. Generally, the same rules apply: check URLs with check_url() and ensure no literal HTML can be injected by escaping appropriately using check_plain().

In addition to checkboxes,

mfb - July 3, 2008 - 00:49

In addition to checkboxes, #options are also unsanitized if #type = radios.

 
 

Drupal is a registered trademark of Dries Buytaert.