This handbook page is based on a similar article by Robert Douglass at http://www.lullabot.com/articles/drupal_input_formats_and_filters. It is modified for inclusion here by permission.
Processing textual content for output in a browser is one of Drupal's most critical tasks. Without such processing we would all have to become masters at typing in HTML text! This section of the handbook explains what filters and input formats are, why they are important, how they are used, and why they impact site security.
Filters and Input Formats
The pillars of Drupal's text handling are filters and input formats. A filter is a set of rules that can be applied to transform text in some way. Some filters strip certain HTML tags or security hazards from text. Other filters look for special patterns and expand the text in a meaningful way. Other fun-oriented filters, such as the Pirate Filter, rewrite the text altogether (in this case, to make it "talk like a pirate"). Filters know how to do one thing, and do it well; text in, filtered text out.
Some filters have extra configuration options. The HTML filter, for example, strips all but an allowed set of HTML tags from text. The set of allowed tags can be determined by the administrator.
An input format is an ordered collection of filters. Any text that is being displayed to the browser should be run through the filters in an input format first. The input format then applies all of the filters, in the right order, so that one filter feeds its output to the next, forming a chain. This chaining of filters can be the source of great flexibility as well as great confusion. The flexibility comes from the fact that filters can be made to work together, the confusion comes from the case where filters inadvertently work against each other, one filter undoing the work of the previous filter.
Input versus Output
Drupal captures input in its raw form, saving whatever gets submitted straight to the database without alteration. Then, before displaying any such content in the browser, Drupal processes the text by choosing an input format to apply. Why doesn't Drupal apply the filters in an input format before saving input into the database? The answer is simple; flexibility. Changing the text a user has input before saving it in the database, would make it impossible to get back to the original state. A site administrator could never change the configuration of the filters. By filtering on output, not on input, Drupal gives the site administrator the option of changing how content is displayed at any time.
Drupal's Core Filters
Drupal 5 ships with the following core filters:
- HTML Filter: The HTML filter is primarily responsible for removing HTML tags from text. It can be configured to allow any number of tags (whitelist) and it will remove the rest. It removes them either by stripping them, or by escaping them into entities like this: <div> If tags are escaped, they show up in the output as visible tags: <div>Some text</div>. The set of tags that are allowed by default include: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
The final task of the HTML filter is to add a spam link deterrent to anchor tags. The deterrent, proposed by Google, gives search engines a tip about which links to follow when crawling the web. If this option is enabled, rel="nofollow" will be added as an attribute of all anchor tags.
- Line Break Converter: This filter converts line breaks into <br> or <p> tags depending on whether a single or double line break is found. This preserves the paragraph formatting in the text that is input.
- URL Filter: Any web or email addresses that are found in the text will be converted to clickable links, thus saving the user the hassle of having to type <a href="....">
- PHP Evaluator: The PHP Evaluator is the most radical of all Drupal's core filters. It looks for text enclosed in <?php ... ?> and evaluates it as PHP code. This effectively allows you to program and extend Drupal just by submitting content to the site! In 99% of cases, this is a bad idea, and the initial attraction of harnessing such power should be weighed by a healthy sense of fear. In most cases it is better to write a simple module instead of using the PHP Evaluator. Furthermore, in the wrong hands, the PHP Evaluator is an enormous security risk. If the PHP Evaluator module is not configured properly it represents a major security loophole that could compromise the security of an entire site.
Drupal's Core Input Formats
Drupal also comes with three input formats pre-defined.
- Filtered HTML: This is the workhorse input format that is used most of the time for displaying posts such as blogs, pages, forum topics and so forth. It combines the URL Filter, the HTML Filter and the Line Break Converter in a way that allows users a small set of HTML tags for formatting while taking care of paragraphs and URLs behind the scenes. This is also the default input format for new Drupal installations. More on default input formats later.
- PHP Code: This input format consists of only one filter, the PHP Evaluator filter. This input format is to be used when the goal is embedding PHP code in a post.
- Full HTML: The Full HTML input format applies only the Line Break Converter filter. No HTML tags are stripped and no weblinks are converted to anchor tags.
Filters and Security
Filters and security go hand-in-hand. Without filters, there would be no security for your site as malicious attackers would have free reign in using scripts to deface your site, subject your users to phishing scams, and steal important data such as passwords.
The heart of the security offered by filters comes from from the HTML Filter and the calls it makes to filter_xss and check_plain. These are the functions that Drupal uses to prevent attacks based on user input. For this reason, all of your user submitted output should be run through the HTML Filter. It is tempting to ignore this advice, especially if you are having troubles getting the configuration settings just right for your purposes. Don't ignore this advice. You may end up sorry.
Also worth reiterating is the fact that the PHP Evaluator filter poses an extreme risk if it can be used by anyone but highly trusted, PHP-competent site administrators. Most sites will be better off deleting the PHP code input format and not extending use of the PHP Evaluator filter to anyone.
Finally, it should be obvious that the Full HTML input format, which does not use the HTML Filter, is insecure and should be offered only to those users who can be trusted not to ruin your site. Most sites will be better off deleting this input format.