I've seen a bit of discussion on the filtering of html from user input and how the user specifies that their submission is text or html.

Below is a modification to two functions in node.module.

1. node_filter()
2. node_filter_html()

With the code below html filtering is either all or nothing. If filtering is turned on then the allowed tags are let go and the disallowed tags are converted to:
<tag>

If html filtering is turned off the drupal "line break" filter is used and all tags are converted as above.

To test this code the allowed_html entry in the variables table needs to contain the allowable tags in the following format:

a,b,dd,dl,dt,i,li,ol,u,ul,p,br

Note: I am by no means a php coder. the code below is predominantly a cut and paste job from the phpBB code, if it breaks something else I don't know, if it breaks your system, sorry ....

-- node_filter --

<code>function node_filter($text) {
    $html_entities_match = array('#&#', '#<#', '#>#');
    $html_entities_replace = array('&', '<', '>');

    $unhtml_specialchars_match = array('#>#', '#<#', '#"#', '#&#');
    $unhtml_specialchars_replace = array('>', '<', '"', '&');

  if (variable_get("filter_html", 0)) {
      /*
      ** filter out any unwanted html tags
      */
     $text = node_filter_html($text);
  }
  else
  {
    /*
    ** use the drupal "line break" filter
    */
    $text = preg_replace($html_entities_match, $html_entities_replace, node_filter_line($text));
  }

  /*
  ** filter links
  */
  if (variable_get("filter_link", 0)) $text = node_filter_link($text);
  return $text;
}

-- node_filter_html --

<code>function node_filter_html($text) {
/*
** This function will prepare a posted message for
** entry into the database.
**
** This code has been shamelessly "borrowed" from the
** phpBB project [www.phpbb.com]
**
*/
    $html_entities_match = array('#&#', '#<#', '#>#');
    $html_entities_replace = array('&', '<', '>');

    $unhtml_specialchars_match = array('#>#', '#<#', '#"#', '#&#');
    $unhtml_specialchars_replace = array('>', '<', '"', '&');

    //
    // Clean up the message
    //
    $message = trim($text);

        $allowed_html_tags = split(',', variable_get("allowed_html", "a,b,dd,dl,dt,i,li,ol,u,ul,p,br"));

        $end_html = 0;
        $start_html = 1;
        $tmp_message = '';
        $message = ' ' . $message . ' ';

        while ( $start_html = strpos($message, '<', $start_html) )
        {
            $tmp_message .= preg_replace($html_entities_match, $html_entities_replace, substr($message, $end_html + 1, ( $start_html - $end_html - 1 )));

            if ( $end_html = strpos($message, '>', $start_html) )
            {
                $length = $end_html - $start_html + 1;
                $hold_string = substr($message, $start_html, $length);

                if ( ( $unclosed_open = strrpos(' ' . $hold_string, '<') ) != 1 )
                {
                    $tmp_message .= preg_replace($html_entities_match, $html_entities_replace, substr($hold_string, 0, $unclosed_open - 1));
                    $hold_string = substr($hold_string, $unclosed_open - 1);
                }

                $tagallowed = false;
                for($i = 0; $i < sizeof($allowed_html_tags); $i++)
                {
                    $match_tag = trim($allowed_html_tags[$i]);
                    if ( preg_match('/^<\/?' . $match_tag . '(?!(\s*)style(\s*)\\=)/i', $hold_string) )
                    {
                        $tagallowed = true;
                    }
                }

                $tmp_message .= ( $length && !$tagallowed ) ? preg_replace($html_entities_match, $html_entities_replace, $hold_string) : $hold_string;

                $start_html += $length;
            }
            else
            {
                $tmp_message .= preg_replace($html_entities_match, $html_entities_replace, substr($message, $start_html, strlen($message)));

                $start_html = strlen($message);
                $end_html = $start_html;
            }
        }

        if ( $end_html != strlen($message) && $tmp_message != '' )
        {
            $tmp_message .= preg_replace($html_entities_match, $html_entities_replace, substr($message, $end_html + 1));
        }

        $message = ( $tmp_message != '' ) ? trim($tmp_message) : trim($message);
    $text = $message;
    return $text;
}

Comments

al’s picture

Priority: Minor » Normal

Is this still an issue?
IMHO, we still haven't got <br /> filtering right yet. The conditions under which it is applied aren't very obvious; it doesn't work intuitively.

We probably need to look at this in more detail...

jonbob’s picture

I don't think this applies anymore.