- input/output filters: right now we use the same functions to filter both input (for example before node_saving) and output. Using different functions could be useful. This feature has been called pre/post filter in the past, but I suggest to call them input/output to avoid confusion with smarty (where "pre/post filters" = before/after template has been parsed)
- processing order: let admin choose the order of filters through an interface

sidenote: we can use permissions to have some filters apply just to some roles (for example html stripping would not be applied to admins); this is up to module writers

CommentFileSizeAuthor
#16 filter.clashing.patch12.99 KBSteven
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

al’s picture

Title: filter revamp » Rewrite filter system
Component: Code » Node system
Category: feature » task

Changing this to a task, as it's not really a feature - it's already there, but needs redoing.

adrian’s picture

Priority: Major » Normal

There has been some more discussions on the devel mailing list about this, and this kind of touches on some of the ideas and thoughts I have for drupal filters.

moshe weitzman’s picture

CVS Drupal was recently changed so the filter() hook acts on output only. This improves things for some applications, but not all. I'm amrking this as Fixed for now. Post more detail and reopen the item if you are interested in doing more with filters.

al’s picture

Title: Rewrite filter system » Filter system - support expensive filters.
Component: Node system » other

The filter system now uses on-output filters. These guarantee to not mess up the original user-contributed text, which is great for editing. Unfortunately, it means some filters that only need to be run once end up being run every time content is displayed (allowing for caching, obviously).

We could do with having some form of on-submission filtering. These will affect how previews look (but not the text in the edit box), and will then on submit have their chained output stored in a field in the db. An expiry field would also be good so we can update the filtered output when a cron run is done.

al’s picture

marco’s picture

Assigned: Unassigned » marco

I'm working on this and have some stuff on my sandbox. My proposal is to have a filter_in() function which saves unfiltered text and then filters it. ATM in my sandbox there is a working version, but Dries gave me some suggestions to make it simpler, so hold on.

I also suggest a filter_out() function which should be applied in check_output(); it will be used in rare cases, only when needed; for example a filter to highlight Google search words if Google is the referer.

I assign this to myself and will work on it as soon as we branch.

Bèr Kessels’s picture

Ax suggested all filtyer issues should be discussed here. instead of copiying all stuf into this [task] i give a link.
here's a [bug] with some discussion on in-outout filtering.
> http://drupal.org/node/view/1812
About filter conflicts. Suggesting to add a weight to a filter.
> http://drupal.org/node/view/2281

Bèr Kessels’s picture

From: http://drupal.org/project/update/2345

URL of homepage:
http://homepage/">http://homepage/

The solution is to change check_output() to drupal_specialchars() in profile.module for profile_homepage entry.

Bèr Kessels’s picture

Another problem with filters.

As far a I know this problem was not yet addressed.

When I insert eg [image:21|foo|bar] an image should occor. So far no real problem. But what happens when I remove image 21?
This would also be valid for the [url] and [title]. The latter already kind of solves it by referring the user to the search.

All nodes with a token should be updated, listed or mentioned to some administrator.

I see three solutions:
On cron run all on-output filters are checked. All nodes are checked to see if all tokens are still valid. A list is compiled with links to the nodes with the specific tokens that need to be updated.

On cron run all on-output filters are checked. All nodes are checked to see if all tokens are still valid. Those tokens are preg_replaced with some default token.

The module should use the _delete() hook to update or list all tokens that will become invalid. Some warning can be outputted to the user who is deleting the node. eg:
The node you are about to delete is referred to by:
- nodetitle1
- nodetitle1
- nodetitle1
Removing this node will make these invalid.

The filter story becomes more and more complex.....

Ber

Steven’s picture

Assigned: marco » Steven
Bèr Kessels’s picture

Proposing filter management.

Hi,
I've finished some stuff and now i'm starting with the filter module. I'd like to find a solution for the annoying filterclashing.

I looked at the way mozilla mail and outlook handle filters and translated this into a drupal UI.

Please look at my proposal at
http://devel.bakkers.nl/misc/first-level-filter.png
http://devel.bakkers.nl/misc/second-level-filter.png

Those are quite riough html files, but it shows the menu, breadcrumb and UI in a good enogh way.

Basically what i propose is:
First level shows a overview. With weighting.
Second level shows the filter page itself. chopped up on one page per filter. Each installed filter will get a page like this.
Please note the Continue radio-options below, they allow some configuarion on whether other filters might still filter the node afther that specific filter was run.

Please give feedback and comments on this proposal, for i can then take up this issue and try to solve it before the 15th of February.

Bèr

Steven’s picture

I was planning on doing this after Jan 31st (when my exams end), but if you can do it, that's good too.

I want to see if it isn't possible to do a generic ordering control however: Drupal's weighting is quick'n'dirtyish, but no-one has ever done a better, adequate solution. A Javascripted dynamic up/down control (which still works through page reloads when Javascript is off) would be ideal for this, and could be applied in various places around Drupal. I have some rough ideas how to do this, but no code yet.
The output for such a control would still be an order, so the DB queries can still say 'order by weight' without a problem. It's only the UI that changes.

I don't like the idea of stopping when a filter has hit a match: can you give examples of where this would be useful? Remember that filters also have to deal with user-supplied HTML, so they need to be robust anyway.
Also, the UI for it is pretty confusing: you have to dig into every filter's own page to change this setting, plus there's no overview to see the big picture about this setting. Remember also that some filters have no settings of their own (in fact, removing the on-switch from title.module is on my list because it's redundant).

Also, what you've done (maybe not on purpose) is call for multiple filters per module (by separating 'legacy filtering' and 'html filtering'). I don't think this is a good idea, because it will complicate the code a lot. Remember that the reason for filter ordering is simply to make sure filters play nice with eachother. If a single module performs multiple filters, I don't think it's crazy to assume it can do this in the most appropriate order itself and doesn't need an admin to tweak it. If a module really does more than 1 extensive filter that might clash with itself, it should be split into 2 modules.

Finally, you allow for filters to be run before the default filters. I don't like this either. All clashing issues with the default filters require pre-escaping and cannot be solved by ordering anyhow (see mailinglist a while ago). Any filter that runs before the default ones will have its output treated as user-supplied input, with tag-stripping (or escaping) as a consequence. The default filters are quite important.

Some other discussed features:
- Role based filtering. This is tricky because the filtering is performed on output: if you check permissions, these will be the permissions of the person who is viewing the node, not of the person who created it, which is not expected behaviour. This is what prevents some modules from handling their own role-specific settings (which would probably be good enough for 90% of the cases because usually it's only the HTML-tags setting which has to be role-specific).

- Caching: I haven't had time or possibility to benchmark my filter cache yet (though it won't apply to drupal CVS anymore now, but the actual changes are small), but I think my idea was the most clean way to get around to it.

I think it's important to keep the filter system's interface simple (only check_output now): right now it is quite transparent to the rest of Drupal and I'd like to keep it that way. Permissions require some sort of context identification, or that we split filtering back into submission-filtering and viewing-filtering (with resulting headaches for making sure everything is filtered correctly, and that user input isn't mangled beyond recognition).

Let's focus on ordering for now though ;).

Bèr Kessels’s picture

First of all: My personal focus lies on fixing the clashing of filters (refer http://drupal.org/node/view/5073#7841). So I will not pay attention to cashing, and permissions.

[...]
The output for such a control would still be an order, so the DB queries can still say 
'order by weight' without a problem. It's only the UI that changes.

I will stick to the current [weight v]. When a new mechanism is designed it can be implemented here too. But this is out of scope for the filter system imo.

I don't like the idea of stopping when a filter has hit a match: can you give examples of
where this would be useful? Remember that filters also have to deal with user-supplied HTML, 
so they need to be robust anyway.

This is my solution for the clashing. I could design some escaping mechanism, but I think this will solve only a part of the problem.
Together with weighting, a good pair of brains and the stopping the problem will be soved, AFAIK.

When a filter (eg. [weblinks], and [title]) clash, (title will be run before weblinks, so all [weblinks:234] are converted to /title/weblinks:234) You can either do this:

  1. filter all html
  2. filter all [weblinks] if it was found do not filter the replaced <a's> anymore
  3. filter all [title]
  4. filter all urls. all uls, except the [weblinks] replaced part.
Also, the UI for it is pretty confusing: you have to dig into every filter's own page to change this setting, 
plus there's no overview to see the big picture about this setting. Remember also that some filters have no
settings of their own (in fact, removing the on-switch from title.module is on my list because it's redundant).

True, But I wanted to follow the way most mail clients have theyr filter UI designed. I do not want to re-invent the weel, so I looked at some existing filter UI and translated that to drupal.

If the continue-settings are not voted out, i will move them to the 'big picture, the first screen.

My main reason for splitting the screens was to avoid one huge, cluttered screen. The big-picture will get lost when all options, help texts textfields are rendered in one page.
Besides, the filter setting will not change too often, AFAIK, so reducing 'amount of clicks' is not really an improvement here, admins don't have to be here often enough.

I will, however, first implement the weighting, the 'stop-on-found etc' might come later, if anybody else sees the benefits of this.

Bèr

Steven’s picture

This is my solution for the clashing. I could design some escaping mechanism, but I think this will solve only a part of the problem. Together with weighting, a good pair of brains and the stopping the problem will be soved, AFAIK. When a filter (eg. [weblinks], and [title]) clash, (title will be run before weblinks, so all [weblinks:234] are converted to /title/weblinks:234) You can either do this:

1. filter all html
2. filter all [weblinks] if it was found do not filter the replaced anymore
3. filter all [title]
4. filter all urls. all uls, except the [weblinks] replaced part.

Well there are two clashing issues: one is what you describe (interfilter clashing like weblinks and title), and the other is that escaping happens too late (clashing between filters and html tag stripping). Another example of interfilter clashing is glossary and urlfilter, where glossary will escape terms in an URL before urlfilter can turn it into a link.

I don't get what you mean with "4. filter all urls. all urls, except the [weblinks] replaced part." though. If this is the result of your stopping mechanism, then wouldn't this mean that for example with weblinks.module and urlfilter.module, any post that contains a [weblink] tag will not have plain URLs escaped? This is not desired behaviour, and this is what I meant with making filters robust: urlfilter needs to handle existing <a>-tags anyway, so it doesn't have a problem when it encounters a link generated by weblinks. I'm all for filter ordering, I just don't see how your stopping mechanism solves the general issue of filters being too greedy. I think you're confusing filter-matching (e.g. email sorting, where stop conditions are needed) with text processing (drupal's filters).

adrian’s picture

Watever the solution that is decided on may be, i feel it's important that we keep in mind that not just the node content has to be filtered The right answer would need to be able to filter comments, poll entries , titles and a fair amount of other text. I think a whitelist set up by the module developer is the easiest way to get this done. Ie: comment.module author specifies that fields x, x and x are to be filtered.

event.module author specifies that location / etc is to be filtered, this is important, as you could maybe want to link the location to the website using the textile "sometext":http://url method, only to find it doesn't work.. thus confusing the user.

Steven’s picture

FileSize
12.99 KB

Here's a patch which fixes all discussed Filter Clashing issues. It adds two significant changes:
- Filters can be re-ordered by the admin under admin > config > filters > ordering.
- A subhook "prepare" is available for escaping HTML before the HTML stripper gets to it (e.g. to prevent HTML tags inside <?php ?> from being removed, discussed on mailinglist)

The two filter hooks (filter and conf_filters) have been merged into one filter hook with an op parameter. This seems the cleanest and most flexible solution (we use the same technique with blocks and taxonomy). Like this:

/* 'hook' is the placeholder for the name of the module */

function hook_filter($op, $text = "") {
  switch ($op) {
    case "name":
      return t("Example filter");
    case "prepare":
      // Do preparing
      // $text = ....
      return $text;
    case "process":
      // Do processing
      // $text = ....
      return $text;
    case "config":
      // Output any options needed
      // $output .= form_...(...)
      return $output;
  }
}

Where:
- "prepare" returns $text with all escaping done (e.g. <code></code> tags for PHP or C++). This happens before HTML tags are stripped by the default stripper (new, was discussed on mailinglists as "hook_filter_prepare")
- "process" returns $text with all regular filtering done (replacement for hook_filter)
- "config" returns the configuration options for this module (replacement for hook_conf_filters)
- "name" returns a friendly name for the filter (for display on the ordering page)

The attached patch is 99% complete: the only thing that is missing is database schemes and update.php code for PGSQL and MSSQL. I cannot do these myself.
All other things have been tested and work as far as I can see.
I suggest we apply this, and then post a task asking for the relevant schemes.

Existing filter modules need to be modified: when this patch hits CVS, I'll make a short text on how to convert filter modules (though as you can see above, the changes are mostly syntactic) and convert some popular filtering modules in contrib.

Steven’s picture

(ack, crappy formatting)

Here's a patch which fixes all discussed Filter Clashing issues. It adds two significant changes:
- Filters can be re-ordered by the admin under admin > config > filters > ordering.
- A subhook "prepare" is available for escaping HTML before the HTML stripper gets to it (e.g. to prevent HTML tags inside <?php ?> from being removed, discussed on mailinglist)

The two filter hooks (filter and conf_filters) have been merged into one filter hook with an op parameter. This seems the cleanest and most flexible solution (we use the same technique with blocks and taxonomy). Like this:

/* 'hook' is the placeholder for the name of the module */

function hook_filter($op, $text = "") {
  switch ($op) {
    case "name":
      return t("Example filter");
    case "prepare":
      // Do preparing
      // $text = ....
      return $text;
    case "process":
      // Do processing
      // $text = ....
      return $text;
    case "config":
      // Output any options needed
      // $output .= form_...(...)
      return $output;
  }
}

Where:
- "prepare" returns $text with all escaping done (e.g. <code></code> tags for PHP or C++). This happens before HTML tags are stripped by the default stripper (new, was discussed on mailinglists as "hook_filter_prepare")
- "process" returns $text with all regular filtering done (replacement for hook_filter)
- "config" returns the configuration options for this module (replacement for hook_conf_filters)
- "name" returns a friendly name for the filter (for display on the ordering page)

The attached patch is 99% complete: the only thing that is missing is database schemes and update.php code for PGSQL and MSSQL. I cannot do these myself.
All other things have been tested and work as far as I can see. I suggest we apply this, and then post a task asking for the relevant schemes.

Existing filter modules need to be modified: when this patch hits CVS, I'll make a short text on how to convert filter modules (though as you can see above, the changes are mostly syntactic) and convert some popular filtering modules in contrib.

Steven’s picture

Dries’s picture

I don't fully agree with the design. Despite my concerns, I'm extremely likely to commit this (probably tomorrow after work) as I recognize that this marks a significant step forward and that my concerns can easily be dealt with later on. I'll summarize my remarks nonetheless:

  • I'd change prepare » strip tags » process to prepare » process. That is, make the strip tags stage a regular filter that can be reordered like any other filter. Moreover, I don't understand why having a fixed strip tags stage is preferred and will provide two reasons not to:
    1. I want to make the strip tags filter the last filter instead of the first so it can strip all Javascript and prevent filters from inserting unwanted Javascript. I might want to do the same for the image tag or any other HTML tags/attribute that I don't want users to use, or filters to emit.
    2. I want to make the strip tags filter the last filter instead of the first so it can strip all unclosed tags. PHP's strip_tags() function, used by the current strip tags code, should do so. Alternatively, I might want to disable the default strip tags filter and use an alternative implementation that is either more robust or that can fix some invalid (X)HTML.

    I believe that making the strip tags filter a normal filter instead of a special case makes gives me more control, makes for less code (less settings) that is faster.

  • Change 'config' to 'settings'.
  • Change some double quotes to single quotes where appropriate.
Steven’s picture

Your concerns about flexible strip tags seem based on the fact that you do not trust filter modules: I don't get this at all. The function of a filtering module is usually quite concise. If a security hole were to be discovered, there is no reason at all to try and cover for it by stripping the output of anything dangerous. Instead, the code should simply be fixed, as we do with every other security hole in Drupal (besides, we have a global XSS catcher for really naughty things). If a filter module outputs invalid (X)HTML, it needs to be fixed as to not do that.

The point of the HTML filter is to keep the user at bay, not the filtering modules. For example, what happens when the filter is set to escape all HTML rather than strip disallowed tags? If you move it after other filters, suddenly you'll see HTML generated by filters being escaped too.
User input and filter-output require different treatments: this is why the HTML filter/stripper is locked as first 'real' filter.

I think you underestimate the difference between filter features and HTML tags: there is certainly no single 1-to-1 mapping between them. Controlling filter features by controlling the allowed HTML tags is like killing flies with a rocket launcher.
If a filtering module emits Javascript, it's probably required for it to work. And if you strip the script-tag, you'd end up with a nice chunk of Javascript code in visible-text. If this Javascripted functionality is optional, don't you think the module author would allow you to toggle it in a much cleaner fashion?

Giving the admin the ability to move the HTML stripper around is something that will IMO not be used at all, and will only cause yet another area where possible malconfigurations can occur. I don't see any proper reason for unlocking it other than adding some useless tinkertoy to Drupal and making filters more confusing than they need to be. Drupal has several areas with locked settings (modules, roles, permissions for uid 1, ...), and each is justified, trading off flexibility for ease of setup and usage.

Bèr Kessels’s picture

There're some more reasons for making html-filtering behave like all others. Besides the fact that consistency should be a valid argument, of course.

And other things like textile. Lets say I want to use textile to markup my content, but want to disable h1. etc, to prevent my document outline from breaking. In an ideal world this would be done by altering the textile module, but our world is far from perfect, so the easyest way would be to place html-filtering after textile and not include into the allowed tags.

The same goes for other markup modules such as texturiser, BBcode, etc.

Chris Johnson’s picture

This filter ordering patch appears to have filtered all of my existing content right out of existence. Every node's text is replaced with this one word: "process". Reverting to filter.module version 1.5 fixes it.

To say I was horrified when I first saw all my content missing is an understatement. Fortunately it was still in the database, but just being filtered out on display.

Gábor Hojtsy’s picture

Chris, update all the modules having filters you use from CVS. If you have some modules not yet updated to handle the new filtering code, you could correct them in contrib CVS (I have done so for glossary, bbcode and weblink).

Chris Johnson’s picture

That may explain it. I've got an older copy of the taxonomy.module because it was broken in CVS.

I think I may wait until there is a somewhat more stable CVS before dropping it on my site again. Solving emergency problems on top of everything else is a bit much.

Steven’s picture

For information on how to convert your module, see the Converting 4.3 modules to CVS page in the docs.

Chris Johnson’s picture

Actually, I believe Dries already fixed the taxonomy module in HEAD. So current CVS should be good. I'm just going to be more cautious in general, is all. :-)