[meta] Input filters and text formats [#807996]

There's been some exciting brainstorming about the possible future of input filters and formats in #659580: Specify what the line break converter should do and rewrite it in DOM and #653988: Line break filter corrupts existing XHTML. Here's the thinking so far.

Just to clarify my point of view on this, Pathologic, an input filter, was my first contrib module, back in the D5 days, and still going strong.

The D7 HTML Corrector filter basically loads the HTML as a PHP DOMDocument object, which appears to be pretty flexible about parsing tag soup, then serializes it back out. The basic idea in reforming filters for D8 is that we actually keep that object around for a bit and pass it around for other filters to work on before we serialize it. This will be of great benefit for filters which could benefit from being able to navigate and modify a DOM instead of parsing with regular expressions.

The biggest sticking point that we need to consider with this sort of approach is that it will no longer be possible to not "correct" HTML while also running other filters on it, as doing this is inherent in creating and serializing the DOMDocument. We could possibly offer some sort of "passthrough" approach which just doesn't run any filters on it at all, but as soon as it hits the filter system, it's gonna be "corrected." Is this a good idea? Let the debate commence! I personally am of the mind that the net benefit of allowing filters like Pathologic to be able to fiddle with things using the DOM far outweighs other concerns.

Let's back up a bit. The full filtering process will look like this: The text will pass through preprocess filters. This is where filters which convert Markdown, Textile, BBcode, etc to HTML will go. The text (hopefully all HTML at this point) then gets loaded into a PHP DOMDocument, which is then passed around to "mid-process" filters to work on. Once those are done, the DOMDocument is serialized back to HTML, and then can be passed through post-process filters, for filters which need to work on HTML instead of a DOMDocument for whatever reason.

This three-stage approach will mean that the filter rearranging page can be (debatably) done away with. We can make sure that filters like BBcode will run before filters like Pathologic by virtue of the fact that the former will be a preprocess filter and the latter will be a DOM (or possibly postprocess) filter. I think this will be a wonderful usability boon for novice users. Possibly, filters can carry their own weights if they need to run before or after other filters in a particular stage, but the user never needs to see that, just as they never need to see module weights in {system}.

Coders of many filters in contrib will be able to easily roll a D8 version without having to rewrite their filter to use the DOM simply by making their filter a postprocess filter - so it still has standard HTML as an input. Eventually, if it makes sense to do so, they can create a new major release which uses the DOM instead.

And while we're reinventing wheels, #226963: Context-aware text filters (provide more meta information to the filter system) needs to happen too.

I've never been a major kitten killer as of yet, but I'm maybe possibly volunteering myself to take a major role in this, pending community feedback.

Comment	File	Size	Author
#15	filter-tng-macros.png	49.98 KB	wim leers
#15	filter-tng-macros-clicked.png	49.92 KB	wim leers

Comments

Comment #2

sun

German

Karlsruhe

commented 26 May 2010 at 11:13

Title:	Input filters & formats: TNG	» [meta] Input filters and text formats
Issue tags:		+FilterSystemRevamp

Good thoughts. However, we need to revamp a lot more in the filter system, which is tightly connected to what you're describing here:

1) Split text processing and filters into two key stages: security filters and macro filters. Security filters are sanitizing user input (e.g., HTML filter, Escape all HTML filter, etc). Security filters always run first, before any other filters. Macro filters are converting stuff dynamically around (e.g., Linebreak filter, URL filter, most of the contributed filters). So there need to be two separate stacks: security filters and macro filters. Each filter stack needs separate filter weights.

Note that 1) slightly clashes with the outlined idea in the original description of this issue, unless security/macro stages would be combined into a matrix with preprocess/dom/process stages somehow.

2) Introduce markup language: The mentioned DOM stage in the original description of this issue requires to know the markup language of the processed text. As of now, Drupal's text processing is only intended for (X)HTML(4). However, some (already existing) use-cases in contrib, as well as HTML5 and in particular trying to load a text string into a DOMDocument requires to declare what kind of markup language we are expecting and dealing with.

3) Get rid of text formats.

I already planned to work on a replacement filter system for D8 or beyond in http://drupal.org/project/filter, would love to have someone to collaborate with. (didn't actually start with that yet though)

Comment #3

damien tournoud commented 26 May 2010 at 11:27

Hm. I don't agree at all.

I would say we need two stages: (a) text processing and (b) filtering.

Text processing is the first operation. It transforms whatever the user has input into HTML Markup. You can only pick one text processor per text format: Markdown, BBCode, Textile, unfiltered HTML, simple HTML (unfiltered HTML with auto-paragraphs). This text processor is used to decide which type of client-side editor to display to the user (WYSIWYG, WYSIWYM, simple textfield, etc.), and is used to display and help text that can actually be understood by the end-user.

Filtering is the second operation. It takes the HTML markup output by the text processor and further filter it. This stage is purely based on DOM manipulations, and that's where we have most of the contrib macro-filters (Views Embed, URL filter, etc.).

Comment #4

sun

German

Karlsruhe

commented 26 May 2010 at 12:46

You actually agreed to a lot, just called it differently. :)

You can only pick one text processor per text format

More or less the same as 2) the introduction of markup language.

..., unfiltered HTML, simple HTML (unfiltered HTML with auto-paragraphs)

This presumes that the user input has been validated and sanitized, or otherwise the text processors quoted here could not exist or would have no distinction. Furthermore, a strictly enforced BBCode markup language would require a security filter that escapes or removes all non-BBCode, i.e., all raw HTML. Only after applying security filters on the user input, we can safely transform BBCode into HTML.

In short: Before we even try to process any user input, the user input needs to be sanitized, so whatever input is processed and macro-filtered is known to be safe. That's the most important distinction of input filters that we are currently presuming, but as of now, not able to enforce in the configuration (UI and API) of text formats.

Filtering is the second operation. ... This stage is purely based on DOM manipulations, and that's where we have most of the contrib macro-filters (Views Embed, URL filter, etc.)

Likewise, this also maps smoothly to my points 1) + 2), i.e., macro filters are executed in a separate, second stack. However, some (or perhaps even many) macro filters are much more simple and faster to implement and execute as string manipulations. Of course, DOM manipulations are useful and sometimes needed, too. We need to make both ways possible.

The Linebreak (auto-paragraph) filter actually is a macro filter and therefore needs to be applied in the second stage, i.e., after security filters have run.

Comment #5

damien tournoud commented 26 May 2010 at 12:56

Only after applying security filters on the user input, we can safely transform BBCode into HTML.

There we disagree. It's basically the job of the text processor to do that. The good ones (ie. Markdown) already do that.

The Linebreak (auto-paragraph) filter actually is a macro filter and therefore needs to be applied in the second stage, i.e., after security filters have run.

There we also disagree. The linebreak filter is an integral part of a text processor. It has no value independently of the text processor.

Comment #6

Garrett Albright commented 26 May 2010 at 20:40

sun, I'm going to have to agree with Damien here; it makes more sense to do security stuff once we have consistent input; ie, HTML (or a DOM object). If we expect the implementors of the Markdown, BBCode, etc filters to implement their own mechanisms for this sort of thing, we're going to end up with a lot of redundant work of various and unpredictable quality and implementation - and this is something which should be in core, anyway.

As for this part:

As of now, Drupal's text processing is only intended for (X)HTML(4). However, some (already existing) use-cases in contrib, as well as HTML5 and in particular trying to load a text string into a DOMDocument requires to declare what kind of markup language we are expecting and dealing with.

Note in the OP I state that there should probably be some way to pass through the system without doing any filtering at all (or perhaps just removing the entire DOM step). This might be the only approach if we want to get something like the PHP filter to work, should we still want to keep that in core (no, please).

Comment #7

Garrett Albright commented 7 July 2010 at 03:26

I made a thing: http://github.com/GarrettAlbright/filtertng

This code has yet to be executed, but hopefully one can see the process I have in mind starting to take shape. Doing it to push this idea, as well as to learn how to use Git and keep myself busy. Once it can actually run, perhaps I'll push it to CVS as a module to piggy-back D7's filtering at first. Feedback would be appreciated.

Comment #8

cpelham commented 10 June 2011 at 00:34

OK, it's been 11 months. :) I can't see any activity in the project pages linked to in #2 and #7. Are you guys still ruminating and planning to come back to this at some point, or has this been usurped by another initiative elsewhere?

Comment #9

Garrett Albright commented 11 June 2011 at 09:01

I've taken my eye off of it, yes. I still have ideas I'd like to see implemented, but it's tough undertaking a task of this size alone, particularly when paid work calls. A classic conundrum…

Would you be willing to help out code-wise on this sort of thing?

Comment #10

barbi commented 14 June 2011 at 04:49

I am interested in helping out code-wise. Can you please split the task into smaller chunks and help me get started?

Comment #11

cpelham commented 15 June 2011 at 00:50

Did you take a look at Garrett's pseudo-code for a module to accomplish this? In his file filtertng.module the steps are nicely broken down so you could just work on trying to code for example one of the functions he suggests is needed:

<?php
/**
 * @file
 * Filter: TNG - A new way of filtering and formatting text for Drupal 8.
 */

/**
 * Implements hook_filtertng_info().
 *
 * Defines filters.
 */
function filtertng_filtertng_info() {
  return array(
    'foobar' => array(
      'title' => t('Convert &ldquo;foo&rdquo; into &ldquo;bar.&ldquo;'),
      'phase' => 'postfilter',
      'weight' => 0,
      'module' => 'filterng',
    ),
  );
}

/**
 * Implements hook_filtertng_formats().
 *
 * Defines formats. In practice, the format should not use any filters except
 * those that are defined in the same module, or in core, or in a module which
 * the module it belongs to requires.
 */
function filtertng_filtertng_formats() {
  return array(
    'foobar' => array(
      'title' => t('Converts &ldquo;foo&rdquo; into &ldquo;bar&ldquo; and does nothing else.'),
      'prefilter' => array(),
      'filter' => array(),
      'postfilter' => array(
        // The key is the ID of the filter, and the array stores option values
        // for the filter, as set by hook_filtertng_[filter_name]_settings().
        // Our simple example doesn't have any settings yet.
        'foobar' => array(),
      ),
    ),
  );
}

/**
 * Runs filters on text. The equivalent of check_markup().
 * 
 * @param $text
 *   The text to filter/format.
 * @param $format
 *   The ID of the format to run on the text.
 * @param $params
 *   An array of paramters for how the text should be filtered. Items include:
 *   - langcode: The language code the text is in. Defaults to the site default
 *     language.
 *   - context: An array of contextual "tags" which may alter how the text is
 *     filtered. For example, if the text is in a field, the name of that field
 *     name, whether this is for a node view or a feed item, or…
 *   - cacheable: Whether the output of this format is cacheable, and whether
 *     we should try reading the cache to provide output without re-filtering.
 *     Defaults to TRUE. 
 */
function filtertng_filter($text, $format = 'foobar', $params = array()) {
  // Fill in the $params array
  global $language;
  $params += array(
    'langcode' => $language['language'],
    'context' => array('generic'),
    'cacheable' => TRUE,
  );
  // Can we use the cache?
  if ($params['cacheable']) {
    $cache_id = $format . ':' . hash('crc32', implode('', $params)) . ':' hash('sha1', $text);
    if ($cached = cache_get($cache_id, 'cache_filtertng')) {
      return $cached->data;
    }
  }
  
  // If we're still here, we couldn't return cached data. Build the input format
  // and filter the text.
  $format = filtertng_format_load($format);
  // …

  // Okay, do prefilter filtering
  foreach ($format['prefilter'] as $filter_id => $options) {
    module_invoke($options['#module'], "filtertng_{$filter['id']}_filter", &$text, $options, $params);
  }

  // Now create a DOM object from the text (which will hopefully be HTML at this
  // point).
  $dom = filter_dom_load($text);
  // …And do this phase of filtering.
  foreach ($format['filter'] as $filter_id => $options) {
    module_invoke($options['#module'], "filtertng_{$filter['id']}_filter", &$dom, $options, $params);
  }

  // Now unserialize and do the final phase.
  $text = filter_dom_serialize($dom);
  forach ($format['postfilter'] as $filter_id => $options) {
    module_invoke($options['#module'], "filtertng_{$filter['id']}_filter", &$text, $options, $params);
  }

  // Cache the result, if we can.
  if ($params['cacheable']) {
    cache_set($cache_id, $text, 'cache_filtertng');
  }

  return $text;
}

/**
 * Load a text format. Find and sort all of its filters.
 *
 * @param $format
 *   The ID of the format to load.
 * @return An input format item.
 */
function filtertng_format_load($format) {
  $formats = &drupal_static(__FUNCTION__);
  if (isset($formats[$format])) {
    return $formats[$format];
  }

  // Is this format defined in code?
  $formats = module_invoke_all('filtertng_formats');
  if (isset($formats[$format])) {
    // Do we need to sort the filters?
    if (!isset($formats[$format]['#prepared']) || !$formats[$format]['#prepared']) {
      _filtertng_filters_prepare($formats[$format]);
    }
    return $formats[$format];
  }

  // Okay, time to hit the database.
  // …
  // Some day.
}

/**
 * List text formats.
 *
 * @return An array of text formats, with the system names as keys and
 *   human-friendly titles as values.
 */
function filtertng_formats_list() {
  $formats = &drupal_static(__FUNCTION__);
  if ($formats === NULL) {
/*     foreach (module_invoke_all('filtertng_formats') as $; */
    
  }
  return $formats;
}
  

/**
 * Sort the filters in a format.
 *
 * @param $format
 *   The format containing the filters to sort.
 */
function _filtertng_filters_prepare(&$format) {
  $filters = module_invoke_all('filtertng_info');
  foreach (array('prefilter', 'filter', 'postfilter') as $phase) {
    if (isset($format[$phase])) {
      if (count($format[$phase]) {
        $weights = array();
        foreach ($format[$phase] as $filter_id => &$options) {
          if (isset($filters[$filter_id])) {
            $weights[] = intval($filters[$filter_id]['weight']);
            $options['#module'] => $filters[$filter_id]['module'];
          }
          else {
            unset($format[$phase][$filter_id]);
          }
        }
        array_multisort($weights, $format[$phase]);
      }
    }
    else {
      $formats[$phase] = array();
    }
  }
  $format['#prepared'] = TRUE;
}

Comment #12

Garrett Albright commented 17 July 2012 at 10:40

Bump.

I'm currently underemployed and looking for something to work on to get my creative energy out and distract me from worrying about how I'm going to pay next month's rent (Fishing for sympathy? Maybe a little, though I'd rather have a contract), and I remembered this thing. It's still something I'd like to see happen, but still not something I want to undertake all by myself. It might be too late for D8, but maybe we can at least get something started that can be slipped into D9 early on. Anyone else interested?

(Issue recap: I propose changing the text format system from how it works now, where basically a bunch of filters execute in order and they're all expected to take text in and spit text out, to a three-phase system where the middle phase uses a PHP DOM object and the pre- and post-DOM phases would still work on text. For example, a format could have Markdown or BBEdit filters in the pre-DOM phase, Pathologic and the HTML limiter (to remove disallowed attributes or tags) in the DOM phase, then Typogrify in the post-DOM phase since it uses external libraries which expects fully-formed HTML input. Being able to use a DOM object for many filters would be a lot less awkward than having to parse around using regular expressions and such. Also, allowing filters to put themselves in phases like this would allow us to do away with the user interface for sorting filters, or at least hide it away. Please ignore the aforelinked Github repo - this was before sandbox repos on d.o, and I think I've since deleted it.)

Comment #13

sun

German

Karlsruhe

commented 17 July 2012 at 11:59

I'd love to have a skype conf call on this. (user: unleashedmind)

There was some disagreement at the beginning, but I agree with @Damien. However, the main challenge to discuss will be to figure out how the new architectural design with text processors (+ subprocessors?) + filters will look.

Comment #14

Garrett Albright commented 18 July 2012 at 13:40

Who would be in such a conference call?

Comment #15

wim leers

Ghent 🇧🇪🇪🇺

commented 31 July 2012 at 13:10

Status	File	Size
new	filter-tng-macros-clicked.png	49.92 KB
new	filter-tng-macros.png	49.98 KB

Note that I come at this from the "in-place content editing" angle, i.e. the Edit/Spark angle.

The content below comes from a variety of sources: my personal thoughts, input from the Spark team, but also input from Daniel "sun" Kudwien, Dave Reid and Nate "quicksketch" Haug, from one of our "D8 WYSIWYG blocker" calls (for notes from those calls, see our Google Doc).

Concerns about forcing each filter to use DOMDocument

From a purely logical POV, as well as a purist POV, it makes perfect sense to me to use DOMDocument for every filter (except for e.g. Markdown/Textile/… of course). However, this has side effects that need to be considered as well.
Imagine the case of an <img> with a data-caption attribute. It's perfectly reasonable (and it's actually even an elegant solution) to transform this image to something like: <div class="captioned-image"><img … /><div class="captioned-image-caption">This is a caption</div></div>.
The "Drupal way" to implement this, would be to use a theme('captioned_image', array('image' => '<img … />', 'caption' => 'This is a caption')) call — a theme function. This function would just print the opening of the outer div, then the image it receives, then the caption, then the closing of the outer div. By forcing every filter to use DOMDocument, it would effectively be impossible to use a theme function in this traditional sense; we'd have to pass in the DOMDocument object and force users to use DOMDocument manipulation functions. That also means it'd be impossible to use a theme template file.
Now, there is a work-around: let theme functions work like they work otherwise, then create another DOMDocument out of the fragment that the theme function creates, parse it, reconstruct the same tree for use in the original DOMDocument.
There's different ways to deal with this, we just have to decide *how* we want to deal with this.

Concerns about requiring all content to be HTML as soon as "preprocessing" or "text processing" is done

Both Garrett Albright and Damien Tournoud (#3) say that all content should be HTML markup as soon as "preprocessing" or "text processing" phases are done. They're different names for the same thing: converting Markdown/Textile/… into HTML.
But … what about other non-HTML content such as tokens ([site:name]), Media module syntax ({{ type: "node/image", nid: 123, …}})), etc.?
To be able to only have to deal with HTML mark-up, and thus to be able to use just DOMDocument-based parsing from this point out, we will have to convert all of the aforementioned syntaxes into a single, HTML markup-based syntax — even if it's just wrapping the existing syntax in the standardized HTML markup-based syntax.
E.g.: <macro type="token">[site:name]</macro>, <macro type="media">{{ type: "node/image", nid: 123, …}}</macro>.

Need for standardized way of handling "macro filter tags"

The things mentioned above (i.e. tokens, Media module, etc.) can be called "macro filter tags" in general, or even just "macros".
Yet even with all of the above, there can still be (edge?) cases that can't work this way. For example: oEmbed. This is something Drupal needs to be able to support.
We could even argue that Drupal should move all of its macros to that syntax, because it is on track to become an industry standard.
This possibility at least seems very promising: @sun discussed with @EclipseGc on Blocks/Layouts in D8 + Inline API + "oEmbed": D8 will most likely expose every entity + every field + piece of content on its own URL already. We can re-use all the new fancy content plugins in D8! :) Prototypes: http://groups.drupal.org/node/242403
However, if we go for oEmbed or something similarly URL-based (which makes sense given the above), we then face The Preview Problem. Imagine a node with an image field. Imagine then that we want to refer to this image field from within the body of the node using oEmbed. How could we do that if the node has not yet been saved? This could be solved by either always immediately saving entities or by leveraging #1642062: Add TempStore for persistent, limited-term storage of non-cache data.

My own analysis

This is the analysis I made (#1699722: "True WYSIWYG" and compatibility with Drupal's text formats/filters), rewritten here for clarity. Keep in mind that it's done from the POV of WYSIWYG compatibility. That is, we want to achieve "true WYSIWYG" (i.e. what's in the WYSIWYG must match the final output exactly), but we still want users to be able to insert and change embeddables (token_filter/media/oEmbed/…).

I believe the solution lies in classifying filters. In my mind, there are three distinct classes of filters, and AFAICT all existing filters fall in either of these classes (if I'm wrong, just let me know!):

non-HTML markup filters (or "different markup" filters): Markdown/Textile/… but also e.g. the PHP filter.
transformation filters: Typogrify, link ads, etc. But also for example the "data-caption attribute to an actual caption" case described above.
macro filters (or "token" or "wrapper" filters): e.g. token_filter, media, inline

The first two are "destructive by design": it is not feasible to apply the filters and still keep the related pre-filter values around. (Keep in mind that we're looking at this from the WYSIWYG angle.)
This sounds vague, but think about the third one, which does not need to be destructive: it is possible to replace the macros with their expanded values, yet still include some metadata so that the WYSIWYG editor can know the actual macro and thus show an editing interface when you click it. All that would be necessary to pull this off, is to have an additional parameter passed in to check_markup(): bool $wrap_macro_tags. This flag could be set by a module and would apply to all filtered text on the current page. When set to TRUE, it would then wrap each macro using something like this:

function filter_wrap_macro($macro, $value, $module) {
  $attributes = array(
    'class'               => 'edit-editable-macro',
    'data-macro'          => $macro,
    'data-macro-provider' => $module,
  );
  return '<span' . drupal_attributes($attributes) . '>' . $value . '</span>';
}

All of this is exactly what I did in Edit module's proposed modifications to the filter module: see Edit's filter.inc (I'm calling my version of check_markup() from hook_field_attach_view_alter()).

This approach has already been prototyped and proven to work (technically, it has not been usability tested). At least the following people also like the approach: sun, Dave Reid, quicksketch.
Attached are two screenshots from a (ugly!) prototype that show how it can work. The UI would need to be much better (e.g. it doesn't scale to selecting a different image from the site's media gallery), but the point is that expanded macro tags are detectable through JS, and you can click them to edit them.

So, when you're analyzing whether a certain rich text field is allowed to get WYSIWYG editing, you can then analyze the text format it's using:

if only "macro filters": just load the WYSIWYG editor and apply it to the current HTML (with $wrap_macro_tags = TRUE)
if >0 "transformation" filters: load the original using AJAX and don't run the "transformation" filters (but do run macro filters like in the previous bullet). Not perfect WYSIWYG, but very, very close. Certain filters may want to provide a JS/WYSIWYG-counterpart, e.g. in the case of data-caption attributes for image captions. But in other cases, it may not make sense: for Typogrify it's near impossible to do, for link ads it's undesirable because it could interfere with the creation of links.
if >0 "non-HTML markup filters": no WYSIWYG editor, show the regular field widget as it appears on node/edit

Thoughts? :)

Comment #16

wim leers

Ghent 🇧🇪🇪🇺

commented 31 July 2012 at 13:15

Of course I forgot something. I forgot to say that you could consider bool $wrap_macro_tags as a context, so in that sense, it ties back to #226963: Context-aware text filters (provide more meta information to the filter system).

Comment #17

wim leers

Ghent 🇧🇪🇪🇺

commented 31 July 2012 at 16:38

Referenced by #1706688: [meta] In-place editing, inline macros, editables, and Wysiwyg in core.

Comment #18

wim leers

Ghent 🇧🇪🇪🇺

commented 2 August 2012 at 10:44

RE: using DOMDocument for everything.

I managed to prove myself wrong :) It *is* possible to use DOMDocument-based parsing *and* still have theme functions. Proof: http://drupalcode.org/project/edit.git/blob/0d8b07896824f196fb78c670bb16....

RE: standardized way for handling "macro filter tags".

@sun and I had a discussion about this last night. To truly accomodate the use of macros, they need context, hence they need #226963: Context-aware text filters (provide more meta information to the filter system). And @sun says about macros:

there's technically just simply no dependency on the filter system, and even more so, the current filter system heavily limits possibilities and would have to be re-architected first to allow fully-fledged macro implementations to happen in the first place

And:

it's a huge architectural change for the filter system, which will require tons of careful conceptual design and implementation work. My gut feeling also tells me that once this would happen, someone will raise the final and inevitable question why Filter module still exists and is not part of Text module.

Combine that with the fact that the filter cache was almost removed because Field API is doing the caching already (source: #226963-44: Context-aware text filters (provide more meta information to the filter system)).

So:
- filter system would need to be rearchitected;
- filter system would become stateful instead of stateless (in that the output would depend on the context), which leads to caching issues (cache explosion);
- "macro filters" implemented in the filter system would still need to do their own parsing, and thus no standardized syntax would be enforced;
- we *do* want a standardized syntax, as well as a "Macro API", so that macro providers don't need to do all the parsing work anymore;
- if we have a Macro API, we can allow modules to add more context, and then let others react to that context;
- Field API — which is where by far the majority of these contextual filters would be needed — does have the necessary context: hook_field_attach_view_alter(&$output, $context);
- CONCLUSION: we shouldn't alter the filter system to accomodate macro filters, we should move that into a separate module that hooks alters fields, entities, comments and whatnot on output. Then we could have an even simpler classification: 1) "non-HTML markup filters", 2) "transformation filters".

In my discussion with @sun, I proposed this:

So, roughly, the ideal macro system would:
- hook into field/entity/whatnot "view" events and alter the output of those;
- yet still provide an API.

I think this means we'd need to:
- have a macro.module in Drupal core (or outside core?), which implements hook_field_attach_view_alter() and others;
- it then calls module_invoke_all('macro_context', &$context), allowing other modules to provide context ($context is passed by reference);
- with $context = array('field' => $field, 'entity' => $entity) in the case of hook_field_attach_view_alter();
- the Edit module would implement hook_macro_context() and check whether a user would have access to edit the given field, and add ('edit-macro-wrap' => TRUE) to the context
- macro module then does the parsing to find *all* macros (macros would have a standardized syntax), each macro is annotated to know which module implements it;
- macro module then calls module_invoke($module, 'macro_render', $macro, array $context), with $context = array('field' => $field, 'entity' => $entity, 'wrap-macro' => TRUE), for every macro;
- macro module then also calls drupal_alter('macro_render', $original_macro, $rendered_macro, $context), with $context still the same;
- Edit module would be able to detect its 'edit-macro-warp' context, but only for fields that are editable, and would then be able to wrap these rendered macros;
- this assumes order MAY NOT matter.

(ROUGHLY!)

To which he responded:

well, you just described the architecture of Inline API 2.x :P

(See #1671276-49: Integrate with Field API instead of random forms and textareas.)

A downside to this multi-phased approach (phase 1: filtering, phase 2: expand/render macros upon viewing, *after* filtering) is that it becomes impossible to apply filters to macros. Use case: a macro inserts an image, and you want this image to be captioned.
The macro module would then need to support captions natively to cope with this use case. Alternatively, the macro module could simply apply filters to the expanded macros manually. However, some if not most filters need to be able to e.g. set data- attributes on the affected HTML, wrap the affected HTML in some way, or annotate the affected HTML in some other way. So most likely, the macro module applying filters to the expanded macros manually would make little sense.

Comment #19

wim leers

Ghent 🇧🇪🇪🇺

commented 2 August 2012 at 11:29

Having stated my own analysis in #15, the discussion with @sun outlined in #18 and then reread the entire issue, it seems to me there is a way to unify all of the outlined goals.

Filter stages

The filter system would need to have 4 stages:

HTML generator filters: the end result of running these filters MUST be HTML. Unnecessary when user is expected to enter full HTML.
Examples: Markdown, Textile, PHP filter, but also core filters such as filter_autop ("Convert line breaks into HTML"), filter_url ("Convert URLs into links") and filter_html_escape ("Display any HTML as plain text").
security filters: strip tags that the user MAY NOT use. Unnecessary when everything is allowed.
Examples: core's filter_html ("Limit allowed HTML tags") filter.
HTML DOM transformation filters: DOM-based transformations; filters SHOULD NOT use regular expressions when they can use DOM manipulation instead.
Examples: Pathologic, data- attributes-based image caption filter (see first line of #18),
HTML text transformation filters: string-based transformations.
Examples: Typogrify, insertion of link ads.

At each of the stages below, the number of filters is 0–N.

Stage 1. Input: raw text. Output: HTML.
Stage 2. Input: HTML with potentially disallowed tags. Output: HTML with only allowed tags.
Stage 3. Input: HTML DOM. Output: potentially modified HTML DOM.
Stage 4. Input: HTML string. Output: potentially modified HTML string.

Note that because of the transition from stage 3, to stage 4, we're actually implicitly executing core's filter_htmlcorrector:

function _filter_htmlcorrector($text) {
  return filter_dom_serialize(filter_dom_load($text));
}

Filter orders won't matter anymore?

I'm hopeful that this division in stages also means that we can stop forcing users to think about ordering their filters, which is a UX nightmare. Because of the basic ordering enforced by the different stages, there are less combinations possible, and definitely many potentially incorrect combinations are excluded automatically. The fact that stage 3 works on the DOM level solves a lot of potential problems as well. The most problematic stage is surely going to be stage 4, hence as many filters as possible should move over to stage 3, where conflict is far less likely.

How can we prove that order doesn't matter anymore?

Stages + lack of order would make for a much better UX

Also, since the filter system cares about outputting valid HTML, I think it'd be safe to say there will actually always be just *one* filter in stage 2. Then the "allowed tags" could become the prominent setting of a text format (while still allowing it to be set to "all tags allowed").
If all of the above assumptions are true, the UI could become very simple:

Step 1: do you want the text to be written as HTML directly, as Markdown, as Drupal's simplified version of HTML (auto URL, auto paragraph) or as Drupal's plain text (escape HTML, auto URL, auto paragraph)? (By default, core will only offer full HTML, , Drupal's simplified version of HTML and Drupal's plain text.)
Step 2: specify allowed tags.
Step 3: which transformations do you want to apply? (By default, core would only offer filter_html_escape

Comment #20

Garrett Albright commented 3 August 2012 at 06:55

You lost me near the beginning when you started talking about theme functions. What does content filtering have to do with theme functions?

Also, not every filter would use the DOM object; just the ones that want to. My idea is that filters could specify themselves as a pre-DOM filter (in which case, they would get the content before it's converted into a DOM object; this is where Markdown/Textile/BBCode/etc filters would go, as well as ones using Tokens and such); a DOM filter (in which case they would get the DOM object); or a post-DOM filter (in which case they would get the content after it has been serialized from the DOM object; this is where SmartyPants/Typogrify-style filters would go).

Frankly, I haven't read all of your post(s) yet, but you seemed to have some fundamental concepts of the plan wrong to start with, so I thought I'd correct those ASAP and read the rest when I have time. Hopefully it gives you a better idea of the possibilities here.

Comment #21

Garrett Albright commented 3 August 2012 at 08:00

Of course, now I look silly because I see that you eventually do get at the pre-DOM, DOM, post-DOM idea, but yeah, there's still a lot of stuff there that I don't quite grasp the relevance of. Caching, for example. I can see why we'd want to change how it works, but isn't that outside the scope of this issue, which could conceivably work with the current caching mechanism or with no caching at all?

As for filtering out unwanted HTML tags before creating the DOM object, I disagree; stripping out unwanted tags (or attributes or combinations thereof) is an example of something that would be far easier to do if we could manipulate and navigate around a DOM object instead of hoping for the best with regular expressions as we are now.

Comment #22

wim leers

Ghent 🇧🇪🇪🇺

commented 3 August 2012 at 08:56

#20:
1. Some content filters, for example image captions, may need to have customizable output. Then you need theme functions. I initially thought that when every filter has to do its thing through DOM manipulations, that it would be impossible to use theme functions, but I proved myself wrong :)
2. In my opinion, the order outlined in #19 makes more sense.

#21:
1. If you refer to the "cache explosion" point: imagine a "[current-user:name]" token in a piece of text, that gets expanded by filters. And gets cached in a different cache entry for each user. That is the sort of caching issues you can expect when you include context in the filter system.
2. I don't think I explicitly stated I don't want to use DOM manipulation for stripping out unwanted tags? In any case: I agree with you. :)

Comment #23

wim leers

Ghent 🇧🇪🇪🇺

commented 2 September 2012 at 14:09

At DrupalCon Munich, we showed off Spark Drupal in its current state. The reception was very good: pretty much universally, people were enthusiastic about it! :)

Thanks to that, we've started working on step 1 of proposing Aloha Editor as the core WYSIWYG editor. That first step will be without in-place editing (because there are still things that need to be flushed out in that area); it will just be "WYSIWYG in core for back-end forms", i.e. for forms with text processing and a HTML-based input format (i.e. no WYSIWYG for Markdown etc.).
That first step is now done and ready for review, over at #1760386-4: Migrate Aloha Editor integration from the Edit module and make it work on the back-end.

Why is this relevant to this issue? Well, I have implemented what I have proposed in this issue:

The proposed modifications to the Filter module now live in a stand-alone "filter_true_wysiwyg" module. This makes it easier to review this. I've also cleaned up this code (in the Edit module, I had both my original, "simplest possible thing that can work" approach and the proposed approach — #807996-19: [meta] Input filters and text formats) and have added explanatory comments throughout to more clearly communicate the reasoning.

So, as of today, you can review the approach right there, in the code. It's working on D7. I've also included plenty of code comments to explain the rationale for each thing. filter_true_wysiwyg is less than 300 lines of code, about half of which is comments.

The changes necessary for filter modules to support this functionality are limited to a single line of code — I've included a "caption" module which is based on the caption_filter module (but my implementation is simpler and more robust) and implements the extra things necessary: it declares that it is of the type FILTER_TYPE_TRANSFORM_DOM.

Note that this does not (yet) with the "macro tag" issue (e.g. token_filter, media module, etc.). It's still possible to include that as well, but it's something that can be done on top of the work done in the filter_true_wysiwyg module. Doing that work may only makes sense if we pursue to have some sort of "macro tag API" in core, i.e. something like sun's Inline module 2.x.

Comment #24

Garrett Albright commented 7 September 2012 at 09:44

I don't like the constant names, since they seem to define what a filter should do instead of when it's run; it may be confusing if I want my filter to run during the FILTER_TYPE_HTML_GENERATOR phase, but it doesn't actually generate HTML. I think something like FILTER_TYPE_PRE_DOM, FILTER_TYPE_DOM, FILTER_TYPE_POST_DOM would be less restrictive. In the case of FILTER_TYPE_SECURITY, what if I don't want a filter of that type to run in a given format? Will I still be able to disable it?

Comment #25

wim leers

Ghent 🇧🇪🇪🇺

commented 7 September 2012 at 09:56

Yes, they define what a filter should do, because otherwise there's no way to reason about what a filter does. That's the main point: so that e.g. "true WYSIWYG editors" can reason about this.

That being said, I've outlined a strict order above (in #19) anyway, so it does actually imply the order as well.

You will still be able to configure which filters should be applied for a given text format. FILTER_TYPE_SECURITY only implies that these filters, if configured for a given text format, can never be disabled when rendering output for the user, to prevent security holes.

Comment #26

andypost

he/him

Russian

commented 9 June 2014 at 16:18

is this issue still valid?

Comment #27

Garrett Albright commented 10 June 2014 at 04:57

I personally would still love to see a system like I mention in the OP, but it looks like D8 went in a different direction largely to accommodate WYSIWYG in core.

Comment #28

wim leers

Ghent 🇧🇪🇪🇺

commented 11 June 2014 at 20:32

#27: D8 didn't go in a different direction at all, since D8 didn't address this at all!

The only thing that D8 added was the utmost crucial part for D8 that was the least amount of work and caused the least amount of resistance: a classification of filters. That may be a baby step to help this issue move forward, but that doesn't mean this issue is no longer relevant or valid — it most certainly is.

I'd love to see this happen, but I'm afraid this is now D9 material. Too few people care about this to change/improve this. The current system is mostly "good enough", AFAICT.

Comment #29

Garrett Albright commented 12 June 2014 at 07:40

Well, I care. And if we could get some sort of consensus about getting something like this in core for D9, I'd be glad to work with you to make it happen.

Comment #30

wim leers

Ghent 🇧🇪🇪🇺

commented 12 June 2014 at 08:24

I know you care :) But as you can tell by the number of people participating in this issue over the past 4 years, there are few who do.

In any case, I'd be happy to help with reviews when you get back to work on this :)

Shall we move this to 9.x-dev?

Comment #31

Garrett Albright commented 17 June 2014 at 15:21

Version:

8.x-dev

» 9.x-dev

Boop.

Comment #32

Garrett Albright commented 31 July 2014 at 17:11

Given the new versioning scheme, do we think that this might be something that can be done in a point release (8.1.x), or are we still going to wait for 9.x?

Comment #33

wim leers

Ghent 🇧🇪🇪🇺

commented 31 July 2014 at 18:31

If you can achieve this without breaking APIs, then yes, we can do this in a point release.

Comment #34

catch

he/him

English

29 April 2016 at 20:47

Version:

10.1.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #49

smustgrave commented 21 July 2025 at 16:53

Issue still valid?

Comment #50

andypost

he/him

Russian

commented 21 July 2025 at 21:05

there's even more in terms HTML 5 adoption, like related and specifically #3463613: Explore PHP 8.4 native HTML 5 parser vs html5-php

Comment #51

21 July 2025 at 21:05

Version:

11.x-dev

» main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

[meta] Input filters and text formats

Comments

Concerns about forcing each filter to use DOMDocument

Concerns about requiring all content to be HTML as soon as "preprocessing" or "text processing" is done

Need for standardized way of handling "macro filter tags"

My own analysis

RE: using DOMDocument for everything.

RE: standardized way for handling "macro filter tags".

Filter stages

Filter orders won't matter anymore?

Stages + lack of order would make for a much better UX

Child issues

Related issues

Referenced by