Looking through #1286004: Research/Build suitable captioning system for inline images, something I have been thinking for awhile is that Drupal could use a system to handle custom markup. The common method for this is using the square brackets, ala BBCode, but perhaps a better way would be to come up with a custom 'drupal' html tag:

<drupal tag="{tag type}" {attributes....}>{text node}</drupal>

Or self closing:

<drupal tag="{tag type}" {attributes....} />

The benefit here over the BBCode style is that these type tags can be handled via dom manipulation as well as grepped out through a regular expression.

While this may seem something more for contrib, if in core it would provide a template for any contrib modules that want to insert custom markup through special tags. Likewise the WYSIWYG editor that is decided upon for core could have a plugin shipped with it that does automatic parsing of these tags and provides an API for the WYSIWYG replacement.

Types of use these tags could have:

  • Image Captions
  • Embed Media Filter
  • Inserting inline polls
  • Automatic referencing of other entities/nodes
  • Node paging

An example of an embed could be something list:

<drupal tag="embed" src="http://youtube.com/v/xxxxxx" width="400" height="300" autoplay="false" />

Then the module that provides the embed type tag would be responsible for validating and parsing out the actual embed src and decide if it should be rendered or not.

If there is interest in this, it's something I would gladly jump in and help on. I've already done similar systems for numerous clients and it works out really well.


quicksketch’s picture

You might take a look at my attempt at solving this same problem over in http://drupal.org/project/caption_filter. It implements a [caption] filter (similar to WordPress) in both PHP and JavaScript. It currently only works in tinyMCE but there's a patch for CKEditor in there too. It'll be interesting to see if Aloha (or the chosen editor) has support for this kind of on-the-fly parsing at all.

Jamie Holly’s picture

I've done something really similar on my own site and a couple of client's sites. Anymore I more favor a regular markup style tag like the >drupal type="" tag I mentioned above. That way if you ever stop using a filter that implements a custom tag, it will just not show up in the regular HTML. Of course if we went to a standardized method of handling custom tags then that will be easily remedied as well.

Aloha would be able to do this, but it would take some work to get it going. They don't have a placeholders system in place like CKEditor, but it would be doable through the selection and range stuff and then altering the DOM. Convert the tags on aloha-editable-activated and change them back on aloha-editable-deactivated.

quicksketch’s picture

They don't have a placeholders system in place like CKEditor

The placeholder system in CKEditor isn't very good in my experience. I suppose for literal "placeholders" it does the job just fine (think things like single images, youtube embeds, or the Drupal <!--break--> tag), but for "wrappers" like I had for Caption Filter, the placeholder system provides no value.

I agree that using HTML tags can often be beneficial though, for the reasons you state but also because newer versions of jQuery have native support for "data" attributes, which basically allow you to super-charge tags with completely fabricated properties. Take a look at Confluence's implementation (TinyMCE) using data attributes, which includes a wrapper DIV around images and then a normal IMG tag. The data attributes hold all the internal information (like asset ID, resize dimensions, an ImageCache-like preset, etc.) Then you end up with situation that your unfiltered HTML is still valueable because the IMG tag itself is actually valid, but it's been augmented with special properties that only work when the filter is running.

Spark itself probably won't worry about alternative formats (BBCode, Markdown, etc), but if it's aiming to be a solution for core we'll need something that can handle non-HTML based formats too.

Jamie Holly’s picture

Yeah the placeholders in CK pretty much suck, but they are better than nothing. An alternative would be to wrap any filter generated markup in a division and then have jquery grab the calculated width and height during the pre-edit (or whatever) stage and do a simple graphical representation of how that would appear.

As far as the data attribute, that is very powerful but I think we want to keep everything simple enough for the layman to look at the markup and get an idea of what is happening. If we can follow a normal HTML syntax as close as possible, then we will alleviate a lot of confusion down the road. That's why I was thinking a <drupal type="{filter type}" > style syntax. It's easy to grep out server side and for jQuery to process on the client side.

quicksketch’s picture

If we can follow a normal HTML syntax as close as possible, then we will alleviate a lot of confusion down the road.

Right, data attributes *are* normal HTML and even better, they're actually valid markup. So not only will they not render if filtering is disabled, they also are valid HTML and will pass through validators. If the filtering were disabled, the main content would still actually function; in a way you could consider it graceful degradation. Here's an example:

<img src="/files/example.png" width="100" height="100" data-file-id="10" data-caption="This is the image caption" data-align="right" />

In jQuery land, these data-* attribute automatically become accessible on the IMG element as $('img').data('caption') (http://api.jquery.com/data/#data-html5).

Jamie Holly’s picture

But the question is how will we get filtering to inject these tokens? You really can't do a simple diff between the raw and filtered text, as that would rely upon filtering order (ie: autop would be a diff as well as linkify). About the only real solution I see if having the tokens put in during filter's process function. That would then be left up to the contrib author's to utilize a wrapper function to put the tokens in before their replacement. Something like:

$text = str_replace($search, _filter_tokenize($replace, (array) $data), $text);

That's not a real biggie for D8, but could be considered a pseudo API change in D7 and might take awhile for contrib to catch up. Until that happens, anyone trying inplace editing in D7 would be met with some really funky side effects.

quicksketch’s picture

Yeah the parsing code on the PHP side would be a bit more difficult, but it's not impossible. We could use one particular attribute to trigger "filter this tag", then use a regular expression to select the whole tag that contained the data attribute, then work specifically on that one tag. I did something similar for the Image Resize Filter module. If you want to see this approach in action, try out Confluence, it's Wiki software that uses data attributes in its WYSIWYG with great success (though it's written in Java).

Regarding filtering order... well yeah that problem will continue exist just like it does today. Though using <drupal> tags or any other kind of pseudo-tag wouldn't solve the problem either.

Wim Leers’s picture

Hopefully Dave Reid will chime in on this issue as well :) (I've linked to this issue from #1580210-142: Figure out what WYSIWYG editor to use.)

Personally, I think it's important that we tackle this problem as part of Spark. If we don't, then WYSIWYG in Drupal still can't come fully to fruition. Fortunately, we also want to tackle this as part of Spark.

@quicksketch offers a very interesting approach in #7 (especially when combined with #5). @Dave Reid, what are your thoughts on this?

Jamie Holly’s picture

@quicksketch - the attribute approach is interesting and parsing it out via regex wouldn't be that bad, but I could see if being problematic on more complex filters, especially if they output two top level blocks. That's why I feel a filter_tokenize API would be a much better and sure method to tackle this:

function myfilter_process($settings, $text){
  $search = '[[some weird tag I came up with]]';
  $replace = '<div class="my_weird_tag">....</div>";
  return ($search, filter_tokenize($replace, 'myfilter', array('editable=>TRUE, 'settings'=>('some att'=>1));

Now for new Drupal supplied functions

//In filter.module

function filter_tokenize($text, $filterName, $attributes){
  if (filter_should_tokenize()){
     $tokenStart = '<!--filter name="' . $filterName . '"{extra parsing to inject attributed'"-->';
     $tokenEnd = '<!--/filter-->';
    return $tokenStart . $text . $tokenEnd;

  return $text;

// We don't need to have the tokens on regular view, just when editing, so let's set a flag.
function filter_should_tokenize($enable = NULL){
  static $enabled = FALSE;
  if ($enable !== NULL){
    $enabled = $enable;

  return $enabled;

// Menu handler for inline_edit_prepare or whatever.

function inline_edit_prepare($element){
{.... our other prepare code...}

It's basic, just off the top of my head this morning and could definitely be improved, but you do get the general idea, but it does make sure the markup created within the filter process handler is fully marked. We can also use the editable setting to indicate to Aloha if that area should be editable or not by using the block plugin that's in Aloha common (some other WYSIWYG do offer something similar).

Either way we go, the tokenize or attribute route, it's still going to require module maintainers to update their code. The problem with my approach is that in D7 the functions wouldn't exist unless you installed D7.{whatever version} whereas your attribute approach wouldn't break things in lower point versions.

Jamie Holly’s picture

Ohh and back to the original issue. I still think core should work on coming up with a standardized syntax for custom tags and build an API to make implementing them easier. That would more be a core initiative that Spark would have to deal with and something that should definitely go to D8., but would get away from the wide range of formats being used now ([[tag]], [tag], <--{tag}-->, etc.).

lightsurge’s picture

#9 Looks similar to the sort of method I suggested in the other issue? #125/#132

Both look fine to me but I suppose putting the tokens in a data attribute as quicksketch suggests in #5 seems a little more graceful, because a html comment should be just that really, a comment, not a functional reference to data.

Jamie Holly’s picture

#11 - Yeah, that's where I got that idea LOL. The using of comments for functional reference isn't the best. We could also go to a custom wrapper tag:

<drupalfilter name='' editable=true {.....}>


This provides an included benefit of being able to pull it out and grab the attributes with standard jQuery, where as the comment method would require a regex. That means this route would follow the typical flow of DOM selection and manipulation much better, plus those tags would only be included when you are editing, not on normal view.

lightsurge’s picture

@#11 Having said that, no, it would mean having to encapsulate multiple tags rendered from a token into a new div (otherwise which tag do we put the data attribute into)... which seems bad here. So the html comment, even though it's perhaps taking using a html comment not quite as intended.

lightsurge’s picture

@#12 Wouldn't this invalidate the markup tho?

lightsurge’s picture

I think this will be a two-pronged attack. When it comes to tokens, that's quite easy, you just have a filter that runs before all others, as described here http://drupal.org/node/1580210#comment-6070130

1) So basically after the first filter, our tokens in markup are changed inline with this example:

<!--TOKEN-STARTED="]token[" i.e. an escaped token -->

By the end of the input filter process they'll have become inline with this:

<h2>I'll be your token output for today!</h2>

2) But in the case of non-tokens, i.e. slices of text that are targetted by things like semantic linking filters, we have to compare the before/after filtering inline with http://drupal.org/node/1580210#comment-6070182

In 2 we'd be running a diff of before and after the complete filtering process (filters may alter the same markup more than once, remember, which is unlike tokens which only get replaced once and is why they're easier to deal with here *), and we'd stick a html comment around those discovered diff output sections with what was originally in the markup. So on save we record what's in the comment instead and the input filters can do their business as intended the next time that textarea is filtered, rather than those dynamic changes having been hard-saved.. As part of this I think we'd have to filter_xss() the original markup and escape it in that html comment (so on save it will be sanitised and no longer warts and all with XSS code, etc - perhaps the same with the tokens as well, although in both cases that doesn't seem like much of an issue to me).

But anyway, 2 is much more of a grotty solution than 1) and might involve some gripes.

* I might be wrong here, is there any reason we'd want subsequent filters to alter markup generated from tokens earlier in the filtering process?

Jamie Holly’s picture

@#14 - yes it would be invalid markup since you aren't using a valid tag, but this will only be used when preparing for inline edit. When you start the inline edit, an AJAX call is made to Drupal and the tokenized element value is then sent back to be edited. On regular views those tags wouldn't be put in at all.

MustangGB’s picture

Presumably invalid markup shouldn't be used because a clean fallback for clients with JavaScript disabled in needed.
Some data attributes like #5 look much neater than random comment tags or invalid HTML.

lightsurge’s picture

@#17 Yeah as quicksketch #5 says that includes any validators that won't take account of javascript alterations.

Like I say I think there'll be problems with using the data attribute, might work great for applying a caption to an image, but it wouldn't work great necessarily for a token that gets converted into a whole bunch of markup. With either, in the case of 2) in #15 there's also the issue of bloating the markup in certain cases (say for example a paragraph of text is filtered to have p tags added, and we end of up with this whole paragraph without the tags in the comment/data attribute), so that solution is likely to be much more complex, but likely doable.

Perhaps we're getting ahead of ourselves a bit here as far as Spark, at this stage of the game, goes. It looks like we do have sketches of solution[s] to this particular problem of before/after input filtering, so we might be safe-ish in saying that it won't get in the way from the idea becoming a reality in any sort of unfixable way, or impinging on any other considerations like the choice of wysiwyg.

I think as useful as the method would be, putting invalid tags into markup and relying on javascript to turn them valid, doesn't seem right.

Jamie Holly’s picture

#17 - If they don't have javascript enabled they won't get the invalid markup. This is only injected in a prepare AJAX callback right before in place editing takes place. In place editing alone requires Javascript to be enabled. The whole thing is we need a way to mark the beginning and ending of the code rendered by a filter. A simple attribute isn't enough for that. Take an ad filter. Some ad injection code does something like this:

<script type="text/javascript">
var adunit = '250x300';
<iframe ....></iframe>

Marking the opening

Now when the WYSIWYG editor prepares the content for output it can iterate through whatever tag we come up with and send do an editor prepare call on that filtered content, either leaving it alone or putting some sort of placeholder in there and letting it activate a custom settings dialog when needed. Before the getContents is called to send it back to Drupal, that entire block is replaced with the appropriate markup ([tag ...], whatever). The best part is that this would all work with normal jQuery selectors as well, keeping something easy to use and that people are already familiar with.

lightsurge’s picture

The whole thing is we need a way to mark the beginning and ending of the code rendered by a filter. A simple attribute isn't enough for that. Take an ad filter. Some ad injection code does something like this

Totally get this, that's why I think the data attr is no good here.

This is only injected in a prepare AJAX callback right before in place editing takes place

Not sure I get this though.. I'm assuming you mean at this point, we have the final markup from the input filters, and in javascript we add the <drupalfilter name='' editable=true {.....}> wrappers to the relevant bits and add the relevant (token etc.) data? At this stage, how do we find the relevant bits/data? The ajax call diffs markup against the original content?

Jamie Holly’s picture

#20 - The flow would actually go this way:

  1. The content is clicked to edit. In aloha, this fires a aloha-editable-activated event.
  2. Before editing is enabled the editor makes an AJAX call to Drupal to get the "prepared for edit" content, which is when the tags are put in (running through check_markup without caching and a flag set to indicate placeholders should be injected)
  3. The content is sent back to the editor and the content to be edited is replaced with this new content
  4. The editor then lookps through the custom tags and takes whatever action is necessaary (leave it alone, insert placeholders, etc.)
  5. You can now edit

When you save the content:

  1. The editor again goes through these tags, replacing them with the appropriate custom markup ([caption], whatever).
  2. The finalized content is sent back to Drupal.
  3. Drupal runs the regular filters on it (check_markup).
  4. Content sent back to editor.
  5. Edited content replaced with the new, filtered content.

So the wrapper tags are only inserted when content is being edited and inserted through Drupal.

And another added benefit of these tags is that the data attribute can be used on them. That means if I offer a custom tag with some settings, a click on that placeholder can open a dialog. I can change some settings and then set it with the data-... attribute (or even with just $(this).data()), which makes #1 in the save workflow even easier to handle.

ezra-g’s picture

Since it hasn't been explicitly mentioned in this issue: http://drupal.org/project/token_filter

Jamie Holly’s picture

#22 - Token filter really isn't what we are talking about here. Instead we're talking about a way to wrap markup inserted via filters as a way to identify that content in the in place editor.

Although Token Filter is an excellent example of why we need a function to wrap the output. Token Filter can do replacements that are just plain text, no markup at all. There's really no way of identifying that before being ready for edit with an IPE and if it isn't marked then if you do an IPE, the text generated from the token will then be saved with the content, losing the token all together.

Wim Leers’s picture

We had a call with the Aloha Editor developers. We now know more about what is needed to make this work with Aloha Editor. Looks like HTML comments are off the table. Custom tag vs. attributes both still seem viable.

Please see #1617670: Aloha Editor: adoption concerns discussion.

Thanks for all the great input so far! :)

Jamie Holly’s picture

After playing around and coming up with a very (VERY) crude example, I've figured that we wouldn't really need a custom tag, but could wrap the filtered area in div or span with a class to easily select it. Let the module decide what kind of wrap it gets depending on what it does. Here's the example I came up with:

[caption align=right]<img src="/sites/default/files/drupal.jpg" alt="" width="144">This is a right aligned caption[/caption]
<p style="">Some test text for editing.</p>

Regular that would render to:

<div class="caption caption-right">
<div class="caption-inner" style="width: 144px;"><img src="/sites/default/files/drupal.jpg" alt="" width="144" />This is a right aligned caption</div>
<p>Some test text for editing.</p>

When you click to edit the area, an ajax call is sent to Drupal to tokenize the filters. The body then becomes:

<div class="drupal-ipe-block" name="caption_filter" align="right" width="144px" id="caption_filter">
<div class="caption caption-right"> <div class="caption-inner" style="width: 144px;"
><img src="/sites/default/files/drupal.jpg" alt="" width="144" />This is a right aligned caption</div> </div> </div>
<p>Some test text for editing.</p>

Next in my flow is a drupalblock plugin for Aloha I threw together real quick:

var DrupalPlugin = Plugin.create( 'drupal', {
      settings: {},

      init: function () {
        var that = this;
        Aloha.bind('aloha-editable-deactivated', function(jEvent, params) {

      _createBlocks: function(e) {
        // Make sure that block hasn't been processed already
          var name = jQuery(this).attr('name');
          if (Drupal.alohaFilters[name]){
            Drupal.alohaFilters[name].onBlock(this, e);
          jQuery( this).alohaBlock();


Drupal.alohaFilters is part of our own internal system. This will allow module maintainers to simply add callbacks for block behaviors.

Here's the very rough routine I threw together in my example:

Drupal.alohaFilters.caption_filter = {
    onBlock : function(elem, aloha){
      var img = $('img', elem);
      $(elem).bind('click', function(){
        var caption = $(".caption-inner", elem).text();
        var p;
        if (p = prompt('Enter caption', caption)){
          $(".caption-inner", elem).html('').append(img).append(p);
    onSave : function(elem, aloha){
      var img = $('img', elem);
      var caption = $(elem).text();
      var src=img.attr('src');
      var width = img.attr('width');
      var tag = '[caption align="right"]<img src="' + src + '" width="' + width + '" />' + $(elem).text() +'[/caption]';

Just hacking at the Aloha plugin, here's the save handler, which calls on onSave for our blocks:

var saveIfChanged = function () {
        var html = Drupal.behaviors.alohaEditor.getContentFromContainer($container);
        var tempDiv = $("<div />");
        jQuery('.drupal-ipe-block', tempDiv).each(function(){
          var name = jQuery(this).attr('name');
          if (Drupal.alohaFilters[name]){
        //jQuery( this).alohaBlock();
        var save = tempDiv.html();
        if (region.html != save) {
          // Do our save, get back filtered content and replace editor with content
         // then reinitialize the editor so blocks get added in again.

What happens is you can't edit the image caption. If you click on it a prompt pops up asking for a new caption. It then updates the caption for reverse parsing on save.

Everything is working as expected. I just did this as a proof of concept and in no way as actual code, but it looks like what we want to do is very doable. I might try to throw the example up on one of my servers tomorrow.

Jamie Holly’s picture

One other addition on this. We could implement a way for module developers to more easily add in their WYSIWYG js and css files:

function caption_filter_filter_info() {
  $filters = array();
  $filters['caption_filter'] = array(
    'title' => t('Convert [caption] tags and allow image alignment'),
    'process callback' => 'caption_filter_process_filter',
    'tips callback' => 'caption_filter_tips',
    'weight' => 20,
    'wysiwyg_files' => array(
       'js' => array('image_caption_wysiwyg.js'),
       'css' => array('image_caption_wysiwyg.css')
  return $filters;

Then on pages with the editor those files are added in from the filter list of every input format that user has access to (unless we implement a lazy load method, which I really like - see #1542344: Use AMD for JS architecture)

lightsurge’s picture

@Jamie Holly

running through check_markup without caching and a flag set to indicate placeholders should be injected

Are you suggesting then that there be a change to drupal core to allow the calling of check_markup with a flag (argument) to add some kind of wrapper to each piece of token output? (if not, sorry if I'm being a bit dumb here) I can see how that would be really useful, even though as mentioned in the other thread, I'm not too sure about adding divs.

lightsurge’s picture

I can see how that would be really useful, even though as mentioned in the other thread, I'm not too sure about adding divs

Of course, if such a flag were available, then I suppose these divs would only be added on a javascrip trigger on 'inline edit' which would be okay (edit: well, better).

Jamie Holly’s picture

@#27 - Yeah this would require a change to the filter module. Since we are talking D8, it is a good time to figure this out so any API changes can go in. A fallback solution for D7 would be for each module to manually add the wrapper tag if the page request is for the IPE prepare.

The modifications I did to filter.module are really simple (and does need improved, but done just for proof of concept):

function filter_tokenize($text, $filtername, $type, $attributes = array()){
  if (filter_enable_tokenize()){
    $tag = $type=='inline' ? 'span' : 'div';
    $defaults = array(
      'class' => array('drupal-ipe-block'),
      'name' => $filtername
    $attributes = array_merge_recursive($defaults, $attributes);
    $tagStart = '<' . $tag . drupal_attributes($attributes) . ">\n";
    $tagEnd = "\n</" . $tag . ">";
    return $tagStart . $text . $tagEnd;
  return $text;

function filter_enable_tokenize($enable = NULL){
  static $enabled;
  if ($enable !== NULL){
    $enabled = $enable;
  return $enabled;

In all honesty it would be best to drop the filter_enable_tokenize() and just add a new flag to check_markup indicating if filters should tokenize and pass that variable to the filter processor functions.

@#28 - that's the only time they are added; when you are actually editing that content. Once you save your edit, then the onSave function is called for the JS for your filter (Drupal.alohaFilters.caption_filter.onSave above) and that replaces the added div or span with the actual markup code your filter parses. Once all those run, that content is sent back to Drupal to save and then gets filtered again. That filtered output (the same as you see normally) is sent back to the browser and the edited content replaced with it and the editing session terminated.

lightsurge’s picture

Since the Spark distribution is for 7.x and it's supposed to be an incubator for UX improvements, the 'Pressflow of UX', it seems like it should work as far as possible out of the box with contrib modules (i.e. you wouldn't incubate a reptile egg with an environment that would kill it :D).. dunno, maybe it's also about incubating changes in contrib, but the changes in contrib would be more complex quite possibly than the changes in core - and along with that all the replicated effort, which seems a bad way of doing it.

So I think these changes, if they became the plan, should be made to 7.x

Jamie Holly’s picture

Another possibility without doing core changes in D7 is implementing a custom check_markup when getting the content to edit. It would be rather tricky, but doable:

  • Loop through the input filters.
  • Store the text in a variable.
  • Process the returned text for the filter
  • Do a diff between the stored text and the new text.
  • Any additions in the new text would get wrapped with the div or span (figure the type of element needed by the first tag).
  • The deletion from the stored variable would be the custom tag used to trigger the replacement. Store that in a data attribute of the wrapper.
  • Also store the filter name in a data attribute.
  • Return the new text, complete with wrapper.

It's not perfect, nor foolproof and of course there are certain filters you wouldn't want this done with, like auto_p. That would create one heck of a mess!!!!

lightsurge’s picture

@Jamie Holly I totally +1 your idea with improving check_markup here but I think if a security maintainer reads this they may well lay a reptile egg and probably aren't thinking about incubating it... hehe ;)

(Based on creating an alternative check_markup() I mean, not your idea for 8.x!)

lightsurge’s picture

Just to emphasise not criticising there, very similar in #31 to how I envisaged this working, but figure it would not make it due to same reason in #32. Reason why I favour your idea of patching core.

An alternative is to course run a filter at the beginning and the end of the cycle (or after each input filter, possibly), the first filter to catch the tokens, and the second to catch other non-token orientated alterations.

lightsurge’s picture

Could we not use hook_filter_info_alter?

function spark_filter_info_alter(&$info) {
$filter = $info['php_code']['process callback'];
$info['php_code']['process callback'] = 'spark_get_filter_changes($filter)';

This is a bit beyond me, but could spark_get_filter_changes() then check for an IPE context, otherwise leaving the filter to do its stuff as normal?

Jamie Holly’s picture

@#33 - the only problem with patching core in D7 is that we would need the filter_tokenize function. Now contrib also has to start using that function for all this to work, meaning we would have contrib modules that are dependent upon a certain point release of Drupal (or do a function_exists()), or maintain a separate Spark branch of that module.

But yeah, I think the patch route is by far the best. You are sure to wrap everything you need this way, so long as contrib does their part.

lightsurge’s picture

#34 was a bad code illustration, but you get my point, we could override all process callbacks to a single function by adding the 'real' process callback function to the $filter variable... to get to the same place as #31, without having to have some weird weighting of input formatters like in #33 or mobbing the check_markup() function, and being able to react to progressive (i.e. 'I am a piece of plain text' to 'I am a piece of text with a bit more' to 'I am a piece of text with a bit more and more') filter alterations as well as token replacement.

Jamie Holly’s picture

@#36 - still won't work without some sort of diff against the actual text. Think of how input filter processors work:

function mymodule_filter_process($text){
  $search = '[token]';
  $replace = 'my new text';

  return str_replace($search, $replace, $text);

"This is [token]" goes in and "This is my new text" comes out.

Even using a single override you have the same problem:

function filter_process_override($text, $filter){
  //$text = "This is [token]";
  $newText = call_user_func_array($filter['original_processor'], func_get_args());

  //$newText = "This is my new text" ;

  return $newText;

Now we got to figure out the logic to compare $text to $newText, extract those changes and wrap them with whatever placeholders we come up with. Again, there are going to be some filters that this will create an absolute mess on. Not only auto_p, like I mentioned above, but also things like markdown.

The other big benefit of going the extreme filter_tokenize route is that we can have custom attributes that can be sent to our plugin in Aloha to configure how the blocks should act without the contrib module needing to include any Javascript. Provide a basic framework and then allow contrib to leverage that framework's API to create more robust dialogs and that for their blocks.

lightsurge’s picture

No, it would run a diff, my thought that was if you override process callbacks for all filters as in the crude example to the same function e.g. spark_add_placeholders(), adding the original filter process callback function as a variable to each $filter object which gets passed, you could indeed run a diff for each filter against the text-up-till-now ($text) and the text that the current filter would change it to (by triggering the real filter process function), then add your wrappers with data attributes etc, and so on. So it's progressive, and seems to tick all boxes.

lightsurge’s picture

Re-reading your post I see that you understood what I mean, I can't think of anything but diff for 7.x without an extreme patch which as in your case would require contrib modules to change quite a bit, though.

Jamie Holly’s picture

The diff thing is still going to be pretty complex since we have to function on a character level granularity and that still is very complex. I just saw this, but haven't played with it: https://github.com/gorhill/PHP-FineDiff

Still there are filters this is going to create a mess on. Say I got a filter that changes all links to rel='nofollow'. Every link is going to be marked as a change, but without doing some actual DOM navigation, then we have no way to figure out where to really start it. Going with a plain diff, it would detect rel="nofollow" as the only change in text like this:

<a href="http://example.com">click me</a>

So now we have to go back and mark it at the start of the tag and know to look for the closing tag to end our mark.

Next problem - two tags side by side:

This is my test post.


Yada yada yada

Now the filter processor is going to change all of those at once and return it to our main processor. A diff is only going to see that as one chunk of code, not two and break things. This is probably the most common scenario we would see.

With these problems, I don't think it is a real viable solution without some insane logic going in to parse everything out, and given the second example, I just can't figure out code in my head that would split that into two separate areas.

lightsurge’s picture

Understand about the first bit and the markup bloat issue (perhaps that would be 'unlikely' to exceed bloating the markup more than twice the size, tho... heh) and agree it's not ideal.

Not too sure about what you mean in the second, but it wouldn't surprise me at all if there's an issue with this diffing methodology.

Jamie Holly’s picture

The second is real simple. Say I have a post that has this in it:

This is my post

[image id=1][image id=2]

Do you like my images?

That's the code in $text in our main filter processor function. Now we get our new text by passing that into the original filter's processor. That gives us:

This is my post

<img src="http://example.com/sites/default/files/someimage.jpg" /><img src="http://example.com/sites/default/files/someimage.jpg 2" />

Do you like my images?

You have now got a diff that encompasses two separate tags, but the diff will only see it as a single change, as it should.

One way to separate the two tags would be a data attribute on the starting tag of the replaced text, then as it moves forward through the diff stack it gets to either a) original content or b) another tag with the data attribute. It then marks that as the end of that block. Of course about the only way to really get this is with full DOM parsing on each chunk from the diff and that alone could become rather intensive. The other option would be a walker class that can go back to the closing of the previous block and then insert the ending marker there and the next opening marker right after that.

lightsurge’s picture

Gotcha... alternatively I suppose in this case we could simply make the tokens inaccessible in IPE (e.g. using aloha blocks) and point the user to the nodeform, and for any plugins to make sure any new tokens inserted through IPE are diffable. Although looking for another data-spark-ipe attribute might work, I suppose.

Bearing in mind I accept this diff method is effectively pants.

Edit: or as part of our diffing module, separate the tokens before the diff and then squish their outcome back together again.

lightsurge’s picture

Parsing disaster really.

Jamie Holly’s picture

The problem I see with the diff approach is that there really is far too many areas that it can break. That would be a disaster for Spark. Say I got a high traffic site and do a quick IPE on a big piece of content. This diff method really borks the content, but I don't realize that until after it's saved and reprocessed. Now you got an end user rushing to try and recreate the content on the node form. That would be really ugly and very feasible given this approach.

lightsurge’s picture

Title:Implement A Custom Markup/Tag Framework.» Provide WYSIWYG integration and only store the proper, "pre-transform" mark-up
Priority:Normal» Critical

Yeah, it'll be really interesting to see if the Spark tries to patch 7.x core to accommodate this or develop a workaround as has also been discussed here.

As per http://drupal.org/node/1580210#comment-6074266 Wim Leers marked this problem as critical, but this issue isn't marked as such, so adjusting accordingly (hope you don't mind me adjusting the title, but I thought 'Implement A Custom Markup/Tag Framework' is basically a solution to this problem):

#110: That's critical indeed. I'd imagine that whenever the current visitor is allowed to edit the content on the current page, it would be rendered with some additional mark-up, enabling us to provide WYSIWYG integration and to only store the proper, "pre-transform" mark-up.

lightsurge’s picture

Title:Provide WYSIWYG integration and only store the proper, "pre-transform" mark-up» Provide WYSIWYG [filter compatibility] and only store the proper, "pre-transform" mark-up
Jamie Holly’s picture

I've got the example up I talked about earlier, using the image caption filter as an example and using the wrapper function in the filter processor:


lightsurge’s picture

As far as diffing goes, utilising filter_dom_load() on the pre/post filter markup and comparing the objects might work?

Wim Leers’s picture

Thanks, Jamie Holly, you rock! :D

However, as already indicated by you and others in this and other issues: I'm afraid it's far more complex to provide a complete solution. Providing a partial solution will put is in a not-so-nice position.

"token"/"wrapper" filters
The solution you've demonstrated is kind of similar to the "token"-type filter. It's really more a wrapper, but the solutions for both are similar and seemingly even the same. In either case, the filters are transforming only content that is more or less meaningless without being transformed, i.e. content that is intended to be transformed.

overriding filters
But, what in the case of "overriding" filters. For example:

  • the Typogrify filter replaces all instances of & with <span class="amp">&</span>, all capitalized acronyms such as CDN with <span class="caps">CDN</span>, and typographically incorrect quotes such as " with and
  • link ads, which replaces only certain keywords with (spammy) links
  • non-HTML input formats that are transformed to HTML, such as Markdown, Textile … — this we can't support, of course

Also note that the current filter system doesn't do any sort of classification of filters (also useful: filter example module).

It'd be easy enough to add a flag to hook_filter_info() to indicate that a filter is of the "token" kind, of the "different markup" kind, or the "transformation" kind (can't think of a better name, but I mean e.g. Typogrify). It's very likely that I've missed other kinds of filters. Just thinking out loud. The quick 'n dirty approach that comes to mind:

  • if only "token" filters: just load the WYSIWYG editor and apply it to the current HTML (assuming that it already loaded the special "edit mode" version)
  • if >0 "transformation" filters: load the content using AJAX and don't run the "transformation" filters. Not perfect WYSIWYG, but very, very close.
  • if >0 "different markup" filters: no WYSIWYG editor, show the regular body form as it appears on node/edit

sun (the maintainer of the WYSIWYG module) mentioned that he has had this exact discussion in the WYSIWYG module issue queue. Hopefully he'll chime in here. And even better: hopefully we can collaborate on this instead of reinventing the wheel!

lightsurge’s picture

http://gitorious.org/htmldiff shows quite well the amount of work needed for Spark to isolate the "transformation"/"different markup" type filter output.

lightsurge’s picture


if only "token" filters: just load the WYSIWYG editor and apply it to the current HTML (assuming that it already loaded the special "edit mode" version)

I presume here by special edit mode you mean with the token output in spans/divs with pre-transform (i.e. tokens) in a data attribute?

"transformation"/"different markup"

By 'different markup' you mean filters from modules like http://drupal.org/project/caption_filter ? That's a bit disappointing, I suppose, but then there doesn't seem to be any way of doing it, apart from patching core/contrib as in Jamie Holly's example.

Jamie Holly’s picture

@#50 - Yeah there is no 100% way to get everything in this, especially given how the current filter system works. Things like in content ads provide a big problem with no global solution.

What we're looking at is some big changes most likely having to go into the contrib modules and a fundamental change to the filtering system. That's going to make have a D7 version of Spark close to impossible or a ton of problems pop up.

The filter classification idea is really good. It would be nice to see some focus on that for D8, then we might be able to come up with a really good solution. I would love to get Sun's input on that and see the issue he is talking about. I just searched the queue and couldn't find it.

Jamie Holly’s picture

@#53 - since Spark is going to be a distribution, a patched filter module isn't that far out of the scope. Take the project page:

Spark aims to be "the Pressflow of Drupal authoring experience."

Pressflow is a patched version of Drupal, so Spark could be the same thing.

Contrib is where the real problem is. One possible solution would be a change in the filter_info, setting a flag if that filter is Spark compatible. Then we take a two pronged approach, combining both the wrapper and diff idea. If the filter is spark compatible, then that means that filter is using the wrapper/tokenize function and acts accordingly. If the flag isn't set, then the HTMLDiff is performed and wrapped with something that says "hey - replace this with the original token on save" and that markup, text, whatever is made into a non-editable block in Aloha. That might even push the maintainers of non-compatible modules to get their modules Spark compatible, especially if Spark becomes a part of D8.

Wim Leers’s picture

#53: not necessarily. We can manually add a new property to existing filters to categorize them, using hook_filter_info_alter()… :)

Wim Leers’s picture

#54: darn, should've refreshed :) I'm no fan of any diff-based approach though. Inefficient, unreliable, etc.

Of course, there's only so much we can do in Spark, given the limited timeframe until D8. So my gut says: as part of Spark, we only want to deal with "token"-type filters. We can then address the rest later, and probably come up with a better solution because we'd have built a subset of what we'll need in D8.

lightsurge’s picture

Don't know if it might be this issue I came across that sun could've been talking about #1377566: wysiwyg vs (server-side) filters

Jamie Holly’s picture

@#56 - Perhaps get an initiative going for D8 to get the base of what will be needed for filtering? Given the popularity of this idea as well as the WYSIWYG module, it might be worth looking into. That gives us roughly 8 months to figure it all out, which I'll admit is a very short amount of time in Drupal time.

@#57 - not sure if that's it or not, but it is an interesting discussion. One big difference between WYSIWYG and Spark is that any WYSIWYG solution has to be built to function in all the editors they use. At least with Spark we can go to a more custom tailored solution for Aloha or whatever editor is chosen.

lightsurge’s picture

Apologies for the rough code here. I was messing about with php dom which I know next to nothing about.. but the way you can flatten nodes into a straight array and run an array_diff off that seems nowhere near as bad as using real diff (especially HTMLdiff)...

Plus, you can limit it to just the tags you want, say for example if you wanted Aloha to add a data attribute to an A tag that gets wrapped around a piece of plain text, but not bother with stuff like paragraphs (otherwise we'd have data attributes all over the place).

The original text here is:

<p>Hello world</p>

The filtered output of this is:

<p>Hello world</p><a href="http://www.google.com" rel="nofollow" data="Hello world">Hello world</a>

So while the P block is unaffected, the A tag, which is new from the input filters, get's a data attribute. I know that data attribute here should be something like 'REMOVE ME', but I was also interested to see how easy it was to put the plain text from the node into the attribute.

Something like this could work in hook_filter_info_alter(); - is this still too clumsy? Obviously this could probably be cut down to just a few lines... very rough illustration!

= new DOMDocument();
$snippet = '<p>Hello world</p>';

$snippet_filtered_dom = new DOMDocument();
$snippet_filtered='<p>Hello world</p><a href="http://www.google.com" rel="nofollow">Hello world</a>';

$snippet_array = get_dom_array($snippet_dom);
$snippet_filtered_array = get_dom_array($snippet_filtered_dom);
$snippet_diff = array_diff($snippet_filtered_array, $snippet_array);

$snippet_diff_keys = array_keys($snippet_diff);

if (
$snippet_diff_keys) {
$changed_node = $snippet_filtered_dom->getElementsByTagName('body')->item(0);
$node = $changed_node;
block_nodes($node, &$changed_node, $snippet_diff_keys);
  foreach (
$changed_node->childNodes as $child_node) {
$new_body_content .= $snippet_filtered_dom->saveXML($child_node);

block_nodes($body_node, &$changed_node, $snippet_diff_keys, $i = 0) {
  foreach (
$body_node->childNodes as $node){
    if (
in_array($node->nodeName, array('p', 'a'))) {
      if (
in_array($i, $snippet_diff_keys)) {
$node->setAttribute("data", $node->nodeValue);
$changed_node = $body_node;
    if (
$node->childNodes) {
block_nodes($node, &$changed_node, $snippet_diff_keys, $i);

get_dom_array($dom) {
$body_content= array();
$body_node = $dom->getElementsByTagName('body')->item(0);
get_nodes_recursively($body_node, &$body_content, $dom);

get_nodes_recursively($parent, &$body_content, $dom) {
  foreach (
$parent->childNodes as $node){
    if (
in_array($node->nodeName, array('p', 'a'))) {
$body_content[] = $dom->saveXML($node);
    if (
$node->childNodes) {
get_nodes_recursively($node, &$body_content, $dom);
Jamie Holly’s picture

It would work in some cases, but there are plenty others mentioned above that would give unexpected results and end up breaking things.

Not only that, but this would be required to run after each filter in a format. In most cases that won't be too bad, but if you have 10 or 12 filters in a format, then it will become a serious performance hit, especially with longer and more complex content.

lightsurge’s picture

@#60 yeah I suppose I'm really only thinking of how to add at least a few bits more of that real wysiwyg appearance, rather than all of it. I rarely use much by way of tokens in user-editable areas apart from Media tags, but I do use a lot of input filters that change how textareas look, link stuff together, etc - I can imagine clicking 'Edit' in our IPE wysiwyg and suddenly the markup and UX becomes something it noticeably wasn't a moment before. It'll still be good tho of course.

Method in #59 seems easier to 'tame' than the sort of diff methodology that's been mentioned, and quite easy to exclude markup generated from tokens (which should already have the data attributes in some parent container making it possible to exclude them). It does somewhat rely on things like modules adding classes etc. to tags that will identify the markup as being different to any user-input for array_diff(). But I can bung it in a hook, so I can at least explore using it for my own purposes if need be :)

Not as good as your idea that allows modules to choose which bits of their output markup gets wrapped for Aloha, must admit.

Jamie Holly’s picture

@#61 - The hook idea would help. The big concern here is you are taking the simplest filter modules and making them some of the more complex ones, since you are now having to perform a character level diff on that section and figure out exactly what your filter changed.

After 3 days of heavy thought on this, the most viable option is a combination of the wrapper and what Wim Leers suggested in #50. Honestly it seems like grounds for a totally re engineered filter system. It would be nice to see it go OOP where processors extend a base class. Then we could do a simple $this->replace on the text and have the replace function add wrappers if needed. Then we could also offer a base format of special quick tags ([...]) that can easily be extended upon without the need for a bunch of extra filters. Of course something like this wouldn't be until D9 most likely.

lightsurge’s picture

you are now having to perform a character level diff on that section

Well, it's not a character level diff if I'm understanding you correctly. The array_diff is done against the DOM nodes before/after filter, each value in the two arrays is a node. So stuff like in #40 adding rel='nofollow' to a link wouldn't be a problem, except that I haven't got as far as working out if you can tell Aloha HOW it has changed (and sort of doubt that'd be easy), rather than simply that it HAS been altered by the filters. But simple stuff like a string that filters into a link works quite well, leaving the link there but letting you edit the text (so on next save the link might disappear, point somewhere else, etc), and there's a certain amount of non-regexpy stuff you can do with this method to make sure nothing crazy starts happening to the markup.

Know what you mean about going OOP, seems like that would work well with the filter system, but a major overhaul.

Jamie Holly’s picture

@#63 - Until you get into things like token_filter, where:

Welcome to [site_name]


Welcome to My Siite

Now you just got the text of that node changing and you need to figure out exactly what did change in that text. So you need what the replacement was and what was replaced (what triggered this change? [site_name] in my example).

lightsurge’s picture

I did factor this in. Could use a filter that runs before all others that modifies such tokens to something like:

<div data="(escaped) [token]" data-do-something-special-with-this-node"Yes!">[token]</div>

and when it runs thru the actual filters that do the token replace, the second (unescaped) [token] would be converted to markup, and the first could be used by an aloha plugin for user-customisation. The [token] would be what was saved. Example in #59 could be configured to ignore a dom node with such a data attribute, and it wouldn't pay any attention unless directed to tokens in plain text anyway.

Potentially the Spark team could easily do this to incorporate such things as token filter module as part of the distro, all you have to do is add a filter that's weighted at -100 etc.

Jamie Holly’s picture

@#65 - that was actually one of the original parts of this issue. The problem is that while [] is current for token tags in filters, it's not the only thing used. Some modules use comments and other identifiers for these things. Luckily they seem to be decreasing in numbers, but there are still some out there.

I would love to see Drupal adopt a universal syntax for custom markup/replacement tags. That would really help in these situations, except this one since the damage is already done.

lightsurge’s picture

Yeah... it's a difficult upgrade path though. I have such tokens or 'tags' all the way through my sites, newsletters, streams, etc etc often different in their composition as you say.

In fact as far as wysiwyg textareas go, for me that's where tokens are actually more scarce. I've used fields and formatters to similar effect in the past, image captions for example. Although I do see the dynamic usefulness of say Media tags, wish I had that a while ago (in some projects I'm left with the task of having to try and migrate images to it that have been inserted as markup, and stored using things like IMCE).

Problem with tokens I think is that you need some sort of helper to insert them, otherwise I find people don't use them. Out of the box I found Media module, great at it is at inserting from a media archive, isn't greatly intuitive in terms of changing the tag settings once they're inserted through WYSIWYG.

Definitely an architectural thing, 'coulda woulda shoulda' had some sort of system for tokens that WYSIWYGs could have jumped on from the beginning.

Jamie Holly’s picture

I view them as a really necessary evil. I've got a few clients that have hired writers, most of who have worked in the news media as journalists. You find with a lot of them that anything beyond maybe a simple hyperlink is Greek to them. A "quicktags" method becomes a simple, yet secure way of injecting complex code. I can have those tags run after HTML filtering and still keep things like script, iframe and other bad code out of the allowed tags. When dealing with WYSIWYG, I find this even more important, especially when it comes to pasting in content.

A global system developed in Drupal can handle the insert problem relatively easy. Just set up a hook_quicktag_info (or whatever name it is) and offer tips like filters do now. Have a button that opens a pop-up that lists all the tags and gives examples. I actually have a system like this in place on crooksandliars.com that works out extremely well. On top of that there is an admin section where the site admins can easily add new tags and descriptions and that. It even has role based access control as to what tags a person can insert and can not.

Upgrade path? Yeah that is an absolute nightmare. The only real path is looping through all the field contents that allow these tags, performing a regex and transforming to the new syntax then saving the modified field. If you got a bunch of filters and tens of thousands of nodes, that would lead to one really slow upgrade!

Wim Leers’s picture

Title:Provide WYSIWYG [filter compatibility] and only store the proper, "pre-transform" mark-up» Brainstorm: provide WYSIWYG [filter compatibility] and only store the proper, "pre-transform" mark-up
Version:» 7.x-1.x-dev
Assigned:Unassigned» Wim Leers
Status:Active» Fixed

This will be continued in the Edit module's issue queue, in #1699722: "True WYSIWYG" and compatibility with Drupal's text formats/filters. Foundations have already been implemented.

This issue was the catalyst to that, but I'm now closing it because we're moving this to the Edit module and because this issue was more of a brainstorm issue — which was very helpful :) Thanks to all of you, and to lightsurge and Jamie Holly in particular!

An initial (very basic!) implementation of this is being done at #1686568: Nested editables: macro filter tags (first: token_filter), for the token_filter module. For image captions, we worked with the Aloha Editor developers to build a "CaptionedImage" Aloha Block, which stores the caption as a data attribute on the image tag — see #1699740: "Perfect images + captions".

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.