There's been some exciting brainstorming about the possible future of input filters and formats in #659580: Specify what the line break converter should do and rewrite it in DOM and #653988: Line break filter corrupts existing XHTML. Here's the thinking so far.
Just to clarify my point of view on this, Pathologic, an input filter, was my first contrib module, back in the D5 days, and still going strong.
The D7 HTML Corrector filter basically loads the HTML as a PHP DOMDocument object, which appears to be pretty flexible about parsing tag soup, then serializes it back out. The basic idea in reforming filters for D8 is that we actually keep that object around for a bit and pass it around for other filters to work on before we serialize it. This will be of great benefit for filters which could benefit from being able to navigate and modify a DOM instead of parsing with regular expressions.
The biggest sticking point that we need to consider with this sort of approach is that it will no longer be possible to not "correct" HTML while also running other filters on it, as doing this is inherent in creating and serializing the DOMDocument. We could possibly offer some sort of "passthrough" approach which just doesn't run any filters on it at all, but as soon as it hits the filter system, it's gonna be "corrected." Is this a good idea? Let the debate commence! I personally am of the mind that the net benefit of allowing filters like Pathologic to be able to fiddle with things using the DOM far outweighs other concerns.
Let's back up a bit. The full filtering process will look like this: The text will pass through preprocess filters. This is where filters which convert Markdown, Textile, BBcode, etc to HTML will go. The text (hopefully all HTML at this point) then gets loaded into a PHP DOMDocument, which is then passed around to "mid-process" filters to work on. Once those are done, the DOMDocument is serialized back to HTML, and then can be passed through post-process filters, for filters which need to work on HTML instead of a DOMDocument for whatever reason.
This three-stage approach will mean that the filter rearranging page can be (debatably) done away with. We can make sure that filters like BBcode will run before filters like Pathologic by virtue of the fact that the former will be a preprocess filter and the latter will be a DOM (or possibly postprocess) filter. I think this will be a wonderful usability boon for novice users. Possibly, filters can carry their own weights if they need to run before or after other filters in a particular stage, but the user never needs to see that, just as they never need to see module weights in {system}.
Coders of many filters in contrib will be able to easily roll a D8 version without having to rewrite their filter to use the DOM simply by making their filter a postprocess filter - so it still has standard HTML as an input. Eventually, if it makes sense to do so, they can create a new major release which uses the DOM instead.
And while we're reinventing wheels, #226963: Context-aware text filters (provide more meta information to the filter system) needs to happen too.
I've never been a major kitten killer as of yet, but I'm maybe possibly volunteering myself to take a major role in this, pending community feedback.
| Comment | File | Size | Author |
|---|---|---|---|
| #15 | filter-tng-macros.png | 49.98 KB | wim leers |
| #15 | filter-tng-macros-clicked.png | 49.92 KB | wim leers |
Comments
Comment #2
sunGood thoughts. However, we need to revamp a lot more in the filter system, which is tightly connected to what you're describing here:
1) Split text processing and filters into two key stages: security filters and macro filters. Security filters are sanitizing user input (e.g., HTML filter, Escape all HTML filter, etc). Security filters always run first, before any other filters. Macro filters are converting stuff dynamically around (e.g., Linebreak filter, URL filter, most of the contributed filters). So there need to be two separate stacks: security filters and macro filters. Each filter stack needs separate filter weights.
Note that 1) slightly clashes with the outlined idea in the original description of this issue, unless security/macro stages would be combined into a matrix with preprocess/dom/process stages somehow.
2) Introduce markup language: The mentioned DOM stage in the original description of this issue requires to know the markup language of the processed text. As of now, Drupal's text processing is only intended for (X)HTML(4). However, some (already existing) use-cases in contrib, as well as HTML5 and in particular trying to load a text string into a DOMDocument requires to declare what kind of markup language we are expecting and dealing with.
3) Get rid of text formats.
I already planned to work on a replacement filter system for D8 or beyond in http://drupal.org/project/filter, would love to have someone to collaborate with. (didn't actually start with that yet though)
Comment #3
damien tournoud commentedHm. I don't agree at all.
I would say we need two stages: (a) text processing and (b) filtering.
Text processing is the first operation. It transforms whatever the user has input into HTML Markup. You can only pick one text processor per text format: Markdown, BBCode, Textile, unfiltered HTML, simple HTML (unfiltered HTML with auto-paragraphs). This text processor is used to decide which type of client-side editor to display to the user (WYSIWYG, WYSIWYM, simple textfield, etc.), and is used to display and help text that can actually be understood by the end-user.
Filtering is the second operation. It takes the HTML markup output by the text processor and further filter it. This stage is purely based on DOM manipulations, and that's where we have most of the contrib macro-filters (Views Embed, URL filter, etc.).
Comment #4
sunYou actually agreed to a lot, just called it differently. :)
More or less the same as 2) the introduction of markup language.
This presumes that the user input has been validated and sanitized, or otherwise the text processors quoted here could not exist or would have no distinction. Furthermore, a strictly enforced BBCode markup language would require a security filter that escapes or removes all non-BBCode, i.e., all raw HTML. Only after applying security filters on the user input, we can safely transform BBCode into HTML.
In short: Before we even try to process any user input, the user input needs to be sanitized, so whatever input is processed and macro-filtered is known to be safe. That's the most important distinction of input filters that we are currently presuming, but as of now, not able to enforce in the configuration (UI and API) of text formats.
Likewise, this also maps smoothly to my points 1) + 2), i.e., macro filters are executed in a separate, second stack. However, some (or perhaps even many) macro filters are much more simple and faster to implement and execute as string manipulations. Of course, DOM manipulations are useful and sometimes needed, too. We need to make both ways possible.
The Linebreak (auto-paragraph) filter actually is a macro filter and therefore needs to be applied in the second stage, i.e., after security filters have run.
Comment #5
damien tournoud commentedThere we disagree. It's basically the job of the text processor to do that. The good ones (ie. Markdown) already do that.
There we also disagree. The linebreak filter is an integral part of a text processor. It has no value independently of the text processor.
Comment #6
Garrett Albright commentedsun, I'm going to have to agree with Damien here; it makes more sense to do security stuff once we have consistent input; ie, HTML (or a DOM object). If we expect the implementors of the Markdown, BBCode, etc filters to implement their own mechanisms for this sort of thing, we're going to end up with a lot of redundant work of various and unpredictable quality and implementation - and this is something which should be in core, anyway.
As for this part:
Note in the OP I state that there should probably be some way to pass through the system without doing any filtering at all (or perhaps just removing the entire DOM step). This might be the only approach if we want to get something like the PHP filter to work, should we still want to keep that in core (no, please).
Comment #7
Garrett Albright commentedI made a thing: http://github.com/GarrettAlbright/filtertng
This code has yet to be executed, but hopefully one can see the process I have in mind starting to take shape. Doing it to push this idea, as well as to learn how to use Git and keep myself busy. Once it can actually run, perhaps I'll push it to CVS as a module to piggy-back D7's filtering at first. Feedback would be appreciated.
Comment #8
cpelham commentedOK, it's been 11 months. :) I can't see any activity in the project pages linked to in #2 and #7. Are you guys still ruminating and planning to come back to this at some point, or has this been usurped by another initiative elsewhere?
Comment #9
Garrett Albright commentedI've taken my eye off of it, yes. I still have ideas I'd like to see implemented, but it's tough undertaking a task of this size alone, particularly when paid work calls. A classic conundrum…
Would you be willing to help out code-wise on this sort of thing?
Comment #10
barbi commentedI am interested in helping out code-wise. Can you please split the task into smaller chunks and help me get started?
Comment #11
cpelham commentedDid you take a look at Garrett's pseudo-code for a module to accomplish this? In his file filtertng.module the steps are nicely broken down so you could just work on trying to code for example one of the functions he suggests is needed:
Comment #12
Garrett Albright commentedBump.
I'm currently underemployed and looking for something to work on to get my creative energy out and distract me from worrying about how I'm going to pay next month's rent (Fishing for sympathy? Maybe a little, though I'd rather have a contract), and I remembered this thing. It's still something I'd like to see happen, but still not something I want to undertake all by myself. It might be too late for D8, but maybe we can at least get something started that can be slipped into D9 early on. Anyone else interested?
(Issue recap: I propose changing the text format system from how it works now, where basically a bunch of filters execute in order and they're all expected to take text in and spit text out, to a three-phase system where the middle phase uses a PHP DOM object and the pre- and post-DOM phases would still work on text. For example, a format could have Markdown or BBEdit filters in the pre-DOM phase, Pathologic and the HTML limiter (to remove disallowed attributes or tags) in the DOM phase, then Typogrify in the post-DOM phase since it uses external libraries which expects fully-formed HTML input. Being able to use a DOM object for many filters would be a lot less awkward than having to parse around using regular expressions and such. Also, allowing filters to put themselves in phases like this would allow us to do away with the user interface for sorting filters, or at least hide it away. Please ignore the aforelinked Github repo - this was before sandbox repos on d.o, and I think I've since deleted it.)
Comment #13
sunI'd love to have a skype conf call on this. (user: unleashedmind)
There was some disagreement at the beginning, but I agree with @Damien. However, the main challenge to discuss will be to figure out how the new architectural design with text processors (+ subprocessors?) + filters will look.
Comment #14
Garrett Albright commentedWho would be in such a conference call?
Comment #15
wim leersNote that I come at this from the "in-place content editing" angle, i.e. the Edit/Spark angle.
The content below comes from a variety of sources: my personal thoughts, input from the Spark team, but also input from Daniel "sun" Kudwien, Dave Reid and Nate "quicksketch" Haug, from one of our "D8 WYSIWYG blocker" calls (for notes from those calls, see our Google Doc).
Concerns about forcing each filter to use DOMDocument
From a purely logical POV, as well as a purist POV, it makes perfect sense to me to use DOMDocument for every filter (except for e.g. Markdown/Textile/… of course). However, this has side effects that need to be considered as well.
Imagine the case of an
<img>with adata-captionattribute. It's perfectly reasonable (and it's actually even an elegant solution) to transform this image to something like:<div class="captioned-image"><img … /><div class="captioned-image-caption">This is a caption</div></div>.The "Drupal way" to implement this, would be to use a
theme('captioned_image', array('image' => '<img … />', 'caption' => 'This is a caption'))call — a theme function. This function would just print the opening of the outer div, then the image it receives, then the caption, then the closing of the outer div. By forcing every filter to use DOMDocument, it would effectively be impossible to use a theme function in this traditional sense; we'd have to pass in the DOMDocument object and force users to use DOMDocument manipulation functions. That also means it'd be impossible to use a theme template file.Now, there is a work-around: let theme functions work like they work otherwise, then create another DOMDocument out of the fragment that the theme function creates, parse it, reconstruct the same tree for use in the original DOMDocument.
There's different ways to deal with this, we just have to decide *how* we want to deal with this.
Concerns about requiring all content to be HTML as soon as "preprocessing" or "text processing" is done
Both Garrett Albright and Damien Tournoud (#3) say that all content should be HTML markup as soon as "preprocessing" or "text processing" phases are done. They're different names for the same thing: converting Markdown/Textile/… into HTML.
But … what about other non-HTML content such as tokens (
[site:name]), Media module syntax ({{ type: "node/image", nid: 123, …}})), etc.?To be able to only have to deal with HTML mark-up, and thus to be able to use just DOMDocument-based parsing from this point out, we will have to convert all of the aforementioned syntaxes into a single, HTML markup-based syntax — even if it's just wrapping the existing syntax in the standardized HTML markup-based syntax.
E.g.:
<macro type="token">[site:name]</macro>,<macro type="media">{{ type: "node/image", nid: 123, …}}</macro>.Need for standardized way of handling "macro filter tags"
The things mentioned above (i.e. tokens, Media module, etc.) can be called "macro filter tags" in general, or even just "macros".
Yet even with all of the above, there can still be (edge?) cases that can't work this way. For example: oEmbed. This is something Drupal needs to be able to support.
We could even argue that Drupal should move all of its macros to that syntax, because it is on track to become an industry standard.
This possibility at least seems very promising: @sun discussed with @EclipseGc on Blocks/Layouts in D8 + Inline API + "oEmbed": D8 will most likely expose every entity + every field + piece of content on its own URL already. We can re-use all the new fancy content plugins in D8! :) Prototypes: http://groups.drupal.org/node/242403
However, if we go for oEmbed or something similarly URL-based (which makes sense given the above), we then face The Preview Problem. Imagine a node with an image field. Imagine then that we want to refer to this image field from within the body of the node using oEmbed. How could we do that if the node has not yet been saved? This could be solved by either always immediately saving entities or by leveraging #1642062: Add TempStore for persistent, limited-term storage of non-cache data.
My own analysis
This is the analysis I made (#1699722: "True WYSIWYG" and compatibility with Drupal's text formats/filters), rewritten here for clarity. Keep in mind that it's done from the POV of WYSIWYG compatibility. That is, we want to achieve "true WYSIWYG" (i.e. what's in the WYSIWYG must match the final output exactly), but we still want users to be able to insert and change embeddables (token_filter/media/oEmbed/…).
I believe the solution lies in classifying filters. In my mind, there are three distinct classes of filters, and AFAICT all existing filters fall in either of these classes (if I'm wrong, just let me know!):
data-captionattribute to an actual caption" case described above.The first two are "destructive by design": it is not feasible to apply the filters and still keep the related pre-filter values around. (Keep in mind that we're looking at this from the WYSIWYG angle.)
This sounds vague, but think about the third one, which does not need to be destructive: it is possible to replace the macros with their expanded values, yet still include some metadata so that the WYSIWYG editor can know the actual macro and thus show an editing interface when you click it. All that would be necessary to pull this off, is to have an additional parameter passed in to
check_markup():bool $wrap_macro_tags. This flag could be set by a module and would apply to all filtered text on the current page. When set to TRUE, it would then wrap each macro using something like this:All of this is exactly what I did in Edit module's proposed modifications to the filter module: see Edit's filter.inc (I'm calling my version of
check_markup()fromhook_field_attach_view_alter()).This approach has already been prototyped and proven to work (technically, it has not been usability tested). At least the following people also like the approach: sun, Dave Reid, quicksketch.
Attached are two screenshots from a (ugly!) prototype that show how it can work. The UI would need to be much better (e.g. it doesn't scale to selecting a different image from the site's media gallery), but the point is that expanded macro tags are detectable through JS, and you can click them to edit them.
So, when you're analyzing whether a certain rich text field is allowed to get WYSIWYG editing, you can then analyze the text format it's using:
$wrap_macro_tags = TRUE)data-captionattributes for image captions. But in other cases, it may not make sense: for Typogrify it's near impossible to do, for link ads it's undesirable because it could interfere with the creation of links.Thoughts? :)
Comment #16
wim leersOf course I forgot something. I forgot to say that you could consider
bool $wrap_macro_tagsas a context, so in that sense, it ties back to #226963: Context-aware text filters (provide more meta information to the filter system).Comment #17
wim leersReferenced by #1706688: [meta] In-place editing, inline macros, editables, and Wysiwyg in core.
Comment #18
wim leersRE: using DOMDocument for everything.
I managed to prove myself wrong :) It *is* possible to use DOMDocument-based parsing *and* still have theme functions. Proof: http://drupalcode.org/project/edit.git/blob/0d8b07896824f196fb78c670bb16....
RE: standardized way for handling "macro filter tags".
@sun and I had a discussion about this last night. To truly accomodate the use of macros, they need context, hence they need #226963: Context-aware text filters (provide more meta information to the filter system). And @sun says about macros:
And:
Combine that with the fact that the filter cache was almost removed because Field API is doing the caching already (source: #226963-44: Context-aware text filters (provide more meta information to the filter system)).
So:
- filter system would need to be rearchitected;
- filter system would become stateful instead of stateless (in that the output would depend on the context), which leads to caching issues (cache explosion);
- "macro filters" implemented in the filter system would still need to do their own parsing, and thus no standardized syntax would be enforced;
- we *do* want a standardized syntax, as well as a "Macro API", so that macro providers don't need to do all the parsing work anymore;
- if we have a Macro API, we can allow modules to add more context, and then let others react to that context;
- Field API — which is where by far the majority of these contextual filters would be needed — does have the necessary context:
hook_field_attach_view_alter(&$output, $context);- CONCLUSION: we shouldn't alter the filter system to accomodate macro filters, we should move that into a separate module that hooks alters fields, entities, comments and whatnot on output. Then we could have an even simpler classification: 1) "non-HTML markup filters", 2) "transformation filters".
In my discussion with @sun, I proposed this:
To which he responded:
(See #1671276-49: Integrate with Field API instead of random forms and textareas.)
A downside to this multi-phased approach (phase 1: filtering, phase 2: expand/render macros upon viewing, *after* filtering) is that it becomes impossible to apply filters to macros. Use case: a macro inserts an image, and you want this image to be captioned.
The macro module would then need to support captions natively to cope with this use case. Alternatively, the macro module could simply apply filters to the expanded macros manually. However, some if not most filters need to be able to e.g. set data- attributes on the affected HTML, wrap the affected HTML in some way, or annotate the affected HTML in some other way. So most likely, the macro module applying filters to the expanded macros manually would make little sense.
Comment #19
wim leersHaving stated my own analysis in #15, the discussion with @sun outlined in #18 and then reread the entire issue, it seems to me there is a way to unify all of the outlined goals.
Filter stages
The filter system would need to have 4 stages:
Examples: Markdown, Textile, PHP filter, but also core filters such as
filter_autop("Convert line breaks into HTML"),filter_url("Convert URLs into links") andfilter_html_escape("Display any HTML as plain text").Examples: core's
filter_html("Limit allowed HTML tags") filter.Examples: Pathologic, data- attributes-based image caption filter (see first line of #18),
Examples: Typogrify, insertion of link ads.
At each of the stages below, the number of filters is 0–N.
Note that because of the transition from stage 3, to stage 4, we're actually implicitly executing core's
filter_htmlcorrector:Filter orders won't matter anymore?
I'm hopeful that this division in stages also means that we can stop forcing users to think about ordering their filters, which is a UX nightmare. Because of the basic ordering enforced by the different stages, there are less combinations possible, and definitely many potentially incorrect combinations are excluded automatically. The fact that stage 3 works on the DOM level solves a lot of potential problems as well. The most problematic stage is surely going to be stage 4, hence as many filters as possible should move over to stage 3, where conflict is far less likely.
How can we prove that order doesn't matter anymore?
Stages + lack of order would make for a much better UX
Also, since the filter system cares about outputting valid HTML, I think it'd be safe to say there will actually always be just *one* filter in stage 2. Then the "allowed tags" could become the prominent setting of a text format (while still allowing it to be set to "all tags allowed").
If all of the above assumptions are true, the UI could become very simple:
filter_html_escapeComment #20
Garrett Albright commentedYou lost me near the beginning when you started talking about theme functions. What does content filtering have to do with theme functions?
Also, not every filter would use the DOM object; just the ones that want to. My idea is that filters could specify themselves as a pre-DOM filter (in which case, they would get the content before it's converted into a DOM object; this is where Markdown/Textile/BBCode/etc filters would go, as well as ones using Tokens and such); a DOM filter (in which case they would get the DOM object); or a post-DOM filter (in which case they would get the content after it has been serialized from the DOM object; this is where SmartyPants/Typogrify-style filters would go).
Frankly, I haven't read all of your post(s) yet, but you seemed to have some fundamental concepts of the plan wrong to start with, so I thought I'd correct those ASAP and read the rest when I have time. Hopefully it gives you a better idea of the possibilities here.
Comment #21
Garrett Albright commentedOf course, now I look silly because I see that you eventually do get at the pre-DOM, DOM, post-DOM idea, but yeah, there's still a lot of stuff there that I don't quite grasp the relevance of. Caching, for example. I can see why we'd want to change how it works, but isn't that outside the scope of this issue, which could conceivably work with the current caching mechanism or with no caching at all?
As for filtering out unwanted HTML tags before creating the DOM object, I disagree; stripping out unwanted tags (or attributes or combinations thereof) is an example of something that would be far easier to do if we could manipulate and navigate around a DOM object instead of hoping for the best with regular expressions as we are now.
Comment #22
wim leers#20:
1. Some content filters, for example image captions, may need to have customizable output. Then you need theme functions. I initially thought that when every filter has to do its thing through DOM manipulations, that it would be impossible to use theme functions, but I proved myself wrong :)
2. In my opinion, the order outlined in #19 makes more sense.
#21:
1. If you refer to the "cache explosion" point: imagine a "[current-user:name]" token in a piece of text, that gets expanded by filters. And gets cached in a different cache entry for each user. That is the sort of caching issues you can expect when you include context in the filter system.
2. I don't think I explicitly stated I don't want to use DOM manipulation for stripping out unwanted tags? In any case: I agree with you. :)
Comment #23
wim leersAt DrupalCon Munich, we showed off Spark Drupal in its current state. The reception was very good: pretty much universally, people were enthusiastic about it! :)
Thanks to that, we've started working on step 1 of proposing Aloha Editor as the core WYSIWYG editor. That first step will be without in-place editing (because there are still things that need to be flushed out in that area); it will just be "WYSIWYG in core for back-end forms", i.e. for forms with text processing and a HTML-based input format (i.e. no WYSIWYG for Markdown etc.).
That first step is now done and ready for review, over at #1760386-4: Migrate Aloha Editor integration from the Edit module and make it work on the back-end.
Why is this relevant to this issue? Well, I have implemented what I have proposed in this issue:
So, as of today, you can review the approach right there, in the code. It's working on D7. I've also included plenty of code comments to explain the rationale for each thing.
filter_true_wysiwygis less than 300 lines of code, about half of which is comments.The changes necessary for filter modules to support this functionality are limited to a single line of code — I've included a "caption" module which is based on the caption_filter module (but my implementation is simpler and more robust) and implements the extra things necessary: it declares that it is of the type
FILTER_TYPE_TRANSFORM_DOM.Note that this does not (yet) with the "macro tag" issue (e.g. token_filter, media module, etc.). It's still possible to include that as well, but it's something that can be done on top of the work done in the filter_true_wysiwyg module. Doing that work may only makes sense if we pursue to have some sort of "macro tag API" in core, i.e. something like sun's Inline module 2.x.
Comment #24
Garrett Albright commentedI don't like the constant names, since they seem to define what a filter should do instead of when it's run; it may be confusing if I want my filter to run during the FILTER_TYPE_HTML_GENERATOR phase, but it doesn't actually generate HTML. I think something like FILTER_TYPE_PRE_DOM, FILTER_TYPE_DOM, FILTER_TYPE_POST_DOM would be less restrictive. In the case of FILTER_TYPE_SECURITY, what if I don't want a filter of that type to run in a given format? Will I still be able to disable it?
Comment #25
wim leersYes, they define what a filter should do, because otherwise there's no way to reason about what a filter does. That's the main point: so that e.g. "true WYSIWYG editors" can reason about this.
That being said, I've outlined a strict order above (in #19) anyway, so it does actually imply the order as well.
You will still be able to configure which filters should be applied for a given text format. FILTER_TYPE_SECURITY only implies that these filters, if configured for a given text format, can never be disabled when rendering output for the user, to prevent security holes.
Comment #26
andypostis this issue still valid?
Comment #27
Garrett Albright commentedI personally would still love to see a system like I mention in the OP, but it looks like D8 went in a different direction largely to accommodate WYSIWYG in core.
Comment #28
wim leers#27: D8 didn't go in a different direction at all, since D8 didn't address this at all!
The only thing that D8 added was the utmost crucial part for D8 that was the least amount of work and caused the least amount of resistance: a classification of filters. That may be a baby step to help this issue move forward, but that doesn't mean this issue is no longer relevant or valid — it most certainly is.
I'd love to see this happen, but I'm afraid this is now D9 material. Too few people care about this to change/improve this. The current system is mostly "good enough", AFAICT.
Comment #29
Garrett Albright commentedWell, I care. And if we could get some sort of consensus about getting something like this in core for D9, I'd be glad to work with you to make it happen.
Comment #30
wim leersI know you care :) But as you can tell by the number of people participating in this issue over the past 4 years, there are few who do.
In any case, I'd be happy to help with reviews when you get back to work on this :)
Shall we move this to 9.x-dev?
Comment #31
Garrett Albright commentedBoop.
Comment #32
Garrett Albright commentedGiven the new versioning scheme, do we think that this might be something that can be done in a point release (8.1.x), or are we still going to wait for 9.x?
Comment #33
wim leersIf you can achieve this without breaking APIs, then yes, we can do this in a point release.
Comment #34
catchYep.
Comment #49
smustgrave commentedIssue still valid?
Comment #50
andypostthere's even more in terms HTML 5 adoption, like related and specifically #3463613: Explore PHP 8.4 native HTML 5 parser vs html5-php