Problem/Motivation

#1809702: WYSIWYG: Add Aloha Editor to core adds support for WYSIWYG editing of core text fields! A requirement of that issue is to add two new capabilities to the filter system:

  1. Have check_markup() run some, but not all, text filters when sending text to a WYSIWYG editor. In particular, we want to run all filters that protect against XSS exploit or otherwise limit the allowed HTML. But we do not want to run token expansion or similar kinds of transformation filters, instead leaving it up to client-side WYSIWYG plugins to handle those.
  2. Enable the Aloha module (or alternate contrib module) to determine when to not enable WYSIWYG editing for a text field. For example, some filters (e.g., Markdown) are (to varying degrees) incompatible with WYSIWYG editing.

This issue is about addressing the Filter system changes needed to support the above.

Proposed resolution

With great help and inspiration from #807996: [meta] Input filters and text formats, this patch proposes adding a 'type' key to hook_filter_info() information that must be set to one of the following:

  1. FILTER_TYPE_MARKUP_LANGUAGE: The filter converts something that's not HTML to HTML in a way that is not compatible with WYSIWYG editing.
  2. FILTER_TYPE_HTML_RESTRICTOR: The filter restricts the HTML allowed, for example, to protect against XSS.
  3. FILTER_TYPE_TRANSFORM_REVERSIBLE: The filter performs a transform for which a WYSIWYG plugin exists to perform the same transformation (and its reverse) client-side. For example, <img data-caption="Druplicon"> may be (reversibly!) transformed to <figure><img><figcaption>Druplicon</figcaption></figure>.
  4. FILTER_TYPE_TRANSFORM_IRREVERSIBLE: The filter performs a transform for which a WYSIWYG plugin does not exist to perform the transformation client-side (especially, the reverse of it); instead, the WYSIWYG editor displays the unfiltered text. For example, Token Filter.

User interface changes

No direct UI changes.

API changes

  • New function filter_get_filter_types_by_format()
  • New optional $filter_types_to_skip parameter in check_markup()
  • New required key 'type' in hook_filter_info()

Caveats

  • A design goal for this issue was to classify filter types as generically as possible, rather than focus on just a WYSIWYG (or particular WYSIWYG editor) use case (see #37). However, a perfect decoupling of type from use case has not been achieved. Whether a filter is FILTER_TYPE_MARKUP_LANGUAGE, FILTER_TYPE_TRANSFORM_REVERSIBLE, or FILTER_TYPE_TRANSFORM_IRREVERSIBLE depends on the desired WYSIWYG experience, and the WYSIWYG plugins available (see #38). Other ways of classifying filters were explored in this issue, but none were considered substantially better.
  • When using a WYSIWYG editor, any filter of type FILTER_TYPE_HTML_RESTRICTOR runs both on "output" (from server to browser) and on "input" (by the WYSIWYG editor prior to submitting the form). This results in data loss (if disallowed HTML is in the text to begin with, the process of editing it in WYSIWYG strips it from what is then saved back). But, we think that losing disallowed HTML is a reasonable and logical consequence of WYSIWYG editing.
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Wim Leers’s picture

Status: Needs review » Active
FileSize
14.7 KB

And … patch!

It is possible to test in combination with the 8.x version of the Aloha module. Steps:
- apply the above patch
- perform a fresh D8 install, with the standard install profile
- apply the #1664602-4: Allow attributes to be passed to drupal_add_[css|js] (SRI) patch
- install the 8.x Aloha module: http://drupalcode.org/project/aloha.git/shortlog/refs/heads/8.x-2.x

Now Aloha Editor should appear wherever you have a processed text field: in the comments, on your node body, etc.

(Pushing Aloha 8.x code right now, it needs a circular reference to this issue… :))

UPDATE: Aloha 8.x code online — relevant issue: #1782326: Port Aloha 7.x-2.x to Drupal 8 (i.e. a 8.x-2.x version).

Wim Leers’s picture

Status: Active » Needs review
Wim Leers’s picture

After planning everything so carefully, of course I forget to add the new file with new unit tests to the patch.

Status: Needs review » Needs work

The last submitted patch, drupal_wysiwyg-in-core-round-one-1782838-3.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
20.13 KB

Looks like testbot is unable to deal with the newish method of adding new files in diffs (http://stackoverflow.com/a/857696).

Trying the old skool way.

Status: Active » Needs work

The last submitted patch, drupal_wysiwyg-in-core-round-one-1782838-5.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
1.02 KB
21.21 KB

First failing test can't be us. Second failing test fixed. (One-line change.)

chx’s picture

Status: Needs review » Needs work
+++ b/core/modules/filter/filter.moduleundefined
@@ -546,6 +617,91 @@ function filter_default_format($account = NULL) {
+    return $filter->status == 1;

just return $filter->status;

+++ b/core/modules/filter/filter.moduleundefined
@@ -546,6 +617,91 @@ function filter_default_format($account = NULL) {
+    if ($filter->status == 0) {

if (!$filter->status) perhaps?

+++ b/core/modules/filter/filter.moduleundefined
@@ -546,6 +617,91 @@ function filter_default_format($account = NULL) {
+      if (is_null($result)) {

if (!isset($result). I must say this needs a ton of comments. I am happy for using array_reduce, heaven knows I am, but you must realize this is the first reduce-with-lambda in core so it needs a bit more explanation.

+++ b/core/modules/filter/filter.moduleundefined
@@ -771,6 +932,19 @@ function check_markup($text, $format_id = NULL, $langcode = '', $cache = FALSE)
+      $filter_types_to_skip = FALSE;

Why all this dance? Why can't we just default filter_types_to_skip to an empty array? I never see a === FALSE check . It'd simplify the code and the doxygen.

+++ b/core/modules/filter/filter.moduleundefined
@@ -1221,10 +1403,12 @@ function theme_filter_guidelines($variables) {
+      'allowed_html' => '<p> <br> <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>',

So now the security filter is the last and so we need to enable p and br? Weird-ish, why this wasn't a problem before?

sun’s picture

Status: Needs work » Needs review
Issue tags: -wysiwig +wysiwyg

@Wim Leers: Can we split the "filter types" chapter into an own issue, please? There's a long backlog of issues related to that, and I'd like to make sure that we're taking all prior art, considerations, and concerns properly into account. (I'll try to collect issues)

larowlan’s picture

webchick’s picture

This is all plumbing-related stuff so far, so I'm not sure we really need accessibility review on this particular patch. But of course, accessibility review on http://drupal.org/project/aloha is very welcome and encouraged. We're tracking a few issues in the Edit module queue, which can be moved over here when/if necessary.

larowlan’s picture

Removing tag, sorry for being over-zealous - we're seeing some AU Govt agencies starting to enquire about ATAG compliance so I'm being vigilant, but perhaps too much in this case. Will wait for round two.
@mgifford asked the question here https://getsatisfaction.com/aloha_editor/topics/how_accessible_is_aloha but not response yet.

Wim Leers’s picture

#8:

  • First 3 remarks: I prefer to err on the side of explicitness, to prevent any doubt. But you're absolutely right, it can be done that way. Patch updated.
  • Regarding array_reduce() usage: added comments. Let me know if it's sufficient.
  • Remark 4, regarding the dancing: it was a relic of the past, I should indeed have cleaned it up that way. Done, thanks :)
  • Remark 5, regarding the updated format:
    1. In the title and the already very long issue summary, I failed to mention a third point that I'm addressing in this patch: I'm updating the default text formats. You found one such change, the others are in standard.install.
    2. The updated default "Filtered HTML" and "Full HTML" text formats are necessary if we want to allow Aloha Editor to be used for those formats. Right now they use the filter_autop and filter_autourl filters, which both generate HTML. In the issue summary, I've explained why that is a problem for WYSIWYG editing.
    3. The above also explains why this wasn't a problem before: <p> and <br> were indeed not allowed before, even though they implicitly were allowed! If you'd use those tags yourself in the "Filtered HTML" format, they'd be filtered away, but they'd be added by the filter_autop filter, which is executed later.
    4. This should probably be addressed in a separate issue.

#9: Without that, this issue would just be about "allowed tags", which is a tiny scope. I'd prefer to keep all API changes to the filter system in one patch, because they relate to the same problem space (as explained in the issue summary).
If you feel strongly about this though, I'll of course do that. I'm not aware of a long backlog, I only know about #807996: [meta] Input filters and text formats and the other issues that one links to.

#12: AE's accessibility is being discussed over at https://github.com/alohaeditor/Aloha-Editor/issues/590. I've just had a call with the AE guys two days ago, and they're arranging to hire Everett Zufelt as a consultant to ensure accessibility is top-notch. They actually want to go further than any other WYSIWYG editor in terms of accessibility :) You're welcome to join the discussion there!

Status: Needs review » Needs work
Issue tags: -wysiwyg, -in-place editing

The last submitted patch, drupal_wysiwyg-in-core-round-one-1782838-13.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Needs review
Issue tags: +wysiwyg, +in-place editing
Crell’s picture

Wim, slap me if this is too far off topic, but one long-standing gripe with the filter system is that it's global, rather than per-field. Certain fields should be plain text only, regardless of the user, while on others an epic WYSIWYG makes sense, while on others you want just a limited set of manual tags. That has always been missing in core; filters are either all-on or all-off, and user-based access controlled. In D5 and D6 there was the filerbynodetype module (maintained by me) that sort of worked for body fields; in D7, we finally just recently got that functionality into better_formats, per-field.

It seems like that capability dovetails with what's being done here, and is an implication of it; certain filters only make sense in certain situations, one of which is "will there be a WYSIWYG on it", but another is simply "does the business case even exist for this filter to even be possible here". Is that something that makes sense to incorporate in this work, or is it actually separate? (It feels like it's part and parcel here, but but I could be wrong.)

Wim Leers’s picture

AFAICT that's a separate issue: this one is about providing plumbing that is absolutely necessary for "true WYSIWYG" editors to do their work, no matter if it's on a per-format or per-field basis. The changes you're suggesting have — again, AFAICT — significantly more far-reaching implications (and API changes).

geerlingguy’s picture

Imagine you have the text Hello, and welcome to http://drupal.org.; if you apply the "Filtered HTML" filter to it, it will become <p>Hello, and welcome to <a href="http://drupal.org">drupal.org</a>.</p>.

I use the Wysiwyg Linebreaks module in tandem with a Wysiwyg editor to get text to and from an HTML-like format; it was originally adapted from JS used by Wordpress for their Wysiwyg integration (don't know if they still use it anymore) to make sure that when you switch from a plain textarea with no tags to a WYSIWYG (which requires <p>, <br /> and the like), and optionally, back to a plain textarea, the text line spacing and readability is preserved.

This is a bit overkill for a WYSIWYG-as-the-default-and-enabled option, of course (not sure if that's what's being proposed, but it seems so), since end users will practically never see the actual plain text of the edited area... but I like being able to turn off Wysiwyg editors for certain roles, and typing without tags (in lieu of Markdown) is certainly more fun than having to remember all the boilerplate markup.

webchick’s picture

Issue tags: +Spark

Tagging.

Bojhan’s picture

Could you expand on the UI implications of this? I'd probally also be helpful if this got some reviews, perhaps we can do a g.d.o/core post?

webchick’s picture

There are no UI implications of this, yet. This is the filter system foundations that are required to support WYSIWYG editors in core.

I guess we could try g.d.o/core, but IMO there are only a select few people who have the background to review this particular patch. We've reached out to all of them. They are all swamped with other D8 things they are working on. :( We've now asked effulgentsia to dig in.

Wim Leers’s picture

Tracking HEAD.

Bojhan’s picture

Assigned: Wim Leers » sun

Seems like this new patching round should get a review from the maintainer of filter.module.

@sun Could you take a look at this?

effulgentsia’s picture

Status: Needs review » Needs work

Overall, I think this is great, and makes a lot of sense.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+define('FILTER_TYPE_HTML_GENERATOR', 'html generator');

Why make the value of a constant into a string? Why not use numbers like we do elsewhere?

+++ b/core/modules/filter/filter.module
@@ -547,6 +618,99 @@ function filter_default_format($account = NULL) {
+    // @todo: Remove the fallback for when no filter type is defined. We don't
+    // fail, we just ignore these right now.

Let's resolve this @todo. I think we should either throw an exception when no type is specified or make one of our types the default.

+++ b/core/modules/filter/filter.module
@@ -547,6 +618,99 @@ function filter_default_format($account = NULL) {
+      $setting_name = $filters_metadata[$filter->name]['allowed tags setting'];
+      $allowed_tags = preg_split('/\s+|<|>/', $filter->settings[$setting_name], -1, PREG_SPLIT_NO_EMPTY);

Ick. Why hard-code the format of the setting? Can we instead change 'allowed tags setting' to 'allowed tags callback' to let each filter that wants to restrict tags decide how to return the array of tags?

+++ b/core/modules/rdf/lib/Drupal/rdf/Tests/CommentAttributesTest.php
@@ -180,6 +180,6 @@ function _testBasicCommentRdfaMarkup($comment, $account = array()) {
-    $this->assertEqual((string)$comment_body[0]->p, $comment->comment, t("RDFa markup for the comment body found."));
+    $this->assertEqual((string)$comment_body[0], $comment->comment, t("RDFa markup for the comment body found."));

Is this relevant to this issue?

+++ b/core/profiles/standard/standard.install
@@ -48,16 +38,6 @@ function standard_install() {
-      // URL filter.
-      'filter_url' => array(
-        'weight' => 0,
-        'status' => 1,
-      ),
-      // Line break filter.
-      'filter_autop' => array(
-        'weight' => 1,
-        'status' => 1,
-      ),

I'm not sure about removing these prior to adding aloha.module. I would bundle it into that patch. Otherwise, this patch is a UX regression: standard install gives you non-wysiwyg input without linebreak assistance. Or, maybe a separate patch along the lines of #18, where we replace these with a JS solution for non-wysiwyg.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
23.39 KB
11.25 KB

Why make the value of a constant into a string? Why not use numbers like we do elsewhere?

Fixed. Assimilated system.module.

Let's resolve this @todo. I think we should either throw an exception when no type is specified or make one of our types the default.

Fair enough :)

Throwing an exception is the simplest solution. Assuming a default makes upgrading easier. But also less precise. The best default we can do is default to FILTER_TYPE_TRANSFORM_TEXT, because it is essentially the "worst case assumption": anything would essentially qualify under this. But it prevents us from reasoning correctly about the filters in a text format: we may default a filter to FILTER_TYPE_TRANSFORM_TEXT, but what if it's actually a HTML-generating filter such as Textile or Markdown? Instant incorrect reasoning. Potentially even worse: imagine somebody has written a special purpose "security" filter. Hence it should never not run. The default classification would prevent it from not running. No matter what default you choose, other things always break down.

The only conclusion possible is that throwing an exception is the only valid solution.

Ick. Why hard-code the format of the setting? Can we instead change 'allowed tags setting' to 'allowed tags callback' to let each filter that wants to restrict tags decide how to return the array of tags?

Excellent point. I tried to keep changes a minimum, but you're right that that is a poor design.

Fixed. Assimilated filter.module's process and settings callbacks.

I'm not sure about removing these prior to adding aloha.module. I would bundle it into that patch. Otherwise, this patch is a UX regression: standard install gives you non-wysiwyg input without linebreak assistance. Or, maybe a separate patch along the lines of #18, where we replace these with a JS solution for non-wysiwyg.

Agreed. Removed from the patch.

Correspondingly, also removed the changed default settings for the filter_html filter: the <p> and <br> tags that were added to the default list of allowed tags, have been removed.

Is this relevant to this issue?

Yes, but not anymore. It is necessary to make the tests pass when the filter_autop filter is disabled, the RDFa module's tests assume a <p> tag will magically appear. When the filter_autop filter is no longer part of the default filters, that assumption no longer holds true.

By undoing the filter system changes in the standard install profile (the previous point addressed, your last point), however, this is no longer necessary. Removed from the patch.

Wim Leers’s picture

Darn :( Careful preparation, but then I accidentally include another patch: ignore the changes in common.inc (they're from #1664602-4: Allow attributes to be passed to drupal_add_[css|js] (SRI)).

Reroll to deal with that. I also noticed that I'm calling $filters_metadata what's called $filters_info elsewhere. I wanted to do that in a separate reroll to make the above patch about the *actual* changes, not renaming things.

Status: Needs review » Needs work
Issue tags: -wysiwyg, -in-place editing, -Spark

The last submitted patch, drupal_wysiwyg-in-core-round-one-1782838-26.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work
Issue tags: +wysiwyg, +in-place editing, +Spark

The last submitted patch, drupal_wysiwyg-in-core-round-one-1782838-26.patch, failed testing.

effulgentsia’s picture

Status: Needs work » Needs review

Thanks. This looks ready to me, except for some minor nits and questions below, and since it's marked as critical priority, I'm tempted to RTBC. However, not doing so yet, to give time for others to review, especially @sun, given the second sentence of #9.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+
+
+

Extraneous newlines.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+ * Formats using filters of this type may not be able to use WYSIWYG editors.
+ *
+ * WYSIWYG use case: ability to detect non-HTML formats, such as Markdown, where
+ * no WYSIWYG editor should be used because it would be impossible to go back to
+ * the original text format.

For this and the other constants, I like this information, but are we okay with coupling this to wysiwyg like this? Personally, I'm okay with that, but pointing this out for others who might have alternate ideas.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+ * Filters SHOULD NOT use regular expressions when they can use DOM manipulation
+ * instead. This makes filters as robust as possible.

I don't think this statement belongs here. It's up to the filter to decide how to balance robustness with code readability with performance.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+ * E.g.: the Typogrify filter would transform `WYSIWYG` and `I said "foo"!` into
+ * `<span class="caps">WYSIWYG</span>` and `I said “foo”!`, respectively. Text
+ * link ad systems would transform `fancy car` into something like
+ * `<a href="http://fancycar.com">fancy car</a>`. Neither of those text-based
+ * transformations make sense when doing WYSIWYG editing, nor is it possible to
+ * reliably reverse them.

Is it okay to mention a contrib module like this? Is it okay to use curly quotes in code files? Is it okay to mention fancycar.com, a commercial website? Hoping people more familiar with our docs standards can answer that.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+const FILTER_TYPE_TRANSFORM_TEXT = 3;

Do we actually need to distinguish FILTER_TYPE_TRANSFORM_DOM and FILTER_TYPE_TRANSFORM_TEXT at all? Would it be simpler to just have FILTER_TYPE_TRANSFORM? Their server-side handling is identical. The only difference is that DOM allows JS to either emulate or not and TEXT doesn't allow that, but why wouldn't there be use cases of JS emulating text transformations (e.g., formatting dates to user's locale)?

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+/**
+ * All of the above implies:
+ * - if a format uses >=1 filters of type FILTER_TYPE_HTML_GENERATOR, no WYSIWYG
+ *   editor can be used.
+ * - if a format uses >=1 filters of type FILTER_TYPE_SECURITY, and a user saves
+ *   modified text through his WYSIWYG editor, any disallowed tags will be lost.
+ *   This seems like a minor annoyance and appears acceptable.
+ * - if a format uses >=1 filters of type FILTER_TYPE_TRANSFORM_TEXT, these
+ *   transformations will not be visible while editing, but will be visible when
+ *   viewing.
+ * - if a format uses >=1 filters of type FILTER_TYPE_TRANSFORM_DOM, these
+ *   transformations may not be visible while editing (it is up to the filter to
+ *   implement support for the WYSIWYG editor, by re-implementing the filtering
+ *   in JavaScript), but will be visible when viewing.
+ */

I'm not sure if this is needed at all, but if it is, it should go above the constants into a @defgroup. See "@defgroup block_caching" in common.inc for an example.

+++ b/core/modules/filter/filter.module
@@ -7,6 +7,77 @@
+
+
+
+

One newline would suffice.

David_Rothstein’s picture

High level review. I have several questions:

  1. Where would something like the Media filter fit in? It's not DOM-based, but it does work in WYSIWYGs. With @effulgentsia's proposed consolidation I suppose it would be FILTER_TYPE_TRANSFORM, but otherwise I'm not sure.
  2. I don't understand the difference between the "generator" and "transform" filter types, at least not with the examples given. I can see why the PHP filter is a generator. Beyond that, it seems pretty fuzzy. The patch identifies Markdown (which does things like transform * List item into <li>List item</li>) as a generator, but Typogrify (which does things like transform WYSIWYG into <span class="caps">WYSIWYG</span>) as a transform filter, and I don't immediately see the difference.
  3. FILTER_TYPE_SECURITY seems like a misleading name, because that's not really what it does. For one thing, the HTML escape filter is classified (by this patch, at least) as a generator, but it certainly provides security too. And on the flip side, filters that can provide security never always do (if they did, #275811: Warn about potentially insecure filter configurations would be a lot simpler to solve). It depends on which other filters are present, and sometimes on how the filter itself is configured. So I think this may need a better name, but it also makes me wonder about the code in the patch that makes check_markup() treat FILTER_TYPE_SECURITY specially.
  4. In general, I don't understand why check_markup() should be allowed to skip certain filters in the text format. In most cases, that seems like it would break things very badly, so at least the patch needs a lot more documentation explaining when you would want to do that. But even reading through these issues, I'm still not getting what any of that has to do with WYSIWYGs... If you want WYSIWYG editing to look like the real, final thing, why would you want the server to return something in-between? (This may be related to my above comments that the various filter type classifications here seem pretty fuzzy to me.)
  5. The patch adds an "allowed tags callback" but how would that work for filters such as HTML Purifier which do more advanced things such as strip particular attributes?
  6. I don't understand how filter_get_allowed_tags_by_format() is supposed to work. Even in simple cases (such as any text format that also has the HTML escape filter enabled) won't it return a completely incorrect set of results? It looks like this function might be assuming the enforced filter ordering discussed in #807996: [meta] Input filters and text formats. But even then I would expect there are cases where it doesn't work correctly.

    The documentation also doesn't explain what "allowed tags" means; is it "allowed on input" (as in, an <img> tag entered by the user will ultimately be displayed), or "allowed on output" (as in, the text format is capable of producing <img> tags in the end, although not necessarily arbitrary ones entered by a user directly)? I assume it's supposed to be the former.

  7. The issue summary describes the goal of allowed tags as:

    First, to have tight integration with a WYSIWYG editor, and to have the WYSIWYG editor reconfigure itself automatically based upon what the user is actually allowed to do, we need to be able to know which HTML tags are actually allowed by an input format.

    A long time ago, Gábor Hojtsy came up with a simple way to allow the WYSIWYG editor to reconfigure itself like this. (I take about 2% credit for the idea since he was inspired to think of it after seeing a slide in a talk I gave, but it was pretty much all him :)

    Basically, the idea is for the WYSIWYG configuration code to treat the filter system like a black box, almost as if it's performing a unit test on it. For each WYSIWYG button, we know what kind of output it produces and how that's supposed to be displayed in the end, so we can just test it and see if it works. As a simple example, a "Bold" button could call check_markup('<strong>test</strong>', $format) and see if it returns something functionally equivalant to what was put in. If it does, then the button is compatible with the text format and can be added to the WYSIWYG.

    I'm not sure what happened to that idea or if it ever made it into the issue queue, but it always made sense to me and doesn't require the WYSIWYG configuration code to have deep knowledge of how the filter system works.

Wim Leers’s picture

#30:

Thanks for *not* setting this to RTBC. I wouldn't be comfortable with this being committed without at least having @sun's blessing. Further, David_Rothstein clearly did a stellar review, raising many good points :)

Extraneous newlines.

One newline would suffice.

Fixed.

For this and the other constants, I like this information, but are we okay with coupling this to wysiwyg like this? Personally, I'm okay with that, but pointing this out for others who might have alternate ideas.

I'm not sure if this is needed at all, but if it is, it should go above the constants into a @defgroup. See "@defgroup block_caching" in common.inc for an example.

Is it okay to mention a contrib module like this? Is it okay to use curly quotes in code files? Is it okay to mention fancycar.com, a commercial website? Hoping people more familiar with our docs standards can answer that.

These should be removed before committing; I only added them to make the patch review easier. Only the first line of the comment block before each const definition should remain. I removed the long, detailed comments in the attached reroll.

I don't think this statement belongs here. It's up to the filter to decide how to balance robustness with code readability with performance.

Do we actually need to distinguish FILTER_TYPE_TRANSFORM_DOM and FILTER_TYPE_TRANSFORM_TEXT at all? Would it be simpler to just have FILTER_TYPE_TRANSFORM? Their server-side handling is identical. The only difference is that DOM allows JS to either emulate or not and TEXT doesn't allow that, but why wouldn't there be use cases of JS emulating text transformations (e.g., formatting dates to user's locale)?

See my reply to #31, point 1.


#31:

Where would something like the Media filter fit in? It's not DOM-based, but it does work in WYSIWYGs. With @effulgentsia's proposed consolidation I suppose it would be FILTER_TYPE_TRANSFORM, but otherwise I'm not sure.

You and @effulgentsia hit on a good point: why the hell is such an implementation detail mentioned? For two reasons: 1) I couldn't think of a better name, 2) the "DOM vs. text" thing is explicitly present in #807996: [meta] Input filters and text formats.

First: why are they separate? The assumption is that DOM-based transformation filters are *reversible* ("reliably reversible transformations"), whereas text transformation filters are *irreversible* ("text-based, irreversible transformations").
Second: better names then are FILTER_TYPE_TRANSFORM_BIDIRECTIONAL and FILTER_TYPE_TRANSFORM_UNIDIRECTIONAL (for the _DOM and _TEXT ones, resp.). Alternatively: FILTER_TYPE_TRANSFORM_REVERSIBLE and FILTER_TYPE_TRANSFORM_IRREVERSIBLE.

Either of those alternative naming schemes communicate far more clearly what's happening: it doesn't matter whether you implement it using DOM traversal or a simple str_replace(), what matters is the reversibility.
To describe it more formally, the following only holds true for FILTER_TYPE_TRANSFORM_REVERSIBLE filters: x = 'foobar'; t'(t($x)) == $x. It must not just be "somewhat reversible", it must be "perfectly reversible", in that you get the exact same result back if you reverse a reversible transformation. If that's not the case, it's an irreversible transformation.

I'm going to call them _REVERSIBLE and _IRREVERSIBLE in the remainder of my reply.

I don't understand the difference between the "generator" and "transform" filter types, at least not with the examples given. I can see why the PHP filter is a generator. Beyond that, it seems pretty fuzzy. The patch identifies Markdown (which does things like transform * List item into <li>List item</li>) as a generator, but Typogrify (which does things like transform WYSIWYG into <span class="caps">WYSIWYG</span&gt;) as a transform filter, and I don't immediately see the difference.

With the above information regarding the FILTER_TYPE_TRANSFORM_* filter types in mind, your question is still valid, but now it's about FILTER_TYPE_HTML_GENERATOR vs. FILTER_TYPE_TRANSFORM_IRREVERSIBLE. E.g.: it is impossible to reliably translate back e.g. HTML to Markdown, because Markdown allows one to mix HTML and Markdown, hence it is is irreversible. I think it's possible to define a format that can be transformed to HTML and transformed back to the original mark-up. But this is not true for virtually all formats out there (they're usually a simplification of HTML, after all).

You could indeed regard FILTER_TYPE_HTML_GENERATOR as pointless. Markdown, Textile, PHP, etc. could be considered FILTER_TYPE_TRANSFORM_IRREVERSIBLE. If there is a magical mark-up language that can be transformed into HTML and back again, then it would indeed be possible to use a HTML WYSIWYG editor on this.

So, from the theoretical POV, I agree.

From the practical POV, however, we deal with FILTER_TYPE_HTML_GENERATOR and FILTER_TYPE_TRANSFORM_IRREVERSIBLE differently. FILTER_TYPE_TRANSFORM_IRREVERSIBLE filters are assumed to be non-essential and thus skippable. FILTER_TYPE_HTML_GENERATOR filters are assumed to be essential and thus non-skippable.
In other words: for text formats with FILTER_TYPE_HTML_GENERATOR filters, we're not working with HTML at all if we skip those filters, and thus it's something that's not editable in a "true WYSIWYG" editor (which is by definition HTML-based, since the language of the web is HTML itself). However, without FILTER_TYPE_TRANSFORM_IRREVERSIBLE filters, you still have a piece of HTML, albeit possibly without some niceties.
Note: it is assumed that text formats *without* FILTER_TYPE_HTML_GENERATOR filters already contain HTML (which is a reasonable assumption since Drupal is a *web* CMS, meaning that it outputs HTML) and thus *are* editable in a WYSIWYG editor.

So, we could rename FILTER_TYPE_HTML_GENERATOR to FILTER_TYPE_TRANSFORM_IRREVERSIBLE_ESSENTIAL. Then we'd have:

  • FILTER_TYPE_TRANSFORM_REVERSIBLE: by definition not an impediment to WYSIWYG editing
  • FILTER_TYPE_TRANSFORM_IRREVERSIBLE_INESSENTIAL: impossible for WYSIWYG editing, but we can just skip these filters while editing
  • FILTER_TYPE_TRANSFORM_IRREVERSIBLE_ESSENTIAL: impossible for WYSIWYG editing, so if a text format uses >=1 filters of this type, we cannot offer WYSIWYG editing

FILTER_TYPE_SECURITY seems like a misleading name, because that's not really what it does. For one thing, the HTML escape filter is classified (by this patch, at least) as a generator, but it certainly provides security too. And on the flip side, filters that can provide security never always do (if they did, #275811: Warn about potentially insecure filter configurations would be a lot simpler to solve). It depends on which other filters are present, and sometimes on how the filter itself is configured. So I think this may need a better name, but it also makes me wonder about the code in the patch that makes check_markup() treat FILTER_TYPE_SECURITY specially.

Maybe it's a misnomer. But I think it's sufficiently explicitly stated what such a filter is about: "strip HTML tags that the user MAY NOT use". It may be a misnomer because there are so many security aspects (e.g. CSRF) that are not so extremely simplistic as "disallow certain HTML tags".
FILTER_TYPE_HTML_TAG_RESTRICTOR may be a better name?
Regarding the special treatment for FILTER_TYPE_SECURITY/FILTER_TYPE_HTML_TAG_RESTRICTOR filters: even when doing WYSIWYG editing, 1) we want to prevent that potential security threats (e.g. <script> tags) are not dealt with because FILTER_TYPE_SECURITY/FILTER_TYPE_HTML_TAG_RESTRICTOR filters did not run, 2) we want to make sure users are seeing a realistic representation of what their content will eventually (i.e. when "completely" filtered) look like, so the same set of tags should be allowed.

In general, I don't understand why check_markup() should be allowed to skip certain filters in the text format. In most cases, that seems like it would break things very badly, so at least the patch needs a lot more documentation explaining when you would want to do that. But even reading through these issues, I'm still not getting what any of that has to do with WYSIWYGs... If you want WYSIWYG editing to look like the real, final thing, why would you want the server to return something in-between? (This may be related to my above comments that the various filter type classifications here seem pretty fuzzy to me.)

Hopefully it's already clear to you by the time you reach this part of my reply.
In 8 words: to prevent crappy HTML from polluting our contents.
In a nutshell: WYSIWYG editing is about editing HTML, but to this day, many (most?) pieces of content in Drupal are not actually HTML, they're some derivative of it, which is then transformed into HTML. We must of course maintain our "on output filtering" approach, to guarantee security etc. Imagine the simple case of e.g. the extlink filter or some "text link ads" filter: without skipping those filters, their output would be saved into the database. The HORROR! Hence we want to be able to skip inessential, irreversible transformations (FILTER_TYPE_TRANSFORM_IRREVERSIBLE_INESSENTIAL).

The patch adds an "allowed tags callback" but how would that work for filters such as HTML Purifier which do more advanced things such as strip particular attributes?

Most excellent point! To which I don't have a good answer. WYSIWYG editors typically offer you a plethora of buttons that you can then configure. The idea is that we automatically configure the WYSIWYG editor as much as possible: if the <em> tag is disallowed by the text format, then we don't show a button for it, etc. But indeed, for e.g. anchors, it makes a lot of sense to configure which attributes are allowed, and thus whether the WYSIWYG editor should provide a UI for it. (And in fact, Aloha Editor supports this, at least to some degree.)

IMVHO, it's excellent follow-up issue material.

I don't understand how filter_get_allowed_tags_by_format() is supposed to work. Even in simple cases (such as any text format that also has the HTML escape filter enabled) won't it return a completely incorrect set of results? It looks like this function might be assuming the enforced filter ordering discussed in #807996: [meta] Input filters and text formats. But even then I would expect there are cases where it doesn't work correctly.

Good find! I should have documented this better (though it is documented implicitly in hook_filter_FILTER_allowed_tags() in filter.api.php), but: the "allowed tags" callbacks are only necessary for *some* filters of the type FILTER_TYPE_SECURITY. Or, as I've stated above: FILTER_TYPE_SECURITY should probably be renamed to FILTER_TYPE_HTML_TAG_RESTRICTOR. In that context, it makes a lot more sense that only then you may need "allowed tags" callbacks.

But in case you'd have an imaginary filter_html_escape filter that *is* of the is of the FILTER_TYPE_HTML_TAG_RESTRICTOR type, then it should be indicating that it allows *no* HTML tags. I.e., it should return the empty array. In that case, the logic in filter_get_allowed_tags_by_format() should work fine (it uses array_intersect() after all).

The documentation also doesn't explain what "allowed tags" means; is it "allowed on input" (as in, an <img> tag entered by the user will ultimately be displayed), or "allowed on output" (as in, the text format is capable of producing <img> tags in the end, although not necessarily arbitrary ones entered by a user directly)? I assume it's supposed to be the former.

It is indeed "allowed on input". As before, filters can essentially do what they want, so they could indeed insert an image tag even though it is not explicitly allowed.

Basically, the idea is for the WYSIWYG configuration code to treat the filter system like a black box, almost as if it's performing a unit test on it. For each WYSIWYG button, we know what kind of output it produces and how that's supposed to be displayed in the end, so we can just test it and see if it works. As a simple example, a "Bold" button could call check_markup('test', $format) and see if it returns something functionally equivalant to what was put in. If it does, then the button is compatible with the text format and can be added to the WYSIWYG.

VERY interesting idea! :)
With that, it would indeed be theoretically possible to get rid of the "allowed tags" callback. Do you see any downsides of going down that path?


In the reroll of the patch:

  • Lengthy in-comment docs removed.
  • FILTER_TYPE_SECURITY -> FILTER_TYPE_HTML_TAG_RESTRICTOR
  • FILTER_TYPE_TRANSFORM_DOM -> FILTER_TYPE_TRANSFORM_REVERSIBLE
  • FILTER_TYPE_TRANSFORM_TEXT -> FILTER_TYPE_TRANSFORM_IRREVERSIBLE
  • (I left FILTER_TYPE_HTML_GENERATOR unchanged for now because there is no clear course of action for that yet.)
effulgentsia’s picture

Thanks, Wim. I'm still not clear on why we want to identify any difference between FILTER_TYPE_TRANSFORM_REVERSIBLE and FILTER_TYPE_TRANSFORM_IRREVERSIBLE_INESSENTIAL. As far as the filter system is concerned, their behavior is identical: do not apply them in PHP when sending text to wysiwyg. Maybe you're imagining some future code that would need to distinguish between these two? If so, what's that code you're imagining?

Just brainstorming a bit, but if the only meaning of these filter types is in relation to wysiwyg, would it be more direct and clearer to call them as such? E.g, instead of $filter_info['type'], make it $filter_info['wysiwyg'], and make the constants FILTER_WYSIWYG_DISALLOW, FILTER_WYSIWYG_APPLY, and FILTER_WYSIWYG_SKIP?

For check_markup(), what if instead of passing it $filter_types_to_skip, we instead overload $format_id to be either a string id, or a $filters array? We do stuff like this in Field API (e.g., the $display parameter in field_default_prepare_view()). Then, whatever code is integrating a wysiwyg editor can be responsible for calling check_markup() with the set of $filters it wants applied.

Finally, I like David's / Gabor's idea of getting rid of allowed tags callback from this issue. It will narrow the scope of this issue to just one thing, and wysiwyg testing check_markup() directly seems an option worth exploring in the aloha module. I like how that approach can test attributes as well as tags. If it turns out to not work well for some reason, we can add 'allowed tags callback' at that time.

Sylvain Lecoy’s picture

The downside with this idea is that it does Not scale at all. Imagine you have to run n times the format function, which you don't know about the complexity at all. So the simple rendering of an editor can possibly take ages because of this phase of init.

effulgentsia’s picture

What the editor needs this for is to determine which toolbar buttons to hide/show. It can generate a test string that covers every toolbar button, call check_markup() once per format, and cache the result. It's possible we'll discover a performance limit, but I don't think we have enough evidence of it at this time to make the idea not worth trying.

David_Rothstein’s picture

Right, the only time this code would need to run is when someone is changing either the WYSIWYG setup or text format setup, which should be pretty rare administrative operations. It would not need to run every time the WYSIWYG is displayed.

I'm not even sure it's conceptually a "cache" as much as it is "configuration" (in the sense that modules like http://drupal.org/project/wysiwyg already provide a user interface for administrators to configure which WYSIWYG buttons to display; this would basically just be additional data which restricts the choice and/or allows them to be configured automatically).

Haven't had time to read through and digest the other comments above yet - just wanted to pop in to help clarify the performance issue.

Wim Leers’s picture

#33:

I'm still not clear on why we want to identify any difference between FILTER_TYPE_TRANSFORM_REVERSIBLE and FILTER_TYPE_TRANSFORM_IRREVERSIBLE_INESSENTIAL. As far as the filter system is concerned, their behavior is identical: do not apply them in PHP when sending text to wysiwyg. Maybe you're imagining some future code that would need to distinguish between these two? If so, what's that code you're imagining?

FILTER_TYPE_TRANSFORM_REVERSIBLE is essentially "WYSIWYG-compatible": because it is reversible, there can be a JS implementation of these filters that at least can give an indication of what the end result would look like. Upon saving the content, however, the changes made by the filter would be undone by the JS, so that you again end up with your source mark-up, the one that may be saved to the DB.

FILTER_TYPE_TRANSFORM_IRREVERSIBLE (later in #32 suggested to be called FILTER_TYPE_TRANSFORM_IRREVERSIBLE_INESSENTIAL) is by nature "WYSIWYG incompatible": it is irreversible, so if it would be applied in JS, there would be no way to go back to the source mark-up, which we need to be able to save it to the DB without saving any cruft that may be generated by the filter.

This is what I said in the OP, strong emphasized lines indicating the relevant parts:

  • DOM transformation filters — DOM-based, reliably reversible transformations. (FILTER_TYPE_TRANSFORM_DOM)

    Filters SHOULD NOT use regular expressions when they can use DOM manipulation instead. This makes filters as robust as possible.

    WYSIWYG use case: these filters MUST NOT be applied when feeding a piece of text into the WYSIWYG editor. Instead, they MAY be re-implemented in JavaScript for each supported WYSIWYG editor.
    E.g. <img data-caption="Druplicon"> may be (reversibly!) transformed to <figure><img><figcaption>Druplicon</figcaption></figure>.

  • Text transformation filters — text-based, irreversible transformations. (FILTER_TYPE_TRANSFORM_TEXT)

    WYSIWYG use case: these filters MUST NOT be applied when feeding a piece of text into the WYSIWYG editor. Furthermore, they MUST NOT be re-implemented in JavaScript.
    E.g.: the Typogrify filter would transform WYSIWYG and I said "foo"! into <span class="caps">WYSIWYG</span> and I said “foo”!, respectively. Text link ad systems would transform fancy car into something like <a href="http://fancycar.com">fancy car</a>. Neither of those text-based transformations make sense when doing WYSIWYG editing, nor is it possible to reliably reverse them.

The opening post.

Just brainstorming a bit, but if the only meaning of these filter types is in relation to wysiwyg, would it be more direct and clearer to call them as such? E.g, instead of $filter_info['type'], make it $filter_info['wysiwyg'], and make the constants FILTER_WYSIWYG_DISALLOW, FILTER_WYSIWYG_APPLY, and FILTER_WYSIWYG_SKIP?

That's what I specifically wanted to avoid! :)
Why avoid this? Because different WYSIWYG editors may have different needs; a specific filter may come with JS support for WYSIWYG editor A but not B. That's why I tried to come up with a categorization in the first place: so that modules can *reason* about the filters used by text formats.

Thoughts?

effulgentsia’s picture

First off, I just want to say thank you to Wim and everyone you worked with (sun, quicksketch, others) in thinking through this problem space. As identified back in quicksketch's DrupalCon London presentation and probably well before then by others as well, Drupal's "apply all filters on output all of the time" approach is fundamentally incompatible with a good wysiwyg experience. To solve this, we need a richer language for describing and working with filters. And a lot of good thinking has gone into identifying the 4 types of filters described in the issue summary and in the patch. Given both my difficulties in understanding the nuances in how each filter maps to the type that it does, and David's observations in #31.1-#31.3, I think we can still improve the terminology or documentation, but I think that's minor relative to the excellent analysis you did getting us here.

Wim and I chatted about #37, and he helped clear up one point of confusion for me. For FILTER_TYPE_TRANSFORM_REVERSIBLE: WYSIWYG use case: these filters MUST NOT be applied when feeding a piece of text into the WYSIWYG editor. Instead, they MAY be re-implemented in JavaScript for each supported WYSIWYG editor. And for FILTER_TYPE_TRANSFORM_IRREVERSIBLE: WYSIWYG use case: these filters MUST NOT be applied when feeding a piece of text into the WYSIWYG editor. Furthermore, they MUST NOT be re-implemented in JavaScript. So my confusion here is, given the MAY in the first part (which implies a "may choose not to"), if I have a filter for which I have not supplied a JavaScript implementation, how do I know whether to set its type to FILTER_TYPE_TRANSFORM_REVERSIBLE or FILTER_TYPE_TRANSFORM_IRREVERSIBLE? Am I stuck in a purely philosophical question (this filter is "in priciple" reversible vs. this filter is "in principle" irreversible), or is there any practical consequence to the choice? Wim's conclusion was that this should not be a philosophical question, but a practical one: that if you have not actually supplied a JavaScript implementation, you should set the type to FILTER_TYPE_TRANSFORM_IRREVERSIBLE, and if some other module ends up supplying a JS implementation on your behalf, then it can alter the type to FILTER_TYPE_TRANSFORM_REVERSIBLE.

With that out of the way, I'm more comfortable having a total of 4 types rather than the 3 I was suggesting earlier.

Next, I tried to think through #31.2: what's the difference between a "generator" and an "irreversible transform". #32 answers this by invoking a concept of "essential" or "non-skippable". In other words, if showing Markdown syntax within a wysiwyg editor is deemed a bad experience, then it should be set to FILTER_TYPE_GENERATOR, with the result that wysiwyg cannot be used for text containing Markdown. OTOH, if showing Markdown syntax within a wysiwyg editor is deemed ok, then it should be set to FILTER_TYPE_TRANFORM_IRREVERSIBLE, with the result that wysiwyg can be used, but any Markdown text is shown as Markdown text not as "what you see is what you get". But then, if some module comes along and implements a JS markdown to html (and back) transformer, then it can set the markdown filter to FILTER_TYPE_TRANFORM_REVERSIBLE, allowing true wysiwyg editing that's automagically converted back into Markdown for database storage. This fluidity of how to classify Markdown based on what code is available and what user experience is desired makes me question the approach of trying to assign semantic names to filter types.

So, brainstorming a little, I came up with an idea of filter flags:
- FILTER_FLAG_REQUIRED_ON_OUTPUT = 0x01
- FILTER_FLAG_POSSIBLE_ON_INPUT = 0x10

I'm still tentative on these names (feedback welcome), but the idea being that each combination of these two flags correlates with one of the 4 filter types in #32:
- 0x00 replaces FILTER_TYPE_TRANSFORM_IRREVERSIBLE: wysiwyg possible but not "true"
- 0x01 replaces FILTER_TYPE_GENERATOR: wysiwyg not possible
- 0x10 replaces FILTER_TYPE_TRANSFORM_REVERSIBLE: true wysiwyg possible
- 0x11 replaces FILTER_TYPE_RESTRICTOR: true wysiwyg possible, filter (e.g., tag stripping) is applied both server-side and client-side.

I don't know if this moves us closer to or farther away from clarity, and usefulness to contexts other than wysiwyg. Thoughts?

Wim Leers’s picture

Status: Needs work » Needs review

To me, it initially feels more harder to grasp. But that might simply be because I'm so familiar with the names I came up with myself.

In general, I think your direction is a good one. Your dual binary filter flags approach goes back to the "filter on output" vs. "filter on input" aspect. I think the aspect that makes it hard to grasp is the inclusion of "required" vs "possible" in these flags. How about this:
- FILTER_FLAG_ESSENTIAL = 0x001
- FILTER_FLAG_SUPPORTS_OUTPUT = 0x010
- FILTER_FLAG_SUPPORTS_INPUT = 0x100

However, the second flag is really *always* true, so may want to remove that one. (Unless we start building client-side/JS/WYSIWYG-only filters, of course… Though I can't really think of a use case for that.) Then we'd end up with:
- FILTER_FLAG_ESSENTIAL = 0x01
- FILTER_FLAG_SUPPORTS_INPUT = 0x10

This is identical to what you have, but just with different names, that make more sense to *me* — but that could really just be me. FILTER_FLAG_SUPPORTS_INPUT would still imply that there's a reversible JS implementation: you send it the "source mark-up", the JS can transform it to the intended end-result, but upon saving, it can transform it back to the "source mark-up". Right?

In summary: your proposal is functionally identical, but *might* indeed improve clarity. What's the best/fastest way to figure out if it indeed improves clarity?

EDIT: I just noticed I said that the *first* flag is always true, but I of course meant the second one (FILTER_FLAG_SUPPORTS_OUTPUT).

effulgentsia’s picture

I like your improved flag names.

What's the best/fastest way to figure out if it indeed improves clarity?

For starters, comments from some of the 45 people following this issue :)

chx’s picture

Status: Needs review » Needs work

A consensus is forming around removing the allowed tags callback from this issue. I agree.

wow I never knew restrictor was a word... good to know.

filter_get_allowed_tags_by_format() should move $filters_info = filter_get_filters(); outside of array_filter and use() it in the closure. One also wonders why dont we store this somewhere. It's not like this can change often. Although perhaps that can be done more easily (no separate update path) after filters have been converted to CMI so let's just leave a @TODO for that.

if ($filter_types_to_skip && in_array($filter_info[$name]['type'], $filter_types_to_skip)) <= just run in_array. It costs next to nothing to run an in_array against an empty array.

By now it's a very simple patch that is close , finally.

Wim Leers’s picture

@chx: Are you implicitly saying that you think the flags approach outlined in #38 and #39 looks good to you?

chx’s picture

The approach yes, the actual flag names are bikeshed and we can happily bikeshed on december 2 until april 1...

Wim Leers’s picture

- allowed tags callback removed
- filter_get_allowed_tags_by_format() removed
- if ($filter_types_to_skip && in_array($filter_info[$name]['type'], $filter_types_to_skip)) improved as per @chx's suggestions — this was a relic of when $filter_types_to_skip defaulted to FALSE
- I decided to stick with the non-flag-based approach for now, because that's easier to understand for most people than bitwise flags. The flags would need to change anyway, so I might as well go that route instead.
- I renamed FILTER_TYPE_HTML_GENERATOR to FILTER_TYPE_MARKUP_LANGUAGE, which communicates more clearly for what purpose it should be used. David Rothstein correctly argued that it's in fact similar to an irreversible transformation, but I think this communicates more clearly what I originally intended.
- I also renamed FILTER_TYPE_HTML_TAG_RESTRICTOR to FILTER_TYPE_HTML_RESTRICTOR (removed the "tag") because it has been rightly pointed out that some filters may strip attributes instead.
- FYI: I implemented the "blackbox allowed tag testing" as suggested by @effulgentsia in #31 in Aloha 8.x-2.x: http://drupalcode.org/project/aloha.git/commitdiff/fe6178a (#1815246: Stop relying on filter_get_allowed_tags_by_format(), figure out which tags & attributes are allowed through blackbox testing). I've proven it to work, with the sole exception that I had to switch from blackbox testing to "greybox" testing to be able to detect that in Filtered HTML, the "p" tag is not really allowed. (filter_html removed it, but filter_autop added it again, the solution was to use two p tags without separating them with a newline, so that filter_autop's magic wouldn't interfere.)

Patch attached. All tests pass locally.

Lars Toomre’s picture

I noticed a few documentation and/or coding style issues in #44. The attached patch includes my corrections and inderdiff shows what I changed.

The biggest changes I made relate to #500866: [META] remove t() from assert message. Hence, I removed the t() from assert messages in the new tests.

chx’s picture

Status: Needs review » Reviewed & tested by the community
FileSize
19.34 KB

Thanks everyone, IMO this is ready. I missed the interdiff from #44 so here is the 32-45 interdiff .

Wim Leers’s picture

Title: WYSIWYG in core: round one — filter types and allowed tags » WYSIWYG in core: round one — filter types

@Lars: thanks!
@chx: WOOT!

Updated the issue title to reflect the scope.

Wim Leers’s picture

Issue summary: View changes

Adding link to Aloha Editor issue.

effulgentsia’s picture

I updated the issue summary and posted a notification to gdo/core.

Crell’s picture

+++ b/core/modules/filter/filter.module
@@ -4,10 +4,31 @@
 /**
+ * Non-HTML markup language filters that generate HTML.
+ */
+const FILTER_TYPE_MARKUP_LANGUAGE = 0;
+
+/**
+ * HTML tag and attribute restricting filters.
+ */
+const FILTER_TYPE_HTML_RESTRICTOR = 1;
+
+/**
+ * Reversible transformation filters.
+ */
+const FILTER_TYPE_TRANSFORM_REVERSIBLE = 2;
+
+/**
+ * Irreversible transformation filters.
+ */
+const FILTER_TYPE_TRANSFORM_IRREVERSIBLE = 3;
+

While I appreciate the potential DX of "just use a constant flag!", there are at least two problems with this.

1) Constants ought to be bound to a class or interface so that they can be lazy loaded. It doesn't look like there is such a class available at present, but IMO the entire filter system should get turned into plugins later. We may be best off punting on this point for now.

2) Much ink has been spilt pointing out that multi-value constants of this sort are inherently dangerous. They are by design completely not extensible. The minute a contrib module decides that it wants to define filter type 4, and some other module decides it also wants to define filter type 4, you have a fatal error (if you're lucky; you could also end up with data loss).

Unless we are 120% certain that there is absolutely no use case for contrib to ever add to this list, these should be string values, not global constants. (And knowing contrib, I don't think it's possible to be 120% certain about anything.)

(To be fair, that is the case for menu types, for instance. That's not extensible. However, those are also bitflags, not discrete constants.)

+++ b/core/modules/filter/filter.module
@@ -550,6 +571,39 @@ function filter_default_format($account = NULL) {
+  // Ignore filters that are disabled.
+  $filters = array_filter($filters, function($filter) {
+    return $filter->status;
+  });

Purdy. :-)

+++ b/core/modules/filter/filter.module
@@ -759,13 +813,18 @@ function filter_list_format($format_id) {
+ * @param array $filter_types_to_skip
+ *   (optional) An array of filter types to skip, or an empty array (default)
+ *   to skip no filter types. All of the format's filters will be applied,
+ *   except for filters of the types that are marked to be skipped.
+ *   FILTER_TYPE_HTML_RESTRICTOR is the only type that cannot be skipped.
  *

This almost makes me thing the constants should be bitflags, not ints. That would actually be more consistent with the use of such constants elsewhere in Drupal, and I think in general.

+++ b/core/modules/filter/lib/Drupal/filter/Tests/FilterAPITest.php
@@ -0,0 +1,111 @@
+  /**
+   * Tests the ability to apply only a subset of filters.
+   */
+  function testCheckMarkup() {
+    $text = "Text with <marquee>evil content and</marquee> a URL: http://drupal.org!";
+    $expected_filtered_text = "Text with evil content and a URL: <a href=\"http://drupal.org\">http://drupal.org</a>!";
+    $expected_filter_text_without_html_generators = "Text with evil content and a URL: http://drupal.org!";
+
+    $this->assertIdentical(
+      check_markup($text, 'filtered_html', '', FALSE, array()),
+      $expected_filtered_text,
+      'Expected filter result.'
+    );
+    $this->assertIdentical(
+      check_markup($text, 'filtered_html', '', FALSE, array(FILTER_TYPE_MARKUP_LANGUAGE)),
+      $expected_filter_text_without_html_generators,
+      'Expected filter result when skipping FILTER_TYPE_MARKUP_LANGUAGE filters.'
+    );
+    // Related to @see FilterSecurityTest.php/testSkipSecurityFilters(), but
+    // this check focuses on the ability to filter multiple filter types at once.
+    // Drupal core only ships with these two types of filters, so this is the
+    // most extensive test possible.
+    $this->assertIdentical(
+      check_markup($text, 'filtered_html', '', FALSE, array(FILTER_TYPE_HTML_RESTRICTOR, FILTER_TYPE_MARKUP_LANGUAGE)),
+      $expected_filter_text_without_html_generators,
+      'Expected filter result when skipping FILTER_TYPE_MARKUP_LANGUAGE filters, even when trying to disable filters of the FILTER_TYPE_HTML_RESTRICTOR type.'
+    );
+  }
+

This should be two separate tests. Merging two tests into a single test method is a moderately large testing faux pas.

Speaking as a module developer (I maintain a filter module, invisimail), it looks like there's not much here that affects me yet other than specifying the type of filter I have. I'm not sure what type invisimail would be, though, especially as, depending on its configuration, it may or may not generate HTML or even Javascript. I would need better guidance for deciding if I should mark it FILTER_TYPE_MARKUP_LANGUAGE, FILTER_TYPE_HTML_RESTRICTOR (since it does affect HTML, but can also generate it), or FILTER_TYPE_TRANSFORM_IRREVERSIBLE (since it's only reversable in some situations, which varies by configuration).

chx’s picture

webchick’s picture

Status: Reviewed & tested by the community » Needs work

That sounds like a "needs work."

Wim Leers’s picture

Status: Needs work » Reviewed & tested by the community

#49: Thanks for the review! :)

This should be two separate tests. Merging two tests into a single test method is a moderately large testing faux pas.

Hm, I don't see what you mean here. There's three test cases, each with a different variety of the fifth parameter to check_markup(). Or do you want me to test the "default" case (the one with the default value for the fifth parameter) in a separate test method?

Speaking as a module developer (I maintain a filter module, invisimail), it looks like there's not much here that affects me yet other than specifying the type of filter I have.

Indeed :)

I'm not sure what type invisimail would be, though, especially as, depending on its configuration, it may or may not generate HTML or even Javascript. I would need better guidance for deciding if I should mark it FILTER_TYPE_MARKUP_LANGUAGE, FILTER_TYPE_HTML_RESTRICTOR (since it does affect HTML, but can also generate it), or FILTER_TYPE_TRANSFORM_IRREVERSIBLE (since it's only reversable in some situations, which varies by configuration).

invisimail does not contain a mark-up language; hence FILTER_TYPE_MARKUP_LANGUAGE is out of the question.
FILTER_TYPE_HTML_RESTRICTOR vs. FILTER_TYPE_TRANSFORM_IRREVERSIBLE can potentially be argued about (hence the need for bitflags/better names), but the main purpose of the invisimail module is not to protect the reader from the resulting HTML to do evil things, the purpose is to perform a transformation so that spambots cannot contact the author. Hence: FILTER_TYPE_TRANSFORM_IRREVERSIBLE.

I *think* @chx kept the status at RTBC and created those follow-up issues to get this in ASAP, so … setting back to RTBC. I can address the test improvements here, if you prefer that. We should definitely to the FILTER TYPE constants -> bit flags + naming stuff in a follow-up, because that is very much prone to bikeshedding.

webchick’s picture

Cool, I'm comfortable with those items being follow-ups.

This has been assigned to sun for awhile, and I would still love to hear from him here, but in the meantime David and Alex and chx gave this very thorough reviews. So I'd like to commit this in ~24 hours, barring anymore serious objections.

chx’s picture

I *think* @chx kept the status at RTBC and created those follow-up issues to get this in ASAP, so … setting back to RTBC <= you are completely correct. I made it clear that while I do have issues with the flag names but that's not to hold this patch up. Totally addressable in followups.

catch’s picture

Patch looks fine to me as well, as do the follow-ups.

Crell’s picture

I'm fine with those being follow-ups process-wise, but I think the DX of those flags still needs a fair bit of work if I'm that confused by them. I'll follow up on the test spinoff with what I meant there.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Darn. Went to commit this, but does not apply. :( Can we get a quick re-roll?

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
14.47 KB

Reroll from #45.

webchick’s picture

Title: WYSIWYG in core: round one — filter types » Change notice: WYSIWYG in core: round one — filter types
Category: feature » task
Status: Needs review » Active
Issue tags: +Needs change record

Yay! Bot came back green. Heeeeere we gooooo....!

Committed and pushed to 8.x. Thanks!

Let's continue further refining the terminology/docs/representation of these flags in #1816160: Should FILTER_TYPE_* be bitflags strings or stay ints?.

This will need a change notice, since it affects filter module authors. Writing it might even be able to help fix the terminology. :)

Wim Leers’s picture

Title: WYSIWYG in core: round one — filter types » Change notice: WYSIWYG in core: round one — filter types

Change record created: http://drupal.org/node/1817474.

Wim Leers’s picture

Title: Change notice: WYSIWYG in core: round one — filter types » WYSIWYG in core: round one — filter types
Status: Active » Fixed
Issue tags: -Needs change record

.

Tor Arne Thune’s picture

Title: Change notice: WYSIWYG in core: round one — filter types » WYSIWYG in core: round one — filter types
Category: task » feature
Eric_A’s picture

Hm, I'd say that the change record and issue summary would probably profit a lot from a note on why filter_url and filter_autop are (temporarily?) being left on the Filtered HTML text format, despite these filters being of type FILTER_TYPE_MARKUP_LANGUAGE.
I was reading about filter_url and filter_autop being of type FILTER_TYPE_MARKUP_LANGUAGE and it all just didn't add up... For now I'll just continue enjoying some more comments and patches on this beautiful issue :-)

Wim Leers’s picture

@Eric_A: well, the Filtered HTML format has always had those filters. For Aloha to work, they would need to be removed (as stated earlier in this issue), but until Aloha is in core over at #1809702: WYSIWYG: Add Aloha Editor to core, there's no need to do that.
And "beautiful issue"? :)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Updated and shortened issue summary to match the latest patch.