Problem/Motivation

Since #2441811: Upgrade filter system to HTML5 landed, it seems that any text format that allows <iframe> or <textarea> within the html filter can lead to data loss.

Steps to reproduce

Configure a text format to allow:
<em> <strong> <cite> <blockquote cite> <code> <li> <dl> <dt> <dd> <h4 id class> <h5 id class> <br> <img src alt width height class> <table class> <caption class> <tbody> <thead> <tfoot> <th> <td> <tr> <iframe title src> <div class> <p class> <h2 id class> <h3 id class> <span id class> <a rel accesskey target title name id href hreflang class data-cta-track-sync data-cta-track-async data-cta-description data-cta-placement> <ul type class> <ol start type class> <hr>

Add some content with an <a> tag.

Notice that the <a> will not be rendered on the front-end, and is not persisted into the database.

I'm filing this as critical because existing content being re-saved can lead to data loss.

Proposed resolution

There are 2 MR with different approaches.
MR!5919
MR!5942 uses a custom tokenizer. This is the preferred solution of longwave, #19

@Wim Leers set this to RTBC but did not specify which MR was preferred. #16

Remaining tasks

TBD

User interface changes

TBD

API changes

TBD

Data model changes

TBD

Release notes snippet

TBD

Issue fork drupal-3410303

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

Luke.Leber created an issue. See original summary.

mcdruid’s picture

Unpublishing this public issue for now as requested by @lleber because it may need to be handled in the private security tracker.

longwave’s picture

Republishing after discussion with @Luke.Leber and confirmation that there is no security issue here but a critical data loss bug that can be handled in public.

dslatkin’s picture

I discovered this issue on our own sites, too. I shared a bit of this on Slack, it seems to be caused in the core/modules/filter/src/Plugin/Filter/FilterHtml.php file's getHTMLRestrictions implementation. The $dom = Html::load($html); line uses the new HTML5 parser to convert the "allowed HTML tags" list into a deeply nested DOM data structure since the list has no closing tags. It then walks over the data structure and extracts tags and attribute names. Since iframes are not permitted to have any child content, the new parser drops any elements that would have been inside the iframe.

I found a couple short-term workarounds that worked:

  1. Move the <iframe> tag to the end of the list, so it doesn't have any child content when it gets parsed by Html::load and avoid adding new tags that come after the iframe until the issue is fixed.
  2. Close the <iframe> tag in the list so it becomes something like <iframe></iframe>. Drupal doesn't seem to encourage closing tags here, so using this workaround might cause a future issue.

Also, I'm not entirely sure, but I could imagine there being more elements than just the iframe that drop the child content. It's just that the iframe is one of the most common use cases with the HTML filter.

luke.leber’s picture

FYI - <textarea> has the same effect here as <iframe> does.

That's the only other HTML5 element that I've found to be problematic. Like my comment in the test, it might be worth taking a "kitchen sink" approach with test coverage, given this is largely governed by a third party library nowadays.

Cheers -- thanks for facilitating everything today.

luke.leber’s picture

Title: FilterHtml data loss when iframe is allowed » FilterHtml data loss when iframe and/or textarea is allowed
Issue summary: View changes

Update issue title / summary.

longwave’s picture

Looking at Masterminds\HTML5\Elements there are bitmasks for tags that have certain features:

    // From section 8.1.2: "script", "style"
    // From 8.2.5.4.7 ("in body" insertion mode): "noembed"
    // From 8.4 "style", "xmp", "iframe", "noembed", "noframes"
    /**
     * Indicates the contained text should be processed as raw text.
     */
    const TEXT_RAW = 2;

    // From section 8.1.2: "textarea", "title"
    /**
     * Indicates the contained text should be processed as RCDATA.
     */
    const TEXT_RCDATA = 4;

    /**
     * Indicates that the text inside is plaintext (pre).
     */
    const TEXT_PLAINTEXT = 32;

Guessing all these tags will therefore be affected.

Thinking that instead of trying to process the allowed list as HTML, we should just use regex instead. Normally regex is insufficient for parsing HTML, but this isn't really HTML anyway, it's just a list of strings that look like HTML tags.

wim leers’s picture

Thinking that instead of trying to process the allowed list as HTML, we should just use regex instead. Normally regex is insufficient for parsing HTML, but this isn't really HTML anyway, it's just a list of strings that look like HTML tags.

+1

Surprisingly related: the CKEditor 5 module's \Drupal\ckeditor5\HTMLRestrictions::fromString() reuses the parsing that FilterHtml does, precisely because it already was historically brittle:

…

    // Reuse the parsing logic from FilterHtml::getHTMLRestrictions().
    $configuration = ['settings' => ['allowed_html' => $elements_string]];
    $filter = new FilterHtml($configuration, 'filter_html', ['provider' => 'filter']);
    $allowed_elements = $filter->getHTMLRestrictions()['allowed'];

…

This also means there's a HUGE amount of implicit test coverage, because HtmlRestrictions has >1500 LoC of test coverage since it's so crucial for the CKEditor 4 → 5 upgrade path (as well as providing detailed validation errors and guidance in the admin UI): \Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest.

My point is: we can change the parsing logic and be very confident that if it passes tests, that it works fine 😄

longwave’s picture

Status: Active » Needs review

Switched FilterHtml::getHTMLRestrictions() to use regexes instead of HTML DOM to parse the allowed tags and attributes.

Not sure I have covered all possible combinations with the regexes, I might have missed some allowed characters in tags or attribute names, and if someone has done something really weird with attribute values (for example, using >) then I think the regexes might fail. But I think this is a good start.

wim leers’s picture

Status: Needs review » Needs work

As predicted: lots of test failures in \Drupal\Tests\ckeditor5\Unit\HTMLRestrictionsTest as well as things extensively relying on it such as SmartDefaultSettingsTest 🤓

longwave’s picture

Status: Needs work » Needs review

Opened an alternative approach in MR!5942 that modifies the way the HTML5 parser works, instead of trying to use regex. This feels simpler but relies a bit on the internals of the HTML5 library. Not sure which is better.

longwave’s picture

Also modified MR!5919 to be more lenient in the characters it accepts.

wim leers’s picture

Status: Needs review » Reviewed & tested by the community

That looks magnificent! 🤩

(And very nice test coverage too 🦙😜)

longwave’s picture

In case it gets lost in the noise above, not sure which of the two approaches are best, but both are passing tests, so for another committer to decide I guess.

alfattal’s picture

Wim Leers Which one?!

longwave’s picture

After dwelling on it for a while I now prefer the custom tokenizer one, as the HTML5 parser is surely more robust than brittle regex and will hopefully handle edge cases that might exist in real world configurations that the regex could choke on in some way - while our test coverage is good it can't hope to cover every strange thing that someone might put in their filter config.

adrianm6254’s picture

I tried both patches MR!5919 & MR!5942 and they both worked fine.

For now I will keep MR!5942 applied.

alfattal’s picture

MR!5942 fixed the issue for me. Great work, thank you all.

quietone’s picture

Issue summary: View changes
Related issues: +#2441811: Upgrade filter system to HTML5

I'm triaging RTBC issues. I read the IS and the comments.

I did update the proposed resolution to include that there are 2 MRs here with different approaches and related details.

Leaving at RTBC.

larowlan’s picture

Issue tags: +Patch release target

larowlan changed the visibility of the branch 11.x to hidden.

larowlan’s picture

I agree with #19 that the custom tokenizer looks like the better approach

Looking a bit further into the masterminds/html5 internals to understand it a bit better

larowlan’s picture

As this is a critical, not going to hold things up - but I think we need a follow-up for explicit coverage for \Drupal\filter\Plugin\Filter\FilterHtml::getHTMLRestrictions which is a public API.

I will file that.

Updating issue credits

  • larowlan committed cb6d0184 on 10.2.x
    Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...

  • larowlan committed 3ae37397 on 11.x
    Issue #3410303 by longwave, Luke.Leber, Wim Leers, quietone, dslatkin:...
larowlan’s picture

Version: 11.x-dev » 10.2.x-dev
Status: Reviewed & tested by the community » Fixed
Issue tags: -Patch release target

Committed to 11.x and backported to 10.2.x

Filing that follow-up now

wim leers’s picture

Sorry for not having been more explicit in #16 — I was definitely referring to the new class($scanner, $events) extends Tokenizer one, which was also committed 👍

xem8vfdh’s picture

I've been told me separate report may be an instance of this same issue. I've detailed my experience/symptoms here: https://www.drupal.org/project/drupal/issues/3412164#comment-15382735

The situation affecting my site as a result of upgrading from 10.1.7 to 10.2.0, which seems similar to this, is indeed critical. As outlined in my notes, the database appears uncorrupted by the upgrade, but the rendered output is corrupted, and once someone edits/saves a page, they persist the corruption to the database.

xem8vfdh’s picture

given the very bad nature of this bug, I think it should be added to the "Known Issues" section of the 10.2.0 release notes: https://www.drupal.org/project/drupal/releases/10.2.0

In fact, users should be advised to not upgrade to 10.2.0 and wait for 10.2.1

xem8vfdh’s picture

following @larowlan's wise advice, I tested this commit (as a patch using cweagans/composer-patches) and my initial testing suggests that this patch does indeed fix my problem, after rolling back to 10.1.7 and rerunning the 10.2.0 upgrade. I will report back if I encounter any issues. Thank you all for fixing this!

wim leers’s picture

Adding @xeM8VfDh's issue from #34 as a related issue.

Glad to read in #36 that this indeed fixed it 😊

longwave’s picture

I've added this issue to the known issues list at https://www.drupal.org/project/drupal/releases/10.2.0

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.