Problem/Motivation
Drupal RSS feeds have html elements in the description field for each node. This is not allowed, all blocks of html codes should be contained within a CDATA tag.
You can read more at this site, http://webservices.xml.com/pub/a/ws/2002/11/19/rssfeedquality.html.
RSS Standard 2.0
Rss Best Practices for element description
Steps to reproduce
Install drupal
Visit example.com/rss.xml
See double encoding of description content.
Proposed resolution
Wrap rss in CDATA
Introduce event subscriber RssResponseCdata to wrap description in CDATA
Remaining tasks
N/A
User interface changes
N/A
API changes
The RssResponseCdata filter for rss now places html in CDATA properly
Data model changes
N/A
Release notes snippet
Issue fork drupal-3433
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #1
(not verified) commentedIn Drupal 4.1.0 the workaround for this issue is editing /usr/share/drupal/include/common.inc and remove all html_entities() in format_rss_item() and format_rss_channel. You can't remove algo strip_tags() because the XML generated won't be valid. I don't know if this issue has been already fixed in Drupal 4.3.0 as I'm waiting for the Debian maintainer to upload the new package. I'll try then.
Comment #2
Steven commentedThe problem is not malformed XML as far as I can see: all entitities relevant to XML are escaped (as < > ...). HTML entities are doubley escaped (< into &lt;). Drupal outputs correct XML.
It's a question of RSS quality and the general recommendation for using CDATA though: it offers advantages in terms of filesize.
Comment #3
killes@www.drop.org commentedSteven says this isn't a bug. Changing title.
Comment #4
drummIf this problem is what I think it is then you might want to look at the fixentities filter module I made. It replaces un-entity-coded less than signs and ampersands with the proper codes if they are not part of tags or entity codes. This fixes some validity issues with invalid user input in both the feed and xhtml view.
Comment #5
el777 commentedCDATA helps us to save a space and get document more readable. But I see here one pitfall. What if our feed contains substring ]]> in it? So feeb become broken?
Comment #6
LAsan commentedIs this a feature request?
Still applies to current version?
Comment #7
mdupontBumping to 8.x-dev
Comment #8
EvanDonovan commentedDoes anyone know what the current best practice is for RSS feeds serialized in XML? CDATA seems to make sense to me, as per the OP from 2003, but double entity-encoding appears to be more common, and is used in Views as well.
It would be good to have someone assess this issue who is still active with Drupal (i.e., not Steven). Tagging with "Needs markup review" accordingly.
Comment #9
jhedstromComment #16
james.williamsStumbled across this today, on investigating code blocks within RSS feeds. If a node has this content in a field, included in an item of an RSS feed:
(Spaces deliberately added in the code & pre tags here, just to demonstrate the markup, as it doesn't seem possible to nest them in a comment on drupal.org! The actual content would not have these.)
(This example comes from an article I wrote - the first codeblock here: https://www.computerminds.co.uk/articles/open-links-popups-foundation?so... - I'm trying to show the actual HTML that should be used for a link.)
The version in the RSS feed source decodes the
"entities to be actual double-quotes ("). It leaves the>and<angle brackets within thecodetag as they were, even though the ones on the wrapping pre and code elements are encoded. So the interpreted result for RSS readers is:So readers actually render this as a link, within a styled pre/code block, rather than the plain HTML code, which is definitely not wanted.
I suspect wrapping the whole content in CDATA tags would be the best way forward, otherwise Drupal is always going to be trying to fudge the desired result, when CDATA is designed exactly for this.
Otherwise, how else can code blocks within content be included correctly in an RSS feed?
-----
Phew, getting the right encoding written into this comment is hard work!
Comment #20
larowlanComment #21
larowlanI think that we can do this in
template_preprocess_views_view_row_rssby wrapping the description - as that is the only field that should contain HTML. I think this is possibly now a novice issue, tagging as such.Comment #22
lendudeYeah that already has the following comment:
// The description is the only place where we should find HTML.
// @see https://validator.w3.org/feed/docs/rss2.html#hrelementsOfLtitemgt
// If we have a render array, render it here and pass the result to the
// template, letting Twig autoescape it.
Which was done in #2500931: Views feed doesn't encode embedded HTML anymore, there in no talk of cdata there, but some test coverage was added for it in \Drupal\views\Tests\Plugin\DisplayFeedTest which we would need to update
Comment #24
larowlanComment #25
oceanic commentedComment #26
oceanic commentedComment #27
lendudeI tried to reproduce the issues with double escaping and other malformed things, but as far as I can tell, that doesn't happen anymore.
So supporting CDATA vs escaping HTML feels like a feature and not a bug.
Comment #28
lendudeI've tested this further and the google RSS reader has no trouble converting the escaped HTML back to readable HTML at least, so to me, there is no bug here.
So for now I'm moving this to a feature request to use CDATA over escaped HTML, but also postponing on getting some more information on why using CDATA over escaped HTML would be preferable.
Comment #29
quietone commentedComment #30
mradcliffeI am removing the Novice tag from this issue because it looks like @quietone updated the issue summary to reflect why it's been postponed. I don't think there's anything further here to do for a novice without further details.
Comment #36
smustgrave commentedIf still a valid feature request please reopen updating issue summary per #28.
Also could be unrelated but #3259255: Html::escapeCdataElement() not adding CDATA correctly was committed so wonder if that solved anything.
Thanks!
Comment #37
fgmTechnically, as I understand the RSS 2.0 standard, HTML in the items should (must ?) be escaped, and should still be in a CDATA section: I had to look this up when upgrading the G2 module for Drupal 10/9.
This being said, comment #28 mentions Google Reader, which is no longer a thing, but all current RSS readers I've used (only a few, admittedly) handle this "incorrect" escaped HTML correctly these days.
I've reset the issue to active because it still feels like we should have a clearly defined, standards-backed answer, instead of just some "it works" anecdotal info.
Comment #38
luke.stewart commentedAssuming this validator is valid...
https://validator.w3.org/feed/check.cgi?url=drupal.org%2Fnode%2Ffeed
As of now reports that this is a valid feed.
It recommends the following:
description should not contain iframe tag
description should not contain relative URL references:
Missing atom:link with rel="self"
Does this mean that this issue can be closed? (Again?)
Comment #39
james.williams@luke.stewart If you’re seeing “description should not contain iframe tag”, then no, this is still a problem to fix. Drupal is outputting HTML in the description elements which needs escaping via CDATA tags.
Comment #41
nicxvan commentedI've been following this for a while. I found this: https://www.drupal.org/project/views_rss/issues/3409451 and the related with more discussion: https://www.drupal.org/project/views_rss/issues/3079683
I am going to create an issue fork and just wrap the description in CDATA and see if any tests complain to see if we can move this forward.
Comment #42
nicxvan commentedAside from being curious if this will have any failures I do have two implementation questions.
1. Do we need to detect if there is html in the description before adding the CDATA wrapper, my understanding is it is not harmful to add the CDATA wrapper everywhere.
2. Do we need to wrap the main description in views-view-rss.html.twig? It was done in the issue I linked above, but I don't think that is actually necessary.
Comment #44
nicxvan commentedLooks like there was at least one failure in:
I can look at this later if someone else doesn't get to it before me.
There was another failure in AJAX but I think it was unrelated because rerunning it succeeded.
Comment #45
duaelfrI did the exact same change in my own
views-view-row-rss.html.twigfor both the title and the description.While there is no issue with the title, the description gets filtered somehow.
I tried to duplicate the description field with another name:
The "anything_else_but_description" field does not get filtered out.
Does anyone know where it could happen?
Comment #46
nicxvan commentedI'm not sure why that is happening, but you shouldn't need to add cdata to the title as I don't think that can be html.
Comment #47
nicxvan commentedI did some investigation, there is some process that is converting root relative and protocol relative urls to absolute in the twig file when the cdata is not wrapping it.
Here are the assertions with comments alluding to this:
// Verify that root-relative URL is transformed to absolute.
$this->assertSession()->responseContains($file_url_generator->generateAbsoluteString('public://root-relative'));
// Verify that protocol-relative URL is left untouched.
$this->assertSession()->responseContains($protocol_relative_url);
// Verify that absolute URL is left untouched.
$this->assertSession()->responseContains($absolute_url);
I downloaded the artifacts and here are the urls after decoding:
Protocol-relative URL
Root-relative URL
I compared with a passing test to confirm those do get http://localhost. Does anyone know what mechanism is converting those to absolute urls?
I pushed up a description filtered with RAW to test how that gets output. For some reason these tests are not working locally. I'm confident this will still fail, but wanted to check.
Comment #48
nicxvan commentedThat actually worked!
I'm not sure it actually solves the issue, I don't see the CDATA section properly.
Other modules that use xml with CDATA add it like this:
public function addCdata($cdata_text): void {
$node = dom_import_simplexml($this);
$no = $node->ownerDocument;
$node->appendChild($no->createCDATASection($cdata_text));
}
And builds it from a \SimpleXMLElement object first.
I'm not sure we need to do all of that. It's possible I'm looking at the wrong thing.
Comment #49
nicxvan commentedThere is something weird going on.
With this in the template:
<description><![CDATA[{{ description|raw }}]]></description>The output looks like this and is encoded, there is no cdata array in the output.
I then changed the twig template to this to confirm that I was editing the right template:
<description>Before CDATA<![CDATA[{{ description|raw }}]]>After CDATA</description>Now the output looks like this
The description content shows up twice for some reason, the before callout shows up once and the after callout shows up twice. The cdata array is there now too.
Comment #50
fgmFWIW I temporarily had this problem of escaped content after the D9->D10 migration on https://osinet.fr/rss.xml, and is disappeared a few days ago when I upgraded to 10.2.4. Not sure what changed in core that could cause these differences.
Comment #51
duaelfrI traced down the CDATA removal to
\Drupal\Core\EventSubscriber\RssResponseRelativeUrlFilter::transformRootRelativeUrlsToAbsolute()I updated MR !7201 with a fix.
Comment #52
nicxvan commentedGreat find!
Rerunning tests since they are failing, but I can't tell if they are germane.
Comment #53
nicxvan commentedFunctional test is failing, I can't take a look at the moment.
Comment #54
nicxvan commentedAh it's the new test: Drupal\Tests\Core\EventSubscriber\RssResponseRelativeUrlFilter
Comment #55
nicxvan commentedI'm rerunning again, I'm slightly suspicious since the failing test is the one that was just updated, but running that test locally passes, trying to run all tests locally and see if I get failures.
Comment #56
nicxvan commentedOk I can replicate this locally now, I think I can fix this.
Comment #57
nicxvan commentedOk this test should run, it seems to put the closing CDATA tag on a new line.
I patched a site and ran the rss validator and it looks valid as far as the CDATA is concerned.
It shows this error: Missing atom:link with rel="self" which is unrelated to the issue being solved here.
I'll move to needs review once I confirm tests are passing.
Comment #58
nicxvan commentedComment #59
nicxvan commentedComment #60
nicxvan commentedI am using this site: https://validator.w3.org/feed/#validate_by_input
Here is the test rss:
If I remove the cdata tag and use this instead
The validator complains about the div and html as expected.
Comment #61
smustgrave commentedSo as a template change will need to update the other views-view-row-rss.html.twig in core.
But also do a CR to announce to other themes to update their templates.
Comment #62
nicxvan commentedGot some new failures, I'm taking a look at them.
Comment #63
nicxvan commentedHere is the failing test:
I am unable to replicate this failure locally at the moment.
Comment #64
nicxvan commentedI was able to replicate I am pushing up a fix:
I'm not sure if the updated test needs more work.
We also updated stable, let me know if that needs a follow up issue instead.
Comment #65
nicxvan commentedComment #66
nicxvan commentedComment #67
smustgrave commentedI reviewed this the other day and forgot to hit save :(
But tweaked the CR slightly to include an examples.
https://git.drupalcode.org/issue/drupal-3433/-/jobs/1313487 shows the test coverage
All core templates appear to be updated.
Believe this is good.
Comment #68
nicxvan commentedTemporarily moving this back to needs work while I double check something, I think I can simplify this a bit.
Comment #69
nicxvan commentedMoving back to needs review since I changed the approach.
I think the raw filter was not appropriate and I realized we could always wrap the description in CDATA so I moved it to the url transform.
I then used XSS::filterAdmin to make sure no dangerous scripts get through.
I'll bump it back to needs work if the tests don't pass.
Comment #70
nicxvan commentedTests faililng
Comment #71
nicxvan commentedI cleaned up the tests further since we are wrapping the description in CDATA we don't need an extra test case, we just need to update the feed test to account for CDATA and ensure the feed is outputting CDATA (which we already put in place).
Comment #72
nicxvan commentedI also simplified the Change Record since we are no longer updating the templates.
Comment #73
smustgrave commentedEven better! Whenever we don't have to update all the twig templates its definitely the route. Helps contrib themes.
Believe the CR is still needed and what you have is good.
Comment #74
nicxvan commentedComment #75
alexpottIs the HTML admin created or user created. If it is user created we should be using
Xss::filter()because the admin filter is still very liberal.Comment #76
nicxvan commentedI used filterAdmin because that is what views usually uses as a final pass for rewrites and I wanted to preserve the most HTML that might have been configured as allowed on filter formats and added to the view.
You're probably right though, cause if someone entered HTML in a plain text field then it passed through this filter it would convert it to HTML.
I think converting this to filter could strip out some HTML that was enabled by an admin on a text format once it passes through the CDATA conversion, but that is probably better than outputting more HTML than expected.
I'll make the change.
Comment #77
nicxvan commentedOk I used the example in locale for which html tags to allow since the default blank filter even removes paragraph tags:
['a', 'abbr', 'acronym', 'address', 'b', 'bdo', 'big', 'blockquote', 'br', 'caption', 'cite', 'code', 'col', 'colgroup', 'dd', 'del', 'dfn', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'ins', 'kbd', 'li', 'ol', 'p', 'pre', 'q', 'samp', 'small', 'span', 'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'tt', 'ul', 'var']
Comment #78
smustgrave commentedCoverage still appears to be there https://git.drupalcode.org/issue/drupal-3433/-/jobs/1385050
Believe the feedback/change request has been addressed.
CR probably is still useful, 50/50 since it doesn't require a twig change now.
Comment #81
larowlanI think this is a bug, not a feature request. Which means its still eligible for 11.x/10.3.x
Comment #82
larowlanLeft some minor nits on the MR, feel free to self RTBC after addressing.
Comment #83
nicxvan commentedThanks, I updated the comments for both instances!
Comment #84
nicxvan commentedAh my local is out of date, one moment while I rebase.
Comment #85
nicxvan commentedOk, that is resolved now, I'll keep an eye on tests, but changing comments shouldn't affect them.
Comment #86
nicxvan commentedI'm doing a full rebase, there are a bunch of unrelated failures.
Comment #87
nicxvan commentedOk, I rebased on 11.x I'll keep an eye on tests.
Comment #88
nicxvan commentedTests are happy on 11.x.
10.2 branch is failing for an unrelated test.
Comment #89
quietone commentedAdded useful links to the issue summary and updated credit.
Comment #90
longwaveIs piggybacking on RssResponseRelativeUrlFilter really the right thing to do? That event subscriber is only supposed to be fixing up relative URIs, and now we're extending it to convert the description to CDATA - this feels like the wrong place to be doing this. Should we be doing this when the RSS feed is generated? Or maybe we need a separate event subscriber?
Comment #91
nicxvan commentedGood point, I'll take a look later this week unless someone else gets to it first.
Comment #92
nicxvan commentedComment #93
nicxvan commentedGot some failures, my local tests are not working at the moment so I'm working on figuring that out now.
Comment #94
nicxvan commentedGot my local working, this should fix it.
Comment #95
nicxvan commentedI did not update the 10.2 branch this time since I didn't create it and I'm not sure this is eligible for backport anyway.
Comment #96
nicxvan commentedI ran test only and it's failing as expected.
Comment #97
nicxvan commentedComment #98
smustgrave commentedRemoving Needs markup review as this isn't changing the twig template (assuming that's what the tag is for).
Per the slack thread, issue summary and change record have both been updated with new approach.
Small feedback to the MR has also been addressed.
Test-only feature did fail before last push so coverage is there.
LGTM
Comment #99
quietone commentedI made changes to comments, mostly about capitalization. And small changes to the change record to put the change first. I did not do a code review.
After that I then saw that there are two new functions that do not have return types. I need a break so I am setting to NW for that. See https://www.drupal.org/docs/develop/standards/php/php-coding-standards#s....
Comment #100
nicxvan commentedI'll take a look
Comment #101
nicxvan commentedI set string|false since https://www.php.net/manual/en/domdocument.savexml.php shows that, let me know if that should be handled differently.
Comment #102
smustgrave commented2 small comments.
Comment #103
nicxvan commentedCommitted the suggestion and answered your question.
Comment #104
nicxvan commentedI addressed your second comment.
Comment #105
smustgrave commentedThanks LGTM!
Comment #106
longwaveI think this solution is really clean, just one nit about an unused argument.
Comment #107
longwaveIn the interests of not holding up a 20+ year old issue any longer, I fixed the nit and set it back to RTBC and will commit later if the tests pass.
Comment #110
longwaveDecided to commit this to 11.1.x and 10.4.x only as there is a chance of it being vaguely disruptive to existing sites that are using RSS and not necessarily expecting CDATA in their output here, given we wrote a change record it makes sense to put it in a minor release only.
Still, great to see a 20+ year old bug finally fixed, and in such a clean way given we can just add the event subscriber to modify RSS responses.
Committed and pushed eba56b0fd7 to 11.x and ebc8e0639e to 10.4.x. Thanks!
Comment #112
mmillford commentedIs there a way to avoid this xss filter? If we're constructing an rss feed whose content we trust, then there should be a way for that content to survive this event subscriber. I would've assumed that wrapping the trusted content in its own CDATA during rendering would've worked, but this new event subscriber doesn't check for existing cdata, and just filters everything without prejudice.
Comment #113
opsdemon commentedDefinitely agree with this idea. Since 10.4.0 all div, style and img tags are getting stripped from our newsletter RSS feed and we ended up downgrading to 10.3 to get it working again.
We use mailchimp and there is no option to add html structure or styling on the other end. If we could wrap our trusted, styled output from the view using a CDATA section and bypass the new filter then that would clear our upgrade path to 10.4 (which is currently blocked by this issue).
Comment #114
nicxvan commentedThey are event subscribers so you should be able to override them.
Comment #115
opsdemon commentedHow and where would I do that? Any pointers or examples would be much appreciated.
Comment #116
mmillford commentedI put together this module to override those services (both the xss filter and the relative url filter, as both strip an existing cdata tag) and add in a check for cdata sections: https://www.drupal.org/project/rss_trust_cdata
You can use that module or check the source to see how it was implemented in order to write your own.
I still think this core service should include some ability to trust content or just opt out of rewriting, but as nicxvan said, overriding the core services will work for now.
Comment #117
opsdemon commentedThank you for providing a module to override this behaviour.
In this case, the view output comes from a twig template and it turns out that wrapping the output in a CDATA section doesn't work there either (as it also gets stripped out).
Eventually I found a fix to that issue here: views_rss/issues/3079683#comment-15110037
Applying the fix to the .theme file fixed the issue and the newsletter formatting looks ok now.
Comment #118
dries commentedThis commit appears to have broken my RSS feed at https://dri.es/rss.xml: all 'img' tags are now being stripped from the feed.
I suspect the issue is related to this change, as it seems the 'img' tag isn't included in the allow-list.
It is also be worth considering whether other elements like 'figcaption', 'figure', and 'video' should be preserved to maintain important content in RSS feeds.
Comment #119
longwaveThanks for reporting, the regression with img and certain other tags is being tracked at #3497758: Regression: RssResponseCdata filtering out common HTML tags from RSS feeds
Comment #121
alexpottFYI: this issue is being reverted in #3497758: Regression: RssResponseCdata filtering out common HTML tags from RSS feeds
Comment #122
heikkiy commentedEDIT: This comment went to wrong issue.
Comment #123
richarddavies commentedAn issue was just reported with a view RSS feed on our website that I think may be caused/related to this change. We have HTML in the RSS description element and it's being double encoded now. First, tags like
<p>are being encoded as<p>and then everything is wrapped in aCDATAtag. The correct approach is to use one or the other, but not both (https://cyber.harvard.edu/rss/encodingDescriptions.html).I can't seem to find a way to configure the feed to turn off either the CDATA or entity encoding, so how do I fix this?
Comment #124
nicxvan commentedThis change was reverted so it must be something else double encoding.