This was thought to be a bug in aggregator module with html entities + titles. In debugging, problem was actually in feed itself and *not* a bug in drupal.

However, there was no test coverage to validate correct escaping of feed titles. This issue adds a test case already committed to D8 currently working on back port to D7.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Steve Dondley’s picture

Important clarification. This occurs when the feed already has html entities in it. For example, if the feed has a title of:

It's an Honor 

(It's an Honor)

When it gets run through check_plain, the ampersand in the html entity get converted to an html entity, and it becomes:

It&amp#39;s an Honor

So the final output is:

It's an Honor

in the title.

Steve Dondley’s picture

I can't properly display the html entities in the comments. It's all screwed up. Let me try with spaces before the semicolon. You should mentally remove them.

It&#39 ;s an Honor

gets converted to

It&amp ;#39; an Honor

which results in the following output:

It&#39 ;s an Honor
Steven’s picture

Status: Needs review » Needs work

I'm pretty sure this is correct behaviour. Most feed fields contain regular, unmarked up text. Double-escaping like that in the source feed is only valid if the field contains escaped, marked up HTML text (such as the body).

In any case, this needs to be solved at the parsing stage, not the output stage. Drupal takes feeds in various formats.

Finally, using strip_tags() on output as a validation measure is disallowed. It should only be used as a conscious filtering op (such as done for taxonomy terms in title attributes).

Steve Dondley’s picture

> Most feed fields contain regular, unmarked up text.

Well, Google uses htmlentities. They are a pretty major supplier of feeds.

Roadskater.net’s picture

yes google news is pretty important and i like it, and it would be nice to have at least a checkbox available in aggregator setup to specify whether to unescape the title. i know nothing about this, and can't do this myself, i'm sorry to say. i had other google news related problems (duplicate news items) so i'm open to suggestions from those who may have good reasons for thinking google news is not important, or not best. thanks for anything anyone does to help us. please share if there's an adopted fix. thanks.

kastaway’s picture

Version: x.y.z » 4.7.2

Does anybody have brainstorming ideas of how to work around this? I'd settle for just dropping the entities from the RSS feed, which would be better than the current mash of ascii characters on the front page of the site.....

matt westgate’s picture

Rather than strip_tags(), you'll want to use filter_xss($item->title, array()) around the item title. It's been shown on the security mailing list that strip_tags() can be bypassed in terms of XSS exploits.

rosenblum68’s picture

Version: 4.7.2 » 4.7.3

I downloaded newest version of 4.7.3 aggregator.module, and it still has the problem. I has to modify the file in multiple locations (using advise for filter_xss) to get it to work both in the block AND the "more" summary page... just a heads up. http://wellweight.org

edmund.kwok’s picture

Version: 4.7.3 » x.y.z
Assigned: Steve Dondley » edmund.kwok
Status: Needs work » Needs review
FileSize
1.88 KB

Changed strip_tags to filter_xss as suggested and in two other places, block_item and summary_item, including the original page_item.

Btw, issue also applies in HEAD, http://drupal.org/node/63459, marking that duplicate because this was reported first. Also, changing version to cvs to get more attention.

Patch is for cvs version.

edmund.kwok’s picture

FileSize
1.88 KB

Patch for 4.7 cvs.

ahoeben’s picture

In this issue (which got sidetracked), I suggested cleaning up the title before it is inserted in the database. This has the obvious benefit of cleaning it up only once, making my patch a lot smaller than the one proposed here.

The argument against cleaning up the tile before insertion in the database is that it is not the Drupal way to do so. For example, when adding a title to a node, the title is inserted into the database as is (after making it database safe); no removal of tags etc. The rationale behind this is that when the user goes to edit the node, he would expect his/her original input and not a parsed/filtered version of it. There is an obvious need to keep the original input, for consistency to the user.

In the case of the aggregator module however, there is no need to store the original input, which is machine generated and which will never be edited. So I'ld say: filter before insertion in the database, and be done with it.

jacauc’s picture

Version: x.y.z » 5.x-dev

Will this patch be part of drupal 5.0 core?
I am running drupal 5 CVS on a test site, and I still see & # 3 9 ; (spaces inserted there to prevent it from being parsed)

drumm’s picture

Status: Needs review » Needs work

Tags need to be stripped from titles.

edmund.kwok’s picture

Assigned: edmund.kwok » Unassigned

Releasing this issue back into the wild. I'm not sure what's the best way to fix this..

msmiffy’s picture

I see that this problem still exists in 5.1. Any sign of a permanent, main-stream resolution? I'll just go hack the code for now as this is UGLY.

jacauc’s picture

I agree 100%

csevb10’s picture

Version: 5.x-dev » 6.x-dev
Assigned: Unassigned » csevb10
FileSize
1.86 KB

I repatched this for Drupal 6.
There's an aggregator_filter_xss so you can actually filter the feed content independent of the site content which seems like a slightly better solution than invoking the filter_xss directly. I patched the same 3 areas as before.
What does everyone think of this as a viable solution moving forward?

csevb10’s picture

Status: Needs work » Needs review
agentrickard’s picture

I'm going to make a run at closing the Aggregator queue for D6.

Personally, I like http://drupal.org/files/issues/aggregator-striptags.patch and agree with ahoeben from #11.

This is machine-text, so keeping the "integrity" of the original title string is not relevant. Stripping the entities seems perfectly fine during the parsing stage.

catch’s picture

Status: Needs review » Needs work

No longer applies.

ezra-g’s picture

Since it was only one line, in D5.3 I just made the quick edit, removed the items from the feed, updated the feed and had the same #39 business.

underpressure’s picture

What is the status of this problem? I use Google news and get this ugly output as well.

alpha2zee’s picture

Drupal's architecture needs to be changed to allow admins complete control over filtering of both input (what is stored in the database) and output (the final, displayed HTML). Further, this filtering should be customizable as per every content type (newsfeed, user comment, and so on). E.g., machine-generated newsfeed titles (database stored) can be filtered for XSS, appropriate HTML tags, etc., during the input stage to avoid the overhead of similar filtering everytime the title is output as web page content.

Also, appropriate code/principles should be used from better (than Drupal's Filter module) filtering scripts like htmLawed and HTMLPurifier. Though the mentioned scripts can be used in plugged-in modules, they then either cannot have certain functionalities (like, to address the newsfeed-entity issue), or re-do some actions that Drupal's core does anyway (increasing the processing time).

cburschka’s picture

What is the valid standard in XML - " or " or & #39;? I don't know how much more attention people pay to feed validity than website validity - would it be enough to require validity and reject quirky feeds, or do we lose a lot of functionality?

Edit: I note that whatever filter is here on d.o. is one whose code I would probably yell upon seeing. You do not unescape & amp; # 39; twice to make '. You just don't.

alpha2zee’s picture

Lines 289, 1179, 1267, 1352, 1367 and 1394 of the drupal/modules/aggregator/aggregator.module file (version timed 1-10-08 22:14) have code like:

... check_plain($item->title) ...

Looks like replacing all those six check_plain occurrences with aggregator_filter_xss should take care of the issue.

jdefay’s picture

I implemented the changes suggested by ahoeben above in his patch in post #11. Then I tried changing all the check_plain references to aggregator_filter_xss, but still got funky characters in Google News feeds on my site .

ahoeben's patch seems to clean titles that show up in feed blocks, but not in the Drupal aggregator category view which was still giving me funky single quotes in feed titles. It looks like the funky symbols are still being stored in the database, and ahoeben's patch fixes them when they are displayed in blocks.

Since the funky stuff was still getting displayed I wanted to remove them before they go into the database & thus make ahoeben's display patch unnecessary. So I added htmlspecialchars_decode() in the aggregator.module aggregator_parse_feed function.

It works great, whenever a "& #39;" shows up in the title of my news feeds, it's replaced with a single quote and then inserted into the database for later use.

Here's the line before:

$title = $item['TITLE'];

Here it is after:

$title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);

TimAlsop’s picture

Can somebody let me know what patch/changes are recommended for a Drupal 6.3 environment ? I am happy to modify code, but wanted to know which code changes to make so that these special characters are changed before the google news feed items are written to the database.

dominich’s picture

For me on 6.3 it's as the above post. I have a google news feed displayed in a block on a site, and the title links were showing &blah; style text - much in the same way that Firefox RSS feed reader does for sites like Slashdot et all interestingly.

Anyway - all I had to do was as jdefay says:

add htmlspecialchars_decode() in the aggregator.module aggregator_parse_feed function.

Here's the line before:

$title = $item['TITLE'];

Here it is after:

$title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);

jdefay rox.

D

TimAlsop’s picture

I tried this, but when I update feed I get:

Fatal error: Call to undefined function: htmlspecialchars_decode() in /data01/mysite/public_html/drupal/modules/aggregator/aggregator.module on line 738

Is there something else I need to do so that this function is available ?

Thanks,
Tim

TimAlsop’s picture

I just noticed that the htmlspecialchars_decode() function is only available in PHP >= 5.1.0. I am using PHP 4.4.4 which is supported by Drupal 6.3. Unfortunately I am unable to upgrade PHP because my site is hostsed on a server managed by my ISP and uses cpanel. So, the version of PHP is out of my control.

Is there a PHP 4.4.4 function which will do same/similar, or do I need to suffer this until PHP 5.1.0 or later is available ?

Thanks,
Tim

TimAlsop’s picture

I found some suggestions at http://uk.php.net/htmlspecialchars_decode

So, I have now changed code as shown below:

if ( !function_exists('htmlspecialchars_decode') )
{
    function htmlspecialchars_decode($text)
    {
        return strtr($text, array_flip(get_html_translation_table(HTML_SPECIALCHARS)));
    }
}

    // Resolve the item's title. If no title is found, we use up to 40
    // characters of the description ending at a word boundary but not
    // splitting potential entities.
    if (!empty($item['TITLE'])) {
      $title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);
    }

I don't get any error now, so I will see if I get any news items with special characters included. I will report my findings in next few days.

bradnana’s picture

This bug still exists as of D6.6 in core aggregator.module. Any plans for a patch to be brought into core?

--Brad

GetActive’s picture

Version: 6.x-dev » 6.8

It also/still happens on the "domain.com/user" page where the title is generated dynamically below the login form.

domesticat’s picture

I can confirm that @jdefay's patch (#26) works well for >PHP5; I've been using it for months. I have not tested the other version, nor can I vouch for whether or not doing this is 'the Drupal way.' (I just know that my users need the feeds to work properly!)

Problue Solutions’s picture

thank you jdefay in #26, after much useless information your advice has finally solved my problem :)

cvining’s picture

jdefay,

I just looked at your site, which has a Google News Feed here: http://jason.defay.org/aggregator/categories (titled 'UCSD In the News').

How did you get that to work? When I setup something similar, all the ampersands (&) are stripped out of the links, which breaks them. Example:

http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://www3.sign...

http://news.google.com/news/url?sa=Tct=us/0-0fd=Rurl=http://www3.signons...

Any suggestions? Thx.

-- Cronin

jdefay’s picture

Hi Cronin,

I just looked at the Google News Feed 'UCSD In the News' you mentioned on my site and noticed it does in fact have the ampersands in the links. Are you seeing something different on my site?

I just upgraded to 6.10 and haven't reimplemented any of these patches described in this thread. So I can't say for certain if the patches will actually strip the Google tags or not.

What I can say is that I got frustrated with the way Google feeds were causing frequent problems with the Drupal aggregator. I switched over to Yahoo news feeds and am getting consistently clean results. It's too bad because I like the Google news service better and would prefer to use it over Yahoo. I just didn't want to deal with headaches anymore.

Jason

jdefay’s picture

All,

Since Cronin asked about it, I decided to go in and re-implement the simple patch I described in my earlier post to this thread.

I added the htmlspecialchars_decode function back into the aggregator.module, removed all the old items from the news feed, then refreshed it. Viola, no more funky characters in the news feed titles.

Since several new updates/releases of Drupal have happened since my original patch, and all of them still contain the old problematic PHP code, perhaps someone can volunteer to help me get this fix added to the next version of Drupal so we don't all have to keep re-patching every time a new release comes out.

Please feel free to respond to this thread and/or email me if you can help make this happen.

Cheers,

Jason

jdefay’s picture

Status: Needs work » Reviewed & tested by the community

Tim,

Have you had any errors on this yet? If not, your aggregator patch seems to be a better, more backward compatible, improvement over mine.

I implemented your version and tested it on my site. It works great so I'd appreciate it if any of you who are reading this & willing to help would do the same and report back here.

Cheers,

Jason

jdefay’s picture

Oops, I shouldn't have changed the patch status since it looks like the original patch isn't the one we trying to implement. I'm setting back to 'code needs work' until someone can create and submit a patch that meets the Drupal patch standards. Here is what I'm proposing someone help me do:

Write a valid patch that replaces line 737 through 741 with the following:

if ( !function_exists('htmlspecialchars_decode') )
{
    function htmlspecialchars_decode($text)
    {
        return strtr($text, array_flip(get_html_translation_table(HTML_SPECIALCHARS)));
    }
}

    // Resolve the item's title. If no title is found, we use up to 40
    // characters of the description ending at a word boundary but not
    // splitting potential entities.
    if (!empty($item['TITLE'])) {
      $title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);
    }

As soon as that patch is up and tested I'll work on getting it into the queue for the next Drupal release.

Cheers,

Jason

domesticat’s picture

Status: Reviewed & tested by the community » Needs work

Been using this patch for a while with no issues; would be glad to see it get in the next version.

cvining’s picture

Jason,

"I just looked at the Google News Feed 'UCSD In the News' you mentioned on my site and noticed it does in fact have the ampersands in the links. Are you seeing something different on my site?"

No, your site looks fine. But my site (http://www.zts.com, see the section "Google Web Search Filtered for Recent Thermoelectric Content"). Check the links: "&"s are stripped out, and broken. I checked and they are stored in the DB that way too.

I just upgraded to drupal 6.9 (a day before 6.10 came out!).

Any thoughts on what's going on? Thanks for the replies.

-- Cronin

jdefay’s picture

Version: 6.10 » 6.8
Assigned: jdefay » csevb10

Sounds like the patch isn't working for some reason. Have you done all of the following?

1. Checked to make sure you have PHP version 5.1.0 or greater installed on your server
2. Changed line 741 in the aggregator.module from $title = $item['TITLE']; to $title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);
3. Removed all previous posts from your DB and re-imported them.

You might also try out the patch I just created below. Please give me feedback on the patch, it's in need of QA testing!

Thanks,

Jason

jdefay’s picture

Version: 6.8 » 6.10
Assigned: csevb10 » jdefay
FileSize
1.05 KB

Hi everyone,

I've created and tested the attached patch on my Drupal 6.10 site and am looking for volunteers to do additional patch testing.

The patch chages the aggregator.module as generally descibed in my posts above. It first checks to see if the server has the htmlspecialcharacters_decode() function available (PHP 5.1.0 and greater). If so it uses the function to clean up HTML in the article titles and saves the results to the database. If the function isn't there for some reason, it defaults to the prior method for saving article titles.

I'll give it a few weeks for testing, then if I don't hear anything I'll assume it's good to go and try to get it into the next patch release.

Thanks,

Jason

P.S. Thanks to domesticat for turning me on to WinMerge!

cvining’s picture

Version: 6.8 » 6.10
Assigned: csevb10 » jdefay

Jason,

Thanks much for the help, but after further testing & searching I'm pretty sure my problem is distinct from this thread.

A drupal thread (Aggregator Module Shows HTML Gibberish) which described my problem quite closely (dropping of key symbols like "&") is at this link: http://drupal.org/node/346990

It seems there is a known bug in one of the php libraries, libxml2 2.7.1,which strips out key html symbols (like <, >, & etc.). The bug is described here: http://bugs.php.net/bug.php?id=45996

My host's php installation has libxml2 2.7.2, which (if I'm reading the libxml release notes right) doesn't have the fix in it. But the latest release, libxml2 2.7.3, should. So only a few systems will be affected by this particular bug, and it will go away as the php libraries get updated.

In mean time, guess I'll use some work-around.

Thanks again .

jdefay’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
1.05 KB

I fear there may not be sufficient traffic on this thread to get additional patch testing done. After waiting a few weeks for testers to evaluate the final version of the patch, and not getting any responses, I've gone ahead and updated the status to "reviewed & tested by the community".

While nobody seems to have tested the final version, it was tested and improved by a number of users since it was first created in May of 2006 by Steve Dondley. Since those earlier iterations led us to this final version of the patch, I think it has been sufficiently tested to be committed to the next version of Drupal.

I apologize in advance for not being able to get other developers to provide a supporting opinion on the patch. However, I'm confident this patch will work just fine, particularly since the final patch checks to make sure the htmlspecialchars_decode() function is available on the server before running it.

Cheers,

Jason

jdefay’s picture

Version: 6.10 » 6.x-dev

After re-reading the Drupal PATCH instructions, it appears that I should be upgrading this bug to Version "6.x-dev", so that's what I just did :-)

packdragon’s picture

I've been having problems with my aggregator in that my feeds are displaying HTML code. I've got Drupal 6.10 installed. This patch sounds like exactly what I need. I clicked on the attachment, but it appears to display some text in my browser. How do I download and install this patch?

catch’s picture

Version: 6.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Needs work

The patch seems to be reversed - htmlspeciachars_decode() is removed, not added by #46.

Additionally the current development version of Drupal is 7.x rather than 6.x - where possible we fix issues in 7 first, then backport. http://api.drupal.org/api/function/aggregator_parse_feed hasn't changed that much (although it's in a different file now), so it'd be roughly the same change to make, although Drupal 7 requires PHP 5.2 or greater, so no need for the function_exists() there. Only did a very cursory review of the patch, but if you could re-roll against 7 that'd help to get some more reviewers.

alex_b’s picture

Status: Needs work » Needs review
FileSize
1.04 KB

This might work.

cburschka’s picture

Status: Needs review » Needs work

The patch looks good, and this is a great idea, but we definitely need a test case for it.

jdefay’s picture

Status: Needs work » Needs review

Thanks for re-rolling this Alex_b, sounds like we're ready to test & patch

Status: Needs review » Needs work

The last submitted patch failed testing.

jeffschuler’s picture

Assigned: jdefay » Unassigned
Status: Needs work » Needs review

Re-rolled for current D7 HEAD.

jeffschuler’s picture

Re-rolled for current D7 HEAD.

Status: Needs review » Needs work

The last submitted patch failed testing.

jeffschuler’s picture

Status: Needs work » Needs review

Testbot says that the Error handlers test failed. It passes on my machine.
Patch still applies to HEAD.
Re-setting to Needs Review for another shot.

Re-test of 61456_html_decode_titles_2.patch from comment #55 was requested by Arancaytar.

MichaelCole’s picture

#55: 61456_html_decode_titles_2.patch queued for re-testing.

s4j4n’s picture

subscribing

gausarts’s picture

I have my aggregator feed titles ended with &#160;.
Subscribing. Thanks

Jody Lynn’s picture

Version: 7.x-dev » 8.x-dev
agentrickard’s picture

So after 5 years, we should just declare that we don't care and close this.

Moving to D8 is pointless.

Jody Lynn’s picture

LOL, I don't think this one is really that bad, and it's finally in 'needs review'. (I was trying to update the versions on the more promising issues).

agentrickard’s picture

Version: 8.x-dev » 7.0
Status: Needs review » Reviewed & tested by the community

Well, the patch is green, and should be RTBC, so let's give it a try for 7.1.0.

jeffschuler’s picture

Status: Reviewed & tested by the community » Active

I started writing a test for this for verification/assurance. The included patch adds this test but does not include the patch in #55, (nor any other proposed changes in this issue,) in order to allow testing with-and-without such changes.

The test in its current form adds a feed with an item whose title includes a quote (") and an ampersand (&).

However... these look to be displaying properly -- using SimpleTest and adding the feed manually -- without patching.
Am I testing this properly?
Is anyone actually seeing this error in D7?

I had some issues with testing the apostrophe ('). I see the correct output (manually and in SimpleTest's verbose results) when I use it in the test feed item's title, but assertText() doesn't seem to acknowledge it.

Changing this issue back from RTBC seems appropriate.

jeffschuler’s picture

sorry, here's the test...

agentrickard’s picture

@jeffschuler

Now if you can just combine the patch and the test into one, set the issue to Needs Review and let TestBot try it out.

jeffschuler’s picture

@agentrickard:

Sorry if I wasn't clear: the test passes without applying the patch.

We should demonstrate that something is broken, [change the test to do so,] before fixing it.

agentrickard’s picture

So you are suggesting that this actually got fixed somewhere else?

agentrickard’s picture

Status: Active » Needs review

Marking 'needs review' to test the test patch.

jeffschuler’s picture

Either it's been fixed elsewhere or I'm not testing it properly.

I just tried the same test feed (generated in #67) in Drupal 6 and the entities displayed properly. So, perhaps I'm not testing the correct conditions?

The test uses a manually-generated RSS 0.91 feed (copied from the existing RSS091 test file) whose only item's title includes & and " entities.

I've added items to this feed using the title phrase "It&39;s an Honor" (from comment #1,) as well as a few other examples folks cited as problems in this issue, and they all displayed properly in D6 and D7.

Maybe someone experiencing this issue could point us to a feed they're having a problem with?

agentrickard’s picture

The test passes and accounts for the original condition, so I suppose this was fixed elsewhere and we can close this.

If it recurs, we can re-open?

jeffschuler’s picture

Status: Needs review » Closed (fixed)

Sounds reasonable to me.

smscotten’s picture

Version: 7.0 » 7.7
Status: Closed (fixed) » Active

Patch in #55 worked for me, mostly. This suggests to me that the result of #66 is a faulty test condition, not that the problem magically fixed itself. Testing against an external site known to cause the problem may be a better metric than creating a local feed to test against. (Of course, maybe something else has changed between 7.0 and 7.7)

I said the patch "mostly" worked for me. &rsquo; still appears in my feed titles, but that's a htmlspecialchars_decode() issue and may be outside the scope of this issue.

jeffschuler’s picture

smscotten: can you point us to the feed you're having issues with?

smscotten’s picture

Sorry, that probably should have been one of the first things I put in the post.

Feed: http://www.politifact.com/feeds/statements/truth-o-meter/

And here on my site: http://splicer.com/aggregator/sources/19

Ugh. Looked at the source. Looks like they have double-converted their characters to character entities, resulting in &amp;quote; all through, as described in reply #1.

So the patch in #55 is a workaround, not a fix, because the behavior in the aggregator isn't wrong.

Um... right?

EDIT TO ADD: I sent mail to politifact.com and got a response back saying they were looking into it so it may not even show the issue much longer.

David_Rothstein’s picture

Title: Aggregator titles display quotes and other characters with HTML entity equivalents badly » Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests)
Version: 7.7 » 8.x-dev
Category: bug » task
Status: Active » Needs work

I saw this issue recently also, but same thing as above; it turned out that the feed itself (on the source site) was incorrectly escaping its titles. So there's no bug here for Drupal to fix, at least not in D7/D8.

If an actual bug is still reproducible in Drupal 6 (i.e. with a feed that is formatted correctly on the source site but broken when imported to Drupal), someone could feel free to set this issue back to 6.x.

In the meantime, the tests in #67 look pretty solid so it's worth getting the patch rerolled and committed; tests are good regardless of whether the bug itself exists...

jeffschuler’s picture

Status: Needs work » Needs review
FileSize
2.1 KB

Cool. Rerolled #67 for 8.x.

David_Rothstein’s picture

Status: Needs review » Reviewed & tested by the community
FileSize
2.08 KB

Patch looks good to me. The tests pass, and they're completely consistent with the way the other tests in the same area of the code are written.

I did a quick reroll to remove the "No newline at end of file" issue from the newly-added file. This should be good to go.

David_Rothstein’s picture

Ah, but someone is going to come point out that Drupal documentation standards now require the PHPDoc to say "Tests a feed.." rather than "Test a feed.." :)

Fixing that in the attached.

xjm’s picture

Haha, you heard me coming! But also....

+++ b/core/modules/aggregator/aggregator.testundefined
@@ -914,4 +918,15 @@ class FeedParserTestCase extends AggregatorTestCase {
+  /**
+  * Tests a feed that uses HTML entities in item titles.
+  */

Indentation is goofed on the second and third lines. Needs one more space before the asterisks.

xjm’s picture

FileSize
2.08 KB

Fixing that. No commit credit please. :P

xjm’s picture

Issue tags: +Needs backport to D7

Oh, and yeah.

Dries’s picture

Version: 8.x-dev » 7.x-dev

Committed the tests to 8.x. Moving to 7.x.

Dave Reid’s picture

Title: Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests) » [ROLLBACK] Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests)
Version: 7.x-dev » 8.x-dev
Priority: Normal » Major

This appears to have broken tests. Failure on "Quote&quot; Amp&amp;" found http://qa.drupal.org/8.x-status

andypost’s picture

Title: [ROLLBACK] Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests) » Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests)
--- /dev/null
+++ b/core/modules/aggregator/tests/aggregator_test_title_entities.xml

this file was not commited

Dave Reid’s picture

Title: Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests) » [BROKEN HEAD] Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests)
Priority: Major » Critical
sun’s picture

+++ b/core/modules/aggregator/aggregator.test
@@ -914,4 +918,15 @@ class FeedParserTestCase extends AggregatorTestCase {
+    $this->assertText("Quote&quot; Amp&amp;");

Why are you testing the page output that has been converted to plain-text?

When testing HTML escaping/sanitization/entities, use assertRaw() to check the actual, unprocessed content in the page output.

assertText() runs the entire page output through filter_xss(), which performs a plethora of HTML escaping, entity conversions, and validations.

pillarsdotnet’s picture

Status: Reviewed & tested by the community » Needs review

Patch suggested by #87 and #89.

pillarsdotnet’s picture

Patch for real this time.

pillarsdotnet’s picture

Category: task » bug
Status: Needs review » Reviewed & tested by the community

Note that since HEAD is broken, this is a critical bug.

andypost’s picture

Category: bug » task
Status: Reviewed & tested by the community » Needs review

+1 to commit asap

# drush test-run FeedParserTestCase
Feed parser functionality 62 passes, 0 fails, 0 exceptions, and 15 debug messages                              [ok]
No leftover tables to remove.                                                                                  [status]
No temporary directories to remove.                                                                            [status]
Removed 1 test result.                                                                                         [status]
andypost’s picture

Status: Needs review » Reviewed & tested by the community
pillarsdotnet’s picture

Category: task » bug
Dries’s picture

Sorry about that. I just committed the patch in #91. I think that should fix HEAD. Let's see ...

Dries’s picture

Priority: Critical » Normal
Status: Reviewed & tested by the community » Fixed
David_Rothstein’s picture

Title: [BROKEN HEAD] Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests) » Aggregator titles display quotes and other characters with HTML entity equivalents badly (write tests)
Version: 8.x-dev » 7.x-dev
Category: bug » task
Status: Fixed » Patch (to be ported)

Looks good - seems like we can still use a patch for D7.

Regarding assertText() vs assertRaw()... either should be fine here since it's just a text string. Leaving it as assertText() actually has the nice property that it verifies that filter_xss() doesn't do anything funny to &quot; or ;&amp (which it shouldn't), but I suppose that's an extremely backwards way to test that :) Either one is OK.

Albert Volkman’s picture

Status: Patch (to be ported) » Needs review
FileSize
2.08 KB

D7 backport. I *think* I got it right.

Status: Needs review » Needs work

The last submitted patch, core_aggregator_title_entities_test-61456-d7-99.patch, failed testing.

Albert Volkman’s picture

Status: Needs work » Needs review
FileSize
2.05 KB

Trying again.

David_Rothstein’s picture

Should probably use assertRaw() rather than assertText() to be consistent with what went into Drupal 8.

Albert Volkman’s picture

patcher’s picture

does anyone know, is there a similar patch for the feeds module? i am importing a xml with feeds and the html entities are not shown correctly in the title

xjm’s picture

@patcher: This is a core issue. Try looking in the feeds module queue.

aascherson’s picture

FileSize
349.11 KB

Hey guys,

Long time drupal user but not a pro so maybe I am doing something dumb here. I am getthe the tags in my titles from a google alert feed. Been looking everywhere for a solution to this and trying to apply this patch to drupal 7.15 core.
using git, I get a fail on :

checking patch modules/aggregator /aggregator.test ...

I get a file not found error. So it finds the patch, but not the test file that I can see there in the directory. Am new to git, am I doing something stupid? Any guidance would be really appreciated as this is driving me nuts. Have tried to look to see if I can apply patch manually but cant seem to work out what needs to go where on this one. Apologies if this is a newb moment.

For reference I am running this from the modules/aggregator directory, which is also where the patch is. Tried running from /test as well but same problem.

lundj’s picture

Is there any solution in one of the last versions of the aggregator.module?

In version 7.17 I have the same problem with Facebook-RSS-Feeds: http://cl.ly/image/220U1V26180j on http://junge-akademie-wittenberg.de/projekt/sprache-und-politik (right sidebar at the end of the page)
The related Feed for this one: https://www.facebook.com/feeds/page.php?id=301588419892418&format=rss20

Is there any solution for this?

Albert Volkman’s picture

Status: Needs review » Needs work

The last submitted patch, core_aggregator_title_entities_test-61456-d7-108.patch, failed testing.

Albert Volkman’s picture

Status: Needs work » Needs review
FileSize
2.04 KB

Oops.

fbreckx’s picture

fbreckx’s picture

rtrubshaw’s picture

Not proud of this, but since the views that referenced the aggregator feeds already had template files I simply ran the entire output through strtr().

E.g. in "views-view--twitter-feed.tpl.php":

<ul><?php print strtr( $rows, array( '&amp;amp;' => '&amp;', '&amp;#' => '&#', ) ); ?></ul>

(and similar code for other full feed template files)

I found that Twitter, Facebook and Flickr have all - at times - stuck HTML entities in the title. YMMV

chirikamo17640’s picture

#55: 61456_html_decode_titles_2.patch queued for re-testing.

Liam Morland’s picture

ggevalt’s picture

Wow.
This has to be one of the most incredible threads on Drupal forums.
The problem began in 2006 on Drupal 4 and folks seem to be still focusing on it for Drupal 8.
I'm in Drupal 6, I have the issue, and I see no resolution.
If I have missed something, please advise or provide a link as to the solution for Drupal 6 folks which still represent a substantial number of people.
Thanks so much for all you do.
cheers,
g

webengr’s picture

any fix for this? I agree hard to believe drupal 7 is doing this with a standard install. ?

Albert Volkman’s picture

@ggevalt Issues are fixed upstream as to avert regressions.

@webengr Have you tested my latest patch? When a patch is in "Needs Review", it's helpful to test out the patch, report back your findings, and possibly mark the issue as RTBC.

mimes’s picture

Test patch in #110 applied.
Results:

Albert Volkman’s picture

Albert Volkman’s picture

@mimes That's odd, it's still passing on the bot. Do you have a vanilla setup?

GuillaumeDuveau’s picture

Mmmm... spent a while on this. I could sure help with testing the test patch, but I what I believe, it that it's a real-world situation we have with feeds that already have html entities in it, like the original author of the issue said in comment #1 7.5 years ago. I can't really see what the test helps with that. Having an option to html_decode_entities by feed item was proposed and forgotten. I'd happily submit a patch to include that option if any Core maintainer gives me the green light.

Otherwise here is a solution to html_entity_decode the titles without hacking Core. We define and use a new parser thanks to hook_aggregator_parse, in a custom module "mymodule". Basically it's exactly the same parser except that we replace in the copy of aggregator_parse_feed() :

      $item['title'] = $item['title'];

with :

      $item['title'] = html_entity_decode($item['title'], ENT_QUOTES, 'UTF-8');

You have to choose the new parser on admin/config/services/aggregator/settings. Here the code of mymodule.module :

<?php

/**
 * @file
 * MyModule.
 */

/**
 * Implementation of hook_init().
 */
function mymodule_init() {
  module_load_include('inc', 'aggregator', 'aggregator.parser');
  /* note : I tried to move this in mymodule_aggregator_parse() but then admin/config/services/aggregator/settings doesn't show the parser select options. */
}

/**
 * Implements hook_aggregator_parse_info().
 */
function mymodule_aggregator_parse_info() {
  return array(
    'title' => t('MyModule parser'),
    'description' => t('Parses RSS, Atom and RDF feeds, decodes html entities.'),
  );
}

/**
 * Implements hook_aggregator_parse().
 */
function mymodule_aggregator_parse($feed) {
  global $channel, $image;

  // Filter the input data.
  if (mymodule_parse_feed($feed->source_string, $feed)) {
    $modified = empty($feed->http_headers['last-modified']) ? 0 : strtotime($feed->http_headers['last-modified']);

    // Prepare the channel data.
    foreach ($channel as $key => $value) {
      $channel[$key] = trim($value);
    }

    // Prepare the image data (if any).
    foreach ($image as $key => $value) {
      $image[$key] = trim($value);
    }

    $etag = empty($feed->http_headers['etag']) ? '' : $feed->http_headers['etag'];

    // Add parsed data to the feed object.
    $feed->link = !empty($channel['link']) ? $channel['link'] : '';
    $feed->description = !empty($channel['description']) ? $channel['description'] : '';
    $feed->image = !empty($image['url']) ? $image['url'] : '';
    $feed->etag = $etag;
    $feed->modified = $modified;

    // Clear the cache.
    cache_clear_all();

    return TRUE;
  }

  return FALSE;
}

/**
 * Parses a feed and stores its items.
 *
 * @param $data
 *   The feed data.
 * @param $feed
 *   An object describing the feed to be parsed.
 *
 * @return
 *   FALSE on error, TRUE otherwise.
 */
function mymodule_parse_feed(&$data, $feed) {
  global $items, $image, $channel;

  // Unset the global variables before we use them.
  unset($GLOBALS['element'], $GLOBALS['item'], $GLOBALS['tag']);
  $items = array();
  $image = array();
  $channel = array();

  // Parse the data.
  $xml_parser = drupal_xml_parser_create($data);
  xml_set_element_handler($xml_parser, 'aggregator_element_start', 'aggregator_element_end');
  xml_set_character_data_handler($xml_parser, 'aggregator_element_data');

  if (!xml_parse($xml_parser, $data, 1)) {
    watchdog('mymodule', 'The feed from %site seems to be broken, due to an error "%error" on line %line.', array('%site' => $feed->title, '%error' => xml_error_string(xml_get_error_code($xml_parser)), '%line' => xml_get_current_line_number($xml_parser)), WATCHDOG_WARNING);
    drupal_set_message(t('The feed from %site seems to be broken, because of error "%error" on line %line.', array('%site' => $feed->title, '%error' => xml_error_string(xml_get_error_code($xml_parser)), '%line' => xml_get_current_line_number($xml_parser))), 'error');
    return FALSE;
  }
  xml_parser_free($xml_parser);

  // We reverse the array such that we store the first item last, and the last
  // item first. In the database, the newest item should be at the top.
  $items = array_reverse($items);

  // Initialize items array.
  $feed->items = array();
  foreach ($items as $item) {

    // Prepare the item:
    foreach ($item as $key => $value) {
      $item[$key] = trim($value);
    }

    // Resolve the item's title. If no title is found, we use up to 40
    // characters of the description ending at a word boundary, but not
    // splitting potential entities.
    if (!empty($item['title'])) {
      $item['title'] = html_entity_decode($item['title'], ENT_QUOTES, 'UTF-8');
    }
    elseif (!empty($item['description'])) {
      $item['title'] = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['description'], 40));
    }
    else {
      $item['title'] = '';
    }

    // Resolve the items link.
    if (!empty($item['link'])) {
      $item['link'] = $item['link'];
    }
    else {
      $item['link'] = $feed->link;
    }

    // Atom feeds have an ID tag instead of a GUID tag.
    if (!isset($item['guid'])) {
      $item['guid'] = isset($item['id']) ? $item['id'] : '';
    }

    // Atom feeds have a content and/or summary tag instead of a description tag.
    if (!empty($item['content:encoded'])) {
      $item['description'] = $item['content:encoded'];
    }
    elseif (!empty($item['summary'])) {
      $item['description'] = $item['summary'];
    }
    elseif (!empty($item['content'])) {
      $item['description'] = $item['content'];
    }

    // Try to resolve and parse the item's publication date.
    $date = '';
    foreach (array('pubdate', 'dc:date', 'dcterms:issued', 'dcterms:created', 'dcterms:modified', 'issued', 'created', 'modified', 'published', 'updated') as $key) {
      if (!empty($item[$key])) {
        $date = $item[$key];
        break;
      }
    }

    $item['timestamp'] = strtotime($date);

    if ($item['timestamp'] === FALSE) {
      $item['timestamp'] = aggregator_parse_w3cdtf($date); // Aggregator_parse_w3cdtf() returns FALSE on failure.
    }

    // Resolve dc:creator tag as the item author if author tag is not set.
    if (empty($item['author']) && !empty($item['dc:creator'])) {
      $item['author'] = $item['dc:creator'];
    }

    $item += array('author' => '', 'description' => '');

    // Store on $feed object. This is where processors will look for parsed items.
    $feed->items[] = $item;
  }

  return TRUE;
}
Irene Meisel’s picture

Status: Needs review » Reviewed & tested by the community

tested patch #110 it worked. no fatal error on ubuntu. new test ran + passed.

Irene Meisel’s picture

Issue summary: View changes

  • Commit 478d1a0 on 7.x by David_Rothstein:
    Issue #61456 by Albert Volkman, jeffschuler, David_Rothstein, jdefay,...
David_Rothstein’s picture

Status: Reviewed & tested by the community » Fixed

Committed #110 to 7.x - thanks!

(And I switched t() to format_string() in the test assertion message on commit, since that was changed everywhere in core a long time ago.)

I'm really unclear if there's any functional bugs left to fix here (especially in Drupal 6) but someone can reopen/retitle or create a new issue if there is.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

gojensen’s picture

Maybe I missed something, this still didn't work for me using a facebook page feed... Had to replace the check_plain calls with aggregator_filter_xss as mentioned in post #25. This was running on v7.31. (I could see everything from patch #110 was added, but didn't seem to resolve this specific issue...)

Another problem I have is that when I import the feeds "something" parses the content/description and changes all urls from absolute urls (i.e. http://facebook.com/ ...) to relative paths (i.e. /pagename/photos/image.jpg)... so when a user reads the post and wants to click a link or tiny image in the post they get a drupal error about the site not being found. Anyone know WHERE the scripts strips the url?

gojensen’s picture

Status: Closed (fixed) » Needs review
Liam Morland’s picture

Status: Needs review » Active

The "needs review" status is for issues that have a patch which might fix the issue.

MediaFormat’s picture

Seems fixed for the <item><title>, but not for the <channel><title>

  • Dries committed a084058 on 8.3.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed a775f05 on 8.3.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed d78f6a6 on 8.3.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...

  • Dries committed a084058 on 8.3.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed a775f05 on 8.3.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed d78f6a6 on 8.3.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...

  • Dries committed a084058 on 8.4.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed a775f05 on 8.4.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed d78f6a6 on 8.4.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...

  • Dries committed a084058 on 8.4.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed a775f05 on 8.4.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed d78f6a6 on 8.4.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...

  • Dries committed a084058 on 9.1.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed a775f05 on 9.1.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
  • Dries committed d78f6a6 on 9.1.x
    - Patch #61456 by jeffschuler, edmund.kwok, jdefay, David_Rothstein,...
poker10’s picture

Status: Active » Fixed
Issue tags: -Needs backport to D7

Moving back to Fixed. If there are any problems left, please create a new issue (feel free to link it here). Thanks.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.