When supplied with an RSS title including HTML encoded special characters, the aggregator module converts the leading ampersand to its HTML code causing the special character code to be displayed instead of the desired special character. For example, ™ is converted to ™

As an untested fix, on line 320 of aggregator.pages.inc I've used html_entity_decode() to decode special characters before check_plain() encodes them.

Comments

dcam’s picture

Version: 7.27 » 7.x-dev
Priority: Minor » Normal
Issue tags: +Needs steps to reproduce

Thanks for the bug report! Unfortunately, I can't reproduce the issue. Could you provide us with steps to make this happen on a clean install of 7.x, please?

I've tried putting characters that you say are double-encoded in the feed title and in item titles. It didn't fail in either case. I wasn't sure which kind of title you meant. The line you reference is in template_preprocess_aggregator_item(). Despite the fact that the variable you're decoding is named "feed_title", it's actually an item title. So yeah, that's misleading. I think I'll check to see if they're changing that in 8.x. Anyway, because of that I can't be certain which kind of title has the issue.

I'm checking 8.x for this bug too.

Dave Alitz’s picture

I'm referring to the item title.
Here are the steps I followed:

In the output I see:

The underlying HTML code for the item is:

<li><a href="http://articlefeeds.nasdaq.com/~r/nasdaq/symbols/~3/7XPrchs7DC8/see-how-caterpillar-ranks-among-analysts-top-dow-30-picks-cm367514">See How Caterpillar Ranks Among Analysts&amp;#39; Top Dow 30 Picks</a>
</li>

I noticed when I examined the HTML with Google's Dev Tools, it decoded the &amp; and the HTML appeared to be correct. When I looked at the code by using "View page source" I could see the &amp;

Ubuntu 14.04 LTS, Apache 2.4.7, PHP 5.5.9, MySQL 5.5.37

Dave Alitz’s picture

I also added html_entity_decode() to line 722 in aggregator.module, the theme_aggregator_block_item(), to fix the item title blocks.

dcam’s picture

The item title you're talking about is already double-encoded on the feed you're downloading. Drupal isn't doing it. Is there another example of Drupal double-encoding or is that the only one?

Dave Alitz’s picture

I assumed that the data in the database was raw data from the feed and didn't look back at the feed source. (Perhaps the schema could indicate that some pre-processing has occurred.) I haven't been able to find an instance which isn't double-encoded in the source.

I guess the bug is at the feed source; but perhaps Drupal could handle this a bit better. I've gotten double-encoding from two of the three sources I aggregate. I expect it's a fairly common occurrence. Perhaps the titles should be looped through html character decoding until no more replacements can be made instead of just once.

Since it handles well-formatted feeds properly, I guess this becomes more of a feature request than a bug.

dcam’s picture

I was surprised to see that Drupal doesn't store the raw feed data. In fact, I initially didn't know what you're talking about until I looked at my own database contents. I dug into the code and discovered that Drupal doesn't explicitly decode the entities. It's PHP's xml_parser() that does it. It will decode any of the five valid XML entities, which includes ampersands and single/double quotes. In this case, I think Drupal is doing its best to store what it believes is the raw data. Doing additional decoding would violate that.

As a result, I don't think this would ever be added. Your best bet is to contact the feed publisher to let them know there's a problem with their feed. My workplace occasionally has to do that with the feeds they import and we've found that most publishers like to know that sort of thing. That or use Aggregator's API to make a custom module with a parser that decodes HTML entities in the title as many times as you like.

Of course, I'm not a maintainer of Aggregator. You can bump the issue version to 8.x-dev and see what they say. Changes, especially new features, must be added to 8.x first.

laughnan’s picture

I wonder if that is the same thing causing these Craigslist feeds to appear with some weird HTML encoding (http://www.pdxrestore.org/shop).

Dave Alitz’s picture

Yes, you're experiencing the same problem. The craigslist feed you listed has item titles like <title><![CDATA[Places to Find Recycled &amp; Used Material (Metro Areas)]]></title>. Unfortunately, the RSS specification just describes the item title as a string. It doesn't specify whether it should be plain text or some other format.

laughnan’s picture

@Dave So this is a craigslist feed issue? Fascinating. I will have to make some modifications to that RSS feed then.

Dave Alitz’s picture

It's more a deficiency in the RSS specification. The ATOM specification improves on RSS by explicitly declaring whether the title is plain text or HTML. On one hand, there's an argument to be made that the aggregator module should try and test for various formats. On the other hand, if you assume that whenever you find HTML you should display it as HTML, you'll break plain text titles about HTML coding.

At any rate, there doesn't seem to be much enthusiasm for the Aggregator module in Drupal 8. It looks like it will be in 8 because the feature set is already frozen; but I wouldn't hope for any expanded functionality. Some have suggested that the Feeds module is a better answer.

izmeez’s picture

Great information in this thread and interesting questions:

Will aggregator survive beyond drupal 8 ?

and

Should aggregator parse and correct feeds? Ask the d8 queue as it would be a new feature.

Maybe I am mistaken but it seems to me that aggregator does serve a useful niche that more power modules like feeds do not. Not adding feeds as nodes has some advantages.

I see now that d8 aggregator feeds are entities from a commit to d8 two years ago #293318-164: Convert Aggregator feeds into entities

Maybe there is some continued interest on the future of aggregator?