There is currently no check to determine if the description tag is empty, therefore it gets overwritten by content:encoded, summary, and content respectively (if they are present).

The comment: // Atom feeds have a content and/or summary tag instead of a description tag. implies mutual exclusion, however I have come across some feeds with both tags (with description containing a summary, but content:encoded containing the whole long article).

I propose a check before overwriting:

aggregator.parser.inc: function aggregator_parse_feed(&$data, $feed)

126
+ if (empty($item['description'])) {
    if (!empty($item['content:encoded'])) {
      $item['description'] = $item['content:encoded'];
    }
    elseif (!empty($item['summary'])) {
      $item['description'] = $item['summary'];
    }
    elseif (!empty($item['content'])) {
      $item['description'] = $item['content'];
    }
+ }
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

conan_payne’s picture

I've found WordPress feeds appear to contain both description and content:encoded.

For example: http://ma.tt/feed/

Open the url in Firefox (just to display it formatted using the feed handler), and the description will be shown correctly. However Aggregator ends up saving the full contents in content:encoded.

conan_payne’s picture

Looking at some references to the Atom format, if content:encoded exists, then it should contain the full content; however, if a description also exists then the description element is even more likely to contain the summary/excerpt we want, and that without a check, gets overwritten:

http://www.rssboard.org/rss-profile#namespace-elements-content-encoded

The content:encoded element defines the full content of an item (optional). This element has a more precise purpose than the description element, which can be the full content, a summary or some other form of excerpt at the publisher's discretion.

http://www.atomenabled.org/developers/syndication/#contentElement

either contains, or links to, the complete content of the entry.

http://www.xml.com/pub/a/2004/04/07/dive.html

In RSS 0.92, RSS 0.93, RSS 0.94, and RSS 2.0, //item/description is sometimes a summary and sometimes full entry content. There is no way to distinguish programmatically whether a description is a summary or full content. The existence of an additional content element in the same entry (such as content:encoded) is a good predictor that description is a summary, but it's not conclusive. And many feeds, such as the default feeds produced by Movable Type, have a summary in the description element but no full content anywhere.

samuel.sirois’s picture

Here is a patch implementing what @conan_payne has proposed in this issue summary.

Patch includes new test cases & sample XML file.

Applying the patch gives us the same behaviour as Mozilla Firefox on the new test file (aggregator_test_content_and_description_cohabitation.xml).

samuel.sirois’s picture

Status: Active » Needs review

Status: Needs review » Needs work
samuel.sirois’s picture

Status: Needs work » Needs review

Changing status to "Needs review" since tests all pass under PHP 5.5 & MySQL 5.5

Version: 7.22 » 7.x-dev

Core issues are now filed against the dev versions where changes will be made. Document the specific release you are using in your issue comment. More information about choosing a version.