Recently, a poll was promoted to drupal's front page. As I am subscribed to drupal's node feed, this poll showed up in my rss reader within an hour. Soon after, it showed up again. And again. And again.

A quick look at the rss feed that drupal was generating confirmed that drupal is not generating guid tags for the feed, which is generally bad.

The RSS 2.0 specification at Harvard Law states in part:

If the guid element has an attribute named "isPermaLink" with a value of true, the reader may assume that it is a permalink to the item, that is, a url that can be opened in a Web browser, that points to the full item described by the element. An example:

<guid isPermaLink="true">http://inessential.com/2002/09/01.php#a2</guid>

So, since my own site runs drupal, and I've noticed odd duplication issues too, I decided to generate the following patch:

Index: includes/common.inc
===================================================================
RCS file: /cvs/drupal/drupal/includes/common.inc,v
retrieving revision 1.491
diff -u -F^f -r1.491 common.inc
--- includes/common.inc 28 Oct 2005 01:06:36 -0000      1.491
+++ includes/common.inc 23 Nov 2005 05:49:19 -0000
@@ -783,6 +783,7 @@ function format_rss_item($title, $link,
   $output = "<item>\n";
   $output .= ' <title>'. check_plain($title) ."</title>\n";
   $output .= ' <link>'. check_url($link) ."</link>\n";
+  $output .= ' <guid isPermaLink="true">'. check_url($link) ."</guid>\n";
   $output .= ' <description>'. check_plain($description) ."</description>\n";
   foreach ($args as $key => $value) {
     if (is_array($value)) {

It's a very simple patch, and I tested it against 4.6.3, against which it performed flawlessly. I have absolutely no reason to suspect that it won't work flawlessly with HEAD also.

CommentFileSizeAuthor
#7 patch_171.05 KBalexmarkley
patch_16737 bytesalexmarkley
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Gabriel R.’s picture

I've noticed the duplication on the Drupal.org site too, and other Drupal sites.

Thanks for creating the patch!

Dries’s picture

  1. $link is not guaranteed to be a permalink; it can be an URL alias, and thus, it is subject to change.
  2. Why would this fix the poll problem? That problem is caused by the fact that the content of the poll item is updated as people vote. Mind to elaborate?
alexmarkley’s picture

$link is not guaranteed to be a permalink; it can be an URL alias, and thus, it is subject to change.

True. And as a matter of fact, if the node is deleted, the same problem will occur. However, if you read the RSS documentation I linked to, it states:

<guid> is an optional sub-element of <item>.

guid stands for globally unique identifier. It's a string that uniquely identifies the item. When present, an aggregator may choose to use this string to determine if an item is new.

<guid>http://some.server.com/weblogItem3207</guid>

There are no rules for the syntax of a guid. Aggregators must view them as a string. It's up to the source of the feed to establish the uniqueness of the string.

That means that the string doesn't have to be a link, or even a permanent one per se. Its only role is to be a string that is unique inside the context of that channel, which allows the reader to distinguish between new items and old-but-updated items.

Note that this, like all aspects of rss, is of limited usefulness. Because of the transient nature of the internet, things come and go frequently. If someone chooses to delete a node or take down their drupal site, there is no rss implementation in the world that can handle that intelligently.

Why would this fix the poll problem? That problem is caused by the fact that the content of the poll item is updated as people vote. Mind to elaborate?

The reason that this fixes this is because, while the contents of the poll item change, the URL to the poll should not. If the URL does not change, and if the news reader is informed that it will not change, (using the <guid> tag), it will not create duplicates of the item. As a matter of fact, some newsreaders (such as NetNewsWire, if my sources are to be trusted) will do a diff on the current version vs. the old version, and will highlight the differences.

(Note that I tested this with a poll on my scratch installation. It works.)

Again, there are cases where this is not very useful to some people. Ie, if you plan on changing aliases all the time, it simply won't help. However, while I can think of situations where my patch won't avoid duplication, I cannot think of a single situation where this patch would negatively impact the support that is already there. (It should never confuse a RSS 2.0 compliant newsreader.) If anything, it only helps to implement more of the standard.

The only two negative aspects to this that I can think of are:

  1. the execution time to add the strings
  2. the extra bandwidth required to transport them

Otoh, I don't honestly believe those are worth considering in this case.

Anyway, If anybody here has some RSS experience and can drop their $0.02, that would rock.

alexmarkley’s picture

Oh, I forgot to mention. In the example that you mentioned - where somebody was mucking about with aliases - the reader should simply see that as a new item, and happily create a duplicate. (With the new, correct URL, of course.)

breyten’s picture

If we're going to put in a GUID, I'd rather have something like <guid isPermaLink="false">$nid@$base_url</guid> -- We know that the nid is unique, and making guid non perma links avoids exposing links of the type $base_url/node/$nid to the end users, which most site admins don't want.

But still this is a matter of fighting symptoms with workarounds -- The real problem is that Drupal's *aggregator* isn't smart enough to see the difference. We should solve it there.

alexmarkley’s picture

I'd rather have something like <guid isPermaLink="false">$nid@$base_url</guid>

Yes, I think I like that better than I like mine. That's very good. Anyway, what's the process for changing a patch once it's already been submitted? :-P

But still this is a matter of fighting symptoms with workarounds -- The real problem is that Drupal's *aggregator* isn't smart enough to see the difference. We should solve it there.

By aggregator, do you mean the bit that bundles the feed? Or the module that parses and displays feeds?

If you mean the part that bundles the feed, I must protest. The only way for that to work would be to prevent the contents of the <description> tag from changing. But by preventing node updates to go out the rss feed, you would put the rss audience at an unfair disadvantage. (For example, all typos would be embedded in the feed until that item slid off the bottom.)

If you mean the part that parses and displays feeds, I must admit that I personally do not use the rss reader that is built-into drupal. If it does not understand guid tags, that should be fixed too. However, this patch does not attempt to tackle that problem.

To summarize, I am simply recommending guid tags in drupal's feeds because it would prevent, among other negative things, all these stupid duplication problems that I (and my users) see every time I tweak my output filter code.

alexmarkley’s picture

FileSize
1.05 KB

New Patch

As I'm sure you've all guessed, I'm new to this community, so I'm not quite sure how things are normally done. So, since I have a new patch that addresses this problem more correctly, I'm just gonna attach it too.

In attempting to implement the above suggestion, I discovered that format_rss_item has no idea of the $nid of the object in question. Upon furthur review, I realized that it is being used to handle objects that are not drupal nodes! (modules/aggregator.module calls it to rebundle external feed items)

Anyway, what I really wanted to do is patch node.module and pass the additional tag in the call to format_rss_item, so that's what I did.

I tested this technique against a front page poll and a 'typo' in a node. My reader did not mark them as new, but correctly identified them as updated.

I think this addresses all of the previous concerns, and actually ends up being more flexible in the long run.

Dries’s picture

malex: you can attach new or improved patches as you just did. I'll try to review it shortly. Thanks for the explanation, and welcome to the community. You're doing great. :-)

walkah’s picture

+1 to the patch in #7 ... and kudos for a nice read of the rss2 spec :) i also like the extras usage... means other feeds could (possibly) include their own guid's for feeds (i knew that system was gonna be helpful :P ).

Dries - I think this looks ok. you?

Dries’s picture

Status: Needs review » Fixed

Committed to HEAD. Thanks.

Anonymous’s picture

Status: Fixed » Closed (fixed)