I seem to remember this being addressed on these forums somewhere, but my search came up empty; so forgive me if I have missed a discussion on this. We are pulling about 70 RSS feeds onto our (as-yet-unopened) Drupal 4.6.7 site using the Aggregator2 module (the very latest version that works with 4.6.7). Everything is working fine; except the module has the nasty habit of creating duplicate nodes for the same feed item - sometimes not once, but even two, three or four times or even more.

We have a cron run happening every 12 minutes; and the feeds are set to draw anywhere from 15 minutes to three hours, depending on their individual importance. (And I don't know if this is fact, but it appears that, the faster a feed is set to update, the more duplicates are produced from it.)

I should point out that all these feeds are from legitimate news organizations - Reuters, CBS News, MSNBC, etc. - and that according to the logs the cron seems to be executing and completing correctly. We recently changed the individual feeds from "Update" to non-update, per the module writers' recommedation to keep server resources down; however, maybe it's my skewed observation and maybe it's chance, but it seems like since we did that the duplicating problem has actually gotten worse. (Does selecting "Update" *force* the feed to check for duplicate items?)

Finally, I want to point out that what results are not duplicate node *listings* but actual separate duplicate nodes, each with a different Drupal node number. I thought I read somewhere here that one of the 4.6 upgrades was supposed to get rid of this problem. Although it's not crucial, I find these duplications extremely annoying - and I'm afraid the public will, too. Does anyone know how we can get this to stop? My thanks in advance --

Comments

kirkcaraway’s picture

Are you sure the problem is not with the feeds coming from the news sources? My newsreader picks up duplicates coming from sources like CNN, Washington Post, etc. all the time. Sometimes it's the site editing or updating these stories, which causes their RSS feed to post them again. Sometimes you can see no difference, because the change is on their backend.

Take a look at these feeds in a reader (I use Bloglines.com) and make sure the problem isn't there first.

Good luck.

PS. Is someone working on making Aggregator2 compatible with 4.7?

conkhead’s picture

Thanks, kirk, I will do this. However, I was under the impression that the module looks for the unique GUID number of a feed to decide whether it's already been posted or not. (There's a db table for the GUID numbers, so I assume that's what they're there for. Bad assumption on my part?) All the major publishers seem to use a GUID numbering system, so I assume this should take care of the problem. (Or are you saying that the publishers put out duplicate items with *different* GUID numbers? Oh, the horror... :-)

Re: your P.S.: We decided to build this upcoming (fricking enormous - maybe too big for its own good... :-) site around Drupal version 4.6, and we do not regret that decision; however, we're already planning a site around 4.7 and it would be great to have a version of Aggregator2. This is a much-needed module, IMHO, that executes a unique task. I also hope it's ported --