I have a site with about 25 feeds or so setup using the feedapi module using the SimplePie parser. I'm noticing that I keep getting duplicate nodes being created on successive runs of cron. It doesn't seem to occur all the time but certainly a lot of the time. I also have the box checked for checking for duplicates within the feed so this doesn't seem to be working correctly.

The nodes that are created using a pathauto alias, so if the path already exists, it just affixes a "-0", "-1", etc...... at the end.

I'm not sure if this is related but I'm seeing some strange behavior of certain feed nodes being promoted to the front page for no apparent reason. I'm also troubleshooting another issue where I keep reaching the max time when running cron so I'm not sure if that is related or not.

Thanks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Aron Novak’s picture

Status: Active » Postponed (maintainer needs more info)

Can you try out if it happens without pathauto also?

bschoudel’s picture

I can certainly try that.

After a few days of monitoring I'm noticing a pattern where the duplicates only seem to be created on the feeds where I have multiple feeds from same source. For example, I may have an espn feed on "big ten" basketball and an espn feed on "NCAA Basketball". It's these feeds or so it appears that the duplicates are generated.

bschoudel’s picture

I have turned off pathauto and confirmed that I still am getting duplicate nodes on certain feeds.

bschoudel’s picture

Ok, after further review it appears to be a malformed feed. It seems that after an article is posted, if the feed runs the next day the very same article is posted again with a new url containing the date. Based on the logic of feedapi which compares direct matches of guid and url it makes sense that a new node is created.

I wrote a local module that searches for a substring on the pathauto generated url in the hook_nodeapi insert case. If it finds a match then it issues a node_delete on the current node being saved.

This seems to have resolved my problem.

Thanks

pillarsdotnet’s picture

I solved the same problem by writing a parser module, using common_syndication_parser as a starting point. Added functionality includes:

  • Duplicates are resolved by following url redirects and only storing the final non-redirected URL.
  • Fetches content from source site, converts relative urls to absolute, and filters through check_markup() before caching.
  • Only shows those paragraphs which contain searched-for terms, in an attempt to satisfy "fair-use" standards.

Here's the function which follows redirects to get the "real" url and avoid duplicate articles.

function _direct_parser_realurl($url) {
  static $curlopts = array(
    CURLOPT_AUTOREFERER => true,
    CURLOPT_COOKIESESSION => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HEADER => true,
    CURLOPT_NOBODY => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => 'Mozilla/4.0',
  );
  $ch = curl_init($url);
  curl_setopt_array($ch, $curlopts);
  $output = curl_exec($ch);
  curl_close($ch);
  foreach(explode("\r\n",$output) as $line) {
    if (!strncmp($line, 'Location: ', 10)) {
      $url = substr($line, 10);
    }
  }
  return $url;
}

Compare the source feed with the parsed result.

(But be kind; the Coyle site is very much under construction; the theme is obviously not fleshed out.)

pillarsdotnet’s picture

The above strategy still occasionally produced duplicate stories, when the story in fact was found on duplicate URL's.

Filtering unnecessary arguments from the URL helped but did not eliminate the problem.

My latest strategy is to replace the guid with an md5 sum of the extracted text. That seems to be working for now.

cardentey’s picture

A small variation:

function _direct_parser_realurl($url) {
  static $curlopts = array(
    CURLOPT_AUTOREFERER => true,
    CURLOPT_COOKIESESSION => true,
    CURLOPT_HEADER => true,
    CURLOPT_NOBODY => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1',
  );
  $ch = curl_init($url);
  curl_setopt_array($ch, $curlopts);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 10);
  $output = curl_exec($ch);  
  
  $newurl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  if (!$newurl) {
    $newurl = $url;  
  }
 
  curl_close($ch);
  
  return $newurl;
}

function mymodule_feedapi_after_parse($feed){

 for ($i = 0; $i < count($feed->items); $i++) {
      $feed->items[$i]->options->original_url = _direct_parser_realurl($feed->items[$i]->options->original_url);
    }
    
}

Saludos,
Roberto
http://www.sobrefamosos.com

pillarsdotnet’s picture

FileSize
10.45 KB

Thanks for the improvement. I'm attaching my module, in case anyone is interested.

The current version depends on a single regex which is used to select applicable paragraphs from all newsfeeds. Not everyone would prefer this implementation, but it works for me.

Summit’s picture

Hi,

I have this situation already a long time, see: http://drupal.org/node/251908
Could a solution for this (may be one of the above), please be inserted into feedapi 6?
Thanks a lot in advance for considering this!

Greetings,
Martijn

pillarsdotnet’s picture

@Summit:

Not the same problem.

Your issue has to do with pathauto generating duplicate aliases. Doesn't happen on my installation -- the second one gets a "-1" added to the name; the third gets a "-2"; etc. Dunno what is causing your problem but my hunch is that it has to do with your pathauto installation, not with feedapi.

My issue is different. I want to eliminate duplicate article summaries, even when they come from distinct feed items. To do this, I made a new parser module and rewrote its criteria for duplicate detection.

The original code compares the URL and GUID as reported by the feed.

My code compares a hash of the excerpt of the article as shown on my site.

The original is relatively quick and compliant with the RSS standard.

Mine is relatively slow but immune from lousy implementations of the standard.

I doubt that the feedapi author will choose to incorporate my ideas. But if I get a second paying client who wants a newsfeed, I might go ahead and register a new parser module. Meanwhile, you have a fairly recent snapshot of my working code (above). Feel free to use and improve it. Or pay someone to do it for you.

Aron Novak’s picture

The idea is quite good, sometimes it's useful to consider the actual content of the feed item.
However, as i checked out the module source code, it could re-use common syndication functions instead of copy-pase them. I imagine this module as a common syndication wrapper plus the small modification what you did in guid computation.

pillarsdotnet’s picture

Yup. Mine is a horrible ugly hack; it Works For Me (tm) but I'd have to clean it up quite a bit before publishing as a module. Meanwhile, anybody who wants to use the ideas to make something better is certainly welcome.

Aron Novak’s picture

Anyway i'm sure this is not a big work to achieve a clean module.

virtualdrupal’s picture

Status: Postponed (maintainer needs more info) » Active

I'm experiencing the same issue, if I have two separate feeds from the same source, like "USAToday World Politics" & "USAToday World Health", any time they include the same article within both feeds, I get duplicates...

I'm processing hundreds of feeds though so this regex solution might send my cron into a timeout frenzy... any advice? Currently using the simplepie parser

hope it's ok that I changed the status... This is actually pretty critical

Aron Novak’s picture

Status: Active » Postponed (maintainer needs more info)

#14:
Have you turned on "Check for duplicates on all feeds" option? If not, do that. If you did, please supply the exact feed URLs where you experience this problem.

virtualdrupal’s picture

Aron, I was wrong, it's not duplicate nodes, it's duplicates of the same node within a view (since each feed is from a different taxonomy), enabling the "distinct node" feature in views2 appears to catch it.

Aron Novak’s picture

Status: Postponed (maintainer needs more info) » Fixed

enabling the "distinct node" feature in views2 appears to catch it

If i understand the situation, this issue is fixed.

virtualdrupal’s picture

Status: Fixed » Active

Well I guess the "distinct:yes" solution caught some of them but not all.

I'm actually experiencing nodes being created both twice from the same feed (with -0 being appended on to the duplicate URL) as well as two different feeds being credited for the same node (as parents)..

Usually it's the 2nd issue, where two separate feeds that carry some identical stories get credited twice as the parent, so only one node, with two parents.. this causes two problems. Views seems to pick it up twice, even though it's technically only one node and "distinct:yes" is configured. So when I'm on a category page (example.com/taxonomy/term/3) with a views block generating all relevant nodes by argument arg(2);, I get duplicate results for any article with multiple parents.

The same issue causes a second problem, I use if (!$node->links['feedapi_feed']['title']): as a variable in my node.tpl.php to quote the source, but when it has two parent's, I'm not sure how to foreach my way through the array as feed_feed->title no longer exists..

Example :: (Typical node)

                    [links] => Array
                        (
                            [feedapi_feed] => Array
                                (
                                    [title] => Feed: Reuters
                                    [href] => node/6449
                                )

                            [feedapi_original] => Array
                                (
                                    [title] => Original article
                                    [href] => http://feeds.reuters.com/~r/reuters/worldNews/~3/1w3sYSzcQqo/idUSTRE55T0LQ20090701
                                )

                        )

Example :: (Problem Node with multiple parents)

                    [links] => Array
                        (
                            [feedapi_feed_263] => Array
                                (
                                    [title] => Feed: Newsweek
                                    [href] => node/263
                                )

                            [feedapi_feed_6526] => Array
                                (
                                    [title] => Feed: MSNBC.com
                                    [href] => node/6526
                                )

                            [feedapi_original] => Array
                                (
                                    [title] => Original article
                                    [href] => http://www.newsweek.com/id/204762
                                )

                        )

The two feeds that make up the problem node in this example are
http://www.newsweek.com/id/43805/output/rss
&
http://www.msnbc.msn.com/id/3032506/device/rss/rss.xml

So in some cases when both problems combine I get the same article three times, one with two parents, and then another with -0 at the end when one of those parents happened to randomly create the node twice, all of which show up in my view block.

I see the duplicate node (with -0 appended to the duplicate) less often than the multiple parent issue.

Let me know if you need any other details to help troubleshoot, pretty critical problem at the moment.

Best Regards

virtualdrupal’s picture

I should add that I'm running the FeedAPI DEV version, updated in the last 7 days.. so if anything significant has happened related to duplicates recently let me know and I'll try another upgrade.

virtualdrupal’s picture

To answer your question from #15, "check for duplicates on all feeds" is selected on every feed.

Eric_A’s picture

I'm actually experiencing nodes being created both twice from the same feed (with -0 being appended on to the duplicate URL)

I'm looking at 5.x-1.5 code and the interesting thing to me is that feedapi_inherit_feedapi_item() always returns NULL in case $op is 'unique', or any op for that matter. The code in _feedapi_invoke_refresh() doesn't distinguish between FALSE or NULL, so in some scenario's we run both the code for a unique and non unique item.

Unfortunately I cannot investigate further right now.

The interesting scenario in _feedapi_invoke_refresh() seems to be this one:

A new item is processed and feedapi_node reports it as unique (save it) and then feedapi_inherit reports it to be non unique (update an item that may not yet exist everywhere?).

Of course some people may have these processors running in the reverse order.

EDIT:
I have taken another quick look and it appears that feedapi_inherit causes no harm. It reports a duplicate but does not insert anything itself, neither does it appear to trigger some other insert.

Eric_A’s picture

5.x-1.5 _feedapi_node_unique() checks for a duplicate URL or GUID within the same feed, but _feedapi_node_update() uses the nid of the first duplicate URL or GUID from all feeds... This wrong combination of feed and feed item is then used to delete from and insert into in the list of existing items...

EDIT: _feedapi_update() was a mistake, and I fixed it to the correct name: _feedapi_node_update().

leducmills’s picture

I'm having similar issues with my site - both in the creation of duplicate nodes (with -0, -1, etc.) and the appearance of duplicate feeds in the admin/content/feeds menu whenever I edit and save a feed.

The big issue for me is that the feed items aren't always showing all the content - sometimes the body is missing, and sometimes it's not showing by default but when I go in and edit, it's there and all I have to do is save it to get it to show. I'm so lost, been trying to fix this for a few days now.

Setup:

Drupal 6.14
FeedAPI 6.x-1.9-beta1
FeedAPI Node 6.x-1.9-beta1
FeedAPI Mapper 6.x-2.0-alpha3
Common syndication parser 6.x-1.9-beta1
FeedAPI Inherit 6.x-1.9-beta1
FeedAPI Taxonomy Compare 6.x-1.4

Eric_A’s picture

Status: Active » Needs work
FileSize
2.46 KB

Follow up on #22. I know it's not against HEAD but perhaps this 5.x-1.5 patch gets someone started who has time and deeper knowledge of FeedAPI.

EDIT: #324797: Duplicate items when update items enabled http://drupal.org/node/324797#comment-1607302
All that is missing in that issue is a complete solution for 5.x.

deltab’s picture

subscribing, same issue with 6.x

Anonymous’s picture

subscribing, same issue with 6.x

with simple pie as my XML parser of choice

jannalexx’s picture

Version: 6.x-1.5 » 6.x-1.9-beta2

subscribing, same issue with 6.x

Eric_A’s picture

A fixed (duplicate) 6.x issue is here: #324797: Duplicate items when update items enabled

Summit’s picture

If it is fixed in june in 6.x ..how can it still occur?

jannalexx’s picture

unfotunatelly this was never fixed for me. fuzzy behaviour on dublicated content on cron runs, nodes made from feedapi / simplepie parser / cache lifetime in seconds:30

Eric_A’s picture

Are any of you not using Drupal 6.14 and not using Poormanscron 1.x? If you are using one or both then see if updating to 6.15 and 2.x fixes your problems.

jannalexx’s picture

there is nothing to do with poormanscron as it can happen by simply refreshing feed items
common syndication parser works without dublicated content (dev version tested) so i had to switch to it from simplepie

focal55’s picture

I had duplicate nodes being created with FeedsAPI 6.x-1.x-dev + Drupal 6.16. I was trying to modify the FeedsAPI importer to have a custom content type used when creating nodes from import. My custom content type seemed to cause the duplicates because after I changed the setting back to create new Feed Items it works and does not create duplicates.

meghs’s picture

Hello focal55,can you clarify the solution you give in post, I am facing the same problem. I have also used a custom content type for my feed items.