duplicate nodes being created from feedapi runs [#365943]

Comment	File	Size	Author
#24	feedapi_node-365943-24-D5.patch	2.46 KB	eric_a
#8	direct_parser.zip	10.45 KB	pillarsdotnet

Comment #1

aron novak

Hungarian

Hungary, Budapest

commented 3 February 2009 at 10:35

Status:

Active

» Postponed (maintainer needs more info)

Can you try out if it happens without pathauto also?

Log in or register to post comments

Comment #2

bschoudel commented 3 February 2009 at 14:25

I can certainly try that.

After a few days of monitoring I'm noticing a pattern where the duplicates only seem to be created on the feeds where I have multiple feeds from same source. For example, I may have an espn feed on "big ten" basketball and an espn feed on "NCAA Basketball". It's these feeds or so it appears that the duplicates are generated.

Log in or register to post comments

Comment #3

bschoudel commented 5 February 2009 at 15:13

I have turned off pathauto and confirmed that I still am getting duplicate nodes on certain feeds.

Log in or register to post comments

Comment #4

bschoudel commented 7 February 2009 at 15:02

Ok, after further review it appears to be a malformed feed. It seems that after an article is posted, if the feed runs the next day the very same article is posted again with a new url containing the date. Based on the logic of feedapi which compares direct matches of guid and url it makes sense that a new node is created.

I wrote a local module that searches for a substring on the pathauto generated url in the hook_nodeapi insert case. If it finds a match then it issues a node_delete on the current node being saved.

This seems to have resolved my problem.

Thanks

Log in or register to post comments

Comment #5

pillarsdotnet commented 16 February 2009 at 09:42

I solved the same problem by writing a parser module, using common_syndication_parser as a starting point. Added functionality includes:

Duplicates are resolved by following url redirects and only storing the final non-redirected URL.
Fetches content from source site, converts relative urls to absolute, and filters through check_markup() before caching.
Only shows those paragraphs which contain searched-for terms, in an attempt to satisfy "fair-use" standards.

Here's the function which follows redirects to get the "real" url and avoid duplicate articles.

function _direct_parser_realurl($url) {
  static $curlopts = array(
    CURLOPT_AUTOREFERER => true,
    CURLOPT_COOKIESESSION => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_HEADER => true,
    CURLOPT_NOBODY => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => 'Mozilla/4.0',
  );
  $ch = curl_init($url);
  curl_setopt_array($ch, $curlopts);
  $output = curl_exec($ch);
  curl_close($ch);
  foreach(explode("\r\n",$output) as $line) {
    if (!strncmp($line, 'Location: ', 10)) {
      $url = substr($line, 10);
    }
  }
  return $url;
}

Compare the source feed with the parsed result.

(But be kind; the Coyle site is very much under construction; the theme is obviously not fleshed out.)

Log in or register to post comments

Comment #6

pillarsdotnet commented 17 February 2009 at 02:49

The above strategy still occasionally produced duplicate stories, when the story in fact was found on duplicate URL's.

Filtering unnecessary arguments from the URL helped but did not eliminate the problem.

My latest strategy is to replace the guid with an md5 sum of the extracted text. That seems to be working for now.

Log in or register to post comments

Comment #7

cardentey commented 17 February 2009 at 16:58

A small variation:

function _direct_parser_realurl($url) {
  static $curlopts = array(
    CURLOPT_AUTOREFERER => true,
    CURLOPT_COOKIESESSION => true,
    CURLOPT_HEADER => true,
    CURLOPT_NOBODY => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1',
  );
  $ch = curl_init($url);
  curl_setopt_array($ch, $curlopts);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 10);
  $output = curl_exec($ch);  
  
  $newurl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  if (!$newurl) {
    $newurl = $url;  
  }
 
  curl_close($ch);
  
  return $newurl;
}

function mymodule_feedapi_after_parse($feed){

 for ($i = 0; $i < count($feed->items); $i++) {
      $feed->items[$i]->options->original_url = _direct_parser_realurl($feed->items[$i]->options->original_url);
    }
    
}

Saludos,
Roberto
http://www.sobrefamosos.com

Log in or register to post comments

Comment #8

pillarsdotnet commented 22 February 2009 at 19:38

Status	File	Size
new	direct_parser.zip	10.45 KB

Thanks for the improvement. I'm attaching my module, in case anyone is interested.

The current version depends on a single regex which is used to select applicable paragraphs from all newsfeeds. Not everyone would prefer this implementation, but it works for me.

Log in or register to post comments

Comment #9

summit commented 4 March 2009 at 08:36

Hi,

I have this situation already a long time, see: http://drupal.org/node/251908
Could a solution for this (may be one of the above), please be inserted into feedapi 6?
Thanks a lot in advance for considering this!

Greetings,
Martijn

Log in or register to post comments

Comment #10

pillarsdotnet commented 4 March 2009 at 19:00

@Summit:

Not the same problem.

Your issue has to do with pathauto generating duplicate aliases. Doesn't happen on my installation -- the second one gets a "-1" added to the name; the third gets a "-2"; etc. Dunno what is causing your problem but my hunch is that it has to do with your pathauto installation, not with feedapi.

My issue is different. I want to eliminate duplicate article summaries, even when they come from distinct feed items. To do this, I made a new parser module and rewrote its criteria for duplicate detection.

The original code compares the URL and GUID as reported by the feed.

My code compares a hash of the excerpt of the article as shown on my site.

The original is relatively quick and compliant with the RSS standard.

Mine is relatively slow but immune from lousy implementations of the standard.

I doubt that the feedapi author will choose to incorporate my ideas. But if I get a second paying client who wants a newsfeed, I might go ahead and register a new parser module. Meanwhile, you have a fairly recent snapshot of my working code (above). Feel free to use and improve it. Or pay someone to do it for you.

Log in or register to post comments

Comment #11

aron novak

Hungarian

Hungary, Budapest

commented 5 March 2009 at 07:56

The idea is quite good, sometimes it's useful to consider the actual content of the feed item.
However, as i checked out the module source code, it could re-use common syndication functions instead of copy-pase them. I imagine this module as a common syndication wrapper plus the small modification what you did in guid computation.

Log in or register to post comments

Comment #12

pillarsdotnet commented 5 March 2009 at 17:25

Yup. Mine is a horrible ugly hack; it Works For Me (tm) but I'd have to clean it up quite a bit before publishing as a module. Meanwhile, anybody who wants to use the ideas to make something better is certainly welcome.

Log in or register to post comments

Comment #13

aron novak

Hungarian

Hungary, Budapest

commented 5 March 2009 at 20:09

Anyway i'm sure this is not a big work to achieve a clean module.

Log in or register to post comments

Comment #14

virtualdrupal commented 18 June 2009 at 09:10

Status:

Postponed (maintainer needs more info)

» Active

I'm experiencing the same issue, if I have two separate feeds from the same source, like "USAToday World Politics" & "USAToday World Health", any time they include the same article within both feeds, I get duplicates...

I'm processing hundreds of feeds though so this regex solution might send my cron into a timeout frenzy... any advice? Currently using the simplepie parser

hope it's ok that I changed the status... This is actually pretty critical

Log in or register to post comments

Comment #15

aron novak

Hungarian

Hungary, Budapest

commented 18 June 2009 at 12:19

Status:

Active

» Postponed (maintainer needs more info)

#14:
Have you turned on "Check for duplicates on all feeds" option? If not, do that. If you did, please supply the exact feed URLs where you experience this problem.

Log in or register to post comments

Comment #16

virtualdrupal commented 18 June 2009 at 22:52

Aron, I was wrong, it's not duplicate nodes, it's duplicates of the same node within a view (since each feed is from a different taxonomy), enabling the "distinct node" feature in views2 appears to catch it.

Log in or register to post comments

Comment #17

aron novak

Hungarian

Hungary, Budapest

commented 19 June 2009 at 08:41

Status:

Postponed (maintainer needs more info)

» Fixed

enabling the "distinct node" feature in views2 appears to catch it

If i understand the situation, this issue is fixed.

Log in or register to post comments

Comment #18

virtualdrupal commented 1 July 2009 at 04:33

Status:

Fixed

» Active

Well I guess the "distinct:yes" solution caught some of them but not all.

I'm actually experiencing nodes being created both twice from the same feed (with -0 being appended on to the duplicate URL) as well as two different feeds being credited for the same node (as parents)..

Usually it's the 2nd issue, where two separate feeds that carry some identical stories get credited twice as the parent, so only one node, with two parents.. this causes two problems. Views seems to pick it up twice, even though it's technically only one node and "distinct:yes" is configured. So when I'm on a category page (example.com/taxonomy/term/3) with a views block generating all relevant nodes by argument arg(2);, I get duplicate results for any article with multiple parents.

The same issue causes a second problem, I use if (!$node->links['feedapi_feed']['title']): as a variable in my node.tpl.php to quote the source, but when it has two parent's, I'm not sure how to foreach my way through the array as feed_feed->title no longer exists..

Example :: (Typical node)

                    [links] => Array
                        (
                            [feedapi_feed] => Array
                                (
                                    [title] => Feed: Reuters
                                    [href] => node/6449
                                )

                            [feedapi_original] => Array
                                (
                                    [title] => Original article
                                    [href] => http://feeds.reuters.com/~r/reuters/worldNews/~3/1w3sYSzcQqo/idUSTRE55T0LQ20090701
                                )

                        )

Example :: (Problem Node with multiple parents)

                    [links] => Array
                        (
                            [feedapi_feed_263] => Array
                                (
                                    [title] => Feed: Newsweek
                                    [href] => node/263
                                )

                            [feedapi_feed_6526] => Array
                                (
                                    [title] => Feed: MSNBC.com
                                    [href] => node/6526
                                )

                            [feedapi_original] => Array
                                (
                                    [title] => Original article
                                    [href] => http://www.newsweek.com/id/204762
                                )

                        )

The two feeds that make up the problem node in this example are
http://www.newsweek.com/id/43805/output/rss
&
http://www.msnbc.msn.com/id/3032506/device/rss/rss.xml

So in some cases when both problems combine I get the same article three times, one with two parents, and then another with -0 at the end when one of those parents happened to randomly create the node twice, all of which show up in my view block.

I see the duplicate node (with -0 appended to the duplicate) less often than the multiple parent issue.

Let me know if you need any other details to help troubleshoot, pretty critical problem at the moment.

Best Regards

Log in or register to post comments

Comment #19

virtualdrupal commented 1 July 2009 at 06:02

I should add that I'm running the FeedAPI DEV version, updated in the last 7 days.. so if anything significant has happened related to duplicates recently let me know and I'll try another upgrade.

Log in or register to post comments

Comment #20

virtualdrupal commented 1 July 2009 at 22:17

To answer your question from #15, "check for duplicates on all feeds" is selected on every feed.

Log in or register to post comments

Comment #21

eric_a commented 1 October 2009 at 21:12

I'm actually experiencing nodes being created both twice from the same feed (with -0 being appended on to the duplicate URL)

I'm looking at 5.x-1.5 code and the interesting thing to me is that feedapi_inherit_feedapi_item() always returns NULL in case $op is 'unique', or any op for that matter. The code in _feedapi_invoke_refresh() doesn't distinguish between FALSE or NULL, so in some scenario's we run both the code for a unique and non unique item.

Unfortunately I cannot investigate further right now.

The interesting scenario in _feedapi_invoke_refresh() seems to be this one:

A new item is processed and feedapi_node reports it as unique (save it) and then feedapi_inherit reports it to be non unique (update an item that may not yet exist everywhere?).

Of course some people may have these processors running in the reverse order.

EDIT:
I have taken another quick look and it appears that feedapi_inherit causes no harm. It reports a duplicate but does not insert anything itself, neither does it appear to trigger some other insert.

Log in or register to post comments

Comment #22

eric_a commented 1 October 2009 at 10:20

5.x-1.5 _feedapi_node_unique() checks for a duplicate URL or GUID within the same feed, but _feedapi_node_update() uses the nid of the first duplicate URL or GUID from all feeds... This wrong combination of feed and feed item is then used to delete from and insert into in the list of existing items...

EDIT: _feedapi_update() was a mistake, and I fixed it to the correct name: _feedapi_node_update().

Log in or register to post comments

Comment #23

leducmills commented 30 September 2009 at 17:15

I'm having similar issues with my site - both in the creation of duplicate nodes (with -0, -1, etc.) and the appearance of duplicate feeds in the admin/content/feeds menu whenever I edit and save a feed.

The big issue for me is that the feed items aren't always showing all the content - sometimes the body is missing, and sometimes it's not showing by default but when I go in and edit, it's there and all I have to do is save it to get it to show. I'm so lost, been trying to fix this for a few days now.

Setup:

Drupal 6.14
FeedAPI 6.x-1.9-beta1
FeedAPI Node 6.x-1.9-beta1
FeedAPI Mapper 6.x-2.0-alpha3
Common syndication parser 6.x-1.9-beta1
FeedAPI Inherit 6.x-1.9-beta1
FeedAPI Taxonomy Compare 6.x-1.4

Log in or register to post comments

Comment #24

eric_a commented 1 October 2009 at 21:13

Status:

Active

» Needs work

Status	File	Size
new	feedapi_node-365943-24-D5.patch	2.46 KB

Follow up on #22. I know it's not against HEAD but perhaps this 5.x-1.5 patch gets someone started who has time and deeper knowledge of FeedAPI.

EDIT: #324797: Duplicate items when update items enabled http://drupal.org/node/324797#comment-1607302
All that is missing in that issue is a complete solution for 5.x.

Log in or register to post comments

Comment #25

deltab commented 27 October 2009 at 18:19

subscribing, same issue with 6.x

Log in or register to post comments

Comment #26

Anonymous (not verified) commented 18 November 2009 at 22:13

subscribing, same issue with 6.x

with simple pie as my XML parser of choice

Log in or register to post comments

Comment #27

jannalexx commented 11 February 2010 at 00:27

Version:

6.x-1.5

» 6.x-1.9-beta2

subscribing, same issue with 6.x

Log in or register to post comments

Comment #28

eric_a commented 11 February 2010 at 10:57

A fixed (duplicate) 6.x issue is here: #324797: Duplicate items when update items enabled

Log in or register to post comments

Comment #29

summit commented 11 February 2010 at 13:52

If it is fixed in june in 6.x ..how can it still occur?

Log in or register to post comments

Comment #30

jannalexx commented 13 February 2010 at 11:16

unfotunatelly this was never fixed for me. fuzzy behaviour on dublicated content on cron runs, nodes made from feedapi / simplepie parser / cache lifetime in seconds:30

Log in or register to post comments

Comment #31

eric_a commented 14 February 2010 at 09:05

Are any of you not using Drupal 6.14 and not using Poormanscron 1.x? If you are using one or both then see if updating to 6.15 and 2.x fixes your problems.

Log in or register to post comments

Comment #32

jannalexx commented 10 April 2010 at 12:02

there is nothing to do with poormanscron as it can happen by simply refreshing feed items
common syndication parser works without dublicated content (dev version tested) so i had to switch to it from simplepie

Log in or register to post comments

Comment #33

focal55 commented 16 November 2010 at 17:56

I had duplicate nodes being created with FeedsAPI 6.x-1.x-dev + Drupal 6.16. I was trying to modify the FeedsAPI importer to have a custom content type used when creating nodes from import. My custom content type seemed to cause the duplicates because after I changed the setting back to create new Feed Items it works and does not create duplicates.

Log in or register to post comments

Comment #34

meghs commented 18 March 2011 at 10:36

Hello focal55,can you clarify the solution you give in post, I am facing the same problem. I have also used a custom content type for my feed items.

Log in or register to post comments

duplicate nodes being created from feedapi runs

Comments