I have a site with about 25 feeds or so setup using the feedapi module using the SimplePie parser. I'm noticing that I keep getting duplicate nodes being created on successive runs of cron. It doesn't seem to occur all the time but certainly a lot of the time. I also have the box checked for checking for duplicates within the feed so this doesn't seem to be working correctly.
The nodes that are created using a pathauto alias, so if the path already exists, it just affixes a "-0", "-1", etc...... at the end.
I'm not sure if this is related but I'm seeing some strange behavior of certain feed nodes being promoted to the front page for no apparent reason. I'm also troubleshooting another issue where I keep reaching the max time when running cron so I'm not sure if that is related or not.
Thanks
Comment | File | Size | Author |
---|---|---|---|
#24 | feedapi_node-365943-24-D5.patch | 2.46 KB | Eric_A |
#8 | direct_parser.zip | 10.45 KB | pillarsdotnet |
Comments
Comment #1
Aron NovakCan you try out if it happens without pathauto also?
Comment #2
bschoudel CreditAttribution: bschoudel commentedI can certainly try that.
After a few days of monitoring I'm noticing a pattern where the duplicates only seem to be created on the feeds where I have multiple feeds from same source. For example, I may have an espn feed on "big ten" basketball and an espn feed on "NCAA Basketball". It's these feeds or so it appears that the duplicates are generated.
Comment #3
bschoudel CreditAttribution: bschoudel commentedI have turned off pathauto and confirmed that I still am getting duplicate nodes on certain feeds.
Comment #4
bschoudel CreditAttribution: bschoudel commentedOk, after further review it appears to be a malformed feed. It seems that after an article is posted, if the feed runs the next day the very same article is posted again with a new url containing the date. Based on the logic of feedapi which compares direct matches of guid and url it makes sense that a new node is created.
I wrote a local module that searches for a substring on the pathauto generated url in the hook_nodeapi insert case. If it finds a match then it issues a node_delete on the current node being saved.
This seems to have resolved my problem.
Thanks
Comment #5
pillarsdotnet CreditAttribution: pillarsdotnet commentedI solved the same problem by writing a parser module, using common_syndication_parser as a starting point. Added functionality includes:
Here's the function which follows redirects to get the "real" url and avoid duplicate articles.
Compare the source feed with the parsed result.
(But be kind; the Coyle site is very much under construction; the theme is obviously not fleshed out.)
Comment #6
pillarsdotnet CreditAttribution: pillarsdotnet commentedThe above strategy still occasionally produced duplicate stories, when the story in fact was found on duplicate URL's.
Filtering unnecessary arguments from the URL helped but did not eliminate the problem.
My latest strategy is to replace the guid with an md5 sum of the extracted text. That seems to be working for now.
Comment #7
cardentey CreditAttribution: cardentey commentedA small variation:
Saludos,
Roberto
http://www.sobrefamosos.com
Comment #8
pillarsdotnet CreditAttribution: pillarsdotnet commentedThanks for the improvement. I'm attaching my module, in case anyone is interested.
The current version depends on a single regex which is used to select applicable paragraphs from all newsfeeds. Not everyone would prefer this implementation, but it works for me.
Comment #9
Summit CreditAttribution: Summit commentedHi,
I have this situation already a long time, see: http://drupal.org/node/251908
Could a solution for this (may be one of the above), please be inserted into feedapi 6?
Thanks a lot in advance for considering this!
Greetings,
Martijn
Comment #10
pillarsdotnet CreditAttribution: pillarsdotnet commented@Summit:
Not the same problem.
Your issue has to do with pathauto generating duplicate aliases. Doesn't happen on my installation -- the second one gets a "-1" added to the name; the third gets a "-2"; etc. Dunno what is causing your problem but my hunch is that it has to do with your pathauto installation, not with feedapi.
My issue is different. I want to eliminate duplicate article summaries, even when they come from distinct feed items. To do this, I made a new parser module and rewrote its criteria for duplicate detection.
The original code compares the URL and GUID as reported by the feed.
My code compares a hash of the excerpt of the article as shown on my site.
The original is relatively quick and compliant with the RSS standard.
Mine is relatively slow but immune from lousy implementations of the standard.
I doubt that the feedapi author will choose to incorporate my ideas. But if I get a second paying client who wants a newsfeed, I might go ahead and register a new parser module. Meanwhile, you have a fairly recent snapshot of my working code (above). Feel free to use and improve it. Or pay someone to do it for you.
Comment #11
Aron NovakThe idea is quite good, sometimes it's useful to consider the actual content of the feed item.
However, as i checked out the module source code, it could re-use common syndication functions instead of copy-pase them. I imagine this module as a common syndication wrapper plus the small modification what you did in guid computation.
Comment #12
pillarsdotnet CreditAttribution: pillarsdotnet commentedYup. Mine is a horrible ugly hack; it Works For Me (tm) but I'd have to clean it up quite a bit before publishing as a module. Meanwhile, anybody who wants to use the ideas to make something better is certainly welcome.
Comment #13
Aron NovakAnyway i'm sure this is not a big work to achieve a clean module.
Comment #14
virtualdrupal CreditAttribution: virtualdrupal commentedI'm experiencing the same issue, if I have two separate feeds from the same source, like "USAToday World Politics" & "USAToday World Health", any time they include the same article within both feeds, I get duplicates...
I'm processing hundreds of feeds though so this regex solution might send my cron into a timeout frenzy... any advice? Currently using the simplepie parser
hope it's ok that I changed the status... This is actually pretty critical
Comment #15
Aron Novak#14:
Have you turned on "Check for duplicates on all feeds" option? If not, do that. If you did, please supply the exact feed URLs where you experience this problem.
Comment #16
virtualdrupal CreditAttribution: virtualdrupal commentedAron, I was wrong, it's not duplicate nodes, it's duplicates of the same node within a view (since each feed is from a different taxonomy), enabling the "distinct node" feature in views2 appears to catch it.
Comment #17
Aron NovakIf i understand the situation, this issue is fixed.
Comment #18
virtualdrupal CreditAttribution: virtualdrupal commentedWell I guess the "distinct:yes" solution caught some of them but not all.
I'm actually experiencing nodes being created both twice from the same feed (with -0 being appended on to the duplicate URL) as well as two different feeds being credited for the same node (as parents)..
Usually it's the 2nd issue, where two separate feeds that carry some identical stories get credited twice as the parent, so only one node, with two parents.. this causes two problems. Views seems to pick it up twice, even though it's technically only one node and "distinct:yes" is configured. So when I'm on a category page (example.com/taxonomy/term/3) with a views block generating all relevant nodes by argument arg(2);, I get duplicate results for any article with multiple parents.
The same issue causes a second problem, I use
if (!$node->links['feedapi_feed']['title']):
as a variable in my node.tpl.php to quote the source, but when it has two parent's, I'm not sure how to foreach my way through the array as feed_feed->title no longer exists..Example :: (Typical node)
Example :: (Problem Node with multiple parents)
The two feeds that make up the problem node in this example are
http://www.newsweek.com/id/43805/output/rss
&
http://www.msnbc.msn.com/id/3032506/device/rss/rss.xml
So in some cases when both problems combine I get the same article three times, one with two parents, and then another with -0 at the end when one of those parents happened to randomly create the node twice, all of which show up in my view block.
I see the duplicate node (with -0 appended to the duplicate) less often than the multiple parent issue.
Let me know if you need any other details to help troubleshoot, pretty critical problem at the moment.
Best Regards
Comment #19
virtualdrupal CreditAttribution: virtualdrupal commentedI should add that I'm running the FeedAPI DEV version, updated in the last 7 days.. so if anything significant has happened related to duplicates recently let me know and I'll try another upgrade.
Comment #20
virtualdrupal CreditAttribution: virtualdrupal commentedTo answer your question from #15, "check for duplicates on all feeds" is selected on every feed.
Comment #21
Eric_A CreditAttribution: Eric_A commentedI'm looking at 5.x-1.5 code and the interesting thing to me is that feedapi_inherit_feedapi_item() always returns NULL in case $op is 'unique', or any op for that matter. The code in _feedapi_invoke_refresh() doesn't distinguish between FALSE or NULL, so in some scenario's we run both the code for a unique and non unique item.
Unfortunately I cannot investigate further right now.
The interesting scenario in _feedapi_invoke_refresh() seems to be this one:
A new item is processed and feedapi_node reports it as unique (save it) and then feedapi_inherit reports it to be non unique (update an item that may not yet exist everywhere?).
Of course some people may have these processors running in the reverse order.
EDIT:
I have taken another quick look and it appears that feedapi_inherit causes no harm. It reports a duplicate but does not insert anything itself, neither does it appear to trigger some other insert.
Comment #22
Eric_A CreditAttribution: Eric_A commented5.x-1.5
_feedapi_node_unique()
checks for a duplicate URL or GUID within the same feed, but_feedapi_node_update()
uses the nid of the first duplicate URL or GUID from all feeds... This wrong combination of feed and feed item is then used to delete from and insert into in the list of existing items...EDIT: _feedapi_update() was a mistake, and I fixed it to the correct name: _feedapi_node_update().
Comment #23
leducmills CreditAttribution: leducmills commentedI'm having similar issues with my site - both in the creation of duplicate nodes (with -0, -1, etc.) and the appearance of duplicate feeds in the admin/content/feeds menu whenever I edit and save a feed.
The big issue for me is that the feed items aren't always showing all the content - sometimes the body is missing, and sometimes it's not showing by default but when I go in and edit, it's there and all I have to do is save it to get it to show. I'm so lost, been trying to fix this for a few days now.
Setup:
Drupal 6.14
FeedAPI 6.x-1.9-beta1
FeedAPI Node 6.x-1.9-beta1
FeedAPI Mapper 6.x-2.0-alpha3
Common syndication parser 6.x-1.9-beta1
FeedAPI Inherit 6.x-1.9-beta1
FeedAPI Taxonomy Compare 6.x-1.4
Comment #24
Eric_A CreditAttribution: Eric_A commentedFollow up on #22. I know it's not against HEAD but perhaps this 5.x-1.5 patch gets someone started who has time and deeper knowledge of FeedAPI.
EDIT: #324797: Duplicate items when update items enabled http://drupal.org/node/324797#comment-1607302
All that is missing in that issue is a complete solution for 5.x.
Comment #25
deltab CreditAttribution: deltab commentedsubscribing, same issue with 6.x
Comment #26
Anonymous (not verified) CreditAttribution: Anonymous commentedsubscribing, same issue with 6.x
with simple pie as my XML parser of choice
Comment #27
jannalexx CreditAttribution: jannalexx commentedsubscribing, same issue with 6.x
Comment #28
Eric_A CreditAttribution: Eric_A commentedA fixed (duplicate) 6.x issue is here: #324797: Duplicate items when update items enabled
Comment #29
Summit CreditAttribution: Summit commentedIf it is fixed in june in 6.x ..how can it still occur?
Comment #30
jannalexx CreditAttribution: jannalexx commentedunfotunatelly this was never fixed for me. fuzzy behaviour on dublicated content on cron runs, nodes made from feedapi / simplepie parser / cache lifetime in seconds:30
Comment #31
Eric_A CreditAttribution: Eric_A commentedAre any of you not using Drupal 6.14 and not using Poormanscron 1.x? If you are using one or both then see if updating to 6.15 and 2.x fixes your problems.
Comment #32
jannalexx CreditAttribution: jannalexx commentedthere is nothing to do with poormanscron as it can happen by simply refreshing feed items
common syndication parser works without dublicated content (dev version tested) so i had to switch to it from simplepie
Comment #33
focal55 CreditAttribution: focal55 commentedI had duplicate nodes being created with FeedsAPI 6.x-1.x-dev + Drupal 6.16. I was trying to modify the FeedsAPI importer to have a custom content type used when creating nodes from import. My custom content type seemed to cause the duplicates because after I changed the setting back to create new Feed Items it works and does not create duplicates.
Comment #34
meghs CreditAttribution: meghs commentedHello focal55,can you clarify the solution you give in post, I am facing the same problem. I have also used a custom content type for my feed items.