Closed (fixed)
Project:
SimpleFeed
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
31 Mar 2008 at 04:36 UTC
Updated:
25 Jun 2008 at 19:44 UTC
Jump to comment: Most recent file
Comments
Comment #1
Se7enLC commentedI just made some code changes and I think I might have addressed the problem.
It looks like duplicate checking is based on an md5sum of the title and body. So if either the title or body change slightly, the feed is no longer the same.
I decided to add in a call to "get_permalink()", which for livejournal feeds is the URL of the entry. Any changes made to the entry, including title and body, will NOT change that URL. This means that it will be able to update the correct feed item as needed. Fallback is to the title+body method if the $url variable doesn't get filled in.
Code is below, modified in simplefeed_item.module:
$url = $item->get_permalink();
$iid = md5($url);
if (!$url)
{ $iid = md5($title . $body); }
Comment #2
mrrijo commentedi had the same problem with current stable 5 release with google alerts. I have just put your code. And i am waiting for a while to see the any duplication occurs!. Thanks for this tip. :)
Comment #3
Se7enLC commentedwell that didn't work at all, I still got duplicate items.
Comment #4
Se7enLC commentedlooking in the db, it seems that the iid field of simplefeed_feed_item is not being filled in. It seems that the correct $iid is being saved into $form_state. The node is created using this line of code, which I assume is internal to drupal:
drupal_execute('feed_item_node_form', $form_state, $node);
There's no other mention of "feed_item_node_form" anywhere else in simplefeed, simplepie or drupal
Comment #5
Se7enLC commentedAnother possible fix:
in the function "simplefeed_item_feed_parse" toward the end, "drupal_execute('feed_item_node_form', $form_state, $node);" is called. $node is supposed to contain the defaults and form_state should contain the per-feed-item parameters. As it turns out, the function is using the $iid from the node rather than from the form state.
I added the following line of code right before that function call:
$node->iid = $iid;
I now see the iid being filled into the database. Unsure yet how this will effect duplicate checking, but I am hopeful
EDIT: it seems that this change was a success. New items are being imported, but changed items are not being updated. This may be intentional behavior. I might make changes to drop and re-add changed feed items under the same number
Comment #6
mfer commentedI'm in the same situation.
Instead of dropping and re-adding why not updated the body, title, and other attributes on the node and save it if there is a change? This way, if any comments or other things have been added to the node on the local site they will remain.
Comment #7
mfer commentedWhat if we switched out
For
This is different than what was there in the 1.x version of simplefeed. That was using
When get_id(true) is used the returned result is either
or
If no value or false is sent to get_id() it will return the id (from atom feeds), guid, identifier, permalink (link), title, or the same result as if it were true. My only concern here is if it gets to the title part. There may end up being duplicate titles on different posts.
So, are we up for going with
or a similar function of our own that does the same thing except we do something different at the title part? In the few minutes I looked through this it looks like it might be a more robust solution.
Comment #8
mfer commentedWell, the idea of going with md5($item->get_id()) seems nice until you have to update the iid in simplefeed_feed_items table where you don't have all the info you are looking for.
In the data set I have the duplicates come up when there is a change to the body section of the node. This can be some going in there and changing something or it can be a change from feedburner. For example, feedburner adding blog bling at the bottom of the item. Or, I've noticed feedburner will occasionally remove extra spaces. This can happen after an item has been pulled in.
I'm experimenting with this instead:
This is the same title and url we should have for each item. This won't cause duplications due to changes in the body. And, if we want to update the node based on body changes we can now detect the item and update it on something other than the body.
Updating to this is pretty easy. It's just a matter of using an update function similar to the last update.
The only big downfall would be if someone puts out 2 items with the same name and that have don't have item urls (so the feed url is used) the second one won't be added.
I'm comparing this to the current setup. I'll report back in a few days when there has been a chance to see data with the differences.
Comment #9
mfer commentedFYI, I'm going to work on a patch for what I proposed in #8 sometime in the next week.
Comment #10
mfer commentedHere is a patch for the 5.x-2.x branch. The 6.x patch is coming soon.
Comment #11
mfer commentedLet's try that again on D5 without my .project and .settings files.
Comment #12
mfer commentedHere is a patch against 6.x-1.x. This is untested since I don't have simplefeed running on drupal 6 right now but the change seems pretty straight forward.
Also, I used the function to update based on the update 1 function but I altered it so it goes through the whole table and not just the first 50.
Comment #13
m3avrck commentedMatt this looks most excellent!
I'm going to double check these patches running 'em through a few 1000 blogs I have and confirm 'em both.
Comment #14
xiffy commented@Se7enLC
Confirmed, I don't know if it is the Drupal way, probably not since the $iid is attached to the form-stated, so that would be the preferred method I suppose.
But that does not work, my iid in tha database stayed empty (drupal 6.2, simplepie 1.1.1 developement snapshot of simplefeed). Your $node->iid = $iid; worked like a charm.
This should be in a next release, or at least a working version of the preferred Drupal method.
Cheers
Comment #15
jmaties@drupal.org commented"db_num_rows" does not exist in Drupal6.x :(
Comment #16
mfer commented@m3avrck - any word on the patches in #11 and #12? I've had them running since I posted them without any noticeable problem.
Comment #17
m3avrck commentedYes hope to commit this week, been on travel for a bit. I'll also triple check the patches running through a few thousand blogs too :)
Comment #18
m3avrck commented@Se7enLC and @xiffy, please see this new issue for what you guys are experiencing: http://drupal.org/node/269185 -- this is seperate from the patches in this issue. Your issue seems to be a PHP4 type issue.
Comment #19
m3avrck commentedThanks Matt! Patch #11 committed to D5: http://drupal.org/cvs?commit=120679
Comment #20
m3avrck commentedPatch #12 committed to D6: http://drupal.org/cvs?commit=120702
Comment #21
m3avrck commentedgood to go, thanks guys!
Comment #22
gsnedders commentedHowever, due to things like http://www.intertwingly.net/blog/2005/04/09/Clone-Wars, just using the ID doesn't work. It's wide spread enough to be an issue.
Comment #23
mfer commented@gsnedders - what do you mean just using the id wouldn't work? I don't see how that's an issue since duplicate checking isn't based on just an id. Can you please explain what you mean?
Comment #24
gsnedders commentedI've been running around too much, so what I say may well be non-sense. Looking at it again, what you currently have doesn't work either, as you can have a conforming feed with two items with the same title and link which are different.
Comment #25
mfer commented@gsnedders You can have a feed with a 2 separate items that have the same link and title? Can you provide an example of this?
What ever solution we use for this module needs to take into account the existing design debt of the module as well as an update path for those who are already using it. Any suggestions for an alternative method of duplicate detecting?
Comment #26
gsnedders commentedI would take the ID to be unique per item provided that it is not repeated at all in the current feed, if it is, I would then fall back on the link. I would always treat the ID to be unique only within the current feed, and not globally (though technically it should be).
I don't have any examples of feeds with multiple items with identical link/title with different IDs, but they shouldn't be treated as duplicates. One thing I do strongly believe is that the given ID should not just be outright ignored (even though it does need to be ignored in some cases for compatibility with the real world).
http://www.詹姆斯.com/blog/2006/08/rss-dup-detection gives a decent description of duplicate description, FWIW.
Comment #27
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.