I have many importers parsing external XML and processing into Nodes, setting the URL and GUID correctly, but when the Importers check the sources again, some of then are duplicating nodes, ignoring the "unique target option".

Does anybody having this problem?

Comments

pedrorocha’s picture

Issue tags: -duplicate +duplicated nodes
danmuzyka’s picture

Version: 6.x-1.0-beta9 » 6.x-1.0-beta10

I am having the same issue on 6.x-1.0-beta10. I checked the feeds_node_item table and confirmed that multiple records, each created by a different parent feed node, have identical URLs and GUIDs, even though I selected both of those fields to be unique targets in the importer configuration. Then I tried deleting all of my imported nodes, importing again from one of my feed nodes, and then editing that feed node to change the feed URL field to match the value of another one of my feed nodes. Bingo! When I imported from the first feed node again, I did NOT get duplicate item nodes.

So, it seems that the "unique target" value only works for feed item nodes that have the same parent feed node. I imagine that importing content from feeds that come from the same site, which may have items that appear in more than one feed, is a fairly common use case. For instance, just now I was trying to import feeds from my del.icio.us bookmarks from different feeds based on my tags. In the past, I worked on a news site that has an agreement with Bloomberg news allowing it to import RSS feeds from multiple Bloomberg article categories. Sometimes the same article appeared in two or more feeds. In that case the feeds were populating different parts of the site so it was a moot point, fortunately.

Thanks in advance for any help anyone can provide!

johnv’s picture

There are a lot of posts about the Unique target. Regarding your question, see the following link that explains why the behaviour is correct: 1 GUID per feed, not per node type: http://drupal.org/node/761076#comment-2802256

danmuzyka’s picture

@johnv, thanks for the quick reply. I see in the comment you pointed out that @alex_b states that this behavior is deliberate, however the business logic behind that decision does not make sense to me. @smscotten makes a similar point in http://drupal.org/node/661606#comment-3799942, and if there is there is a reason that it is better to test uniqueness against other nodes with the same parent feed node rather than against all imported nodes, I am not understanding it.

If there are other issues or comments arguing in favor of the current approach, could you point a few out? Maybe I just overlooked them. Thanks again for your help.

johnv’s picture

@Dan, I am just a user, not maintainer of this module. In my case, i specify file-names as a source, so I can create one super-importer for different 'feeds'. And you're right, when automating that, I'll run into problems.
But as alex_b states the 'as-is/works-as-designed' situation, perhaps you'd better change this issue to a ' feature request' instead of a bug report.

danmuzyka’s picture

Title: Unique targets is being ignored » Unique targets ignored unless nodes have same parent feed node
Category: bug » feature

@johnv, sure thing, I guess I assumed that you were a close friend or colleague of alex_b or at least had been using this module for long enough you that had particular insight into the rationale behind the current functionality. I'll change the category to feature request if you think that makes more sense. I'm also renaming the issue title for clarity.

johnv’s picture

Status: Active » Closed (duplicate)

Check also this issue, which already contains a patch for the very thing: Attach multiple importers to one content type.

EvanDonovan’s picture

Version: 6.x-1.0-beta10 » 6.x-1.x-dev
Priority: Normal » Major
Status: Closed (duplicate) » Active

Sorry for reopening this, but based on the comments in #661606-14: Support unique targets in mappers and following, I think it is necessary. It was incredibly surprising to me when I discovered that the GUID is only a GUID for a specific feed source.

johnv, I am not sure if #634462: Attach multiple importers to one content type (in D6) actually addresses my needs, since that is more about having multiple feed URLs on a single node, whereas I would like to create them as separate nodes for ease of administration on my site. (There are potentially going to be over a hundred of them.)

For anyone who doubts that GUIDs are currently only specific to a feed source (feed_nid in my case), create multiple feed nodes that pull from the same feed URL, then after running them, try the following query:

SELECT fn.nid,
       fn.feed_nid,
       fn.id,
       fn.url,
       fn.guid
FROM   feeds_node_item fn
       INNER JOIN (SELECT dfn.guid
                   FROM   feeds_node_item dfn
                   GROUP  BY dfn.guid
                   HAVING COUNT(guid) > 1) dup
         ON dup.guid = fn.guid
ORDER  BY fn.guid, fn.nid

That query will show that there are duplicates, and that it is because of the different feed sources.

While I see how making GUIDs specific to a feed source could make sense in some cases (standard RSS feeds, where the same URL could show up in multiple feeds and you might want it from both), it doesn't make sense in others (an XML datasource against which you are running multiple API queries).

I think that there ought to be a way to have a GUID which is really a GUID, i.e, the code in FeedsNodeProcessor's existingItemId() method would for this field query with SELECT DISTINCT, and without the feed_nid field. That way it would enforce uniqueness across all feeds. (For the purpose of handling legacy data, the query could instead be a select ordered by created date, ascending, limited to the 1st nid returned. That way, it would always match the oldest one.)

Does anyone else have a need for this? I would even suggest that this new GUID should be called "GUID" in the interface, since it is more consistent with the standard meaning of "GUID". The other one could be called "Feed-Specific GUID" or something (I know that's a clunky name, can't think of something better offhand).

EvanDonovan’s picture

This is what I am using for now in my existingItemId function in FeedsNodeProcessor:

      switch ($target) {
        case 'nid':
          $nid = db_result(db_query("SELECT nid FROM {node} WHERE nid = %d", $value));
          break;
        case 'url':
	  // ead: hacked this to return only the oldest result if there are duplicates, and to be unique across tables
          $nid = db_result(db_query_range("SELECT nid FROM {feeds_node_item} WHERE id = '%s' AND url = '%s' ORDER BY imported ASC", $source->id, $value, 0, 1));
          break;
        case 'guid':
	  // ead: hacked this to return only the oldest result if there are duplicates, and to be unique across tables
          $nid = db_result(db_query_range("SELECT nid FROM {feeds_node_item} WHERE id = '%s' AND guid = '%s' ORDER BY imported ASC", $source->id, $value, 0, 1));
          break;
      }

This being a hack to the module code, I suppose the correct way to do things would be to create a processor that inherited from FeedsNodeProcessor. But my proposal in #661606-18: Support unique targets in mappers would also work I think, and would mean that it would not be necessary to create an entire new class just for this one thing.

twistor’s picture

Issue summary: View changes
Status: Active » Closed (outdated)
Issue tags: -duplicated nodes