Currently, Feeds only supports expiring imported content after a certain time or not at all.

In many cases it would be useful to expire items that do not exist on the external data source anymore. The resulting synchronizing behavior would resemble that of a cache of the external data source.

This caching behavior could be handy when aggregating news feeds that have very high and very low frequencies at the same time (Feed A has 2 items a day while Feed B has 10 items in the last year. With an expiry time of 30 days, Feed A has 60 items while Feed B has e. g. only 1).

Further, it could be used when complex join operations across a current result of various external data sources are required. In this scenario, Feeds could be used to pull the current result of the external sources into a local cache (this "cache" could be nodes or also Data tables). Then the full power of the RDBMS or of the node/CCK API can be used on this cache while making sure that the cache is continuously synched with the external data source on cron time.

Files: 
CommentFileSizeAuthor
#181 163-7.x-2.0-alpha7.patch4.08 KBgnucifer
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 163-7.x-2.0-alpha7.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#174 feeds-d6-synchronization-661314-174.patch4.79 KBgnucifer
FAILED: [[SimpleTest]]: [MySQL] 2,564 pass(es), 0 fail(s), and 55 exception(s).
[ View ]
#167 feeds-d6-synchronization-661314-167.patch4.77 KBgnucifer
FAILED: [[SimpleTest]]: [MySQL] 2,564 pass(es), 0 fail(s), and 55 exception(s).
[ View ]
#166 661314-166-feeds-d6-synchronization.patch4.77 KBgnucifer
FAILED: [[SimpleTest]]: [MySQL] 2,562 pass(es), 0 fail(s), and 55 exception(s).
[ View ]
#165 661314-165-feeds-d6-synchronization.patch4.76 KBgnucifer
FAILED: [[SimpleTest]]: [MySQL] 2,566 pass(es), 0 fail(s), and 55 exception(s).
[ View ]
#163 feeds-synchronize.patch4.06 KBgnucifer
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-synchronize.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#162 feeds_gc-D7-variable_set-661314-161.patch702 bytesbyrond
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds_gc-D7-variable_set-661314-161.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#161 feeds_gc-D7-155.zip4.94 KBbyrond
#153 feeds_gc_D7.zip3.64 KBtevans
#146 feeds_gc.zip4.77 KBsmithmilner
#145 feeds_gc.tar_.gz3.07 KBrgristroph
#119 feeds_gc.tar_.gz2.76 KBamitkeret
#118 feeds_gc.tar_.gz2.76 KBamitkeret
#114 delete_orphans-feeds-6.x-1.0-beta11-661314-113.patch8.96 KBvimaljoseph
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-feeds-6.x-1.0-beta11-661314-113.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#107 delete_orphans-feeds-6.x-1.0-beta10-661314-107.patch9.12 KBvimaljoseph
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-feeds-6.x-1.0-beta10-661314-107.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#93 delete_orphans-661314-93.patch8.32 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-661314-93.patch.
[ View ]
#90 delete_orphans-661314-90.patch8.68 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-661314-90.patch.
[ View ]
#86 feeds-661314-86_use_expiry.patch9.09 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-661314-86_use_expiry.patch.
[ View ]
#68 feeds_node_import_test.tar_.gz999 bytesafeldinger
#51 feeds-661314-51_use_expiry.patch5.96 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]
#50 feeds-661314_use_expiry.patch6.04 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]
#49 feeds-661314_batch_delete_1.patch9.78 KBtanc
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-661314_batch_delete_1.patch.
[ View ]
#47 feeds-661314_batch_delete.patch6.61 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]
#42 feeds-661314.patch5.05 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]
#41 feeds-661314.patch7.86 KBNicolasH
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-661314.patch.
[ View ]
#37 feeds_gc_6x.1.beta9_.zip3.46 KBinfo@cgfix.com
#12 feeds_gc.zip2.88 KBzengenuity
#10 feeds-expire-by-imported.patch5.19 KBnelson_rp
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-expire-by-imported.patch.
[ View ]

Comments

SeanBannister’s picture

Sub

janusman’s picture

subscribing

milos.kroulik’s picture

subscribe. I am really very interested, since this seems to be situation on my planned site.

rjbrown99’s picture

I'm having the same issue and need this functionality. I'm importing a series of XML files, once per day. The content of the XML files could be updated, and generally this means that a record could just disappear. I have no way to currently reconcile that with the nodes that were created by the feed.

In my specific instance, what I'd like to do is to leave the nodes published and flip one of my CCK fields (is_item_available) from YES to NO. There is residual value in keeping them in the system, I just want a flag to remove them from search and facet results.

I have one other similar problem I am dealing with. In my feeds (which are products), there is a field for "instock". Pretty much everyone defaults that to YES for everything whether it is really in stock or not. I'm thinking about an enhancement that would go and visit the product page at the remote site, make sure we don't get 403/404, and then parse for a phrase such as "out of stock".

One of the ideas I had was this -

1) On import, capture the GUID of every item in the current import feed somewhere
2) For the import in question, compare the GUIDs from the feeds_node_item table against the GUIDs from the import file
3) In the event something exists in the feeds_node_item table that is no longer in the import, flip my CCK field is_item_available to NO.

That would seem to accomplish the first problem of expiring content. Alex I'd love to hear your thoughts about this approach.

rjbrown99’s picture

FYI - Alex pointed me to this node:
#289100: FeedAPI GC: handle items what are not in the feed anymore

That's a garbage collector for FeedAPI. Nice starting point. I'm currently thinking about implementing something like this as an override of the expire function in the node processor.

rjbrown99’s picture

OK, here's my new approach for this. The table feeds_node_item has a list of all items that have been imported from a feed. There is a column in here called "imported" which gets set to FEEDS_REQUEST_TIME when a node is created or updated. Based on the current behavior of the FeedsNodeProcessor, there is a bit of code near the top like this:

          // If hash of this item is same as existing hash there is no actual
          // change, skip.
          if ($hash == $this->getHash($nid)) {
            continue;
          }

So what is says is if the hash of the item has not changed, don't update it. I decided to modify that as such:

          // If hash of this item is same as existing hash there is no actual
          // change, skip.
          if ($hash == $this->getHash($nid)) {
            // If the item is skipped, we still update the request time. This is for garbage collection later.
            db_query("UPDATE {feeds_node_item} SET imported = %d WHERE nid = %d", FEEDS_REQUEST_TIME, $nid);
            continue;
          }

Basically, what I am saying is even if the node hasn't changed, run a database query to update the imported time to be the current time(). The db_query avoids a node_save so this was no big speed change for me. If the node did change, that value would get updated on the node_save.

The end result of this change is that any items that are in my current feed now have the same imported value. Anything that is no longer in my feed has an older imported value. The other side benefit of this logic is if an item disappears from a feed today but reappears tomorrow, the imported date would be brought current and we could take action on it.

My next step is to create a custom module to extend the FeedsNodeProcessor function expire to perform a check of that value in the table, and anything older than FEEDS_REQUEST_TIME will be disabled and/or have a CCK field changed. This effectively solves my garbage collection problem.

rjbrown99’s picture

As a continued follow up to my own message, this is working well. Instead of extending expire, I created a new function called garbage. Since I already hijacked the FeedsNodeProcessor into my own processor, I added the following underneath the node_save($node):

$this->garbage($source->feed_nid);

I pass the source feed NID into the garbage collector, and then that function looks like this:

  /**
   * Collects garbage
   */
  public function garbage($feed_nid) {
    // Return only items tied to this feed node, which are instock (Term ID 15975 YES), and were not part of this feed import
    // We assume they were removed, so we set the taxonomy value of instock to NO
    $result = db_query('SELECT n.nid FROM {node} n JOIN {feeds_node_item} fni JOIN {term_node} tn ON (n.nid = fni.nid AND fni.nid fni.nid AND fni.nid = tn.nid) WHERE fni.feed_nid = %d AND fni.imported < %d AND tn.tid = 15975', $feed_nid, FEEDS_REQUEST_TIME);
    while ($node = db_fetch_object($result)) {
      $fullnode = node_load(array('nid' => $node->nid));
      // Make sure we are dealing with items nodes
      if($fullnode->type == 'items') {
        // Change the instock value of the item
        $fullnode->field_stock[0][value] = 15976;
        // Prepare the update to the feeds_node_item table and the hash
        $hash = $this->hash($fullnode);
        $fullnode->feeds_node_item = new stdClass();
        $fullnode->feeds_node_item->hash = $hash;
        node_save($fullnode);
      }
    }
  }

This does a database query for anything tied to this specific feed node, that has an imported date older than the current feed, and has a specific taxonomy term set. That term in my case corresponds to "YES" which means the item is in stock. If we find something that is in stock but is no longer part of the feed, the idea is we want to get rid of it. So I load up the node, poke in the taxonomy term for "NO" which is term 15976 in my case, and then save the node. So the end result is anything that isn't in my current feed and was showing as in stock is updated.

Maybe that will help someone else that is having the same problem.

josephcheek’s picture

how well is this working (is this production-ready)? we want to be able to have nodes deleted when the item is removed from the RSS as well.

rjbrown99’s picture

So far it works just fine for what I am doing. Anything that disappears from a feed has its taxonomy field flipped from YES to NO, which then causes it to be hidden from my views and facets. I am not deleting anything, just changing CCK/taxonomy fields but that could probably be implemented fairly easily.

nelson_rp’s picture

StatusFileSize
new5.19 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-expire-by-imported.patch.
[ View ]

Sorry I didn't recognize this bug for what it was and submitted a duplicate: #735556: allow FeedsNodeProcess to expire based on age of import / last update. The attached patch applies to 6.x-1.0-alpha12 (and probably later). It allows continuation of the existing expire behavior or new expiration behavior based on the last import time. Also, the import time is "touched" on update even if no change is needed.

rohitst’s picture

Can anyone suggest on how to proceed with the Configuration settings in order for the garbage collector to work.

zengenuity’s picture

StatusFileSize
new2.88 KB

Here is a more general solution based on #7 that might be of greater use to people. It's a complete module. Install it, you should have a new Feed Processor called "Node Processor with Garbage Collection". This will unpublish content that is no longer in the feed. It will also republish content if the item shows back up in the feed. If you want to delete items instead of unpublish, you only have to change a couple lines in the FeedsNodeProcessorWithGC.inc file.

I had to copy the code over from the process function in FeedsNodeProcessor.inc. So, if that gets updated from this point on, you may want to update this accordingly.

Also, I've noticed that after you install the module, you need to clear the cache before you the new processor will show up.

alex_b’s picture

Status:Active» Needs work

#12: could you make this a patch? take a look at cvsdo in case you have problems with adding new files.

mattiasj’s picture

subscribing, would be useful in many cases

josephcheek’s picture

any chance of getting a re-rolled patch for alpha15?

rjbrown99’s picture

Can we agree on an approach for how to handle this? There are three different options in the issue. I'm going to recap what I think I heard.

#1: For me it's pretty simple - I need to flip a taxonomy term back and forth (YES/NO) because I'm using that term in apachesolr_views to facet on items that are in stock or out of stock. I also have a need for out of stock items to remain on the site and be indexed and searchable.

#2: Others may have a different need, such as complete removal of an item that drops out of a feed.

#3: Another scenario is to expire content based on the length of time since the original import.

Finally, what does it mean to 'expire' - does it remove the node, unpublish it, or flag/mark it somehow (as in my case with the taxonomy term)? Should this be selectable by the Feeds administrator?

In all cases, I think we need something like what I posted in #6, which is the ability to tell whether or not an imported item is still in the current feed. Perhaps we can start with a patch that does that and then expand it into a few different approaches for different use cases.

Can the interested parties comment on which, if any, of the above apply to you? If there's a different scenario, can you describe it?

I'm happy to contribute some code to this - I just think we're all headed in slightly different directions.

zengenuity’s picture

Yeah, I'm happy to help out, as well. We shouldn't roll a patch until there is some agreement about how to handle this, though. How about this?

Add an option to the Node Processor that enables something like my garbage collection from #12, but it's configurable with three options:

1. Off - default
2. Unpublish nodes when no longer in feed
3. Delete nodes when no longer in feed.

I'm not sure if the taxonomy-based approach from #7 makes sense to put in the patch. I understand why it works for you, but that is a pretty specific case. It would take a lot of options to make that accessible to other users for other uses. The unpublish and/or delete probably meets the needs of most people.

I don't think that this option should be in conflict with the concept of expiring content. Those who want to expire content on a time basis should be able to do that on its own or in combination with this garbage collection option.

If others want this, I can transform my add-on module into a patch with the appropriate options.

alex_b’s picture

Here's what I'm seeing:

- For 'sync' mode, we should limit ourselves to *removing* items that aren't on the feed anymore. This is to keep functionality in Feeds simple. If we break out this functionality into its own method, it can be very easily modified by third parties for custom use cases (I can be convinced otherwise).
- To determine which items exist in the feed in a scalable way, we need to 'touch' existing items by updating a feeds_node_item.touched field. When the whole feed is imported, we need to remove those items that haven't been touched. Removal will need to be batched.

Thoughts?

kreynen’s picture

I'll throw the Open Media Project's use case into the mix. We been using FeedAPI to synchronize events on playback servers from several vendors into a content type called om_airing. om_airing is used to display channel's schedule, but it is also used to find openings in the schedule where producers can insert new shows.

With the options zengenuity described, we'd opt to delete nodes that are no longer in the feed since we import the nodes as unpublished by default. The issue is we'd only want to delete items between the first and last item in the feed... not items that are no longer in the playback server's feed because they have already aired.

I'd add to rjbrown99's cases...

#4 Delete/Unpublish nodes where the pubDate is between the first and last item in the feed that are no longer in the feed.

1. Off - default
2. Unpublish nodes when no longer in feed
3. Delete nodes when no longer in feed.
4. Unpublish nodes when no longer in feed for date range
4. Delete nodes when no longer in feed for date range

zengenuity’s picture

alex_b's suggestions (#18) both sound reasonable to me.

I'm not quite as sure with the date range thing. (#19) I understand what you are saying, but it seems to me that using the first and last time stamps in the feed is likely to cause some removals to be lost. Also, I believe it's possible to create feeds without time stamps at all, which would be totally broken with this option.

alex_b’s picture

Also, I believe it's possible to create feeds without time stamps at all, ...

This is true.

rjbrown99’s picture

I have a bit of an update on my garbage collector in #7 - the hashing routines are wrong. When the initial hash is saved to the feeds_node_item table, the hash is taken on the batch $item, but my collector routine captures a new hash on the $fullnode. That is quite different from what the $item value looks like so the hashes get out of sync and do not match on re-import. I have not yet solved this as I can't think of a clean way to get back to an $item value from the collector to capture an updated hash.

To sum this up, when we finally arrive at a good garbage collector make sure the $hash value in feeds_node_item is calculated properly when updated.

rjbrown99’s picture

More learnings - FEEDS_REQUEST_TIME is updated periodically during the same feeds import run. It's not a consistent value, which means it's probably not the right thing to base the garbage collector on. It seems that this is due to batching - for me I have FEEDS_NODE_BATCH_SIZE defined as 50, and I notice that FEEDS_REQUEST_TIME changes with every 50 items. My theory is that every time we cycle through a batch that variable is reset.

EDIT: I'm starting to think this is an undesired bug in feeds. FEEDS_REQUEST_TIME should be consistent across the entire scope of a single import, shouldn't it? I'd be interested in hearing from alex_b on this one - perhaps this should be its own bug report.

alex_b’s picture

More learnings - FEEDS_REQUEST_TIME is updated periodically during the same feeds import run.

FEEDS_REQUEST_TIME should be the same during a page load. It will not be the same necessarily across an import - that is if the import is batched.

rjbrown99’s picture

Thanks Alex - it sounds to me then like the intended behavior of FEEDS_REQUEST_TIME is to only keep a consistent time across a page load but not an import.

If that's the case, I would suggest that we may want to have a different constant or variable that would keep a consistent time across the span of an import. Otherwise it would be difficult to use time to determine which items are no longer in the current import. The current imported items would all have different times based on when the batch page load took place.

rjbrown99’s picture

OK, here's my shot at a bit of code to keep the same import time across the entire batch run.

  public function process(FeedsImportBatch $batch, FeedsSource $source, $limit = 0) {

  // We know we are at the beginning of a batch if offset/updated/processed are empty.
  if (($batch->offset == NULL) && $batch->updated == NULL && $batch->processed == NULL) {
    // Now capture the initial FEEDS_REQUEST_TIME into a Drupal variable based on the feed NID
    $batchstart = FEEDS_REQUEST_TIME;
    variable_set('feeds_' . $source->feed_nid, $batchstart);
  } else {
    // If we are here, we know we are in the middle of a batch so load the variable
    $batchstart = variable_get('feeds_' . $source->feed_nid);
  }

What I'm doing is capturing the time of the very first page load/batch in a variable. Then we can access it later for expiration of items that are no longer on the current feed. Any items on the current import will have a date equal to or more recent than the $batchstart. Now in a garbage collector or expiration routine it gives us the ability to run a query to find any items that are older than that $batchstart. Anything older isn't on the current feed and should be handled appropriately.

In my garbage function, I run my SQL query by date, find all of the older nodes, deal with them, and finally release the variable:

    // Release garbage collector variable
    if (variable_get('feeds_' . $feed_nid)) {
      variable_del('feeds_' . $feed_nid);
    }

For the first time I can say that my garbage collector is finally working to flag nodes that are no longer on the current import feed.

tayzlor’s picture

subscribing so i can take a look at this later, looks like it will be really useful!

alex_b’s picture

#26 rjbrown99 - I see.

Can you explain if and why FEEDS_REQUEST_TIME progressing through batches breaks sync mode? I am thinking that a progressing FEEDS_REQUEST_TIME is actually a separate problem that occurs only under certain circumstances. I am suggesting to break this item out in a separate issue. If you agree, feel free to go ahead and do so.

rjbrown99’s picture

Sure. It's not that it breaks anything, it just makes expiration of content difficult, or at least I can't see a way to do it.

Here's a scenario. Hopefully this makes sense.

Let's say we previously imported 1,000 feed items and nodes were created. Each of those imported nodes has an entry in the feeds_node_item table, with a timestamp in the 'imported' field based on whatever FEEDS_REQUEST_TIME was when the nodes were imported.

Now we want to import a new feed of 200 items. We also want to expire/remove or take some action on any previous nodes that are no longer on the current feed. So we have to determine which items are current and which ones are old/expired.

In post #6 above, you can see that during a feeds import run I modified the code to update the date on the imported field even if the actual node wasn't changed. Let's assume we are still doing that. Now every item on a current feed has an updated value in the imported field of the database. This is the main method I am using to tell if the item is still on the feed - the imported date.

OK, so the new feed import of 200 items starts and we batch on every 50. So figure a new batch will happen every 50 items, and would produce something like this:

Batch #1: FEEDS_REQUEST_TIME = 1276926314, update imported field with this value
Batch #2: FEEDS_REQUEST_TIME = 1276926482, update imported field with this value
Batch #3: FEEDS_REQUEST_TIME = 1276926568, update imported field with this value
Batch #4: FEEDS_REQUEST_TIME = 1276926642, update imported field with this value

Because the FEEDS_REQUEST_TIME resets on each batch run, the times are different. So now we come to the end of the feeds import run and we want to figure out which of the nodes isn't on the feed any longer. The way I wanted to do this was to run a SQL query that said "select all of the previously imported items tied to feed_nid MYNID and that have a timestamp older than when we started our current feeds run." This would allow us to know which items to expire/unpublish/whatever.

The problem with this approach is that FEEDS_REQUEST_TIME changes during the run, so all of the items on the current import have different timestamps. By the time we get to the end of the batch run we no longer know what the original FEEDS_REQUEST_TIME was. So if we run a query with the time from Batch #4, it will expire all of the items no longer on the feed PLUS all of the items from Batches #1-3 since they are older than that date.

What we really need is to capture the start time for the feeds run so we have a starting point. That way we can then query for all of the older stuff. If the imported date is older than the time the feeds run starts, it clearly isn't on the feed any longer.

In #26 (with the part from #6 to update the timestamp on unchanged nodes), my approach there was to figure out if we are at the very beginning of a feeds run and then store that value in the Drupal variables table. In this case, we can then use one consistent timestamp across all of the batches. When we come to the end and the expiration action we can run the SQL query on anything older than that date.

This is now working for me and quite well. There's only one other thing I am considering/testing. Since a batch can resume in the event that it is interrupted, I need to check my code to ensure that the timestamp is not updated on a resume. Otherwise we would expire anything prior to the resume action.

This is long-winded, but hopefully it makes more sense as to what I'm trying to do. I'm open to different thoughts and approaches if there is a better way or I'm missing something. Also if this is a different issue I'm good moving there but my main goal is to determine a way to expire content no longer on the current feeds import run.

Thanks!

rbrownell’s picture

+1 Subscribe

mardok’s picture

+1 Subscribe me too

alex.designworks’s picture

Thank you so much, zengenuity (#12).

Your module works perfectly and does exactly what you described.

I would say that this is a best solution here.

morningtime’s picture

For clarity, we should make a 'conceptual' distinction between: "expired" (time-based) and "end of life" (source-presence).

My idea is to keep it very simple. Add a super simple tweak to FeedsNodeProcessor (latest dev):

<?php
if (!empty($nid) && $hash == $this->getHash($nid)) {
 
// Morningtime: item still exists in source, update the timestamps to prevent expiry
 
db_query('UPDATE {node} SET created = %d WHERE nid = %d', time(), $nid);

  continue;
}
?>

This minimal change has the following effect:

- items in the source feed will never expire, because they get a fresh timestamp (now)
- "end of life" items will expire after set period

It works because the function expire() in FeedsNodeProcess checks with "n.created" on {node}.

Of course your feed must refresh more often than your expiry period!

alex_b’s picture

#33 - that's a very interesting approach, it allows us to use the 'expiry' infrastructure for item expiry and sync mode. There is a conceptual clash though as expiry works on an importer basis (across all subscriptions/feeds of a single importer) while this approach requires expiry to be executed directly after an import of a specific source.

We will need to weigh the simplicity of a unified expiry/sync mode against the speed of the current expiry mode.

rbrownell’s picture

+1 subscribe

info@cgfix.com’s picture

I tried the #12 Module on the latest feeds 6.x-1.0-beta9 and I got this error:

" * recoverable fatal error: Argument 1 passed to FeedsNodeProcessor::existingItemId() must be an instance of FeedsImportBatch, array given, called in /media/tallman/serverfiles/www/drupal-6/modules/feeds_gc/FeedsNodeProcessorWithGC.inc on line 34 and defined in /media/tallman/serverfiles/www/drupal-6/modules/feeds/plugins/FeedsNodeProcessor.inc on line 308.
* recoverable fatal error: Argument 1 passed to FeedsProcessor::uniqueTargets() must be an instance of FeedsImportBatch, array given, called in /media/tallman/serverfiles/www/drupal-6/modules/feeds/plugins/FeedsNodeProcessor.inc on line 312 and defined in /media/tallman/serverfiles/www/drupal-6/modules/feeds/plugins/FeedsProcessor.inc on line 226.
* recoverable fatal error: Argument 1 passed to FeedsParser::getSourceElement() must be an instance of FeedsImportBatch, array given, called in /media/tallman/serverfiles/www/drupal-6/modules/feeds/plugins/FeedsProcessor.inc on line 233 and defined in /media/tallman/serverfiles/www/drupal-6/modules/feeds/plugins/FeedsParser.inc on line 84.

"

info@cgfix.com’s picture

StatusFileSize
new3.46 KB

Yea, I fixed it. I was able to bring the #12 Module feeds_gc up to date with beta9. So now it works. I just had to swap out a couple of arguments with the appropriate variables. Cool.

Here it is for anyone who needs it. Can we submit this as an actual module or module plugin? I don't know how.

info@cgfix.com’s picture

I think this makes a great module or patch but I do have wonder why there is not just an option that deletes all items before updating making every update a fresh update. It wouldn't allow you to retain unpublished nodes for review but it seems like it should be a whole lot easier.

NicolasH’s picture

+1 for this...

Thanks zengenuity #12 for the module and clausjohnson #37 for updating.

So, @alex_b, is it reasonable to take this approach and

  1. Add a config option of off/unpublish/delete, as suggested in #17
  2. Use an extra field "touched" in the "feeds_node_item" table to handle this, as you suggested in #18

?

Would it also make sense to just change the status field in the node table directly with the unpublish option, rather than a node_load and node_save?

Are there other conceptual things to consider, or would that make an acceptable patch?

jackocnr’s picture

Thanks for this guys - exactly what I was looking for. I used the tweaked module in #37. I decided I wanted to delete old nodes instead of unpublishing them, so I thought I would post the tweaks you need to make to FeedsNodeProcessorWithGC.inc to get that working: delete everything in the while loop on line 139, and just put the following statement in the loop:
node_delete($node->nid);
You can also safely delete the rest of that function (except the return statement).

Hope this helps someone - it worked for me!

NicolasH’s picture

Assigned:Unassigned» NicolasH
Status:Needs work» Needs review
StatusFileSize
new7.86 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-661314.patch.
[ View ]

I've rolled a patch based on #39. It will add some config options to the node processor settings screen.

You will need to run update.php and clear the caches.

There's a few things I'm not sure about yet...it now explicitly sets nodes to published if they are in the feed. This is to accommodate for items that were removed, caused a node to be unpublished, then reappeared again and should make the corresponding node published again. However, it's a bit heavy handed, since it's doing it to all nodes in a feed - always.

This means if a user chooses to unpublish a few nodes manually, the import would always override this decision.

Another idea was to integrate workflow into this...rather than only being able to unpublish or delete, change a workflow status and have actions available. Maybe overkill, I don't know.

NicolasH’s picture

StatusFileSize
new5.05 KB
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]

Ok, after some more testing I found a major flaw...currently the deletion of nodes isn't really batched, it happens after the batched importing. That doesn't matter much on only a medium feed, but becomes very obvious if you have a CSV file where say 3000 out of 5000 items just have been removed and Drupal now starts to delete these nodes on one page request and PHP will most likely time out.

I guess checking whether the number of nodes about to be deleted justifies batching is the right approach...will be looking into that.

Anyway, I think this works currently for most use cases, so pls test...slightly updated patch attached.

tanc’s picture

Excellent work NicolasH, just what I was looking for. The patch in #42 applied cleanly against beta 10 and the functionality works as intended in preliminary testing. I don't have thousands or even hundreds of feed items to keep in sync so I can't speak for large imports/syncs.

alex_b’s picture

Status:Needs review» Needs work

Ok, after some more testing I found a major flaw...currently the deletion of nodes isn't really batched, it happens after the batched importing.

This is indeed a serious flaw. It would not only affect users of large feeds but anybody who enables sync mode after having aggregated for a while from small feeds. This query quickly loads up way more nodes than can possibly be deleted on a page load:

SELECT n.nid FROM {feeds_node_item} fni LEFT JOIN {node} n ON fni.nid = n.nid WHERE fni.feed_nid = %d AND fni.touched < %d

What we could do is work with an expiry flag and leave the actual node deletion up to the expire() process. On import, all nodes that aren't present on the feed would be set to 'expire', and on expire() those nodes would be deleted. The expire method would follow the same procedure itself: set all nodes that are to be expired to expire, then batch them off.

On another note: I'd rather not support an 'unpublish' feature for sync. If it's not really needed, let's not do it.

NicolasH’s picture

@tanc
I really need to update the patch again, but I've been putting it off until we have done some extensive testing in-house on a real project...hopefully happening this week. Pls don't use this in production...as soon as you reach a number that extends beyond the first batch, you will get strange behaviour. This is apart from what alex_b refers to above. It's easily fixed, I just don't want to throw another five patches up if I find something else later.

@alex_b
Glad you're chiming in...I had a quick look at expire() when I pointed out the flaw, but was hoping for some input from you...this sounds good. So I'll add the expire flagging in the next patch....

On not supporting 'unpublish': We have a few use-cases where we'd like to take a different action than simply deleting the node, hard to tell if this would be the same for a lot of people. Hence the suggestion about some possible workflow integration. Maybe in an add-on module?

alex_b’s picture

So I'll add the expire flagging in the next patch....

Glad you're planning on following through with this patch. Note that there is a Drupal 7 branch of Feeds now as well, so you'll have to roll a patch for both branches. There should also be tests to avoid any regressions introduced by future changes. Expiry in 7 is being fixed by #930652: Expiry batching broken.

We have a few use-cases where we'd like to take a different action than simply deleting the node

Just wanted to make sure it's concretely needed :)

NicolasH’s picture

StatusFileSize
new6.61 KB
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]

After looking at it again, I think mixing this in with the expire() process could be confusing...the current expiry related functionality and terminology means something quite different...although it's a nice way to utilise what's already there.

For an approach where this sync functionality is separate, here's a rough patch to mainly get some input on how the batching is best to be integrated in such a scenario. It does introduce a new field "orphaned" into field_node_items.

The batched deletion after import currently works, but it doesn't seem right to call feeds_batch_set() again in FeedsNodeProcessor...it also doesn't display the "Deleting" title...I'm clearly missing something.

Anyway, if this doesn't seem right, I'll go the expire() route...

tanc’s picture

New patch in #47 is working after 24 hours of testing sync with a feed that changes every few hours. Will update if this situation changes.

tanc’s picture

StatusFileSize
new9.78 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-661314_batch_delete_1.patch.
[ View ]

I spoke too soon. Patch in #47 wasn't working after all, in that pretty much everything was getting flagged as orphaned and the clearOrphans method wasn't being triggered in the batch. I've supplied a patch which seems to work for me in preliminary testing. Can someone take a look and review it?

I also had to add a check to see whether there were any items to be processed at all as sometimes my feeds would timeout and all the items would be marked as orphaned. Is there a better value or error to check in process function that will show if there is a problem or timeout? Currently I've just used $batch->getTotal(FEEDS_PROCESSING) > 0 as a check but it may not be the best way.

NicolasH’s picture

Status:Needs work» Needs review
StatusFileSize
new6.04 KB
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]

Yeah, thanks tanc, I also discovered some timestamp issues. I've abandoned the approach in that patch of doing the pruning after the import and have followed Alex's suggestion to wrap it up with the expire() process. I've also limited it to deleting nodes - no unpublish...although I might follow up on that in a new issue if this gets accepted.

The changes are reasonably minimal, I really hope this gets at least some more feedback...

NicolasH’s picture

StatusFileSize
new5.96 KB
FAILED: [[SimpleTest]]: [MySQL] 897 pass(es), 815 fail(s), and 508 exception(es).
[ View ]

Fixed a bug with the flagging query where one feed was able to orphan items from another. I didn't run into this because I only used one feed initially.

Also fixed is the proper unscheduling of the expire process, if deleting orphans is switched off after being switched on.

tyler-durden’s picture

Version:6.x-1.x-dev» 6.x-1.0-beta10

Is this patch going to be in a dev or beta soon, or does it need more testing? this is going to have to be a required feature for me later this spring, so I can test if needed. Currently using Feeds 6.xBeta10 on LIVE sites so I do want to be careful, has anyone tested it and is it stable enough?

NicolasH’s picture

TD, we're running this in production, but on a very specific use case: A CSV file with several thousand items with a handful added/deleted every day.

Testing would be very much appreciated. As you can see, there hasn't been much activity here...I know Alex will want tests and a D7 version, but as long as I don't know whether this approach is acceptable I don't want to put any more time into it.

tyler-durden’s picture

Great, thanks for the info NicolasH

I need this feature and I will be running tests mid month, hopefully to make it live by the EOM if it tests ok. I will keep you updated. On my initial live run I will have a little over 100 items, with the handful of items changing maybe once a week so it may take more time to test properly.

I will certainly keep you updated but it may be a month+ or so away.

imclean’s picture

This looks like a great feature, thanks for all the work. Does this apply to the node processor only? We're using the Data module and it would be very handy to be able to replace all data for a particular feed.

In our case, importing a CSV file into a data table should replace all existing data.

NicolasH’s picture

imclean, yes it does only apply to the node processor and it will compare the items of the feed with what is already existing as nodes in Drupal, deleting only what is missing in the feed.

It seems you are wanting to completely purge all data before doing a fresh import. That's more likely to be feasible with Data tables, since it's a lot less expensive than deleting a lot of nodes. This patch doesn't really deal with that.

However, I think it would also be a good feature to be included. In the meantime, you could probably achieve the the same thing by using one of the feeds hooks:

<?php
function MYMODULE_feeds_after_parse(FeedsImporter $importer, FeedsSource $source) {
  if (
$importer->id == 'my_data_feed') {
     
// Empty the table before these new items get added.
     
$my_data_table = data_get_handler('feeds_data_stuff');
     
$my_data_table->truncate();
  }
}
?>

That should parse the feed and just before inserting it into the data table delete all existing contents. I don't have access to a database at the moment, there might be more tables to clean up...

imclean’s picture

Thanks NicolasH, that does the job nicely. I was already using that hook for some other stuff so just added the required code.

I've had a bit of a look but couldn't find anything else which required tidying up. The Data module keeps things pretty lean.

imclean’s picture

afeldinger’s picture

I have just implemented the same basic functionality, but with a completely different approach. I simply register the node ids of the created or updated nodes in an array. When the processing is done, I pull out all the nodes associated with the import node from the database and remove the orphaned items.

// file: FeedsNodeProcessor.inc
// public function process()

// add array outside loop to track processed nodes
$processed_nodes = array();

// After finding NID, add to list of processed nodes if not present (outside IF)
if ($nid && !in_array($nid, $processed_nodes)) $processed_nodes[] = $nid;

// If new node created, add NID to array after node_save
if (!in_array($node->nid, $processed_nodes)) $processed_nodes[] = $node->nid; // add to list of processed nodes if not present

// Added delete orphans functionality after regular processing
if ($this->config['update_orphans'] == FEEDS_DELETE_ORPHANS) {
$orphaned_total = db_result(db_query("SELECT COUNT(nid) FROM {feeds_node_item} WHERE id = '%s' AND feed_nid = %d AND nid NOT IN (%s)", $source->id, $source->feed_nid, implode(',',$processed_nodes)));
if ($orphaned_total > 0) {
$deleted = 0;
$deleted_nodes = array();
$result = db_query_range("SELECT nid FROM {feeds_node_item} WHERE id = '%s' AND feed_nid = %d AND nid NOT IN (%s)", $source->id, $source->feed_nid, implode(',',$processed_nodes));
while ($node = db_fetch_object($result)) {
      _feeds_node_delete($node->nid);
  $deleted_nodes[] = $node->nid;
      $deleted++;
    }
if ($deleted) {
drupal_set_message(format_plural($deleted, 'Deleted @number orphaned @type node.', 'Deleted @number orphaned @type nodes.', array('@number' => $deleted, '@type' => node_get_types('name', $this->config['content_type']))));
} else {
drupal_set_message(t('No orphaned nodes to delete'));
}
}
}

Your approach looks like it's more thought through, though, and I still haven't gotten around to adding batch processing. I'll give this patch a try and report my findings.

What's the probability of the patch getting into dev eventually?

NicolasH’s picture

@afeldinger Yes, very similar. Bear in mind that there is no batching in that approach. As mentioned somewhere above, if you have been running a feed for a while and have a lot of nodes that will now have to be deleted, you will most likely run into timeout/performance/memory issues.

As for probability, I have no idea...there hasn't been any feedback yet. Can't hurt to test it though and report your findings.

NicolasH’s picture

Whoops, of course you already mentioned the missing batching...:)

Anyway, I can't think of another way than flagging it in the database (as opposed to using an array at import time) if you want to schedule batches to be deleted on cron. The expiry process is pretty neat and it makes sense to piggyback it off there...let me know if you can think of other ways.

afeldinger’s picture

Hm I'm having a bit of trouble after applying your patch. Ophaned nodes are scheduled for expiry, but for some reason they aren't deleted. I can see from my log, that cron runs, and JobScheduler finishes just fine, but my orphaned nodes are still present.

Setting the 'Expire nodes' dropdown doesn't seem to change anything.

Am I missing something?

arielkung’s picture

suscribing #62

arielkung’s picture

When you run cron logged in as administrator nodes are deleted, but if it runs anonymous (from crontab) they won't.

NicolasH’s picture

Thanks for the testing!

@afeldinger, the "Expire nodes" setting shouldn't really influence deleting orphaned nodes, all that should be required is checking "Delete nodes for missing items". Can you check whether it makes any difference if you run this as user1, as suggested by arielkung in #64?

afeldinger’s picture

I'm running the cron job manually, while logged in as user 1. I didn't think the expire nodes setting should be necessary, but I can't seem to find the root of the problem.

NicolasH’s picture

Is there a chance of you trying this with one of the CSV files in the test folder and saving the setup as a feature to post here?

afeldinger’s picture

StatusFileSize
new999 bytes

Alright I've tried this out on a clean Drupal 6.20 install with only the required modules. Imported file is /feeds/tests/feeds/nodes.csv, and it's set to import to the corresponding fields in a standard story node type.

It's still scheduling the correct items for deletion, but they still aren't getting deleted at cron. I've tried running cron manually from the status report and issuing the command from terminal, both with no luck.

I'm currently messing about with the settings, mainly to see if I accidentally strike gold.

I was sort of hoping it would work on the clean install, meaning that it was something in my code that was breaking the functionality, but no such luck. I've attached the setup as a feature, but there really isn't much in it.

NicolasH’s picture

Thanks for that, it does make it a bit easier. Apart from some oddness with the entries in the CSV file (2 have duplicate GUIDs, I think that's intentional for testing), I can't replicate them not deleting on cron.

Here are the steps I took:

- Import nodes
- Delete 2 entries from the CSV file now in files/feeds
- Import again. A message says that 2 orphaned nodes have been flagged for deletion.
- Check in DB feeds_node_item whether the orphaned column has been really set to 1.
- In DB, go to feeds_importer_expire and set the appropriate "next" timestamp to 0, to make sure this will run on next cron.
- Go to cron.php, which in my case deleted the flagged nodes.

Anything in this that would point to a difference?

NicolasH’s picture

Version:6.x-1.0-beta10» 6.x-1.x-dev

And I do all this against dev...

thedavidmeister’s picture

subscribing.

My use case:

We have an existing CMS where I work that some staff use to update two different tables, one has about 2 thousand records that are updated occasionally, but never expire (this represents real world places), the other contains records that expire after no more than a month, and we input 300+ a week (representing events at these places).

The records in the second table reference records in the first table by id.

That CMS can be set to export this data as XML, or CSV if needed, and every record has an updated date.

I want to import both into Drupal, where each record is a node, and retain the relationships between records, but that's another story :P

The problem is that the first type of record shouldn't simply "expire", as updates should be reflected in Drupal by the next Cron run.

The second type of record should definitely be totally deleted from the database when they expire, or my server will be drowning in tens of thousands of nodes within no time at all.

The only solution that I can see would to be implement some kind of "update node if exists in the feed, else delete", which, i believe is what this thread is all about. right?

NicolasH’s picture

thedavidmeister, feeds module will already update existing nodes if the corresponding item is updated - if you tell it to. No problem there. This issue is about deleting nodes that aren't present in the source anymore ASAP - if you tell it to. So yes, this thread is about the second half of your requirement.

afeldinger’s picture

My process is pretty much the same, EXCEPT:

1. I don't have a feeds_importer_expire table
2. I've been running beta10

So I tried using dev instead, but your patch from post 51 fails to apply. Can you roll one against dev?

NicolasH’s picture

Sorry, missed a step there (or a better explanation). feeds_importer_expire is the callback entry in the job_schedule table.

I'm pretty sure I applied that patch against dev...will check again tomorrow.

Just to double-check, you have an "orphaned" and "touched" column in the feeds_node_item table, yes?

afeldinger’s picture

OK, it's working now - thanks for all your help. The job_scheduler next was the missing piece. By the way, applying the patch to dev works just fine, if you apply it in the right directory. That'll teach me not to work in too many locations at once (%&"#!$€%).. Sorry about the confusion.

I'm still unsure, exactly why the 'next' setting is necessary - if the expires setting has no influence when deleting orphaned nodes, shouldn't the job run with every cron run?

NicolasH’s picture

Awesome!

Well, I wanted this to be as close to the existing feeds way of handling things as possible. Running heavy processes simply on every cron run is not always viable - I assume that is the reason for job scheduler to be there in the first place (maybe there is more to it).

I could have created a separate job to delete the orphans, say feeds_importer_delete_orphans, but it would have duplicated a lot of code for not much gain. That's why it piggybacks on the already existing job. But as always, there's many ways of doing this...

afeldinger’s picture

Fair enough. Well for my $0.02 it looks like it's working as intended. I think the "your nodes won't be deleted until the next +1 hr cron run" is an important piece of information - at least while testing.

This is actually the only thing that wasn't immediately apparent - the rest of it makes excellent sense. And the delayed job scheduling is indeed a valid point from a performance point-of-view.

Excellent job implementing this functionality Nicolas - an extremely useful (and necessary) addition to an already great module. Two thumbs way, way up.

Attila’s picture

I'm having the same issue and I'd also like to apply the patch but there are a few things which are not quite clear to me:
- Why is the patch made for the dev version? Isn't it possible to use the beta10 instead? Or doesn't it make any difference?
- What do I need to change in the module settings in order to make it work? Just the "Expire nodes:" setting in the node processor settings?
- I need to use the patch from #51, right?

Sorry for my noobish questions but I'm pretty new to Drupal and especially the Feeds module.

Edit: I just ran cron once again, and it seems to work for me for some reason. The only problem I'm having is that I need to run cron twice to import all 91 nodes.

NicolasH’s picture

@Attila
The patch is rolled against the dev version because that's where it most likely would be applied should it ever get accepted by the module maintainer. I haven't tried it against the beta, but there's a good chance it would work. I don't think much has happened in the D6 dev version over the last few months, so they are probably very similar.

If you apply the patch (yes, from #51), make sure you run update.php afterwards for the database changes. The only setting that needs to be enabled is "Delete nodes for missing items" under admin/build/feeds/edit/[YOUR_FEED_NAME]/settings/FeedsNodeProcessor.

Your comment about having to run cron twice doesn't seem to regard this patch, but rather feeds in general? If you want the import process to run every time cron runs, you need to make sure the "Minimum refresh period" setting under admin/build/feeds/edit/[YOUR_FEED_NAME]/settings is set to "As often as possible".

@afeldinger
Thanks for your support for this patch. Let's hope it will pop up on the module maintainer's radar soon. We'd be happy to support this with financial incentives, but I don't think that's the issue here - seems more to be a time/attention constraint (or the approach is so bad that it's been abandoned :)).

Anyway, if this gets any traction again, I'm happy to put some more informative messages/watchdog logs into place. I think the job scheduler abstraction of cron confuses people in general, with or without this patch. You run cron and nothing happens...run it again after 2 hours and it works (depending on the job scheduler settings).

Maybe a dev setting for job scheduler could address this...while turned on, all jobs run on every cron? I think it would eliminate a lot of support requests for feeds.

mikesir87’s picture

Subscribing

Dave Kopecek’s picture

Subscribing

karengreen’s picture

How do i apply the patch in #51 to beta10 using TortoiseSVN Diff?

Do i need to apply previous patches as well, or can i just add this patch onto my beta10 version?

Thanks,

NicolasH’s picture

Don't know about TortoiseSVN. You would already need to have the module in your own SVN repository, then probably right-click on the module folder and find an "apply patch" command or something similar.

I did use TortoiseSVN on Windows and it's fantastic, but I never applied a patch with it...not on WIndows anymore, so can't provide any help on that.

There is the generic patching handbook of course: http://drupal.org/patch/apply. You'll only need the patch in #51, but I can't confirm that it applies properly against beta10...probably will, though.

karengreen’s picture

Thanks NicolasH, the patches im used to normally have -- (to delete a line) and ++ (to add a line) so i couldnt apply patch automatically.
Instead i did it manually, and it worked fine in beta10.

The batch deletion doesnt seem to work for me though. If i upload an xml with a node missing, it marks it as orphaned, but nothing happens after that. If i run feeds_cron() or job_scheduler_cron() it doesnt trigger the batch deletion.

Ive had to write a quick cron function that does the deletion:

$result = db_query("SELECT * FROM feeds_node_item WHERE orphaned = %d", 1);
while ($node = db_fetch_object($result)) {
_feeds_node_delete($node->nid);
}

any ideas?

imclean’s picture

That patch does seem a bit odd, see the patch by tanc at #49 for the usual format. Use diff -up.

http://drupal.org/patch/create
http://drupal.org/node/583948/cvs-instructions/DRUPAL-6--1

NicolasH’s picture

StatusFileSize
new9.09 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-661314-86_use_expiry.patch.
[ View ]

Oh noes, I haven't used the -up flag, I think. Rerolled version attached. Apologies to those who had trouble applying.

@karengreen: I assume you're having the same issue as a few others before....running cron does not mean that the expire process runs, job scheduler handles that.

To make sure it runs when cron runs, set the "next" field for "feeds_importer_expire" in the "job_schedule" table to 0, then run cron.

karengreen’s picture

Im not sure if its down to this patch or feeds directly, but ive noticed that after importing a feed, the feeds_node_item table, none of the nodes have a feed_nid set, so the delete feed tab doesnt work anymore.

(i'm importing using the standalone form so it is not atached to anything)

Any ideas?

that0n3guy’s picture

I just tested out #86. I patched beta10, which worked fine, and it seems to be working by manually running cron though I did follow these directions:

To make sure it runs when cron runs, set the "next" field for "feeds_importer_expire" in the "job_schedule" table to 0, then run cron.

I will let you know if something is funky when it just runs automatically over time.

NicolasH’s picture

@karengreen: Is the feed_nid set to 0? If so, that's expected behaviour if you use the standalone form. As for nodes not being deleted, can you confirm it's due to the patch?

NicolasH’s picture

StatusFileSize
new8.68 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-661314-90.patch.
[ View ]

There's been some more activity again. I've re-rolled this against the current dev version via git. No functional changes compared to #86, but the update hook in the install file had to move up a number, so this should really be applied from scratch.

bambilicious’s picture

Subscribing.

manu manu’s picture

Status:Needs review» Needs work

Hi,

It seems that there is a problem with FeedsNodeProcessor::flagOrphans() for feeds witch are not using standalone forms.
In my case, 60K nodes were flagged by a feed witch contains only 10K nodes.

FeedsNodeProcessor::flagOphans() should take in consideration $source->feed_nid like in this quick'n'dirty if/else:

/**
   * Flag nodes that don't have a corresponding feed item in the feed anymore.
   */
  public function flagOrphans($batch, $source) {
    if($source->feed_nid) {
      db_query("UPDATE {feeds_node_item} SET orphaned = %d WHERE id = '%s' AND feed_nid = %d", 0, $source->id, $source->feed_nid);
      db_query("UPDATE {feeds_node_item} SET orphaned = %d WHERE id = '%s' AND touched < %d AND feed_nid = %d", FEEDS_NODE_ORPHANED, $source->id, $batch->started, $source->feed_nid);
    } else {
      db_query("UPDATE {feeds_node_item} SET orphaned = %d WHERE id = '%s'", 0, $source->id);
      db_query("UPDATE {feeds_node_item} SET orphaned = %d WHERE id = '%s' AND touched < %d", FEEDS_NODE_ORPHANED, $source->id, $batch->started);
    }
    drupal_set_message(t('Flagged %num orphaned nodes for deletion.', array('%num' => db_affected_rows())));
  }
NicolasH’s picture

StatusFileSize
new8.32 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-661314-93.patch.
[ View ]

Ha, of course, thanks manu manu. I only use standalone forms, so this never came up for me.

Since feed_nid is by default 0 if the importer is not attached to a node, there's not even a need for a conditional statement....I just had to add feed_nid to the query that's already there.

<?php
 
/**
   * Flag nodes that don't have a corresponding feed item in the feed anymore.
   */
 
public function flagOrphans($batch, $source) {
   
db_query("UPDATE {feeds_node_item} SET orphaned = %d WHERE id = '%s' AND feed_nid = %d", 0, $source->id, $source->feed_nid);
   
db_query("UPDATE {feeds_node_item} SET orphaned = %d WHERE id = '%s' AND feed_nid = %d AND touched < %d", FEEDS_NODE_ORPHANED, $source->id, $source->feed_nid, $batch->started);
   
drupal_set_message(t('Flagged %num orphaned nodes for deletion.', array('%num' => db_affected_rows())));
  }
?>
NicolasH’s picture

Status:Needs work» Needs review

And back to review....

manu manu’s picture

Thanks Nicolas,

It makes sense and works well for me.

manu manu’s picture

Status:Needs review» Reviewed & tested by the community

Changing status as I have tested it in production since 2 weeks without problems.

This really should be included in feeds...

math-hew’s picture

OK, I'm stumped. I thought this patch might solve my issues but after reading through the whole thread I don't think it will because the nodes are still going to be completely overwritten when they're updated (right?).

Here's the long version of my situation. Skip below for the TL;DR.

I'm the web manager for a large organization (1,000+ employees) with lots of different departments, many of whom have public Drupal-based websites with faculty/staff listings (grid-style views of a "Person" content type, filtered by "Department" taxonomy). My team maintains those listings manually - if someone leaves a department or their phone number changed or they got a new portrait photo taken, we just update the node by hand and that was fine. Problem is, people don't always let us know, and we're duplicating efforts because this info already exists in a separate non-public directory which is kept up-to-date by HR. So often someone will leave the organization but I never find out about it and a year later their manager will email me saying "OMGWTF Jim Jones is still listed on our site, he left a year ago" and I have to explain that I don't know these things unless someone tells me... and so on.

So my bright idea was to get our IT department to provide a nightly CSV data dump of the internal directory which I'd then pull in via Feeds. After much gnashing of teeth, the data dump was set up, and it works fine except for the whole "delete or unpublish nodes that are deleted from the source" issue AND the fact that for many departments, I need to add additional fields that don't exist in the source CSV - for example, some departments want biographies listed for their staff members, and that info doesn't exist in the internal directory.

Conclusion/TL;DR: What I want to have happen is for specific fields to be updated by Feeds, but not for the entire node to be overwritten because then any additional information I've added to fields that don't exist in the source file will be erased. I am completely, utterly stumped as to how to make this happen.

Status:Reviewed & tested by the community» Needs work

The last submitted patch, delete_orphans-661314-93.patch, failed testing.

thedavidmeister’s picture

at #97, i think maybe you misunderstand something about the way Feeds works, this thread is about getting feeds to delete nodes once they are no longer found in the source and seems only partially relevant to that use case you just described.

there's a setting in Feeds already, for the Node processer that lets you update existing nodes, instead of replacing them, but you definitely need something smart set for your GUID so you can keep track of what is being mapped to where.

i don't see why the patch here would change the way that the nodes are processed, it should only change the way they are expired.

if you don't want some data fields overwritten then don't map anything to that field and you should be fine.

NicolasH’s picture

Status:Needs work» Needs review

#93: delete_orphans-661314-93.patch queued for re-testing.

NicolasH’s picture

Status:Needs review» Needs work

@#97 This patch should address the deletion of employee entries if they are removed from the source and, as mentioned in #99, there is a proper GUID and the feed is properly configured to use it.

Have you applied the patch and feeds is not deleting removed employees? Maybe let's start there. You should be able to address those requirements with feeds and this patch.

Does anybody with more test bot experience can give me pointers to why the patch doesn't apply?

manu manu’s picture

Maybe files targeted by the patch has changed in 6.x-1.x-dev branch?

Aeotrin’s picture

Does this apply to the 7.x versions or is it already included in 7.x. If so can I ask how to achieve this functionality (Delete a node if it no longer exists in the csv file being uploaded.)

NicolasH’s picture

@Aeotrin: I haven't used Feeds in 7, but I'd be surprised if the functionality is there, given that there has been no input from maintainers here since November last year...

I'll be only updating the patch if someone actually finds a bug, but wouldn't consider a 7 version until this goes somewhere. Maybe someone else would, though.

EvanDonovan’s picture

Hope the maintainers take a look at this - seems like it would be a great feature to have, since I presume that this is what "expire" did already (rather than expiring everything, even the items that are still in the feed). I presume #93 is the one to test (on 6.x). Don't know why the test bot would be failing - I can't get the PIFR results to show up here.

EvanDonovan’s picture

Just started testing the patch - although my version of FeedsNodeProcessor class was already somewhat hacked to give me better logging (since I was trying to figure out what was going wrong with how I thought "expire nodes" would work). I'll let you know whether this works, and if it doesn't seem to be, I'll remove my hacks to see if those were what was causing the problem.

vimaljoseph’s picture

StatusFileSize
new9.12 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-feeds-6.x-1.0-beta10-661314-107.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

I've tested the patch in #93, in this, the field "touched" is updated in the condition

<?php
 
if (!($nid = $this->existingItemId($batch, $source)) || ($this->config['update_existing'] != FEEDS_SKIP_EXISTING)) 
?>

but this will not work if we set not to update the existing nodes.

I've moved $nid = $this->existingItemId($batch, $source) out side to the condition and made the following change to the patch in #93 to get this working with out enabling the update existing nodes.

<?php
$nid
= $this->existingItemId($batch, $source);
if (
$this->config['delete_orphans']) {
  if (!empty(
$nid)) {
   
// To be able to take action on items that have been removed, just indicate that this was still in the feed.
   
db_query("UPDATE {feeds_node_item} SET touched = %d WHERE nid = %d", FEEDS_REQUEST_TIME, $nid);
  }
}
// Create/update if item does not exist or update existing is enabled.
if (!($nid) || ($this->config['update_existing'] != FEEDS_SKIP_EXISTING)) {
?>

The default expiry time is set to 3600 sec, I've commented that piece of code so that the job will be processed immediately. I'm not sure about the performance implications for this change.

Attaching a patch against the 6.x-1.0-beta10 version with both these changes. Please tell me whether there is anything wrong in the above code.

to apply the patch use

cd feeds
patch -p1 < delete_orphans-feeds-6.x-1.0-beta10-661314-107.patch
EvanDonovan’s picture

Status:Needs work» Needs review

I tested the patch in #93, and it seems to be working well. I didn't test the patch in #107, or the case that would make it different. I might not have time now, since #93 meets our use case, and this project has already gone long for us.

If #107 is a re-roll of #93 with a fix, the current status of this issue should be "needs review". Ideally, the patch should be a re-roll against 6.x-dev though.

EvanDonovan’s picture

@vimaljoseph: Was there a particular reason why you commented out the 3600 seconds (one hour) expiry time? For large batches of nodes being imported, that could cause problems.

As it is the flagging and the expiration are separate actions by design so that they can happen on separate cron runs, usually.

vimaljoseph’s picture

@EvanDonovan That was for my specific application where the feeds I'm downloading very frequently change, so if I set expiration time 1 hour (which is default for feeds module) , our application will be out of sync with the original feed.

EvanDonovan’s picture

Status:Needs review» Needs work

@vimaljoseph: So that should probably not be a part of the patch itself then, since most people won't need that.

NicolasH’s picture

I agree that that value shouldn't be hardcoded, but it's got nothing to do with this issue, so would be probably better raised as a separate one.

It should just be set via a variable and mentioned in the "hidden settings" in the README, like a bunch of other values that can be set in settings.php or similar.

<?php
$job
['period'] = variable_get('feeds_expire_period', 3600);
?>

@vimaljoseph, would you mind taking that functionality out and re-roll the patch?

EvanDonovan’s picture

@NicolasH: agreed.

There is a "needs review" patch in #927782-35: No new items on cron (issues with job scheduler configuration) for making the $job['period'] dependent on the import period for the feed. I'm not sure if that is the correct solution though. If not, then perhaps a separate issue should be created for that.

vimaljoseph’s picture

StatusFileSize
new8.96 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch delete_orphans-feeds-6.x-1.0-beta11-661314-113.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

@NicolasH I've removed the $job['period'] change from #107 and applied your patch on #93 with the fix I mentioned in #107 to the beta11 released today. Attaching it to this ticket.

NicolasH’s picture

@vimaljoseph: Thanks for that.

I need to create a test setup to go through this in more detail on the weekend and then hopefully set it back to RTBC. Apologies to people who really want this committed, I've moved on from the actual use-case project, so it's a bit harder to make up the different scenarios in a pure "hypothetical" environment.

james_g’s picture

For anybody needing an immediate solution to remove nodes that are no longer in the feed, I just posted: http://drupal.org/project/views_now
This module allows the current server timestamp to be used as a views field. By setting the feed importer to update the node created time with the timestamp, it's expiration timer gets reset on every feed update. (Example: feeds update every 1 hour and expire is set to three hours). For items no longer in the feed, they will expire and be removed after the three hours.

My intention was to create a quick solution that does not involve patching the Feeds module. Because of this, once a solution is agreed to, tested, and committed to Feeds, it could be removed easily and replaced with the updated Feeds module.

morningtime’s picture

The suggested solutions probably work for news items, where only "oldest" items are removed. But it doesn't work for syncing product listings where "obsolete" products are removed at random. In that case you need a separate sync track: a fast comparison of local versus remote items, then remove the obsolete items.

This sync could happen at start or finish of a feed refresh, like a hook/callback. "If feed refresh complete: sync local/remote and delete obsolete." It could work with the already downloaded feed to avoid sync issues.

amitkeret’s picture

StatusFileSize
new2.76 KB

I liked #12's idea of putting this functionality in an extension to the existing, not patching feed's "core".

I've improved on @zengenuity's solution from #12, since the code there was for an older version of feeds.
I've also added some functionality to it - in the processor's settings page, you can now choose whether to unpublish/delete the orphaned nodes.

This is a separate module. After installation (and flushing cache as @zengenuity mentions) you'll have this new Node Processor with Garbage Collection option in Processors.

This was tested with feeds 6.x-1.0-beta10, I'll be happy to get comments about it.

amitkeret’s picture

StatusFileSize
new2.76 KB

Sorry - minor typo in the plugin definition - attached fixed module.

EvanDonovan’s picture

Ah - I like this idea as well. Would you consider making it a sandbox project? That way it could get promoted to real project status later, once other people have tested. (I haven't yet.)

NicolasH’s picture

I would like the idea of this being a separate project as well, since I don't think it will ever make it into the module (although this issue is even linked from the ManagingNews project page).

However, the module in #118 misses something that's been discussed above here a fair bit, and that's batching and scalability (see #44, #60). That's why the patch to feeds module is more complex than the proposed extension. It uses the expire process. Unless you duplicate all that in an extension or do something else that makes it batchable, this simply won't scale.

It will work probably fine for people who start with this right away, but it could be a lot of trouble if a feed already contains hundreds or thousands of nodes and then this gets added.

If you go ahead with this, one very basic addition could be to do a quick query when this is enabled for a feed to show how many nodes will be actually deleted on the next run. A warning if it's above some threshold, perhaps, and suggestions to first clear this out "manually" via some bare-bones batched delete function?

Renee S’s picture

Sub.

EvanDonovan’s picture

What do you think is the threshold over which most Drupal sites would choke if they tried to delete all the nodes at once?

I guess it depends to a large extent on memory & timeout settings in php.ini, as well as the overall system load, and the number of enabled modules (because that determines the number of nodeapi hooks that could get executed).

I would say maybe 300 or so would be a conservative estimate over which to warn people.

bneil’s picture

Subscribe

NicolasH’s picture

Maybe not even worth trying to determine a threshold...just display a message with the number of nodes that will need to be deleted and a generic warning - still think a button that would offer a batched deletion would be the easiest catch-all solution. Anyway, I wonder what amitkeret thinks about all this?

EvanDonovan’s picture

I guess that could work; let people decide for themselves. It's definitely the simplest solution, but I think grouping them in batches would be ideal (300 or so perhaps).

By the way, I am still using one of the patches from the middle of this thread, but the orphans are not getting deleted on cron; I'm not sure why. I'll look into it more and get back to you.

isotherm’s picture

I am using the separate module from #119. In the garbage collection function, it is important to restrict the feed name. Otherwise, if you have multiple feeds and the importers are not attached to a node, importing one feed will unpublish/delete all of the nodes from the other feeds! That section now reads:

  public function garbage($importer_id, $feed_nid, $nids_in_feed) {
    $counts = array('unpublished' => 0, 'published' => 0, 'deleted' => 0);
   
    // Get all published items from this feed that are not in this feed.
    $sql = "SELECT n.nid FROM {feeds_node_item} fni LEFT JOIN {node} n ON fni.nid = n.nid WHERE fni.id = %s AND fni.feed_nid = %d AND n.status = 1";
    if (!empty($nids_in_feed)) {
      $sql .= " AND n.nid NOT IN (" . db_placeholders($nids_in_feed, 'int') . ")";
    }
    $result = db_query($sql, array_merge(array($importer_id, $feed_nid), $nids_in_feed));

The line calling the garbage collection function should look like this:

    // GC: Call garbage collector.
    $garbage_counts = $this->garbage($source->importer_id, $source->feed_nid, $nids_in_feed);
Countzero’s picture

Hi all,

Could anybody tell me if there is a working D7 solution to this ? I only need a 'purge' function to clean all nodes that don't appear in the source anymore.

The content type used for the import is only used for this (no user created nodes), so it should be quite simple, but I didn't see anything about this.

Perhaps I misunderstand how the 'Expire' function works, but couldn't find anything in the doc about it.

Thanks for reading

soulston’s picture

For those of you wanting an alternate solution this module might be an option although it doesn't have a UI:

http://drupal.org/project/dataset

greg.harvey’s picture

Edit: SNAP! Was discussing with soulston in IRC and we both posted the same thing. D'oh! ;-)

Ahead of this feature being completed, you might want to check out my little utility for managing this sort of thing:
http://drupal.org/project/dataset

I wrote it before this issue even existed to tackle similar problems. It might help some people.

EvanDonovan’s picture

Looks like dataset is just an API module in itself, so it would still need to have some glue code that would add nodes to a dataset on a Feeds import, and then a cron implementation to purge old datasets periodically.

greg.harvey’s picture

That's right - it's very light, just an API - it was fine for my purposes when I wrote it but would require some integration if it were to be used anywhere else. A dataset_feeds module as part of the dataset package might be neat though, to do just that. =)

vin247’s picture

#119 works fine for a while, there are times where I'll have 1 to 3 days without anything going wrong, however there are still days, where one time through the day all nodes that are in the feed are unpublished/deleted.

Do I need to do any more configurations? Before the problem (I thought) was cron, I noticed that it only occurred when cron was aborted but now cron is completing every run and now I have nothing to go by in the logs to see why this is happening.

Any help would be appreciated. I'm using solution #119 with 6.x-1.0-beta11 on drupal 6.22

Thanks!

jeni_dc’s picture

I've had the module from #119 running on a site for a few weeks now and haven't noticed any problems like you've described. You mentioned that you thought maybe cron was the problem, in setting this up I realised I had to run cron as an authenticated user to get the nodes to delete properly, so I ended up setting up my cron job according to this:

http://drupal.org/node/995514#comment-4373006

Just thought I'd mention it just in case it might help.

Are you sure your feed isn't empty when everything is being deleted/unpublished?

vin247’s picture

hmm I've just tried setting to run cron as admin but still no luck, not sure how I can check if the feed is empty though, any ideas?

jeni_dc’s picture

I would think that loading the feed in your browser at the same time your site's nodes have all been deleted would show if your feed is empty. If it's a CSV just download it and open it.

When all of your nodes are deleted do they come back on the next cron run? In the settings for your feed processor, what is your setting for "Expire nodes"? That should be never, otherwise it's possible that your nodes are expiring and being deleted on one cron run, then being recreated on the next.

vin247’s picture

The feed is updated twice a day (early morning and early afternoon) and is always populated, I'll have another look at it today if it happens, the only problem is catching it when happens as the occurrence is random.

All the nodes do come back on the next cron and the expire nodes settings is set to never.

vin247’s picture

Checked it straight after it failed again, the feed isn't empty, I just have a feeling that nothing is getting lined up for when cron is run at certain times so I've decided to run cron only within the same hour as the feed is updated, hopefully it should do the trick.

vin247’s picture

Just started using multiple feed importers and had the problem with one feed unpublishing another, the solution in #127 by isotherm, after a few tests it seems to works fine.

vin247’s picture

OK after a few more tests I've realised solution #127 doesn't work, if anyone has it working with multiple feeds please let me know how! thanks!

ndevelder’s picture

#127 is throwing some errors, this is how I fixed them:

Problem: Code thinks that %s value is a column
Solution: Add single quotes around %s:

$sql = "SELECT n.nid FROM {feeds_node_item} fni LEFT JOIN {node} n ON fni.nid = n.nid WHERE fni.id = '%s'

and

Problem: Importer id not being passed to "garbage" function
Solution: Looking at the definition of feedsource here:
http://drupalcontrib.org/api/drupal/contributions--feeds--includes--Feed...
changed $source->importer_id to $source->id

$garbage_counts = $this->garbage($source->id, $source->feed_nid, $nids_in_feed);

Not sure if this solves your multiple feeds problem, but it isn't complaining anymore...

ndevelder’s picture

In this case, not complaining != fixed.

The code is not storing the proper nids, and in my case, is deleting 500 of 534 nodes that have just been imported...

Will look a little further into this, but I'm close to trying to find another solution.

kojis’s picture

If you run cron (or the import) again, does it republish the nodes?

In my case, I have found that the first import will pull in most of the feed, and create ~70 nodes. The second run will then seemingly add a few nodes not captured in the first run, and unpublish 50 of the previously imported nodes. The third run will then republish the 50 nodes, and my import is then seemingly complete. Once there is no content change, the feed job will no longer unpublish any nodes.

This scenario is repeatable for me, I have to imagine I'm hitting some kind of threshold which is why the first import is not completing fully, and in turn, this confuses the Garbage Collection mod.

If anyone has any advice on what might be causing this, it would be much appreciated.

kojis’s picture

I was able to resolve the issue by configuring (increasing) the batch size:
http://drupalcontrib.org/api/drupal/contributions--feeds--plugins--Feeds...

My hunch is that the Garbage Collection modified script is actually recognizing the previous import batches as garbage and deleting them, ergo, imports will encounter issues once they exceed 50 nodes.

I'm not too familiar with the intricacies of Feeds, but I guess ultimately, some additional logic could be added to the GC Importer to properly handle this case. Until then, hopefully increasing this threshold will resolve the issues for people.

rgristroph’s picture

StatusFileSize
new3.07 KB

Hi Folks,

Elliotttf ( https://drupal.org/user/61601 ) and I took the feeds_gc module in #119, made changes from #127 and #141 (I think), and ported it to D7.

This should probably become a module at this point, with 6.x and 7.x branches - does anyone want to do that ? I could start a sandbox if not.

--Rob

smithmilner’s picture

Status:Needs work» Needs review
StatusFileSize
new4.77 KB

Here is a crack at #119 with the changes from #127 and #141. I made some changes so the garbage collection script doesn't delete/unpublish nodes from previous batches of the same import.

**Edit** This is for Drupal 6, works with the latest stable release of feeds.

makkon’s picture

I have csv file with some fields. One of the field is every time unique, so i use node processor with "Update existing nodes" and in mapping i set this unique field as source with target = GUID.
First import create some nodes with unique GUIDs. Now:
1) If i add some row in csv and(or) change some not unique fields in current, and then import - feeds update existing nodes (like price field) and add new node from csv rows if they are not existed.
2) If i delete some fields - nothing happened (There is no new content.). But i need to unpublished such kind of nodes.

Do I understand correctly the functional of feeds_gc? Is it for my problem (in p.2?)
If it is, so why nothing happened with #146 (i use latest dev version)
Ps. no Node processor with Garbage Collection in gui

my bad. it worked like charm. i should just clean cache (on first time dunno why, but new option wont show up)

smithmilner’s picture

This plugin is supposed to unpublish content that was imported from that feed but is no longer in the source (your csv). If one of your csv fields that you removed was the field mapped to GUID it could have troubles matching the existing item to that and it would get unpublished.

makkon’s picture

smithmilner, can you list some situations where this happens?
Even if i mapped only unique fields as GUID?
i tested it on 20 rows(nodes) and all worked good.

smithmilner’s picture

While using this plugin, feeds runs the import and builds a list of Node IDs that match items from the source. These Node IDs are from previous runs of this import. After the list of NIDs is completed it deletes any nodes previously created from the import that is NOT in the list.

Basically it will only unpublish/delete nodes that it can't find in the source (your csv). This will happen if you alter/remove the csv field you're using as GUID.

vin247’s picture

running a few tests using #146 so far so good, thanks!

bancarddata’s picture

rgristroph, thank you for your module at #146. It *almost* works for me. It gets all the way through the import and then stops at 98% with an AJAX 500 error (this doesn't occur until after all of the nodes are actually imported - seems like it is some sort of post-process cleanup problem or something). My apache log shows a "PHP Fatal error: Allowed memory size of 167772160 bytes exhausted (tried to allocate 32 bytes) in /drupal/includes/database/log.inc on line 147". This does not occur with the regular Feeds Node Processor, setup identically to the GC importer. Any ideas what might be happening?

tevans’s picture

StatusFileSize
new3.64 KB

@rgristroph

I tested out your D7 module from #145 and it works great under certain conditions. There were a couple of bug fixes required for my use case. In particular, I added an additional condition to the where clause of the "Get all published items from this feed that are not in this feed" query that filters out items belonging to other importers. Without this additional condition, if you have multiple importers going on the same cron run, the second importer will delete items from the first importer, etc. Also, if you choose "Unpublish orphaned nodes" it doesn't appear to re-publish items that are added back to the feed. In my case, I chose to delete orphaned nodes, so I didn't take the time to fix that one. Attached is my patched version of the module if anyone is interested. Please note that this is for Drupal 7 and has only been tested against Feeds 7.x-2.0-alpha4. Thanks again to rgristroph and Elliotttf!

katherinecory’s picture

I'm using the latest dev version of Feeds and sometimes I exhaust my memory limit but other times I get this error:

SQLSTATE[42S22]: Column not found: 1054 Unknown column 'fni.feed_nid' in 'where clause'

I'm using the module from #153

eddib’s picture

Hi guys,

thx for making this little module. I used the module attached in #153 from @tevans and have two small issues:

  1. After import I get an SQLSYNTAX error.
    I located the error in the query in FeedsNodeProcessorGC.inc, line 203:
    condition('fni.feed_nid', $feed_nid)
    should be
    condition('fi.feed_nid', $feed_nid).
  2. I also had to comment out the lines 56 to 58 to get this working. These lines are not in the original feeds module. (I supposse you added these lines to make this work with rules?)
    if (module_exists('rules')) {
    rules_invoke_event('feeds_import_'. $source->importer()->id, $entity);
    }
anonymous07’s picture

Subscribe

arkjoseph’s picture

rgristroph and tevans, I have 7.14 and feeds 7.x-2.0-alpha4.

I have enabled the module but Im not seeing the new config settings. I also cleared cache, ran cron and update.php but still not seeing a way to unpublish or delete nodes from import. the 6.0 branch worked but not seeing the same for 7.

Thank you

edit: sorry, I wasn't looking hard enough. I found it.

leenwebb’s picture

I'm using the module in #145 and it's working well for me. I have two importers running simultaneously and with the fix in #153 it is working exactly how I want it to - deleting nodes that don't exist in the feed anymore.

Note! I already had an import in place, then switched the processor to this one set to "delete" and the first time I ran it it deleted *all* my nodes (not just the ones created by the last import). Whoa! That was exciting. When I reverted to a backup (phew!) and then created the importer from scratch rather than modifying an existing one, it worked as expected.

morales2’s picture

Using the fixes from #155 we are now in business! Thank you

AdAstra’s picture

"Node Processor with Garbage Collection" from #146 works beautifully as a feed importer, thank you for your immense contribution!

In case this helps others:
- Scenario: using the "iCal dateapi parser" to import an iCal calendar feed into a Drupal calendar (with Events as a custom content type to hold the title and date)
- Success: feeds_gc is able to update the Drupal calendar whenever dates are added/deleted from iCal. It even works with recurring dates in iCal.
- Exception: when deleting one or two dates *from* a recurring date series in Cal, the importer throws a "An HTTP error 500 occurred" error. And the corresponding nodes do not get deleted from the Drupal calendar.

Seems like Feeds/iCal Parser knows what to do with a feed item that has been added/deleted, but doesn't know what to make of a date that has been deleted from *within* the feed item (which is a recurring series). Took lots of case studies to pin point this shortcoming.

Would love to hear your feedback and willing to pay developer to find a work-around.

Cheers!

byrond’s picture

StatusFileSize
new4.94 KB

We are using the module from #153 with changes from #155. I also added the changes from the D6 module in #146 that prevent deleting/unpublishing nodes from previous batches of the same import. We were having this problem with a large CSV file we were importing.

byrond’s picture

StatusFileSize
new702 bytes
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds_gc-D7-variable_set-661314-161.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

I put the variable_set call in the wrong spot, which makes it useless. Moving it outside the while loop fixed the issue. Here's a patch against #161.

gnucifer’s picture

StatusFileSize
new4.06 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch feeds-synchronize.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

Here is general solution to the problem (D7-version, but should be easy to port). Same approach as in #59 but since processed guids are stored in state it will also work with batch operations. I have not gotten around to implement batch operations for the deletes though (as I do not need it myself) but it should not be all that difficult. The problem with the module in #146 is that it is not a general solution to the problem but will only works for nodes.

gnucifer’s picture

Though I think that a more scalable solution would be to add a unique import id for each import, mark all imported items with this id, and afterwards purge all feed items for the same importer having a different import-id. Perhaps this approach has already been implemented in this thread?

gnucifer’s picture

StatusFileSize
new4.76 KB
FAILED: [[SimpleTest]]: [MySQL] 2,566 pass(es), 0 fail(s), and 55 exception(s).
[ View ]

On second thought, I have found a much nicer solution. Using the "imported" column of the feeds_node_item it's possible to know if an item was present in an import or not. Just save the time the import is starting, and than purge all items older than the start date after the import is finished, super simple! Best of all, it's thread safe and not dependent on any unique identifier! Patch is for the d6 version, but porting to d7 should be real easy and I will probably do that as soon as I have some more time. Also had to hack it in quick, so could perhaps be nicer implemented, plus I have not tested it yet.

gnucifer’s picture

StatusFileSize
new4.77 KB
FAILED: [[SimpleTest]]: [MySQL] 2,562 pass(es), 0 fail(s), and 55 exception(s).
[ View ]

Forgot to include the id in the db-query, new patch:

gnucifer’s picture

StatusFileSize
new4.77 KB
FAILED: [[SimpleTest]]: [MySQL] 2,564 pass(es), 0 fail(s), and 55 exception(s).
[ View ]

Fixed small mistake in reporting of number of deleted nodes, I have also verified that the patch works as intended. Though it might be problematic if a very large number of nodes needs to be deleted. In my case 10k nodes needed to be purged, and even though it worked with high enough execution time limit with any "normal" php-settings it would have failed. So support for batching/processing on cron would be desirable.

Renee S’s picture

Thanks, gnucifer! What version of feeds does this apply to? It didn't go on cleanly for me.

Renee S’s picture

Testing this patch a bit, and I notice that the logic is something like this:

  • Record start-time of first batching process
  • (batching happens)
  • Select nids from {feeds_node_item} where id = X and imported time less than when the last processing started.
  • Delete those.

... however, what if you have multiple feeds on an id (since id is just the content-type that holds the feed's URLs.) If I'm just processing one feed node at a time, this will wipe out anything attached to other feed nodes that wasn't imported after the start-time of batch. Or am I missing something? :)

- R

Edited to match gnucifer's description!

gnucifer’s picture

Your description of the logic is sort of correct, but the time compared against is when batching first started, not the start-time of the current batch. When all batches have completed, that is all items has been processed, the items with "imported"-dates older than this timestamp are deleted, since those items where not among the items processed (=no longer in the feed).

And yes you are probably right, but the problem would probably go away though if the feed nid also is included as a condition in the query? Will have a look at it tomorrow, also I discovered some other logical mistake that I will submit along with a possible fix for the above problem. Not really happy with the current code since it feels rather hackish and (obviously) has some logical errors not so easily to spotted. I have also experienced some odd/inconsistent behavior with large number of items, so could be more issues. The concept is probably alright, just need to sort out the implementation.

Renee S’s picture

the problem would probably go away though if the feed nid also is included as a condition in the query?

Yes, I think that would fix it!

gnucifer’s picture

Sorry, did not have time to look at it today, will submit a patch later this week instead.

mxwright’s picture

Is there a D7 port of #165-167 yet? This solution seems ideal for me. Thanks!

gnucifer’s picture

StatusFileSize
new4.79 KB
FAILED: [[SimpleTest]]: [MySQL] 2,564 pass(es), 0 fail(s), and 55 exception(s).
[ View ]

New patch for D6 (against latest dev), nid is now included in query as should be, plus some other minor things fixed.

@mxwright: You might find #163 useful, I will port the "better" solution to D7 as well, but will probably not have time to do so until next week.

Renee S’s picture

Thanks, gnucifer -- this works as far as it goes, however it creates load problems for us. We've got feeds with 250+ items that don't change much. What this does is delete anything brought in on a previous batch, so suddenly a feeds import is going from a small thing to a major job, deleting and recreating nodes all over the place, and timeouts, and the like. The outcome works, but getting there is painful :)

Thinking about it a bit more, I'm not convinced using created-time is the best way to go about this, given the problems I'm seeing, anyway: YMMV. If you have your feeds set to update rather than replace (as we do) and if feeds are imported on a batch every X hours where there's only a small change every so often, what this will do is delete all the feeds brought in on the previous batch and replace them all; instead, I'd expect it to do a diff and delete only the feeds that aren't the same. I think it might be more efficient to match GUIDs for the feeds items, and delete the ones that don't match the current import. Thoughts?

gnucifer’s picture

I assume you have mapped GUID as unique target for the items (and not url for example)? I must admit I am still getting some strange behavior from this patch, but in theory it should only delete items that has been removed from feed since last import (that is if guid is mapped correctly). It should not delete all items from a previous batch and than read them in again.

jelo’s picture

The original use case of this issue was "In many cases it would be useful to expire items that do not exist on the external data source anymore." But the discussion seems to extend to a more complicated problem of synching and time-based expiry of imported items. However, several related issues that discuss how to delete items that do not exist in the original source anymore point here.

Hence, I wanted to mention a potential solution for the original use case: I would add as a requirement that the solution should not necessarily have to rely on the feed as indication if an item still exists or not, i.e. feeds as a module is not even in a position to deal with this problem. In some scenarios it makes sense, e.g. dumping data from one system into a csv and import into the Drupal site. In others it might not, e.g. in my specific use case, I import YouTube videos into nodes. The YouTube feed only provides 20 items. Let's say I import 100 items per day, then I cannot use a solution that relies on the items in the feed because it would miss 80 items that are not in the latest items feed anymore. But I still need to check all 100 items to make sure if all the videos still exist on YouTube.

I considered to use linkchecker http://drupal.org/project/linkchecker as a simple solution for this. An imported item should always have a source URL (the YouTube URL of each video). I can check this source URL for the http header response code. If it returns 404, then the item is not available anymore and I might have to take action on my site. It seems like linkchecker does allow you to do exactly that... Right now the features only list unpublishing content, but maybe it would make more sense to extend linkchecker to execute different actions, such as changing a CCK field (e.g. from in stock to out of stock). I haven't tried this solution yet, but maybe it does already integrate with rules which would give us many options. This would be useful if you do not work in a scenario where your imported feed contains all your items, but only a subset, but you still have to check all previously imported items.

Regarding time-based expiry: wouldn't rules allow various methods for dealing with this issue?

gnucifer’s picture

This will only work if the items you import are available through a public http-url though. If importing items with another fetcher, a file or private FTP fetcher for example, it will not.

jelo’s picture

Good point. It is easy to forget the myriad of use cases that feeds can be used for, from migrations to regular imports etc. Given that this discussion spans 3 years and more than 150 comments, it gets a little hard to keep track of all the points. Maybe it helps to just collect criteria that need to be considered when creating a solution. Questions discussed here evolve around:

  • Is the feed publicly available or private or file?
  • Intended outcome: deletion of items that become unavailable in the source vs. dynamic expiry of imported content vs. complex sync scenarios between multiple feeds/sources
  • Does the feed include all items, i.e. can deleted items be identified because they disappear from the feed?
  • Do you control the source feed? If you do, you might just be able to add a field to set a flag for deleted items and process that during the import with mapping.
  • Number of imported items and frequency of import, e.g. impacting performance

For my YouTube example it would mean: publicly available feed, intended outcome is deletion of 404 items, no influence on the feed, feed does not include all items, number of items varies on channel that gets imported. I believe feeds would not be in the best position to handle this. I tried linkchecker last night and it looks very promising. There is an outstanding patch that might provide trigger/rules integration which would make it very powerful to address this particular scenario.

tyler-durden’s picture

I have applied the patch in #163 to Feeds 7.x-2.0-alpha7 and I think there is an issue, I will document step by step what happens.

I created 1 Feed Importer attached to a node type for import. I created 2 nodes, and imported a 3 item csv file for each. All data was the same EXCEPT for the author ID. Each import created the 3 nodes under their respective author ID for a total of 6 nodes. However when I deleted the 3 lines on the first CSV and ran import again on only 1 node importer, it deleted ALL 6 nodes. I would think it should only delete the 3 that node imported.

I have a field marked unique I'm using for the GUID, BUT multiple customers may have the same data for that GUID field (some may only use 4 or 5 lines in the CSV so they will just number them 1,2,3,4,5, etc.). If multiple imports have the same GUID they still import per respective customer, but this will wreak havoc if I have 100 different node importers for 1 feed type. When someone deletes #4 or #5 for example, ALL GUID's with 4 or 5 will be deleted, not respective to the node importer it was created in.

I hope this is clear. Maybe I'm just not understanding something?

gnucifer’s picture

StatusFileSize
new4.08 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 163-7.x-2.0-alpha7.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

@tyler-durden, I think a missing feed_nid condition is the culprit, I have not added:

<?php
->condition('feed_nid', $source->feed_nid);
?>

which might resolve the problem. This is untested code though so not sure if it will work. Another issue might be that the patch #163 does not apply to 7.x-2.0-alpha7, you will get rejects. This new patch is against this version though.

gnucifer’s picture

What I find strange though is that $source->feed_nid should not be availabe since its a protected member of the $source object. It's used in other places in the existing code though so I guess it works??

tyler-durden’s picture

Thanks gnucifer, it may take me a few days to get back to testing but I will keep you updated.

tyler-durden’s picture

I tried that patch and it works as it should, I didn't need to add the extra code you listed. I assume I was not applying the right patch to Alpha7. Thanks again.

gnucifer’s picture

Ok, good! The extra code was included in the latest patch so no need to add it manually.

nedjo’s picture

Status:Needs review» Closed (duplicate)

While this is the older issue, #1470530: Unpublish/Delete nodes not included in feed has a newer and more developed patch, so marking this issue as a duplicate.

Renee S’s picture

But that patch is for 7. Will it be backported?

gnucifer’s picture

And the patch referred to in #186 will just work for nodes from what I can see, better to put the syncing logic in feeds processor. I think my patch is much preferable even though the code might be in need of some polish.

thedavidmeister’s picture

Status:Closed (duplicate)» Needs review

as per #187, #188 this is not a duplicate as it is for D6 and the patch here is doing something different to the patch in the linked thread.

imclean’s picture

Features are generally added to the latest version and sometimes backported. Typically there shouldn't be more than one issue for the same feature request, regardless of version.

It seems unlikely 2 different methods of doing a similar thing will be applied to different versions of Feeds.

@gnucifer, please check the latest patch in the other issue. It puts most of the processing in FeedsProcessor, while also adding the option to unpublish rather than delete nodes.

Therefore, I believe this is a duplicate of #1470530: Unpublish/Delete nodes not included in feed.

Status:Needs review» Needs work

The last submitted patch, 163-7.x-2.0-alpha7.patch, failed testing.

gnucifer’s picture

@imclean Yes as far as I can see the latest patch in #1470530 supersedes the patch in this thread in functionality and we can close this issue as a duplicate.

EDIT: Looked through the patch again and it's not as generic as I first thought but relies on the entities being nodes as node, for example invoking node_load_multiple($state->removeList), node_save, etc. I THINK that the patch in #181 should work for all types of entities, so they are not equivalent. Would be nice if the patch in #1470530 could be reworked to fit this description. I also had some other questions about the patch, but I will post them in #1470530.

EDIT2: Thinking about this some more, I think the approach in #1470530 is flawed to begin with since only nodes can be published/uppublished and you will never get rid of the node dependency. I like the idea of staring of by saving all the entity id and then marking them of, that saves you from a potentially quite large SQL-query, but I don't think you should hard-code the publish-unpublish/delete behavior. Better than to provide a special rule event for example in which the user can create some custom behaviors. In that case you might need another table to keep track of if the event has been triggered or not, but it should not be impossible to implement.

imclean’s picture

@gnucifer, please look at the patch in #95 of the other issue more closely. Perhaps even apply it so you can get a clearer picture.

  1. FeedsNodeProcessor->Clean() overrides FeedsProcessor->Clean()
  2. node_load_multiple() and node_save() are only used in FeedsNodeProcessor
  3. The unpublishing option only applies to FeedsNodeProcessor and isn't in FeedsProcessor
  4. FeedsProcessor->clean() is generic for all entities
gnucifer’s picture

@imclean You are correct, sorry! Can't believe how I managed to miss that as I looked through the patch serveral times.

imclean’s picture

Status:Needs work» Closed (duplicate)