Problem/Motivation

Feeds has a feature to unpublish or delete items that have been previously imported, but do no longer appear in the source. This is useful to keep source and target data in sync. For example: you have a list of products in a CSV file that you import regularly. When you remove products from the CSV file, you want to have them removed from the site as well.

The unpublish/delete operation is not batched: the complete operation has to be accomplished in one request. This forms a problem when a large amount of items need to be unpublished/deleted: the operation will be too big to accomplish in one request so it will timeout and possibly causing deadlocks.

Proposed resolution

There is no real solution yet, but changes need to be made in FeedsProcessor::clean() and FeedsNodeProcessor::clean()
#7 suggests that the issue could perhaps be solved by using the Queue API (see http://rbayliss.net/drupal-queue-api for a tutorial). We should however take into account that this doesn't interfer with other processes: the feed should remain locked until the operation has been completed.

#2862909: Batched Unpublish/Delete of Non-existent items suggests that for the unpublish action only nodes that are published should be loaded. Right now all nodes in $state->removeList should be loaded, unpublished and saved.

Remaining tasks

  • Figure out how to batch the unpublish/delete operation. Queue API?
  • Improve the implementation of the unpublish action: only load published nodes from $state->removeList.
  • Write an automated test that ensures that the batched operation gets completed.
  • Write an automated test that ensures that the feed remains locked when the unpublish/delete action is not completed yet.

User interface changes

None.

API changes

TBD.

Data model changes

TBD.

Original report by andreboc

For some reason the update of that version the option of "Unpublish non-existent nodes" completely stopped working.
It doesnt change the value to "Not Published" in case the node is missing in the Feeds.

Did someone found a fix for that issue?

Cheers.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

andreboc created an issue. See original summary.

MegaChriz’s picture

Priority: Critical » Normal
Status: Active » Postponed (maintainer needs more info)

This feature should still work as the functionality is backed up with tests. The only case in which nodes are not published/deleted is when the source is completely empty. This is by design, because in most cases you don't want everything wiped out when supplying an empty source. There is an issue about making this behaviour optional: #2333667: Add an option to delete all nodes when the feed source is empty.

Can you try to reproduce this issue on a clean install? If you can reproduce it, can you provide the steps to reproduce it?

andreboc’s picture

Status: Postponed (maintainer needs more info) » Active

Hey there,

Definitely not working, as another user is having the same issue with the same version:
https://www.drupal.org/node/1470530#comment-10209713

Even tho the node isn't inside the Feeds (the feeds is not empty either) the old node doesn't get unpublished at all.

Did a clean installation, same feeds, same problem. Doesn't get unpublished.

Regards,
Andre.

MegaChriz’s picture

Status: Active » Postponed (maintainer needs more info)

Which steps did you follow? Is it something like the following?

  1. Created a Feeds importer.
  2. On basic settings, set "Attach to content type" to "Use standalone form".
  3. At node processor settings, set "Action to take when previously imported nodes are missing in the feed" to "Unpublish non-existent nodes".
  4. On mapping settings, set one target as unique.
  5. Went to /import, clicked the just created feeds importer from the list.
  6. Imported a file with two items. Result: two nodes were created.
  7. Removed one item from the source.
  8. Re-imported the source. Result: one node was unpublished.

If you followed these steps and still encounter this bug, can you provide an export of your Feeds configuration and the two sources that you were importing? So, the feeds importer, one source file with the nodes that were originally imported and one source file in which some items were removed.

venkatesha.k12’s picture

Followed steps mentioned @ #4.

both delete and Unpublish of nodes works fine on beta1 version.
But when we do it for mass data(4000 records) Delete is not working.

im getting " General error: 1205 Lock wait timeout exceeded; try restarting transaction" error.

Could any one able replicate this.

Thanks.

venkatesha.k12’s picture

FileSize
31.04 KB
29.6 KB

Attached screen shots for this.

MegaChriz’s picture

Title: Unpublish non-existent nodes - Not working » Unpublish non-existent nodes times out with large number of records
Status: Postponed (maintainer needs more info) » Active

It is possible that this features doesn't work well yet with a large number of records. In the current implementation all records to unpublish/delete are attempted to process in one go. And yes, that can fail if the number of records is big. This was in fact mentioned in the original issue: #1470530-47: Unpublish/Delete nodes not included in feed, but it was put on hold in comment 80 of that issue (#1470530-80: Unpublish/Delete nodes not included in feed) because of #1363088: Feeds batch process not fully implemented - needed for large imports.

Patches that improve the implementation are welcome.

See:

  • FeedsProcessor::clean()
  • FeedsNodeProcessor::clean()

Maybe this can be implemented with the Queue API. See http://rbayliss.net/drupal-queue-api for a good tutorial for this API.

berndlosert’s picture

I had to deal with this issue at work since we deal with lots of feeds and some of them have over 20000 items. The solution I came up was to create my own FeedsProcessor plugin, which overrides the process() method so that instead of calling clean(), I call my own custom method that adds the nodes that need to be unpublished into a queue that gets processed little-by-little throughout the day.

I would have overridden the clean() method instead but then I realized that the way it is set up with $state->removeList was not going to work since some of the feeds I deal with are huge. (The size of the removeList array is proportional to the number of items in the feed being imported and you quickly run into memory problems when the feed has 100k items or more).

MegaChriz’s picture

$state->removeList is initialized in FeedsProcessor::initEntitiesToBeRemoved() and contains a list of entity ID's. The list initially contains indeed the ID's of all items that were imported with the current importer/feed node.

Any suggestions to improve this?

@berndlosert
Would it help to share your implementation of your custom processor here or is it very specific (for example: contains business logic)? How do you determine which nodes need to be unpublished?

berndlosert’s picture

@MegaChriz: Yes, there is some very specific stuff in there which will most likely confuse anybody looking at it since it is tied with other parts of "the business"

In order to determine which nodes need to be unpublished, I had to make several changes to the process() method. What happens is that every node in the process() method that gets loaded goes through entitySave(). I had to rework the hash checking logic to allow for this. The entitySave() will eventually trigger a presave hook that updates the node's changed property to the current time(). Thus, every node touched by the process() method will have its changed property updated. Since I know when the import started (using $source->state(FEEDS_START)), I know which nodes were affected by the import by simply querying for those nodes that have an updated time greater than the time of the start of the import. The nodes that need to be unpublished are of course the ones whose updated time is older than the start of the import. When the import is done, I simply query for these nodes and add them one by one into a queue using a loop.

berndlosert’s picture

Regarding the hash checking logic, I changed things so that the map(), entityValidate() and entitySaveAccess() methods are called only when one of $entity->feeds_item->is_new, $this->config['skip_hash_check'] or $changed is TRUE.

MrDaleSmith’s picture

I thought I had this issue, but on investigation it appears that instead I have an issue with workbench moderation integration: with workbench moderation in place, if the node has already reached the published state then feeds' attempt to unpublish it has no effect and the published revision remains displayed. This may be an issue with our local set up, but I thought I'd mention it if anybody else came here with the same issue: I've resolved with a custom implementation of hook_node_save that checks the content of the feeds_item property before implementing workbench moderations's transition process.

MegaChriz’s picture

Title: Unpublish non-existent nodes times out with large number of records » Batch processing for FeedsProcessor::clean() (unpublish/delete non-existent nodes times out with large number of records)
MegaChriz’s picture

#2862909: Batched Unpublish/Delete of Non-existent items suggests the following improvement as well:

When the feed comes to clean up it is passed $state->removeList, if the method is unpublish it then uses node_load_multiple on the array of items and runs node_unpublish_action($node) and node_save($node) on each of them, regardless of whether or not they are already unpublished. This means it could only need to unpublish 50 items but actually loads (for arguments sake) 500,000 items that are no longer in the source but are unpublished and saves them again.

So, the unpublish action should be modified to only load nodes that are published from $state->removeList

MegaChriz’s picture

Issue summary: View changes

I've updated the issue summary to better reflect what needs to be done to fix this issue.

jrusse27’s picture

I've resolved with a custom implementation of hook_node_save that checks the content of the feeds_item property before implementing workbench moderations's transition process.

I would be very interested to see if that is what is causing nodes to not be unpublished on my site, would you mind posting a patch of your fix so that I can test?

MrDaleSmith’s picture

@jrusse27 I don't have access to the code I used any more, but basically the default behaviour of Workbench Moderation is that once a version of a node has been published, that version is displayed to users until explicitly unpublished. Feeds was marking the node unpublished, but workbench moderation considered this to be a new draft and just kept on displaying the published revision of the node.

I used hook_node_save to check if the feeds item was there and should be unpublished, and then ran the node through the unpublish form submit handler in the module.

However, this was a year ago: both modules have been updated since then, Workbench Moderation significantly so, so I don't know if the issue still applies.

MegaChriz’s picture

For the D8 version of Feeds, batch support for the cleaning stage has been added in #2939503-15: Unpublish/Delete nodes not included in feed. Now it just need to be code for the D7 version too. Note: the code from that other issue can not easily be backported due to architectural changes in the D8 version, but perhaps it can provide inspiration.

efrainh’s picture

I created this patch for my specific problem which was the timeout error when deleting a large number of users after importing a huge file (120.000+ records) with the option to delete previously imported users missing in the feed. I decided to start the implementation of a solution where the deletion process was handled with a batch. I haven't seen the D8 implementation but i tested my patch with my large files and it worked.

Basically i modified the clean method to check if $state->removeList is greater than a limit number i set, then if it is greater i start the batch process. This code is supposed to work on nodes and users with no problem.

I believe this can be a good start @megaChriz

efrainh’s picture

Status: Active » Needs review
MegaChriz’s picture

Cool, I haven't tested the code yet, but it looks like a good start indeed. For the limit we could reuse the variable 'feeds_process_limit'. We also need to figure out how we can get this operation in a queue instead of a batch on cron runs.