Problem/Motivation
Feeds has a feature to unpublish or delete items that have been previously imported, but do no longer appear in the source. This is useful to keep source and target data in sync. For example: you have a list of products in a CSV file that you import regularly. When you remove products from the CSV file, you want to have them removed from the site as well.
The unpublish/delete operation is not batched: the complete operation has to be accomplished in one request. This forms a problem when a large amount of items need to be unpublished/deleted: the operation will be too big to accomplish in one request so it will timeout and possibly causing deadlocks.
Proposed resolution
There is no real solution yet, but changes need to be made in FeedsProcessor::clean()
and FeedsNodeProcessor::clean()
#7 suggests that the issue could perhaps be solved by using the Queue API (see http://rbayliss.net/drupal-queue-api for a tutorial). We should however take into account that this doesn't interfer with other processes: the feed should remain locked until the operation has been completed.
#2862909: Batched Unpublish/Delete of Non-existent items suggests that for the unpublish action only nodes that are published should be loaded. Right now all nodes in $state->removeList
should be loaded, unpublished and saved.
Remaining tasks
- Figure out how to batch the unpublish/delete operation. Queue API?
- Improve the implementation of the unpublish action: only load published nodes from
$state->removeList
. - Write an automated test that ensures that the batched operation gets completed.
- Write an automated test that ensures that the feed remains locked when the unpublish/delete action is not completed yet.
User interface changes
None.
API changes
TBD.
Data model changes
TBD.
Original report by andreboc
For some reason the update of that version the option of "Unpublish non-existent nodes" completely stopped working.
It doesnt change the value to "Not Published" in case the node is missing in the Feeds.
Did someone found a fix for that issue?
Cheers.
Comment | File | Size | Author |
---|---|---|---|
#19 | fix_timeout_deleting_large_number_of_non_existent_entities-2571767-19.patch | 3.68 KB | efrainh |
#6 | scrennhsot-2.png | 29.6 KB | venkatesha.k12 |
#6 | scrennhsot-1.png | 31.04 KB | venkatesha.k12 |
Comments
Comment #2
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedThis feature should still work as the functionality is backed up with tests. The only case in which nodes are not published/deleted is when the source is completely empty. This is by design, because in most cases you don't want everything wiped out when supplying an empty source. There is an issue about making this behaviour optional: #2333667: Add an option to delete all nodes when the feed source is empty.
Can you try to reproduce this issue on a clean install? If you can reproduce it, can you provide the steps to reproduce it?
Comment #3
andreboc CreditAttribution: andreboc commentedHey there,
Definitely not working, as another user is having the same issue with the same version:
https://www.drupal.org/node/1470530#comment-10209713
Even tho the node isn't inside the Feeds (the feeds is not empty either) the old node doesn't get unpublished at all.
Did a clean installation, same feeds, same problem. Doesn't get unpublished.
Regards,
Andre.
Comment #4
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedWhich steps did you follow? Is it something like the following?
If you followed these steps and still encounter this bug, can you provide an export of your Feeds configuration and the two sources that you were importing? So, the feeds importer, one source file with the nodes that were originally imported and one source file in which some items were removed.
Comment #5
venkatesha.k12 CreditAttribution: venkatesha.k12 as a volunteer commentedFollowed steps mentioned @ #4.
both delete and Unpublish of nodes works fine on beta1 version.
But when we do it for mass data(4000 records) Delete is not working.
im getting " General error: 1205 Lock wait timeout exceeded; try restarting transaction" error.
Could any one able replicate this.
Thanks.
Comment #6
venkatesha.k12 CreditAttribution: venkatesha.k12 as a volunteer commentedAttached screen shots for this.
Comment #7
MegaChriz CreditAttribution: MegaChriz at WebCoo commentedIt is possible that this features doesn't work well yet with a large number of records. In the current implementation all records to unpublish/delete are attempted to process in one go. And yes, that can fail if the number of records is big. This was in fact mentioned in the original issue: #1470530-47: Unpublish/Delete nodes not included in feed, but it was put on hold in comment 80 of that issue (#1470530-80: Unpublish/Delete nodes not included in feed) because of #1363088: Feeds batch process not fully implemented - needed for large imports.
Patches that improve the implementation are welcome.
See:
Maybe this can be implemented with the Queue API. See http://rbayliss.net/drupal-queue-api for a good tutorial for this API.
Comment #8
berndlosert CreditAttribution: berndlosert commentedI had to deal with this issue at work since we deal with lots of feeds and some of them have over 20000 items. The solution I came up was to create my own FeedsProcessor plugin, which overrides the process() method so that instead of calling clean(), I call my own custom method that adds the nodes that need to be unpublished into a queue that gets processed little-by-little throughout the day.
I would have overridden the clean() method instead but then I realized that the way it is set up with $state->removeList was not going to work since some of the feeds I deal with are huge. (The size of the removeList array is proportional to the number of items in the feed being imported and you quickly run into memory problems when the feed has 100k items or more).
Comment #9
MegaChriz CreditAttribution: MegaChriz at WebCoo commented$state->removeList
is initialized in FeedsProcessor::initEntitiesToBeRemoved() and contains a list of entity ID's. The list initially contains indeed the ID's of all items that were imported with the current importer/feed node.Any suggestions to improve this?
@berndlosert
Would it help to share your implementation of your custom processor here or is it very specific (for example: contains business logic)? How do you determine which nodes need to be unpublished?
Comment #10
berndlosert CreditAttribution: berndlosert commented@MegaChriz: Yes, there is some very specific stuff in there which will most likely confuse anybody looking at it since it is tied with other parts of "the business"
In order to determine which nodes need to be unpublished, I had to make several changes to the process() method. What happens is that every node in the process() method that gets loaded goes through entitySave(). I had to rework the hash checking logic to allow for this. The entitySave() will eventually trigger a presave hook that updates the node's changed property to the current time(). Thus, every node touched by the process() method will have its changed property updated. Since I know when the import started (using $source->state(FEEDS_START)), I know which nodes were affected by the import by simply querying for those nodes that have an updated time greater than the time of the start of the import. The nodes that need to be unpublished are of course the ones whose updated time is older than the start of the import. When the import is done, I simply query for these nodes and add them one by one into a queue using a loop.
Comment #11
berndlosert CreditAttribution: berndlosert commentedRegarding the hash checking logic, I changed things so that the
map()
,entityValidate()
andentitySaveAccess()
methods are called only when one of$entity->feeds_item->is_new
,$this->config['skip_hash_check']
or$changed
is TRUE.Comment #12
MrDaleSmith CreditAttribution: MrDaleSmith at CTI Digital commentedI thought I had this issue, but on investigation it appears that instead I have an issue with workbench moderation integration: with workbench moderation in place, if the node has already reached the published state then feeds' attempt to unpublish it has no effect and the published revision remains displayed. This may be an issue with our local set up, but I thought I'd mention it if anybody else came here with the same issue: I've resolved with a custom implementation of hook_node_save that checks the content of the feeds_item property before implementing workbench moderations's transition process.
Comment #13
MegaChriz CreditAttribution: MegaChriz at WebCoo commentedClosed the following issues as duplicates:
#2731437: Batch processing for FeedsProcessor::clean()
#2862909: Batched Unpublish/Delete of Non-existent items
#2731437: Batch processing for FeedsProcessor::clean() has a patch.
Comment #14
MegaChriz CreditAttribution: MegaChriz at WebCoo commented#2862909: Batched Unpublish/Delete of Non-existent items suggests the following improvement as well:
So, the unpublish action should be modified to only load nodes that are published from
$state->removeList
Comment #15
MegaChriz CreditAttribution: MegaChriz at WebCoo commentedI've updated the issue summary to better reflect what needs to be done to fix this issue.
Comment #16
jrusse27 CreditAttribution: jrusse27 commentedI would be very interested to see if that is what is causing nodes to not be unpublished on my site, would you mind posting a patch of your fix so that I can test?
Comment #17
MrDaleSmith CreditAttribution: MrDaleSmith at CTI Digital commented@jrusse27 I don't have access to the code I used any more, but basically the default behaviour of Workbench Moderation is that once a version of a node has been published, that version is displayed to users until explicitly unpublished. Feeds was marking the node unpublished, but workbench moderation considered this to be a new draft and just kept on displaying the published revision of the node.
I used hook_node_save to check if the feeds item was there and should be unpublished, and then ran the node through the unpublish form submit handler in the module.
However, this was a year ago: both modules have been updated since then, Workbench Moderation significantly so, so I don't know if the issue still applies.
Comment #18
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedFor the D8 version of Feeds, batch support for the cleaning stage has been added in #2939503-15: Unpublish/Delete nodes not included in feed. Now it just need to be code for the D7 version too. Note: the code from that other issue can not easily be backported due to architectural changes in the D8 version, but perhaps it can provide inspiration.
Comment #19
efrainhI created this patch for my specific problem which was the timeout error when deleting a large number of users after importing a huge file (120.000+ records) with the option to delete previously imported users missing in the feed. I decided to start the implementation of a solution where the deletion process was handled with a batch. I haven't seen the D8 implementation but i tested my patch with my large files and it worked.
Basically i modified the clean method to check if $state->removeList is greater than a limit number i set, then if it is greater i start the batch process. This code is supposed to work on nodes and users with no problem.
I believe this can be a good start @megaChriz
Comment #20
efrainhComment #21
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedCool, I haven't tested the code yet, but it looks like a good start indeed. For the limit we could reuse the variable 'feeds_process_limit'. We also need to figure out how we can get this operation in a queue instead of a batch on cron runs.