Voting starts in March for the Drupal Association Board election.
Scans all nodes for duplications of node titles and removes all but one.
By design, it only scans nodes for titles that are exact case-sensitive matches, including whitespace.
The one with the lowest ID is taken as the original, all others are queued for deletion.
Both the scanning (which finds all duplicates) and the actual deletion are performed via batch processing.
This *will* destroy content. Because that's what it does.
2013 - Deprecated in favor of Remove Duplicates
This sandbox will not be getting a full release. It's a one-off tool that has the potential for a lot of support headache for me (solving other folks unique mistakes), and as I've learned not to make this mistake again, I won't be actively testing it going forwards. Please look at "Remove Duplicates" for feature requests.
Originally based on sample code from prabhakarsun at http://drupal.org/node/720190 , this is a full rewrite that uses batch processing to manage this problem over thousands of bad nodes.
Original scenario (for me) was finding a site that was using Feeds to import a large number of (location) nodes from a data source that was expected to be updated occasionally. The problem was that:
- The re-import was left at the default "30 minutes" for re-import
- The feeds mapping did not have a GUID set so each re-import created new items.
- The original developer left it in that state, after minimal testing (it worked once)
By the time I found this, we had 400 nodes x 800 copies of them :-/
TODO, maybe add another check to compare on additional fields or values, like CCK. This is even more intensive on large numbers, so has not been done.