Note: klausi's original issue summary below -eosrei
Ever had to build a migration against live data? Me too! Source records are added, updated and even deleted. The migrate module currently only supports new and updated source records. Here is a patch to fix that! Plus, some other related things...
Usage
If you are using --update to update all destination objects and want to remove any un-updated objects (OP's original request):
drush migrate-import --all --update
drush migrate-rollback --all --needs-update
If you are using track_changes to update destination objects with changed source data and want to remove any unprocessed objects:
drush migrate-import --all --track-unprocessed
drush migrate-rollback --all --unprocessed
If you are using a highwater field to update destination objects with changed source data and want to remove any unprocessed objects:
drush migrate-import --all --track-unprocessed
drush migrate-rollback --all --unprocessed --do-not-reset-highwater
Features
The following features are implemented:
- Separated needs_update flag data from row import status, includes hook_update_7208().
- Adds two row status values: STATUS_SKIPPED and STATUS_UNPROCESSED. Definitions: STATUS_IGNORED means the row has been ignored by a prepareRow() function, STATUS_SKIPPED means the row has been skipped because of the id list, highwater field, or tracked changes.
- Displays skipped record statistics after import
Processed 19299 (3 created, 4 updated, 0 failed, 0 ignored, 19292 skipped) in 24.6 sec (17/min) - done with 'Example'
- Implement roll back of both unprocessed and needs-update objects.
- New tests for track_changes, skipped records, and rollback unprocessed with 17 new assertions.
Adds a new option to drush migrate-import:
--track-unprocessed Set the status of all mapped rows to
unprocessed. When the import process
is complete, any unprocessed records
can be considered missing from the
source.
Adds three new options to drush migrate-rollback:
--do-not-reset-highwater Do not reset the highwater field
value after the roll back. Generally
used with partial roll backs.
--needs-update Only roll back objects marked
needs-update
--unprocessed Only roll back objects with a status
of unprocessed.
Notes
I've tried to reduce the overall amount of changed lines, which is why I used migrate-rollback instead of a new drush command such as migrate-purge.
--needs-update and --unprocessed are not compatible with --idlist (obviously) during rollback. They will override anything set it using --idlist.
For sanity the syntax of the saveIDMapping() function has been changed to include separate import_stats and needs_update values. I've checked the following contrib modules: commerce_migrate, g2migrate, migrate_d2d, migrate_extras phpbb2drupal, TYPO3_migrate, and wordpress_migrate. Only the current dev of migrate_d2d uses the saveIDMapping() function, so a small patch is required for it. Additionally, variable defaults now come from the database defaults rather than function and database defaults. If a passed value is NULL, it will not be updated in the database. No more losing your $hash value to set a $rollback_action.
Original:
public function saveIDMapping(stdClass $source_row, array $dest_ids,
$needs_update = MigrateMap::STATUS_IMPORTED,
$rollback_action = MigrateMap::ROLLBACK_DELETE, $hash = NULL)
New:
public function saveIDMapping(stdClass $source_row, array $dest_ids,
$import_status = NULL, $needs_update = NULL, $rollback_action = NULL,
$hash = NULL)
I can submit this as multiple patches if it aids review. It was created in five commits against the migrate.git.
Per mikeryan's comment in #5, "This would absolutely have to be an optional, non-default action" There is no noticable change to any existing functionality. All new functionality is completely optional.
OP issue summary:
Use case: User migration from an old platform, during the staging phase of the project I don't want to rollback and re-import users all the time. Users can get deleted on the old platform in the meantime, so when I run migrate-import --update the already imported users that do not exist anymore should also get deleted.
There is a comment in #1222454-3: migrating users with --update fails that says that "in particular, any destination-side data not in the source data will be wiped on --update". But this does not work for me. The records in the migrate_map table are only set to needs_update=1, but the Drupal 7 users do not get deleted.
Is there any other command I must run to delete the users? Do I have to write a script myself that extracts the destids that are set to needs_update=1 and invoke user_delete_multiple()?
Comment | File | Size | Author |
---|---|---|---|
#19 | migrate-rollback-unprocessed-1416672-19.patch | 39.2 KB | 13rac1 |
Comments
Comment #1
klausiThe workaround in my migration class:
Performs pretty much ok (takes a few additional seconds at the end of the migration) on 180 obsolete users out of 23,000.
Comment #2
mikeryanThis is not a supported feature - the line you quote was speaking of the fields within a migrated item, it did not mean that items disappearing from the source data would magically be deleted from the destination side.
Comment #3
nerdcore CreditAttribution: nerdcore commentedThis would be a pretty great feature. It follows naturally from the discussion of highwater marks and updating of records - keeping data consistent between source and destination...
Wherever this is done in the Migration class (whether postImport() or another new function), the class itself has reference to the map table for that migration and its source key from whichever table was used for source data.
The Migration class could either iterate though all records in postImport(), LEFT JOINing the map table to the source data, and if the source ID is NULL, delete the record from the destination.
OR the Migration class could keep track of all source records which already have destinations, then in postImport() simply compare this list of source IDs to the map table, deleting records which are in the map and not in the internal list.
I just tried to var_dump($this) in postImport() to see if I could discern the structure of the source/dest/map data in my Migration class but I was overwhelmed with data. It would be nice to not have to re-iterate all of the source IDs in postImport() if possible.
Comment #4
Steven Jones CreditAttribution: Steven Jones commentedSo, we need importing and deleting for importing and migrating etc.
So you can do something like this, to force the update setting and to delete items after import. This has basically been ripped out of Migration::rollback()
Comment #5
mikeryanThis would absolutely have to be an optional, non-default action - searching for missing records is an expensive action. I'm not sure it even belongs as part of the import step - I'd be more comfortable with it as a separate action (drush migrate-purge or somesuch).
Comment #6
nerdcore CreditAttribution: nerdcore commentedI think having a flag that we could set on the custom Migration subclass would be ideal. It is definitely going to be expensive and should be off by default. But for my $0.02 I think something like:
$this->deleteWhenSourceDeleted = TRUE;
would be great.
I'm currently dealing with a case where migrations are being fired through cron and using a highwater mark to continuously update from new source data, but sometimes a source record gets deleted and we need to keep it in sync (not just add).
Comment #7
Dentorat CreditAttribution: Dentorat commentedSteven, where did you put this? Thanks
Comment #8
Steven Jones CreditAttribution: Steven Jones commented@penthehuman pop that code in your main Migration subclass.
Comment #9
neruda001 CreditAttribution: neruda001 commentedHi,
in the case you have custom entity in Drupal and data from another mysql DB or from several tables in the same db that you want to import as the entity content.
The issue of deleting non-existent source record previously migrated made me thinking a little and then I opted to develop in postMigration() method the algorithm to do that.
Assuming that the query to get the data maybe a little bit complicated, so I want to describe it once in migration subclass.
For the MigrationSqlSource there was a non implemented method originalQuery() so it got me an error when trying to get the originalQuery object in the postImport() using
$this->source->query();
so I had to implement a new method in sql.inc where this file reside in /plugins/sources of the migrate module ( ver 2.5 ). Discussion https://drupal.org/node/1799964.So the new method in sql.inc called for example getQueryCopy() is like this:
and in the postImport method of the migration class now you can get the query object value without having a dangerous reference to it, this is because later you need to edit this object.
just run migrate import to delete non-existent source record from destination and mapping table of the Drupal db.
Comment #10
13rac1 CreditAttribution: 13rac1 commentedRewrote original issue summary. See above. Comment below.
Comment #11
13rac1 CreditAttribution: 13rac1 commentedAdding definitions above.
Here are sanitized results from my current project after resyncing the database, applying this patch, and running Drupal updates:
Comment #12
13rac1 CreditAttribution: 13rac1 commentedAdding note.
Comment #13
13rac1 CreditAttribution: 13rac1 commentedSimplified patch to remove unnecessary code optimization to reduce patch complexity.
Comment #14
13rac1 CreditAttribution: 13rac1 commentedAdjusting title for clarity.
Comment #15
joachim CreditAttribution: joachim commentedWill this affect migrations that specifically expect to have source records go missing?
For example, I'm working with migrations where users upload new source files to add to a migration. In that case, the migration map table will find that it has records that no longer have matching data in the source. See #2223175: using Migrate UI wizard to add to existing source data / use same map tables and https://drupal.org/sandbox/joachim/2231649
Comment #16
13rac1 CreditAttribution: 13rac1 commented@joachim: Are you asking if this patch will break existing migrations? Of course not. This is an optional new feature. If that isn't your question, can you rephrase? Thanks
Comment #17
joachim CreditAttribution: joachim commentedYup, that was my question (well, and future migrations too!). And I'm glad to hear that, thanks! :)
Comment #18
13rac1 CreditAttribution: 13rac1 commentedUpdating patch for recent commits to 7.x-2.x-dev. The patch includes 5kb of tests.
Comment #19
13rac1 CreditAttribution: 13rac1 commentedMinor update to reset skipped item count when ignored item count is reset.
Comment #20
13rac1 CreditAttribution: 13rac1 commentedNULL
Comment #21
mikeryanThere's way too much going on in this patch, well beyond the original scope of providing a means to remove data no longer in the source.
Comment #22
13rac1 CreditAttribution: 13rac1 commentedAh ha! Thank you for responding. I'll break this into multiple dependent issues.
How should "skipped" rows be accounted for? Should they be combined with "ignored" rows? That could reduce the number of code changes.
The first related issue #2249191: needs_update in the migrate_map_% tables is named/used inconsistently
Comment #25
daniroyo CreditAttribution: daniroyo commentedHello!
¿In which state is this feature?
Is it possible to use it?
Comment #26
mikeryanAs the issue status says, the patch needs work (i.e., it needs first of all to be reduced to the specific issue without extraneous refactoring).
Comment #27
cristiroma CreditAttribution: cristiroma commentedUsing @klausi method in #1 to delete in postImport works if you do the import from command line in one shot. However, sometimes we import large amount of data in a cron task with something like:
drush mi WebServiceData --limit="30 seconds"
, thereforepostImport
is executed every time, so it's not feasible.I have written a custom drush command to remove orphans, which can be run independently on the actual migration.
@mikeryan - Would be useful to implement a founction in the MigrateMap/MigrateSQLMap to retrieve all records? Or is anyhthing I miss? Would be simpler than query migrate_map_migration table using
db_select
, to get the records. $migration->getMap()->getMappings()Code looks something like this:
Comment #28
dshumaker CreditAttribution: dshumaker commentedI have the same issue of a missing entry in the source indicates a "possible" entry/node to delete. I looked at several ways of doing it and what I landed on was creating a mysql table called "keep_list" which is a fresh (emptied after each import) table created specifically for the purpose of indicating which entries are in the XML source. I thought about re-purposing the "needs-update" column but any re-purposing is potentially problematic because it's not what the author had in mind for the field/column. Indeed I found that after initial creation of a node the needs_update flag was triggered because it was assumed drupal needed to put "true" values into the fields (which was not the case, in my case, so I forced needs_udpate to be 0 on initial creation).
Then after my import, in the postImport member I loop through the results of a query that indicates which entries were not visited and I set their status to 0 (unpublished).
Then later on I can go back and remove unpublished nodes based on other criteria if need be (via a new Migration class pass).
In light of Klaus' comment on his patch, ""I've tried to reduce the overall amount of changed lines, which is why I used migrate-rollback instead of a new drush command such as migrate-purge.", I don't believe we should re-purpose rollback just because we don't want to create a new command line parameter "--remove-missing-source" or write a new MigrationClass pass that is specifically designed for deletes alltogether. Indeed it seems like creating separate Migration classes is the best practice of sorts.
Basically, semantically, rollback does not mean delete and I don't think we should go that direction because it's un-intuitive. To me, rollback means undo changes and that does seem intuitive.
0.02 cents.
Comment #29
anrikun CreditAttribution: anrikun commentedI fully agree with dshumaker above (#28): rollback does not mean delete and I don't think we should go that direction because it's un-intuitive.
Comment #30
Gomez_in_the_South CreditAttribution: Gomez_in_the_South commentedAs discussed above, the postImport will be run at the end of every batch. This usually happens if called via the Migrate GUI or running drush with a time limit (either via parameter or CLI max_execution_time set).
If drush approaches the PHP memory limit, then it will also batch the migration in order to free up memory. This is another scenario to be aware of.
Therefore a solution as posted in #27 may be more suitable if you need to reliably run a function only after the entire migration has completed.
Sample output from migration logs demonstrating this:
Comment #31
GeduR CreditAttribution: GeduR at Metadrop commentedHi all, I've merged some of the previous code to a module (sandbox), any help is welcome.
Maybe after further exploration it could be merged with the main module.
https://www.drupal.org/sandbox/gedur/2873105
Hope it helps
Comment #32
cosmicdreams CreditAttribution: cosmicdreams commentedShould this issue be assigned to migrate_tools? They already have a concept of unprocessed files that is derived from calculating how many records are coming in from the source vs how many records exist in the local Drupal site. No real "tracking" is happening there but perhaps this ticket will add that.
Comment #33
cosmicdreams CreditAttribution: cosmicdreams commentedAlso, it appears this change targets D7, should we target D8 first?