A fellow Acquian is faced with an issue that I've heard of before - given a source which is a longish CSV file, which is to be regularly synced (new records plus changed records updated in Drupal), but no suitable field for using highwater marks - how do we detect changed records to avoid rewriting everything on each import operation? He's using an md5 hash of the source row, which I think is the best that can be done under those circumstances, and I think that could be supported directly in MigrateSource:

  • Add a hash column to the migrate map table.
  • Support a track_hashes option to the MigrateSource constructor.
  • When track_hashes is enabled, take a hash of the raw source row (before prepareKey() or anything else is called) and save it away.
  • If we find that the source row already has a map table entry, compare the incoming hash to the saved hash. If they match, skip it (before calling prepareRow).
  • When saveIDMapping is called when done processing a row, save the hash.

I'm focused on the wizard API work at the moment, but patches welcome....

Files: 
CommentFileSizeAuthor
#14 migrate.1835822.patch9.44 KBhernani

Comments

mikeryan’s picture

Forgot to add one other thought I had, the hash function should be overridable - say, to remove irrelevant columns from the row before hashing...

hernani’s picture

+1!

mikeryan’s picture

migrate_d2d will make use of this: #1722850: Support highwater marks for users.

mikeryan’s picture

Issue tags:+Migrate 2.6

Tagging feature requests I'd like to get into Migrate 2.6.

@hernani: Do you have something that serve at least as a starting patch for this?

Thanks.

hernani’s picture

No Mike, my approach was to do it a bit out of the migrate module, but I will see what I can do.

JSCSJSCS’s picture

I am following this because it is something I need. I would point out that this method of saving the track_hashes before calling any other methods will probably not work for us that create the source key on the fly in the prepareKey() method. It probably would not work for any field that is "created" or modified by preparRow() either.?

For instance, I use prepareRow() to "fix" street addresses and business names prior to import (change source from "Business Name, The Inc" to "The Business Name, Inc." prior to importing.

This hash comparison method would see unchanged source file records as changed when compared to already imported (modified) destination data.

Right?

JSCSJSCS’s picture

I played with this most of the day. Just started doing some testing. I added a 32 character text field to a content type and used perpareRow() to create an MD5 hash of the concatenated string values of all the row's column data. Then I mapped that to the node's hash fieldd I created.

The idea was that I would use --update to force the incoming source update (second time through the migration) and then compare the prepareRow()'s "new" MD5 hash with the one stored in the node and conditionally save or not save.

I could not figure out how to read and compare the destination's stored hash field contents with the prepareRow() incoming source hash field contents.

I thought I could use the $node->original entity values, but they do not seem to be any different than the new source data for some reason. I tried using hook_node_presave() too, but could not get that to work either. Those values all seemed to be null.

If I could just figure out a way, either in prepareRow() or prepare(), to compare and fire conditional saves, I would be very pleased.

mikeryan’s picture

Status:Active» Fixed

OK, finally got this in. Set the 'track_changes' option on your source constructor to take advantage of this support - if you do this, then your source data (as it is after prepareRow() is called) is hashed and tracked in the map table - if anything in the source row has changed since the item was previously imported, then it will be reimported.

hernani’s picture

Yay! Thanks !

JSCSJSCS’s picture

Mike,

Thanks for working on this. By "reimported", do you mean that the NID will not change from the original import, all other data will be overwritten? Will rollback still work? Does the reimported data represent a revision of the original data that can be DIFF'ed and/or reverted or is all original data lost (which would be sub-optimal)?

mikeryan’s picture

Yes, with track_changes enabled if the source data changes, the destination object is updated in place (no change to its ID). Rollback is unaffected. For node migrations, if your migration sets the defaultValue of 'revision' to 1, then you should get a fresh revision each time a node is reimported.

JSCSJSCS’s picture

Terrific! I can't wait to test this out.

Automatically closed -- issue fixed for 2 weeks with no activity.

hernani’s picture

Issue summary:View changes
StatusFileSize
new9.44 KB
hernani’s picture

I had to implement this for d6. Patch attached in #14.