Hash source rows to detect changes [#1835822]

A fellow Acquian is faced with an issue that I've heard of before - given a source which is a longish CSV file, which is to be regularly synced (new records plus changed records updated in Drupal), but no suitable field for using highwater marks - how do we detect changed records to avoid rewriting everything on each import operation? He's using an md5 hash of the source row, which I think is the best that can be done under those circumstances, and I think that could be supported directly in MigrateSource:

Add a hash column to the migrate map table.
Support a track_hashes option to the MigrateSource constructor.
When track_hashes is enabled, take a hash of the raw source row (before prepareKey() or anything else is called) and save it away.
If we find that the source row already has a map table entry, compare the incoming hash to the saved hash. If they match, skip it (before calling prepareRow).
When saveIDMapping is called when done processing a row, save the hash.

I'm focused on the wizard API work at the moment, but patches welcome....

Comment	File	Size	Author
#14	migrate.1835822.patch	9.44 KB	hernani
#14

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan commented 8 November 2012 at 20:04

Forgot to add one other thought I had, the hash function should be overridable - say, to remove irrelevant columns from the row before hashing...

Comment #2

hernani CreditAttribution: hernani commented 8 November 2012 at 20:20

+1!

Comment #3

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan commented 13 November 2012 at 14:31

migrate_d2d will make use of this: #1722850: Support highwater marks for users.

Comment #4

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan commented 17 January 2013 at 22:49

Issue tags:

+Migrate 2.6

Tagging feature requests I'd like to get into Migrate 2.6.

@hernani: Do you have something that serve at least as a starting patch for this?

Thanks.

Comment #5

hernani CreditAttribution: hernani commented 19 January 2013 at 01:29

No Mike, my approach was to do it a bit out of the migrate module, but I will see what I can do.

Comment #6

JSCSJSCS CreditAttribution: JSCSJSCS commented 26 February 2013 at 21:49

I am following this because it is something I need. I would point out that this method of saving the track_hashes before calling any other methods will probably not work for us that create the source key on the fly in the prepareKey() method. It probably would not work for any field that is "created" or modified by preparRow() either.?

For instance, I use prepareRow() to "fix" street addresses and business names prior to import (change source from "Business Name, The Inc" to "The Business Name, Inc." prior to importing.

This hash comparison method would see unchanged source file records as changed when compared to already imported (modified) destination data.

Right?

Comment #7

JSCSJSCS CreditAttribution: JSCSJSCS commented 1 March 2013 at 01:35

I played with this most of the day. Just started doing some testing. I added a 32 character text field to a content type and used perpareRow() to create an MD5 hash of the concatenated string values of all the row's column data. Then I mapped that to the node's hash fieldd I created.

The idea was that I would use --update to force the incoming source update (second time through the migration) and then compare the prepareRow()'s "new" MD5 hash with the one stored in the node and conditionally save or not save.

I could not figure out how to read and compare the destination's stored hash field contents with the prepareRow() incoming source hash field contents.

I thought I could use the $node->original entity values, but they do not seem to be any different than the new source data for some reason. I tried using hook_node_presave() too, but could not get that to work either. Those values all seemed to be null.

If I could just figure out a way, either in prepareRow() or prepare(), to compare and fire conditional saves, I would be very pleased.

Comment #8

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan commented 9 April 2013 at 00:03

Status:

Active

» Fixed

OK, finally got this in. Set the 'track_changes' option on your source constructor to take advantage of this support - if you do this, then your source data (as it is after prepareRow() is called) is hashed and tracked in the map table - if anything in the source row has changed since the item was previously imported, then it will be reimported.

Comment #9

hernani CreditAttribution: hernani commented 9 April 2013 at 06:21

Yay! Thanks !

Comment #10

JSCSJSCS CreditAttribution: JSCSJSCS commented 9 April 2013 at 16:21

Mike,

Thanks for working on this. By "reimported", do you mean that the NID will not change from the original import, all other data will be overwritten? Will rollback still work? Does the reimported data represent a revision of the original data that can be DIFF'ed and/or reverted or is all original data lost (which would be sub-optimal)?

Comment #11

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan commented 9 April 2013 at 17:38

Yes, with track_changes enabled if the source data changes, the destination object is updated in place (no change to its ID). Rollback is unaffected. For node migrations, if your migration sets the defaultValue of 'revision' to 1, then you should get a fresh revision each time a node is reimported.