Problem/Motivation
There needs to be a way to adjust the source key before it gets used. Currently the setting of the source ids is embedded inside next(). This setting should be externalized so it can be overwritten.
Proposed resolution
- Hash the keys to the mapping table. This would fix the problem for non-ASCII keys
- Add the unhashed values to a non-indexed (PK) column(s). This would make the DX of hashing less bad since the original values would still be available, just not stored as a PK (with all its limitations)
Remaining tasks
Provide a patch with the new direction.
User interface changes
API changes
This adds a public setter on SourcePluginBase & MigrateSourceInterface.
Data model changes
Original summary by @miiimooo.
This is a follow up to #2613332: Support for non-ascii collations in SQL migration map.
My problem is that I have Swedish characters in my only column that is used to create terms in a taxonomy. The code is Sql.php modifies the sourceidX field definition to ascii so it can store keys up to 255 characters. Removing this breaks a number of tests.
At the moment CSV returns type "string" for the identifiers field. But maybe this could be something different so I don't run into this problem with non ASCII characters?
Comment | File | Size | Author |
---|---|---|---|
#69 | interdiff.txt | 696 bytes | edysmp |
#69 | 2613878-hash_source_keys-69.patch | 25.47 KB | edysmp |
#35 | interdiff-33-35.txt | 588 bytes | edysmp |
#31 | 2613878-hash_source_keys-31.patch | 23.47 KB | edysmp |
#31 | interdiff-29-31.txt | 16.35 KB | edysmp |
Comments
Comment #2
heddnComment #3
heddnComment #4
heddnBefore I get distracted, let's throw out a patch to start the conversation. BTW, I'm classifying this a bug in the API.
Comment #5
mikeryanSo, next() is just filled with chickens and eggs... Optimally prepareRow() is the place to take the raw incoming data and transform it into your canonical source for the pipeline, so ideally that would be the place to manipulate source keys as needed. But one of the inputs into prepareRow (via the Row object) is the existing idmap data (if any), so to retrieve that we have to have the already-fixed-up keys. Without rearranging the chickens and eggs, yes, adding the public setter is fine as far as it goes... It requires extending the source plugin, I'd also ideally like to see an event so anyone can get in there.
But, is there any deeper refactoring we can do here? The source is tightly coupled to the idmap here, should it be? Ideally the source should just be delivering source data - but, it needs to know what rows have previously been processed and thus (usually) skipped. Or it could spew forth events and a listening idmap could tell it when to ignore a row... Anyway, that's all pie in the (9.x) sky...
Comment #6
chx CreditAttribution: chx commentedThat setter is missing because it shouldn't exist. IMO this is a won't fix and the other issue should focus on documenting on getIds instead of prepareRow.
Comment #7
heddn@chx, sometimes the only value available for the source key is a really long string. Sometimes it includes strange non-UTF8 characters that aren't supported by the DB as a key. Sometimes the key is a fabricated number at run time. Lots of reasons exist why one might want to modify the source key. Currently, without overriding next(), there isn't any way to perform that override. In ever 10 migrations on D7, I end up having to override source keys 1 or 2 times.
Why shouldn't it exist? What are some possible alternatives.
Comment #8
miiimoooJust to add, in my case if I wanted to handle this in prepare_row rather, it would mean rewriting the source values once for this migration and then for any other migrations that might refer to it.
I guess that is an option but it's conceivable that the rewriting could be complex enough to warrant creating a further mapping table..
Comment #9
heddnTo summarize a discussion in IRC between chx, phenaproxima, neclimdul and myself:
Comment #10
heddnAnother related issue: #2543282: Migrate source CSV dies on long text. Also, updated the IS given the current direction to solve this. Since this is the 4th issue in the last month, I'm going to bump this to a major, since it seems to be a very common request.
Comment #11
heddnComment #12
edysmpWorking on the new direction.
Comment #13
edysmpInitial work in progress, still needs work on tests.
Comment #15
chx CreditAttribution: chx commentedThanks for the great work! Overall looks really good.
Please do not use a separate setSourceIdsHash method -- especially not a public one (if needed for testing , use a protected method). Currently it's not used anywhere but the constructor and that's the right thing to do ; as before with source ids , same with source hash: it should never change.
Comment #16
Lord_of_Codes CreditAttribution: Lord_of_Codes commentedChanged the function signatures of setSourceIdsHash . Turned it into a protected method.
Comment #17
jian he CreditAttribution: jian he commentedComment #19
jian he CreditAttribution: jian he commentedre-roll
Comment #21
heddnThe problem is that the source id that is set in next uses the value before it is transformed. Then it freezes the row. Transform changes the values. #15 above (rightly) points out the setter for the hash should be protected, but then we cannot update when we transform the value. And getDestination uses the hashed value with the old value PRIOR to the transform.
Comment #22
heddnThis is where I think we need to solve the problem. Instead of passing in the original source values (which can change in tranform) or passing in a hash (which is based on the original value), we need to pass in the tranformed values, then hash them. Otherwise we will result in duplicate created entries.
Test case:
Spanish "hashed" = abc
Español "hashed" = def
Why?
1st record (Spanish) (which was transformed to Español) doesn't exist mapping. If we pass in abc or Spanish, it doesn't find it, so it creates a record for Español
2nd record (Español) also doesn't exist in the mapping. Passing def or Español, it doesn't find. So it creates a 2nd entry for Español.
We need to pass in the values after the transformation to lookupDestinationId(). But now we run into another problem. There isn't a way to extract the transformed values of the sourceIds from Row. getSourceIdValues() returns the pre-transform values. Transform changes the destination values in row, not the source values. And the keys for destination are not the same as source.
Possible solution:
Update getSourceIdValues() to merge in any transformed values. And rename the function to getIdValues, because it isn't the source values any more. It is just the values of the ids. But I don't see a way to make this happen. There isn't any mapping inside of Row. That's outside of it.
Comment #23
chx CreditAttribution: chx commentedI am confused as to what's happening here. I thought this change will be internal to the sql idmap.
Comment #24
chx CreditAttribution: chx commentedAs for #22 which is a very different issue, without hashing, I think extending the example with source IDs and destination IDs will make it easier to understand.
We have a static map with bypass TRUE mapping Spanish to Español.
Row1 , sourceid1 is Spanish, destinationid1 becomes Español.
Row2, sourceid1 is Español, destinationid1 becomes Español.
These are two different rows as identified by the source id. If you want to avoid this then add a process plugin which skips the row if the destination already exists. See DedupeEntity::exists for the very code you need.
Comment #25
edysmpI think this is enough, I made changes only in sql idmap.
Please review.
Comment #26
edysmpComment #28
chx CreditAttribution: chx commentedThanks for the patch! A few observations:
$this::
is valid syntax. Can we change tostatic::
?Comment #29
Ada Hernandez CreditAttribution: Ada Hernandez at MTech, LLC commentedI made the changes referenced in comment #28
Comment #31
edysmpWorked in the tests and refined sql idmap.
Comment #33
edysmpUpdated Test.
Comment #35
edysmpupdated getSourceIDsHash function.
Comment #36
chx CreditAttribution: chx commentedThis looks incredibly promising thanks for the persistent hard work!
Comment #38
Ada Hernandez CreditAttribution: Ada Hernandez at MTech, LLC commentedI updated the testMessageSave() for UnitTest.
Comment #40
edysmpFinal test.. for now.
Comment #41
edysmpComment #42
chx CreditAttribution: chx commentedAwesome!
Huh. What is this. Why is source_id_values sometimes a list sometimes an associated array? Is this a testing artifact? What's going on here?
Comment #43
edysmpComment #44
chx CreditAttribution: chx commentedThis is really close!
+ // source key and value, e.g. ['nid' => 41]. In this case, $source_id_values need to be ordered the same needs to be 80 columns max
+ * It is public only for testing purposes. needs a line break before
But the most serious problems are in saveIdMapping : previously if there are no sourceIdFields, we didn't save anything. Whether that makes any sense is not for this patch to decide so this behavior needs to be changed. Before
+ $fields += array(
add aif (!$fields) { return; }
to keep the previous behavior. Also$this->eventDispatcher->dispatch(MigrateEvents::MAP_SAVE, new MigrateMapSaveEvent($this, $keys + $fields));
let's remove $keys from here, it's just the hashed source id values which should never be leaked to the outside world.Comment #45
edysmpThanks for review and comment.
Comment #46
chx CreditAttribution: chx at Smartsheet commentedLooks great! thanks!
Comment #48
alexpottThis is looking like a good solution to a hard problem
This does not order the keys in the same order as $this->sourceIdFields(). See https://3v4l.org/BiD8Y. Also the fact that the tests are green imply that we're missing test coverage of this.
This is a bit concerning - why has the order that files are migrated changed due to this?
Comment #49
chx CreditAttribution: chx at Smartsheet commentedRegarding order: defeat snatched from the jaws of victory. Check https://3v4l.org/ag2u7 it does the ordering according to the first array.
Comment #50
chx CreditAttribution: chx at Smartsheet commented> This is a bit concerning - why has the order that files are migrated changed due to this?
Because previously we ordered on the serial source and now we order on the serial id hash. The order is still deterministic just different.
Comment #51
benjy CreditAttribution: benjy at PreviousNext commentedI'm not sure if we added explicitly coverage for the key ordering but we had implicit coverage from the pg driver, just added a test run for that.
Original issue: #2571499: idMap source and destination id filtering requires keys
Comment #52
heddnPG tests passed. I think that means this is RTBC again?
Comment #53
alexpott@heddn nope the sorting issue needs to fixed... as explained in #48.1
Comment #54
benjy CreditAttribution: benjy at PreviousNext commented@heddn, the PG tests failed? https://www.drupal.org/pift-ci-job/155478
Comment #55
chx CreditAttribution: chx at Smartsheet commentedI discussed this with edysmp yesterday at length and we came to an agreement on how to fix the sorting; not hard; use a loop.I expect they will post a patch soon, test pending.
Comment #56
edysmpFor sorting.
Comment #58
edysmpand one less problem...
Comment #60
edysmpFinal test.
Comment #61
heddnGiven the blocking nature this causes for contrib migrate, I'm marking as migrate critical. This really needs to get before 8.1 or there will be a lot of BC issues.
Comment #62
chx CreditAttribution: chx at Smartsheet commentedThis is quite close to ready. My only concern here is // Postgress sorts results by order inserted, MySQL sorts by hash.
There are two problems a) PostgreSQL b) while the comment states two facts it is not clear at all why these facts need to be stated here. So something like "As PostgreSQL sorts results by order inserted and MySQL sorts by hash, create a consistent order for easier testing".
Comment #63
alexpottDoesn't this sort of imply that the fix should be in getIdMap() so that that returns a consistent order?
I thought the whole point was that the order of the arguments does not matter?
Comment #64
heddnre #63
The sort order is always consistent/deterministic. Typically, the order usually doesn't matter, except when running tests. No need to add extra overhead to getIdMap, just put in a deterministic order into the test and it is fine.
The order being passed to lookupDestinationID() is important. See:
Comment #65
chx CreditAttribution: chx at Smartsheet commented> The sort order is always consistent/deterministic.
This is not true. http://www.postgresql.org/docs/9.1/static/queries-order.html says "The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on" but we do not care:
> Typically, the order usually doesn't matter, except when running tests.
Correct. The actual import is stateless between rows so whatever order we get the rows, that's all fine. The test uses an order for simpler code but the actual import does not.
Comment #67
heddnComment #69
edysmpComment #70
chx CreditAttribution: chx at Smartsheet commentedThanks.
Comment #71
alexpottCommitted da55f60 and pushed to 8.0.x and 8.1.x. Thanks! I committed this to both branches as this is bug and although there is API change here - it is small and migrate is experimental.
Comment #74
davidwbarratt CreditAttribution: davidwbarratt at Golf Channel commentedThis issue introduced a critical issue. :(
#2679797: Migration migrate_update_8009 for source hash