Problem/Motivation

There needs to be a way to adjust the source key before it gets used. Currently the setting of the source ids is embedded inside next(). This setting should be externalized so it can be overwritten.

Proposed resolution

  • Hash the keys to the mapping table. This would fix the problem for non-ASCII keys
  • Add the unhashed values to a non-indexed (PK) column(s). This would make the DX of hashing less bad since the original values would still be available, just not stored as a PK (with all its limitations)

Remaining tasks

Provide a patch with the new direction.

User interface changes

API changes

This adds a public setter on SourcePluginBase & MigrateSourceInterface.

Data model changes

Original summary by @miiimooo.
This is a follow up to #2613332: Support for non-ascii collations in SQL migration map.

My problem is that I have Swedish characters in my only column that is used to create terms in a taxonomy. The code is Sql.php modifies the sourceidX field definition to ascii so it can store keys up to 255 characters. Removing this breaks a number of tests.

At the moment CSV returns type "string" for the identifiers field. But maybe this could be something different so I don't run into this problem with non ASCII characters?

CommentFileSizeAuthor
#69 interdiff.txt696 bytesedysmp
#69 2613878-hash_source_keys-69.patch25.47 KBedysmp
#67 interdiff_64-66.txt1.98 KBheddn
#67 2613878-hash_source_keys-66.patch25.38 KBheddn
#64 interdiff_60-64.txt770 bytesheddn
#64 2613878-hash_source_keys-64.patch25.24 KBheddn
#60 2613878-hash_source_keys-60.patch25.24 KBedysmp
#60 interdiff-2613878-58-60.txt1.08 KBedysmp
#58 interdiff-2613878-56-58.txt788 bytesedysmp
#58 2613878-hash_source_keys-58.patch23.86 KBedysmp
#56 interdiff-2613878-45-56.txt1.45 KBedysmp
#56 2613878-hash_source_keys-56.patch24.86 KBedysmp
#43 2613878-hash_source_keys-43.patch25.43 KBedysmp
#43 interdiff-2613878-40-43.txt10.29 KBedysmp
#35 2613878-hash_source_keys-35.patch23.19 KBedysmp
#35 interdiff-33-35.txt588 bytesedysmp
#31 2613878-hash_source_keys-31.patch23.47 KBedysmp
#31 interdiff-29-31.txt16.35 KBedysmp
#29 2613878-hash_source_keys-29.patch10.89 KBAda Hernandez
#29 interdiff-25-29.txt5.43 KBAda Hernandez
#25 2613878-hash_source_keys-25.patch11.05 KBedysmp
#19 2613878-hash_source_keys-19.patch25.83 KBjian he
#16 interdiff-2613878-13-16.txt472 bytesLord_of_Codes
#16 2613878-16.patch26.06 KBLord_of_Codes
#13 use_hash_for_migration-2613878-13.patch26.06 KBedysmp
#4 drupal-migrate_set_source_keys-2613878-4.patch1.71 KBheddn
#33 interdiff-31-33.txt4.22 KBedysmp
#33 2613878-hash_source_keys-33.patch23.17 KBedysmp
#38 2613878-hash_source_keys-38.patch23.19 KBAda Hernandez
#38 interdiff-35-38.txt1.62 KBAda Hernandez
#40 interdiff-2613878-38-40.txt4.01 KBedysmp
#40 2613878-hash_source_keys-40.patch24.71 KBedysmp
#45 interdiff-2613878-43-45.txt2.31 KBedysmp
#45 2613878-hash_source_keys-45.patch25.48 KBedysmp
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

miiimooo created an issue. See original summary.

heddn’s picture

Project: Migrate Source CSV » Drupal core
Version: 8.x-1.x-dev » 8.0.x-dev
Component: Code » ajax system
Issue summary: View changes
heddn’s picture

Title: Non ASCII characters in identifiers » Add ability to overwrite source keys (Non ASCII characters in identifiers)
Component: ajax system » migration system
heddn’s picture

Category: Support request » Bug report
Status: Active » Needs review
FileSize
1.71 KB

Before I get distracted, let's throw out a patch to start the conversation. BTW, I'm classifying this a bug in the API.

mikeryan’s picture

Title: Add ability to overwrite source keys (Non ASCII characters in identifiers) » Add ability to overwrite source keys
Category: Bug report » Feature request
Status: Needs review » Needs work
Issue tags: +Needs tests

So, next() is just filled with chickens and eggs... Optimally prepareRow() is the place to take the raw incoming data and transform it into your canonical source for the pipeline, so ideally that would be the place to manipulate source keys as needed. But one of the inputs into prepareRow (via the Row object) is the existing idmap data (if any), so to retrieve that we have to have the already-fixed-up keys. Without rearranging the chickens and eggs, yes, adding the public setter is fine as far as it goes... It requires extending the source plugin, I'd also ideally like to see an event so anyone can get in there.

But, is there any deeper refactoring we can do here? The source is tightly coupled to the idmap here, should it be? Ideally the source should just be delivering source data - but, it needs to know what rows have previously been processed and thus (usually) skipped. Or it could spew forth events and a listening idmap could tell it when to ignore a row... Anyway, that's all pie in the (9.x) sky...

chx’s picture

That setter is missing because it shouldn't exist. IMO this is a won't fix and the other issue should focus on documenting on getIds instead of prepareRow.

heddn’s picture

@chx, sometimes the only value available for the source key is a really long string. Sometimes it includes strange non-UTF8 characters that aren't supported by the DB as a key. Sometimes the key is a fabricated number at run time. Lots of reasons exist why one might want to modify the source key. Currently, without overriding next(), there isn't any way to perform that override. In ever 10 migrations on D7, I end up having to override source keys 1 or 2 times.

Why shouldn't it exist? What are some possible alternatives.

miiimooo’s picture

Just to add, in my case if I wanted to handle this in prepare_row rather, it would mean rewriting the source values once for this migration and then for any other migrations that might refer to it.
I guess that is an option but it's conceivable that the rewriting could be complex enough to warrant creating a further mapping table..

heddn’s picture

To summarize a discussion in IRC between chx, phenaproxima, neclimdul and myself:

  • Setter isn't a preferred approach because we should just solve the problem.
  • Hash the keys to the mapping table. This would fix the problem for non-ASCII keys
  • Add the unhashed values to a non-indexed (PK) column(s). This would make the DX of hashing less bad since the original values would still be available, just not stored as a PK (with all its limitations)
heddn’s picture

Issue summary: View changes
Priority: Normal » Major
Related issues: +#2543282: Migrate source CSV dies on long text

Another related issue: #2543282: Migrate source CSV dies on long text. Also, updated the IS given the current direction to solve this. Since this is the 4th issue in the last month, I'm going to bump this to a major, since it seems to be a very common request.

heddn’s picture

Title: Add ability to overwrite source keys » Use hash for Migration source keys, rather than verbatim values
edysmp’s picture

Assigned: Unassigned » edysmp

Working on the new direction.

edysmp’s picture

Assigned: edysmp » Unassigned
Status: Needs work » Needs review
FileSize
26.06 KB

Initial work in progress, still needs work on tests.

Status: Needs review » Needs work

The last submitted patch, 13: use_hash_for_migration-2613878-13.patch, failed testing.

chx’s picture

Thanks for the great work! Overall looks really good.

Please do not use a separate setSourceIdsHash method -- especially not a public one (if needed for testing , use a protected method). Currently it's not used anywhere but the constructor and that's the right thing to do ; as before with source ids , same with source hash: it should never change.

Lord_of_Codes’s picture

Changed the function signatures of setSourceIdsHash . Turned it into a protected method.

jian he’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 16: 2613878-16.patch, failed testing.

jian he’s picture

Status: Needs work » Needs review
FileSize
25.83 KB

re-roll

Status: Needs review » Needs work

The last submitted patch, 19: 2613878-hash_source_keys-19.patch, failed testing.

heddn’s picture

  1. next()
    • setSourceIdHash
  2. processRow
    • transform (process plugins)
  3. import
    • getDestination(getSourceIdHash())

The problem is that the source id that is set in next uses the value before it is transformed. Then it freezes the row. Transform changes the values. #15 above (rightly) points out the setter for the hash should be protected, but then we cannot update when we transform the value. And getDestination uses the hashed value with the old value PRIOR to the transform.

heddn’s picture

+++ b/core/modules/migrate/src/Event/MigrateIdMapMessageEvent.php
@@ -73,13 +73,13 @@ public function getMigration() {
   }

+++ b/core/modules/migrate/src/MigrateExecutable.php
@@ -216,7 +209,6 @@ public function import() {
@@ -235,7 +227,7 @@ public function import() {

@@ -235,7 +227,7 @@ public function import() {
       if ($save) {
         try {
           $this->getEventDispatcher()->dispatch(MigrateEvents::PRE_ROW_SAVE, new MigratePreRowSaveEvent($this->migration, $row));
-          $destination_id_values = $destination->import($row, $id_map->lookupDestinationId($this->sourceIdValues));
+          $destination_id_values = $destination->import($row, $id_map->lookupDestinationId($row->getSourceIDsHash()));
           $this->getEventDispatcher()->dispatch(MigrateEvents::POST_ROW_SAVE, new MigratePostRowSaveEvent($this->migration, $row, $destination_id_values));
           if ($destination_id_values) {
             // We do not save an idMap entry for config.
@@ -424,7 +416,7 @@ protected function currentSourceIds() {

This is where I think we need to solve the problem. Instead of passing in the original source values (which can change in tranform) or passing in a hash (which is based on the original value), we need to pass in the tranformed values, then hash them. Otherwise we will result in duplicate created entries.

Test case:

  1. Spanish (static_map => Español)
  2. Español

Spanish "hashed" = abc
Español "hashed" = def

  1. next() creates the row, sets the hash then freezes the row
  2. processRow() transforms the value
  3. lookupDestinationId() needs to use the transformed values, instead of the hash.

Why?
1st record (Spanish) (which was transformed to Español) doesn't exist mapping. If we pass in abc or Spanish, it doesn't find it, so it creates a record for Español
2nd record (Español) also doesn't exist in the mapping. Passing def or Español, it doesn't find. So it creates a 2nd entry for Español.

We need to pass in the values after the transformation to lookupDestinationId(). But now we run into another problem. There isn't a way to extract the transformed values of the sourceIds from Row. getSourceIdValues() returns the pre-transform values. Transform changes the destination values in row, not the source values. And the keys for destination are not the same as source.

Possible solution:
Update getSourceIdValues() to merge in any transformed values. And rename the function to getIdValues, because it isn't the source values any more. It is just the values of the ids. But I don't see a way to make this happen. There isn't any mapping inside of Row. That's outside of it.

chx’s picture

I am confused as to what's happening here. I thought this change will be internal to the sql idmap.

chx’s picture

As for #22 which is a very different issue, without hashing, I think extending the example with source IDs and destination IDs will make it easier to understand.

We have a static map with bypass TRUE mapping Spanish to Español.

Row1 , sourceid1 is Spanish, destinationid1 becomes Español.

Row2, sourceid1 is Español, destinationid1 becomes Español.

These are two different rows as identified by the source id. If you want to avoid this then add a process plugin which skips the row if the destination already exists. See DedupeEntity::exists for the very code you need.

edysmp’s picture

I think this is enough, I made changes only in sql idmap.

Please review.

edysmp’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 25: 2613878-hash_source_keys-25.patch, failed testing.

chx’s picture

Thanks for the patch! A few observations:

  1. I didn't even know $this:: is valid syntax. Can we change to static::?
  2. As msgid is a serial, it's a primary key. You can't have two primary keys.
Ada Hernandez’s picture

Status: Needs work » Needs review
FileSize
5.43 KB
10.89 KB

I made the changes referenced in comment #28

Status: Needs review » Needs work

The last submitted patch, 29: 2613878-hash_source_keys-29.patch, failed testing.

edysmp’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests
FileSize
23.47 KB
16.35 KB

Worked in the tests and refined sql idmap.

The last submitted patch, 31: 2613878-hash_source_keys-31.patch, failed testing.

edysmp’s picture

The last submitted patch, 33: 2613878-hash_source_keys-33.patch, failed testing.

edysmp’s picture

updated getSourceIDsHash function.

chx’s picture

This looks incredibly promising thanks for the persistent hard work!

The last submitted patch, 35: 2613878-hash_source_keys-35.patch, failed testing.

Ada Hernandez’s picture

I updated the testMessageSave() for UnitTest.

Status: Needs review » Needs work

The last submitted patch, 38: 2613878-hash_source_keys-38.patch, failed testing.

edysmp’s picture

Final test.. for now.

edysmp’s picture

Status: Needs work » Needs review
chx’s picture

Status: Needs review » Needs work

Awesome!

+    $have_keys = !isset($source_id_values[0]);
+    foreach ($this->sourceIdFields() as $field_name => $source_id) {
+      $id_values_validated += [$field_name => $have_keys ? $source_id_values[$field_name] : array_shift($source_id_values)];
+    }

Huh. What is this. Why is source_id_values sometimes a list sometimes an associated array? Is this a testing artifact? What's going on here?

edysmp’s picture

Status: Needs work » Needs review
FileSize
25.43 KB
10.29 KB
chx’s picture

Status: Needs review » Needs work

This is really close!

+ // source key and value, e.g. ['nid' => 41]. In this case, $source_id_values need to be ordered the same needs to be 80 columns max

+ * It is public only for testing purposes. needs a line break before

But the most serious problems are in saveIdMapping : previously if there are no sourceIdFields, we didn't save anything. Whether that makes any sense is not for this patch to decide so this behavior needs to be changed. Before + $fields += array( add a if (!$fields) { return; } to keep the previous behavior. Also $this->eventDispatcher->dispatch(MigrateEvents::MAP_SAVE, new MigrateMapSaveEvent($this, $keys + $fields)); let's remove $keys from here, it's just the hashed source id values which should never be leaked to the outside world.

edysmp’s picture

chx’s picture

Category: Feature request » Task
Status: Needs work » Reviewed & tested by the community

Looks great! thanks!

The last submitted patch, 45: 2613878-hash_source_keys-45.patch, failed testing.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

This is looking like a good solution to a hard problem

  1. +++ b/core/modules/migrate/src/Plugin/migrate/id_map/Sql.php
    @@ -172,6 +177,30 @@ public static function create(ContainerInterface $container, array $configuratio
    +    // source key and value, e.g. ['nid' => 41]. In this case, $source_id_values
    +    // need to be ordered the same order as $this->sourceIdFields().
    +    // However, the Migration process plugin doesn't currently have a way to get
    +    // the source key so we presume the values have been passed through in the
    +    // correct order.
    +    if (!isset($source_id_values[0])) {
    +      $source_id_values = array_intersect_key($source_id_values, $this->sourceIdFields());
    +    }
    

    This does not order the keys in the same order as $this->sourceIdFields(). See https://3v4l.org/BiD8Y. Also the fact that the tests are green imply that we're missing test coverage of this.

  2. +++ b/core/modules/user/src/Tests/Migrate/d6/MigrateUserPictureFileTest.php
    @@ -44,7 +44,7 @@ public function testUserPictures() {
    -    $file = array_shift($files);
    +    $file = array_pop($files);
    

    This is a bit concerning - why has the order that files are migrated changed due to this?

chx’s picture

Regarding order: defeat snatched from the jaws of victory. Check https://3v4l.org/ag2u7 it does the ordering according to the first array.

chx’s picture

> This is a bit concerning - why has the order that files are migrated changed due to this?

Because previously we ordered on the serial source and now we order on the serial id hash. The order is still deterministic just different.

benjy’s picture

I'm not sure if we added explicitly coverage for the key ordering but we had implicit coverage from the pg driver, just added a test run for that.

Original issue: #2571499: idMap source and destination id filtering requires keys

heddn’s picture

Status: Needs work » Reviewed & tested by the community

PG tests passed. I think that means this is RTBC again?

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

@heddn nope the sorting issue needs to fixed... as explained in #48.1

benjy’s picture

@heddn, the PG tests failed? https://www.drupal.org/pift-ci-job/155478

chx’s picture

I discussed this with edysmp yesterday at length and we came to an agreement on how to fix the sorting; not hard; use a loop.I expect they will post a patch soon, test pending.

edysmp’s picture

Status: Needs review » Needs work

The last submitted patch, 56: 2613878-hash_source_keys-56.patch, failed testing.

edysmp’s picture

Status: Needs review » Needs work

The last submitted patch, 58: 2613878-hash_source_keys-58.patch, failed testing.

edysmp’s picture

heddn’s picture

Issue tags: +Migrate critical

Given the blocking nature this causes for contrib migrate, I'm marking as migrate critical. This really needs to get before 8.1 or there will be a lot of BC issues.

chx’s picture

This is quite close to ready. My only concern here is // Postgress sorts results by order inserted, MySQL sorts by hash.

There are two problems a) PostgreSQL b) while the comment states two facts it is not clear at all why these facts need to be stated here. So something like "As PostgreSQL sorts results by order inserted and MySQL sorts by hash, create a consistent order for easier testing".

alexpott’s picture

Status: Needs review » Needs work
+++ b/core/modules/user/src/Tests/Migrate/d6/MigrateUserPictureFileTest.php
@@ -42,6 +42,8 @@ public function testUserPictures() {
     foreach ($this->migration->getIdMap() as $destination_ids) {
       $file_ids[] = reset($destination_ids);
     }
+    // Postgress sorts results by order inserted, MySQL sorts by hash.
+    sort($file_ids);

Doesn't this sort of imply that the fix should be in getIdMap() so that that returns a consistent order?

+++ a/core/modules/field/src/Tests/Migrate/d6/MigrateFieldFormatterSettingsTest.php
@@ -181,7 +181,7 @@
+    $this->assertIdentical(array('node', 'story', 'teaser', 'field_test'), Migration::load('d6_field_formatter_settings')->getIdMap()->lookupDestinationID(array('story', 'teaser', 'node', 'field_test')));
-    $this->assertIdentical(array('node', 'story', 'teaser', 'field_test'), Migration::load('d6_field_formatter_settings')->getIdMap()->lookupDestinationID(array('node', 'teaser', 'story', 'field_test')));

I thought the whole point was that the order of the arguments does not matter?

heddn’s picture

Status: Needs work » Needs review
FileSize
25.24 KB
770 bytes

re #63
The sort order is always consistent/deterministic. Typically, the order usually doesn't matter, except when running tests. No need to add extra overhead to getIdMap, just put in a deterministic order into the test and it is fine.

The order being passed to lookupDestinationID() is important. See:

  public function getSourceIDsHash(array $source_id_values) {
    // When looking up the destination ID we require an array with both the
    // source key and value, e.g. ['nid' => 41]. In this case, $source_id_values
    // need to be ordered the same order as $this->sourceIdFields().
    // However, the Migration process plugin doesn't currently have a way to get
    // the source key so we presume the values have been passed through in the
    // correct order.
chx’s picture

Status: Needs review » Reviewed & tested by the community

> The sort order is always consistent/deterministic.

This is not true. http://www.postgresql.org/docs/9.1/static/queries-order.html says "The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on" but we do not care:

> Typically, the order usually doesn't matter, except when running tests.

Correct. The actual import is stateless between rows so whatever order we get the rows, that's all fine. The test uses an order for simpler code but the actual import does not.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 64: 2613878-hash_source_keys-64.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 67: 2613878-hash_source_keys-66.patch, failed testing.

edysmp’s picture

chx’s picture

Status: Needs review » Reviewed & tested by the community

Thanks.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Committed da55f60 and pushed to 8.0.x and 8.1.x. Thanks! I committed this to both branches as this is bug and although there is API change here - it is small and migrate is experimental.

  • alexpott committed bd71f24 on 8.1.x
    Issue #2613878 by edysmp, heddn, Adita, Lord_of_Codes, jian he, chx,...

  • alexpott committed da55f60 on 8.0.x
    Issue #2613878 by edysmp, heddn, Adita, Lord_of_Codes, jian he, chx,...
davidwbarratt’s picture

This issue introduced a critical issue. :(
#2679797: Migration migrate_update_8009 for source hash

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.