Problem/Motivation

When migrated translated entities, we will need multiple rows for a single entity, one per translation. Those rows need to be able to find each other, so that they can become parts of the same entity.

Eg: When migrating a translated term, we might have rows with source IDs:

sourceid1 (tid) sourceid2 (language)
2 en
2 fr
2 de

Somehow when the second row (2, fr) is imported, it has to find the entity created by the second row, so that it can add a translation instead of creating a brand new entity.

This will be necessary for i18n node migrations, see https://www.drupal.org/node/2225775#comment-11187483

Proposed resolution

Currently, one can query the ID map for the destination IDs, using all the source IDs:

$id_map->lookupDestinationId(['tid' => 2, 'langcode' => 'en']);

We can allow the destination to also query the id-map by a subset of source IDs, eg:

$id_map->lookupDestinationIdsSubset(['tid' => 2]);

This would return the set of destination ID values for the rows with the same source tid. Potentially multiple sets of destination ID values would be returned.

Alternatives

We could also add more migrations to solve this problem. We would have a d6_taxonomy_term migration that just migrated the default translation, and then a d6_taxonomy_term_translations migration that migrates the (multiple) translations. It could use a 'migration' process to lookup the entity that was created by the d6_taxonomy_term migration, so they would all be able to share the same ID.

This would have the downside of multiplying the migrations that we create. We'd need up to four migrations for a single content type: node, node_revision, node_translations, and node_revision_translations.

Remaining tasks

Contributor tasks needed
Task Novice task? Contributor instructions Complete?
Discuss what we want to do!
Create a patch Instructions
Add automated tests Instructions

User interface changes

None.

API changes

Add a lookupDestinationIdsSubset() method to MigrateIdMapInterface.

Data model changes

Probably none. But if we want to be able to query with strange source IDs, we may have to add more source-ID-hashes to the migrate map table, see https://www.drupal.org/node/2613878

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

vasi created an issue. See original summary.

vasi’s picture

Issue summary: View changes
mikeryan’s picture

To expand on the hashing issue a bit... Right now, the hash stored in the map is a hash of all source keys, and lookupDestinationId() hashes its argument and queries against the table's hash value. There are a couple of ways the new partial-lookup API could work:

  1. Ignore the hash and build a query with conditions on only the specific source keys provided. The disadvantage is that this API would not work for the use cases that required the hashing in the first place (such as odd chars in a key value)
  2. Save multiple hashes in the map table - the current full-key hash, a hash of just the first key column, a hash of the first two (if there were three), ... Disadvantages would be the added overhead to the map table (for multi-column source keys), and your partial lookup could only be on leading columns (e.g., you wouldn't be able to query on the third source ID column alone) unless we generated hashes for all combinations (not very scalable).

I'm not entirely satisfied with either case - any other ideas?

mikeryan’s picture

I do have to say, between those two implementation approaches, I favor #1 - it's simpler, has less overhead, and the downside scenario is a combination of multiple edge cases - it would only affect those who have a source key column which is problematic, which is one of multiple columns, and they need to do a partial query that includes the problematic column.

pbuyle’s picture

If hashing is required for some use cases, option 2 seems to be the way. But I would store the hash of each key columns. There is no need to store every possible combos (that would grow very fast) So for instance, for an entry whose IDs is ['tid' => 2, 'language' => 'en'] I would store three hashes (['tid' => 2, 'language' => 'en'], 2 and english). To avoid cluttering the map table, a secondary table could be used for the additional hashes. The main map table would stay the same, and the hashes table would have a column for each key, plus a column for the full keys hash (used a foreign key to the main map table). That table would only be used when required for a lookupDestinationIdsSubset() query.

That saids, I'm curious of the use cases for the hashes, dealing with data and querying them is the job of the DB, having a complex hashes system on top of the DB looks like a bad pattern to me.

mikeryan’s picture

pbuyle’s picture

Ok, so it looks like the hash thing is primary because the keys were used a PK on the table. So it is probably safe to use the non-hashed keys in condition. And there is no need to hash each key individually.

mikeryan’s picture

Yay option #1! Thanks for pointing that out @pbuyle...

mikeryan’s picture

A note - at first glance, it would seem that we could just make lookupDestinationId() more flexible, to recognize when the source IDs passed in are incomplete and do the alternate on-these-values-only query. The problem there is that lookupDestinationId() currently returns one destination ID (an array containing the ID values for one destination object), while a partial lookup could potentially return multiple destination IDs. In the use case that's driving this, we just need one, but logically the return should be an array of destination ID arrays. So, we proposed a new API for the sake of BC. Of course, we are still experimental, so we can break BC - we don't want to make gratuitous API breaks, but having a special lookupDestinationIdsSubset() doing almost exactly the same thing as lookupDestinationId() would be enough of an API WTF that I'm thinking now we should go ahead and make the change...

And saying that, actually we should add a lookupDestinationIds() (note the plural) which returns an array of destination ID arrays, change lookupDestinationId() to call lookupDestinationIds() and return the first destination ID, and deprecate lookupDestinationId().

vasi’s picture

Here's an implementation, with pretty complete tests. It changes lookupDestinationId() to be implemented in terms of lookupDestinationIds(). It also changes the PHPDoc for lookupDestinationId() to specify that it returns an empty array on failure, not null—both all users of it and the unit tests were assuming an empty array.

It doesn't include the obvious optimization that when all fields are specified, we could use the primary key (source_ids_hash) for lookup.

At sprints, we had the question of whether we want to index the individual sourceidN fields, now that we are doing lookups with them. @heddn pointed out that the data he was working with is not always indexable by DBs. But he suggested that if we do decide to optimize, we could allow source plugins to add an 'indexable' property to their fields as returned by getIds(), which the ID map could then respect.

mikeryan’s picture

Status: Active » Needs work

It doesn't include the obvious optimization that when all fields are specified, we could use the primary key (source_ids_hash) for lookup.

Let's do that.

But he suggested that if we do decide to optimize, we could allow source plugins to add an 'indexable' property to their fields as returned by getIds(), which the ID map could then respect.

Let's do that (in a follow-up issue).

  1. +++ b/core/modules/migrate/src/Plugin/MigrateIdMapInterface.php
    @@ -209,11 +209,25 @@ public function lookupSourceID(array $destination_id_values);
    +   *   The destination identifier values of the record, or empty on failure.
        */
    

    Let's add an @deprecated here.

  2. +++ b/core/modules/migrate/src/Plugin/MigrateIdMapInterface.php
    @@ -209,11 +209,25 @@ public function lookupSourceID(array $destination_id_values);
    +   * Looks up the destination identifies corresponding to a source key.
    

    identifiers.

  3. +++ b/core/modules/migrate/src/Plugin/migrate/id_map/Sql.php
    @@ -517,16 +517,46 @@ public function lookupSourceID(array $destination_id_values) {
    +    $is_associative = !isset($source_id_values[0]);
    

    Accepting unkeyed values is an (already-existing) undocumented feature - shall we document it in the interface?

  4. +++ b/core/modules/migrate/src/Plugin/migrate/id_map/Sql.php
    @@ -517,16 +517,46 @@ public function lookupSourceID(array $destination_id_values) {
    +    if (!empty($source_id_values)) {
    +      throw new MigrateException("Extra unknown items in source IDs " . print_r($orig_source_ids, TRUE));
    +    }
    

    Let's add an @throws for this.

  5. +++ b/core/modules/migrate/tests/src/Unit/MigrateSqlIdMapTest.php
    @@ -431,6 +431,111 @@ public function testLookupDestinationIdMapping($num_source_fields, $num_destinat
    +    }
    

    Should assert the exception message.

  6. +++ b/core/modules/migrate/tests/src/Unit/MigrateSqlIdMapTest.php
    @@ -431,6 +431,111 @@ public function testLookupDestinationIdMapping($num_source_fields, $num_destinat
    +    }
    

    Should assert the exception message.

Thanks vasi!

vasi’s picture

Status: Needs work » Needs review
FileSize
8.42 KB
1.61 KB

Thanks for the review, trying again.

vasi’s picture

FileSize
9.76 KB
5.21 KB

Oops, wrong version of the patch!

mikeryan’s picture

Issue tags: +Needs change record

Looks good - I think we should have a change record indicated the deprecation and showing how to substitute the new API, with that it's RTBC from my standpoint.

Thanks @vasi!

mikeryan’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: -Needs change record

Draft change record added.

vasi’s picture

We should also have an issue to remove the deprecated uses in migrate itself, at least before it becomes stable.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Nice test and nice solution. Committed b073cee and pushed to 8.1.x and 8.2.x. Thanks!

I agree that the question about indexes can be handled in a followup.

  • alexpott committed 20642d5 on 8.2.x
    Issue #2724941 by vasi: Need to be able to lookup destination IDs by...

  • alexpott committed b073cee on 8.1.x
    Issue #2724941 by vasi: Need to be able to lookup destination IDs by...
heddn’s picture

Commenting late, but per our conversation in IRC from last week, this should be fine. The only reason for storing the hash instead of the actual values is because of Indexing. Indexes essentially don't work when non standard language sets. Migrations are slow already, so making it slower isn't all that terrible. Someone can always micro-tune the DB if they need to.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

mikeryan’s picture

I've now published the change record.