See #2748609: [meta] Preserving auto-increment IDs on migration is fragile.

Problem/Motivation

Preserving serial IDs in the upgrade process (e.g., source site nid 53 => destination site nid 53) means that if any content with the same ID is manually created on the destination side before upgrade, it will be overwritten with data from the source. Since the core upgrade process does not provide an option to not preserve IDs, we should warn the user before executing the upgrade process of any potential conflicts.

Proposed resolution

Run a process before the actual import which identifies conflict risks - cases where an ID in the content to be imported matches an ID that already exists in the destination.

  1. Maintain data integrity
    If this is a required step, the user can choose whether to proceed and overwrite any conflicts, or take an alternative approach (per the docs).
  2. No surprises
    They are told exactly what content could be affected by conflicts.
  3. Provide a path forward
    Yes.
  4. Preserve URLs (e.g., node/17).
    Yes.
  5. Minimize technical debt.
    Audit code should be reasonably isolated and maintainable.
  6. Minimize effort to implement.
    We’d need to first do what the import process normally does, identify and instantiate the necessary migrations. For each migration, we need to identify the destination ID field, see if that field is mapped in the process section, look for any of destination ID values which are not in the migration’s map table, and see if any of *those* IDs are in the source. Oh, but we need to consolidate this across all migrations to the same entity type… Ugh.

Remaining tasks

Implement all the things.

User interface changes

The upgrade summary page (after entering db credentials, before executing migrations) should run the audit and display any conflicts.

API changes

Likely addition of an "audit()" method or somesuch to the Migration plugin.

Data model changes

None.

Files: 
CommentFileSizeAuthor
#7 2876085-6.patch15.55 KBvasi

Comments

mikeryan created an issue. See original summary.

vasi’s picture

At the triage, I discussed the possibility of making this much simpler:

The current approach requires looking at all of the source, process and destination in complex ways, to identify exactly which new IDs will be migrated.

Instead, we could simply check "are there any new destination entities of certain types since the last migration (if any)?" If so, warn the user that these new entities might be overwritten.

This risks false positives: It could be that you have created nid 5, and are about to migrate nids 3, 4 and 6. Even though the migration would work, the simple check will warn you. But this case sounds quite rare, and careful users who know they fall in this case can click through the warning.

cilefen’s picture

Priority: Critical » Major
Issue tags: +Triaged D8 major

@xjm, @alexpott, @effulgentsia, @lauriii, @catch and I discussed this issue at a recent meeting and agreed that since the parent is critical, we would downgrade this to major but also make the other related children major.

heddn’s picture

We discussed this in the weekly migrate meeting and mikeryan, quiteone and heddn all agreed with vasi's approach as outlined in #2.

vasi’s picture

Ok, so here's the proposed implementation plan:

  • Create an interface with the getHighestId() method
  • Implement this on id_map, which will yield the highest ID that has been migrated.
  • Implement this on content destinations, which will yield the highest ID that exists of the destination type. Eg: highest node ID on the site.
  • Add a new field 'stable_ids' field to content destination configuration. If this is set and true, on the first import they will check that </code>$id_map->getHighestId() == $destination->getHighestId(). If not, throw an exception. (Maybe we want to do this as a global pre-flight check instead, so as to not waste user time?)
  • Add the stable_ids to all the relevant migration definitions. Eg: Add to d6_node, but not d6_node_translation, since it doesn't use an auto-increment ID. It's too annoying to automatically determine whether a migration is intended to have stable IDs, so I think this is the best option.
mikeryan’s picture

  • Create an interface with the getHighestId() method
  • Implement this on id_map, which will yield the highest ID that has been migrated.
  • Implement this on content destinations, which will yield the highest ID that exists of the destination type. Eg: highest node ID on the site.

Good thought - and if/when we do #2876086: Manipulate auto-increment values to avoid conflicts, we'd implement this on sources as well, so we'd know how high to set the auto-increment value.

(Maybe we want to do this as a global pre-flight check instead, so as to not waste user time?)

Yes, that's the plan as specified in the issue summary here - this is a pre-flight check, runtime checks would be done in #2876090: Warn if migrated content will overwrite manually-created content.

vasi’s picture

Status: Active » Needs review
Issue tags: +Needs tests
FileSize
15.55 KB

Here's a teensy implementation. Still needs tests, and vetting.

vasi’s picture

Status: Needs review » Needs work

The last submitted patch, 7: 2876085-6.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

vasi’s picture

Note also that the patch doesn't really do a good job of deciding when to audit IDs. We'll probably want to add 'preserve_ids: true' to all the relevant YAML files? Or we can try to use heuristics to do the equivalent automatically.