Problem/Motivation

If a source ID from a string field is something that evaluates to an integer like '2' or 2, the source ID is set as an integer. This causes the check for matching on existing content to fail and the source row is considered new.

This means that with --sync, items can be rolled back and recreated which is both a performance issue + can cause issues of the migrated content is for example referenced by other content.

Steps to reproduce

  1. Set a source with an ID of type string
  2. Set source rows with a mix of IDs like 1, '2', 'Three'.
  3. Run a migrate import with sync
  4. 1 and '2' get rolled back and recreated, and only 'Three' remains

See the automated test coverage added as an example

Proposed resolution

Cast the source IDs as the type specified for the ID

Remaining tasks

None

User interface changes

None

API changes

None

Data model changes

None

Original issue summary

I was noticing with a migrate_plus Url migration that the sync option was rolling back everything each import when the source had not changed. It's because the in_array id check (https://git.drupalcode.org/project/migrate_tools/blob/8.x-4.x/src/EventS...) is pretty fragile. In my case, $row->getSourceIdValues() was returning:

[ 'item_id' => INT ]

while $id_map->currentSource() was returning

[ 'item_id' => STRING ]

which causes a false negative for the strict in_array check.

I don't think this is the fault of migrate_tools, but I think it can be prevented. Incoming patch hashes IDs to JSON before comparison, which fixes my use case but might be inefficient.

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

ahebrank created an issue. See original summary.

ahebrank’s picture

StatusFileSize
new1.41 KB
heddn’s picture

Status: Active » Postponed (maintainer needs more info)

Seems like the IDs schema should be set to string then, no?

ahebrank’s picture

StatusFileSize
new1.53 KB

How do you mean? In the migration configuration?

Updating to try to fix the deletion error.

heddn’s picture

Yes, when defining the source you get to specify the ids and their schema.

ahebrank’s picture

The different schemes in the source config didn't seem to matter -- I still got ID mismatches when changing from

ids:
    item_id:
      type: integer

to

ids:
    item_id:
      type: string

I will see if I can replicate that with a test.

coufu’s picture

Patch #4 fixes rolling back everything, but it also makes everything update. Definitely preferable to rolling back everything every time, but I would expect the --sync flag to just delete missing items if I had just run the migration without any flags seconds ago.

kirkkala’s picture

Status: Postponed (maintainer needs more info) » Reviewed & tested by the community

Patch #4 fixes for us also the issue of everything got rolled back. Quite important to maintain the entity id's since we are making references to the migrated nodes.

And I don't see any issue with updating existing items when doing an operation called "sync".

roberttstephens’s picture

For anyone digging into this, this seems to be coming from the next() method of Drupal\migrate\Plugin\migrate\id_map\Sql . The query is SELECT map.sourceid1 AS sourceid1, map.destid1 AS destid1 where both columns returned are strings.

heddn’s picture

Status: Reviewed & tested by the community » Closed (duplicate)

The problem seems to make sense. Let's see if we can fix that over in https://gitlab.com/drupalspoons/migrate_tools/-/issues/101. I'm not sure the approach we are using is the best of solutions. While most php installs have ext-json compiled, not all do.

ahebrank’s picture

It should be OK to use ext-json. It's been required by core for a while: https://git.drupalcode.org/project/drupal/-/blob/8.8.x/core/composer.jso...

bhanu951’s picture

Status: Closed (duplicate) » Active

Reopening this as we are back to Drupal.org from DrupalSpoons.

byrond’s picture

I agree with @Coufu in #7 regarding the expectations of the --sync option. The behavior I expect is to rollback items missing from the feed, as well as update only items that have changed if using track_changes. Both the "rollback everything" and "update everything" issues seem to be resolved with #3104105-12: sync option doesnt work with track_changes. This issue can probably be closed in favor of #3104105: sync option doesnt work with track_changes.

egruel’s picture

I have the same issue on 6.0.x so i update the patch for this version.

berliner’s picture

Status: Active » Needs review
StatusFileSize
new1.82 KB

Updated the patch in #4 for 5.2.0 and fixed an error when calling $id_map->delete.

berliner’s picture

As mentioned by @byrond, this seems to have been fixed in 6.0.x by this commit in #3104105: sync option doesnt work with track_changes.

pacproduct’s picture

Version 6.0.1 does seem to fix the issue as long as the ids' "type" property is set correctly.

In my case migrate_tools was rolling back all my items because the source ID was an integer, and I had defined:

  ids:
    source_id:
      type: string
      max_length: 100

Fixing my configuration to type "integer" fixed everything:

  ids:
    source_id:
      type: integer

I overlooked it in the first place, so I'm wondering if this subtlety is documented somewhere?

robertom’s picture

Priority: Minor » Normal
StatusFileSize
new926 bytes

Hi, sorry for my bad english.

This issue is still valid.

I have many d7 to d9 migrations suffering from this because, for example, d7_taxonomy_term_entity_translation states that entity_id will be integer, but the data coming from the db returns it as a string.

The approach used in patch #4/#17 might be incorrect because, if the id is an md5, the conversion might assume it's a numeric value when it doesn't contain letters and the strict in_array check will fail.

Attached a "workaround" that enforce the type of $source_id_values as stated by $source->getIds()

robertom’s picture

StatusFileSize
new1003 bytes

sorry, there is no point in checking repeatedly...

new patch attached

robertom’s picture

Version: 8.x-5.x-dev » 6.0.x-dev
StatusFileSize
new1.58 KB
new1014 bytes

Even with patch #21 I get some unwanted rollbacks when i run a migration group with the --group=... option.

The problem is due to the way the data is stored in the state variable. We need to add the migration id as the key of the array.

Attached a proposed patch and interdiff with #21

Status: Needs review » Needs work

The last submitted patch, 22: sync-id-hash-3104268-22.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

jamsilver made their first commit to this issue’s fork.

jamsilver’s picture

Status: Needs work » Needs review
StatusFileSize
new3.32 KB

I had a situation with unit tests that failed because it executed the same migration twice in the same page request. It wasn't enough for migrate_tools_sync state to be reset in the constructor.

I've created a Merge Request that adds the following changes to the patch in 22 (interdiff attached):

  1. Reset migrate_tools_sync at the point of scanning through source rows to collect the IDs. This fixes my unit test
  2. Factor that bit of code to a new method for readability.
  3. Rewrote to use settype() as that seemed more elegant and readable

I've also raised this as a Merge Request, rather than as a patch, since that is the workflow in place on this ticket originally. I raised a new one because things seem to have moved along quite a lot since the original was raised.

Tests fail, but the first error I'm seeing looks unrelated to my change. Is there a problem with the tests at the moment?

bbu23’s picture

Unfortunately the MR from #26 cannot be applied to the latest version of the module.

jigarius’s picture

While testing this behavior, I realized that it makes many, many write queries on the key_value table. This in turn results in huge MySQL bin logs which makes the server run out of memory at some point. Is anyone else facing this issue? Is it safe to say that there is an efficiency/memory optimization issue with the --sync option?

Additionally, I was surprised to see that when I run all the migrations one by one with --sync, they run faster than if I run them all with --all --sync.

drush mim {migration.id} --sync: 3m8.960s
drush --all --sync: 8m31.586s

Update: I just realized that my comment seems off-topic. I think I'll create a separate ticket for these problems.

heddn’s picture

Status: Needs review » Needs work

I'm afraid that the change in #26 to move the set into the point of scanning might have lead to #28. Can we revert that change please?

scott_euser made their first commit to this issue’s fork.

scott_euser’s picture

Status: Needs work » Needs review

I updated the MR with a merge from 6.0.x and created a separate issue https://www.drupal.org/project/migrate_tools/issues/3396130 for the pipeline failure

bbu23’s picture

Thank you, tested the patch from MR !40. It seems to have fixed the rolling back everything situation for a migration that uses track_changes. Though the migration is much slower now, quite a difference. Track changes works pretty fast when not used in combination with sync. I'm not sure if this can be improved or not.

pacproduct’s picture

@bbu23 Could you be facing the issue referenced in https://www.drupal.org/project/migrate_tools/issues/3378047 ?

More generally, I have the feeling that this thread should take into consideration the other thread I posted above, because the way migrate_tools_migrate_prepare_row currently works gets exponentially slower with time and should be fixed first, in my opinion.

bbu23’s picture

@pacproduct Yes, thank you, that could be it! One of my migrations took hours with sync option on.

So what is your suggestion? Should I try the patch from the referenced issue that has a dedicated new table to sync or do I need to combine both patches?
Also, is your comment #19 from current issue still valid? I feel that it's a bit strange to have to manually specify a type for each migration. Most of the migrations have the ID coming in as string, that means modifications everywhere. Thanks

pacproduct’s picture

@bbu23 I believe my comment in #19 is still valid, but the best is to try it to be sure.

I feel like the global tendency in programming is to favor strict typing more and more as it tends to reduce issues and bugs.
Thus personally, I'm kind of okay with the idea of configuring what is the type of data you're expecting from the source.
Although maybe `migrate_tools` could warn you when source IDs type do not match the configured type so you know something is dubious, instead of converting it silently which causes it to rollback everything later...

Everyone's situation isn't the same, and there might be cases where having a strict source ID typing wouldn't fit one's need. From the top of my mind, the only case that could be problematic that I can think of is if the source system were to return a mix of different types as the primary ID (a mix of strings and ints, for instance). In that particular case, having a "loose type" for source IDs could be useful.
Hard for me to say which approach is better...

-- --

With regards to your case: If setting properly the source ID types on your migration configurations solves your rollback problem without patching `migrate_tools`, it feels like a quick and healthy fix in my opinion. In which case, you may apply the patch from the referenced issue if it fixes your performance problem.
Otherwise, you may have to merge the 2 patches together. Not sure how tricky this is going to be. YMMV.

scott_euser changed the visibility of the branch 3104268-sync-might-be to hidden.

scott_euser’s picture

Status: Needs review » Needs work

Things have changed too much to safely just resolve conflicts. This will need a more in-depth review and re-application of the MR.

scott_euser changed the visibility of the branch 3104268-sync-might-be3 to hidden.

scott_euser changed the visibility of the branch 3104268-sync-id-too-strict to hidden.

scott_euser’s picture

scott_euser’s picture

Title: sync might be too strict during id comparison; rolls back everything » Sync is be too strict during id comparison and can roll back everything
Issue summary: View changes
Status: Needs work » Needs review
  1. Updated code to work with new code from #3378047: drush migration with '--sync' option has suboptimal performance
  2. Added an example source to the test coverage
  3. Added additional test coverage method to prove the issue and demonstrate the fix: test only fails.
  4. Updated issue summary to standard template with details and clear steps to reproduce
scott_euser’s picture

Title: Sync is be too strict during id comparison and can roll back everything » Sync is too strict during id comparison and can roll back everything
jonasanne’s picture

StatusFileSize
new6.6 KB

Added a diff from merge request for ease of use as patch.

Thankyou @scott_euser for updating this.

jonasanne’s picture

StatusFileSize
new6.61 KB

I made a small mistake with the prev patch.
Goodmorning :D

scott_euser’s picture

Hiding patch, you should download the patch locally as per https://www.drupal.org/docs/develop/git/using-gitlab-to-contribute-to-dr... This avoids confusion for maintainers & further development as to what to work on.

Thanks!

scott_euser’s picture

What would be useful though is RTBC'ing it if you have reviewed (+explaining what you did to test/review). Thanks!

nnevill’s picture

Status: Needs review » Reviewed & tested by the community

Looks good and works!
Thanks!

nicxvan’s picture

Status: Reviewed & tested by the community » Needs work

Needs a rebase. Please use the MR not a patch.

nicxvan changed the visibility of the branch 3104268-sync-id-too-strict2 to hidden.

nicxvan changed the visibility of the branch 6.0.x to hidden.

nicxvan’s picture

Status: Needs work » Needs review

Rebased on 6.0.x

scott_euser’s picture

Status: Needs review » Reviewed & tested by the community

Thanks for sorting out the conflict! Still working well! Phpcs errors are unrelated and due to changes in standards I believe, will need a separate issue to solve that, but not a blocker here.

scott_euser’s picture

scott_euser’s picture

sonfd’s picture

StatusFileSize
new2.91 KB

Here's a patch that applies against 6.0.5 for use with composer. It's the same as the MR, but does not include the tests.

mikelutz’s picture

Status: Reviewed & tested by the community » Needs work
scott_euser’s picture

scott_euser’s picture

Status: Needs work » Needs review

Okay tests for deprecation now passing

macsim’s picture

Status: Needs review » Needs work

Tested MR77 on 6.0.x-dev with a migration where I have bigint as source and string as destination since integer can't be the solution (SQLSTATE[22003]: Numeric value out of range: 1264 Out of range value for column 'sourceid1' at row 1)

Before patch:

vendor/bin/drush mim my_migration --sync
 [notice] Rolled back 74 items - done with 'my_migration'
 [notice] Processed 74 items (74 created, 0 updated, 0 failed, 0 ignored) - done with 'my_migration'

After patch:

vendor/bin/drush mim my_migration --sync
 [notice] No item has been rolled back - done with 'my_migration'
 [notice] Processed 1 item (0 created, 0 updated, 0 failed, 0 ignored) - done with 'my_migration'

Seems good to me

macsim’s picture

Status: Needs work » Needs review

woops sorry for the status change ; for me it's RTBC
I leave "Needs review" until few more people test it

jrockowitz’s picture

StatusFileSize
new7.45 KB
ramzypro’s picture

I'm successfully using the patch from #63 on a sync from an external MySQL data source with no issues.

scott_euser’s picture

Hiding patch to avoid confusion.

I can't change to RTBC since I've been doing the work so next step is someone here in the community confirming they've reviewed and tested to change the status so we can get this back infront of the maintainers now that feedback is resolved + test coverage is in place.

ramzypro’s picture

Status: Needs review » Reviewed & tested by the community

Apologies; I had to read up more on Drupal etiquette about changing status - I have used the patch in #63 on large datasets without issue, many of them mixing string values that start with "0" which are impossible to handle correctly without it.

scott_euser’s picture

Sorry wasn't aimed at you necessarily, just the notification reminded me of this issue and wanted to comment what is needed to move forward :) Thanks for confirming RTBC in any case!

  • heddn committed 1264f3e3 on 6.0.x authored by nicxvan
    Issue #3104268 by scott_euser, heddn, robertom, ahebrank, nicxvan,...
heddn’s picture

Status: Reviewed & tested by the community » Fixed

Hopefully I caught everyone who contributed to this issue. Thanks for all your work on it.

scott_euser’s picture

Thank you!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.