Problem/Motivation
If a source ID from a string field is something that evaluates to an integer like '2' or 2, the source ID is set as an integer. This causes the check for matching on existing content to fail and the source row is considered new.
This means that with --sync, items can be rolled back and recreated which is both a performance issue + can cause issues of the migrated content is for example referenced by other content.
Steps to reproduce
- Set a source with an ID of type string
- Set source rows with a mix of IDs like 1, '2', 'Three'.
- Run a migrate import with sync
- 1 and '2' get rolled back and recreated, and only 'Three' remains
See the automated test coverage added as an example
Proposed resolution
Cast the source IDs as the type specified for the ID
Remaining tasks
None
User interface changes
None
API changes
None
Data model changes
None
Original issue summary
I was noticing with a migrate_plus Url migration that the sync option was rolling back everything each import when the source had not changed. It's because the in_array id check (https://git.drupalcode.org/project/migrate_tools/blob/8.x-4.x/src/EventS...) is pretty fragile. In my case, $row->getSourceIdValues() was returning:
[ 'item_id' => INT ]
while $id_map->currentSource() was returning
[ 'item_id' => STRING ]
which causes a false negative for the strict in_array check.
I don't think this is the fault of migrate_tools, but I think it can be prevented. Incoming patch hashes IDs to JSON before comparison, which fixes my use case but might be inefficient.
| Comment | File | Size | Author |
|---|
Issue fork migrate_tools-3104268
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
ahebrank commentedComment #3
heddnSeems like the IDs schema should be set to string then, no?
Comment #4
ahebrank commentedHow do you mean? In the migration configuration?
Updating to try to fix the deletion error.
Comment #5
heddnYes, when defining the source you get to specify the ids and their schema.
Comment #6
ahebrank commentedThe different schemes in the source config didn't seem to matter -- I still got ID mismatches when changing from
to
I will see if I can replicate that with a test.
Comment #7
coufu commentedPatch #4 fixes rolling back everything, but it also makes everything update. Definitely preferable to rolling back everything every time, but I would expect the --sync flag to just delete missing items if I had just run the migration without any flags seconds ago.
Comment #8
kirkkalaPatch #4 fixes for us also the issue of everything got rolled back. Quite important to maintain the entity id's since we are making references to the migrated nodes.
And I don't see any issue with updating existing items when doing an operation called "sync".
Comment #9
roberttstephens commentedFor anyone digging into this, this seems to be coming from the next() method of Drupal\migrate\Plugin\migrate\id_map\Sql . The query is
SELECT map.sourceid1 AS sourceid1, map.destid1 AS destid1where both columns returned are strings.Comment #10
heddnThe problem seems to make sense. Let's see if we can fix that over in https://gitlab.com/drupalspoons/migrate_tools/-/issues/101. I'm not sure the approach we are using is the best of solutions. While most php installs have ext-json compiled, not all do.
Comment #11
ahebrank commentedIt should be OK to use ext-json. It's been required by core for a while: https://git.drupalcode.org/project/drupal/-/blob/8.8.x/core/composer.jso...
Comment #12
bhanu951 commentedReopening this as we are back to Drupal.org from DrupalSpoons.
Comment #13
byrond commentedI agree with @Coufu in #7 regarding the expectations of the
--syncoption. The behavior I expect is to rollback items missing from the feed, as well as update only items that have changed if usingtrack_changes. Both the "rollback everything" and "update everything" issues seem to be resolved with #3104105-12: sync option doesnt work with track_changes. This issue can probably be closed in favor of #3104105: sync option doesnt work with track_changes.Comment #14
egruel commentedI have the same issue on 6.0.x so i update the patch for this version.
Comment #17
berliner commentedUpdated the patch in #4 for 5.2.0 and fixed an error when calling
$id_map->delete.Comment #18
berliner commentedAs mentioned by @byrond, this seems to have been fixed in 6.0.x by this commit in #3104105: sync option doesnt work with track_changes.
Comment #19
pacproduct commentedVersion 6.0.1 does seem to fix the issue as long as the ids' "type" property is set correctly.
In my case
migrate_toolswas rolling back all my items because the source ID was an integer, and I had defined:Fixing my configuration to type "integer" fixed everything:
I overlooked it in the first place, so I'm wondering if this subtlety is documented somewhere?
Comment #20
robertom commentedHi, sorry for my bad english.
This issue is still valid.
I have many d7 to d9 migrations suffering from this because, for example, d7_taxonomy_term_entity_translation states that entity_id will be integer, but the data coming from the db returns it as a string.
The approach used in patch #4/#17 might be incorrect because, if the id is an md5, the conversion might assume it's a numeric value when it doesn't contain letters and the strict in_array check will fail.
Attached a "workaround" that enforce the type of $source_id_values as stated by $source->getIds()
Comment #21
robertom commentedsorry, there is no point in checking repeatedly...
new patch attached
Comment #22
robertom commentedEven with patch #21 I get some unwanted rollbacks when i run a migration group with the --group=... option.
The problem is due to the way the data is stored in the state variable. We need to add the migration id as the key of the array.
Attached a proposed patch and interdiff with #21
Comment #26
jamsilver commentedI had a situation with unit tests that failed because it executed the same migration twice in the same page request. It wasn't enough for migrate_tools_sync state to be reset in the constructor.
I've created a Merge Request that adds the following changes to the patch in 22 (interdiff attached):
I've also raised this as a Merge Request, rather than as a patch, since that is the workflow in place on this ticket originally. I raised a new one because things seem to have moved along quite a lot since the original was raised.
Tests fail, but the first error I'm seeing looks unrelated to my change. Is there a problem with the tests at the moment?
Comment #27
bbu23Unfortunately the MR from #26 cannot be applied to the latest version of the module.
Comment #28
jigariusWhile testing this behavior, I realized that it makes many, many write queries on the key_value table. This in turn results in huge MySQL bin logs which makes the server run out of memory at some point. Is anyone else facing this issue? Is it safe to say that there is an efficiency/memory optimization issue with the --sync option?
Additionally, I was surprised to see that when I run all the migrations one by one with --sync, they run faster than if I run them all with --all --sync.
drush mim {migration.id} --sync: 3m8.960s
drush --all --sync: 8m31.586s
Update: I just realized that my comment seems off-topic. I think I'll create a separate ticket for these problems.
Comment #29
heddnI'm afraid that the change in #26 to move the set into the point of scanning might have lead to #28. Can we revert that change please?
Comment #31
scott_euser commentedI updated the MR with a merge from 6.0.x and created a separate issue https://www.drupal.org/project/migrate_tools/issues/3396130 for the pipeline failure
Comment #32
bbu23Thank you, tested the patch from MR !40. It seems to have fixed the rolling back everything situation for a migration that uses track_changes. Though the migration is much slower now, quite a difference. Track changes works pretty fast when not used in combination with sync. I'm not sure if this can be improved or not.
Comment #33
pacproduct commented@bbu23 Could you be facing the issue referenced in https://www.drupal.org/project/migrate_tools/issues/3378047 ?
More generally, I have the feeling that this thread should take into consideration the other thread I posted above, because the way
migrate_tools_migrate_prepare_rowcurrently works gets exponentially slower with time and should be fixed first, in my opinion.Comment #34
bbu23@pacproduct Yes, thank you, that could be it! One of my migrations took hours with sync option on.
So what is your suggestion? Should I try the patch from the referenced issue that has a dedicated new table to sync or do I need to combine both patches?
Also, is your comment #19 from current issue still valid? I feel that it's a bit strange to have to manually specify a type for each migration. Most of the migrations have the ID coming in as string, that means modifications everywhere. Thanks
Comment #35
pacproduct commented@bbu23 I believe my comment in #19 is still valid, but the best is to try it to be sure.
I feel like the global tendency in programming is to favor strict typing more and more as it tends to reduce issues and bugs.
Thus personally, I'm kind of okay with the idea of configuring what is the type of data you're expecting from the source.
Although maybe `migrate_tools` could warn you when source IDs type do not match the configured type so you know something is dubious, instead of converting it silently which causes it to rollback everything later...
Everyone's situation isn't the same, and there might be cases where having a strict source ID typing wouldn't fit one's need. From the top of my mind, the only case that could be problematic that I can think of is if the source system were to return a mix of different types as the primary ID (a mix of strings and ints, for instance). In that particular case, having a "loose type" for source IDs could be useful.
Hard for me to say which approach is better...
-- --
With regards to your case: If setting properly the source ID types on your migration configurations solves your rollback problem without patching `migrate_tools`, it feels like a quick and healthy fix in my opinion. In which case, you may apply the patch from the referenced issue if it fixes your performance problem.
Otherwise, you may have to merge the 2 patches together. Not sure how tricky this is going to be. YMMV.
Comment #37
scott_euser commentedThings have changed too much to safely just resolve conflicts. This will need a more in-depth review and re-application of the MR.
Comment #41
scott_euser commentedComment #42
scott_euser commentedComment #43
scott_euser commentedComment #44
jonasanne commentedAdded a diff from merge request for ease of use as patch.
Thankyou @scott_euser for updating this.
Comment #45
jonasanne commentedI made a small mistake with the prev patch.
Goodmorning :D
Comment #46
scott_euser commentedHiding patch, you should download the patch locally as per https://www.drupal.org/docs/develop/git/using-gitlab-to-contribute-to-dr... This avoids confusion for maintainers & further development as to what to work on.
Thanks!
Comment #47
scott_euser commentedWhat would be useful though is RTBC'ing it if you have reviewed (+explaining what you did to test/review). Thanks!
Comment #48
nnevillLooks good and works!
Thanks!
Comment #49
nicxvan commentedNeeds a rebase. Please use the MR not a patch.
Comment #53
nicxvan commentedRebased on 6.0.x
Comment #54
scott_euser commentedThanks for sorting out the conflict! Still working well! Phpcs errors are unrelated and due to changes in standards I believe, will need a separate issue to solve that, but not a blocker here.
Comment #55
scott_euser commentedCreated here for phpcs #3498688: PHPCS fixes for MigrateToolsCommands
Comment #56
scott_euser commentedComment #57
sonfdHere's a patch that applies against 6.0.5 for use with composer. It's the same as the MR, but does not include the tests.
Comment #58
mikelutzComment #59
scott_euser commentedComment #60
scott_euser commentedOkay tests for deprecation now passing
Comment #61
macsim commentedTested MR77 on 6.0.x-dev with a migration where I have
bigintas source andstringas destination sinceintegercan't be the solution (SQLSTATE[22003]: Numeric value out of range: 1264 Out of range value for column 'sourceid1' at row 1)Before patch:
After patch:
Seems good to me
Comment #62
macsim commentedwoops sorry for the status change ; for me it's RTBC
I leave "Needs review" until few more people test it
Comment #63
jrockowitz commentedComment #64
ramzypro commentedI'm successfully using the patch from #63 on a sync from an external MySQL data source with no issues.
Comment #65
scott_euser commentedHiding patch to avoid confusion.
I can't change to RTBC since I've been doing the work so next step is someone here in the community confirming they've reviewed and tested to change the status so we can get this back infront of the maintainers now that feedback is resolved + test coverage is in place.
Comment #66
ramzypro commentedApologies; I had to read up more on Drupal etiquette about changing status - I have used the patch in #63 on large datasets without issue, many of them mixing string values that start with "0" which are impossible to handle correctly without it.
Comment #67
scott_euser commentedSorry wasn't aimed at you necessarily, just the notification reminded me of this issue and wanted to comment what is needed to move forward :) Thanks for confirming RTBC in any case!
Comment #69
heddnHopefully I caught everyone who contributed to this issue. Thanks for all your work on it.
Comment #70
scott_euser commentedThank you!