Problem/Motivation
Migration is very slow when there are lots of rows in the message table. I saw this with file migration (~500K files). I tracked it down to slow queries involving WHERE clause on the table message.
I spent a lot of time to debug this, it's not easy the migration begin with correct ratio and slown down after 1 hour...
At the first of migration i have a ration of import ~ 100 files/s and after ~70K ration slown down to 15 files/s.
You can reproduce this scenario with a migration that write lot of row to message table, in my case i have a lot of message like date not valid or file don't exist. (The source data was not correctly valid.)
Exemple query :
DELETE FROM migrate_message_json_images
WHERE source_ids_hash = 'a03cf05d5026350fa2fd9650fd78d3879aad366235eee7df9af35e268957c90e';
Time Migration without patch : 13 hours.
Time Migration with patch : 1,5 hour.
This is possibly related to #2688297: File migration slows down and eats more and more memory.
Proposed resolution
Add indexes to migrate_message_* table on column source_ids_hash.
Data model changes
No changes to data model but indexe are added. Since this is not done at install, there shouldn't be much of a change.
Comment | File | Size | Author |
---|---|---|---|
#22 | 3085192-16_on_top_of_2845340-16-do-not-test.patch | 676 bytes | Wim Leers |
#21 | add_index_to_migrate_message-3085192-16.patch | 1.36 KB | Wim Leers |
#19 | 3085192-17_on_top_of_2845340-16-do-not-test.patch | 708 bytes | Wim Leers |
#17 | 3085192-17.patch | 1.39 KB | Wim Leers |
#16 | add_index_to_migrate_message-3085192-16.patch | 1.36 KB | Wim Leers |
Comments
Comment #2
focus13 CreditAttribution: focus13 as a volunteer and for Klee Interactive (Klee Group) commentedComment #3
focus13 CreditAttribution: focus13 as a volunteer and for Klee Interactive (Klee Group) commentedComment #5
DamienMcKennaIt also needs an update script to add the index to existing migrate_message_* tables.
Comment #6
mikelutzThis makes sense, but needs an update path and tests around the update path. A slightly more detailed profiling analysis would also help show the reasoning here, and prove that we have a net gain with time saved in reading versus extra time writing when we get large tables (Which I would certainly expect to be the case)
Removing the 'migrate critical' tag and setting to normal priority, The migrate critical tag is for items that are tied to preparation for Drupal 9 and a new Major version. Something like this could be put out in any minor release, so while important, it is not 'migrate critical', nor does it meet the standard of Major priority.
Comment #8
Wim LeersComment #9
Kris77 CreditAttribution: Kris77 commented@DamienMcKenna can you describe how to update with the script please?
I want to try this patch because I have to import about 1milion rows and now it takes two days.
Thanks a lot.
Comment #14
Wim Leershttps://www.drupal.org/u/grasmash is working on a migration … where this patch made his migration three times faster.
This is especially important if you're using #2745797: Add option to content entity destinations for validation (CR: https://www.drupal.org/node/3073707), and your source site has some data integrity problems (common: required fields that were added after lots of content already existed, and the pre-existing content did not get updated to have values for those required fields).
I think I want to challenge the need for an update path and update path tests here. The mantra in the migration system is always:
. Well … isn't that exactly the case here?Comment #15
Wim Leers#2 no longer applied to HEAD. Rebased.
Comment #16
Wim LeersComment #17
Wim LeersActually, I think we can even make this a unique key? No two source rows are supposed to have the same source IDs hash — if they do, then they cannot be identified uniquely either.
The "mapping" tables do already make
source_ids_hash
the primary key, so the uniqueness is already guaranteed.Comment #18
Wim LeersComment #19
Wim LeersHere's a do-not-test patch relative to #2845340: migrate mapping & messages table names are truncated, can lead to incorrect mapping lookups, which is why my previous rebased patches were failing :) Also excluded the test changes in this one, to avoid the need to reroll all the time while having both patches applied.
Comment #21
Wim LeersI don't know what I was thinking … obviously there can be more than one message per source row!
So #16 is correct, #17 is stupid.
Comment #22
Wim LeersHere's a do-not-test patch relative to #2845340: migrate mapping & messages table names are truncated, can lead to incorrect mapping lookups, which is why my previous rebased patches were failing :) Also excluded the test changes in this one, to avoid the need to reroll all the time while having both patches applied.
Comment #23
joachim CreditAttribution: joachim at Factorial GmbH commentedLGTM.
Comment #24
alexpottThis is a major performance bug if it can have so much impact on a migration. Backported to 9.2.x - there's no reason to not get this in asap. I agree that writing an upgrade path is not worth it given how migrations tend to be ephemeral and this is not a breaking bug.
Committed and pushed f1c06e87e0 to 9.3.x and d4e4f7c1d3 to 9.2.x. Thanks!