Problem/Motivation
Migration is very slow when there are lots of rows in the source. I saw this initially with url_alias migration and then with file migration (~200K files). I tracked it down to slow queries involving JOINs.
User@Host: drupal[drupal] @ localhost []
# Thread_id: 62 Schema: drupal QC_hit: No
# Query_time: 184.817538 Lock_time: 0.000041 Rows_sent: 0 Rows_examined: 897152256
use drupal;
SET timestamp=1484412007;
SELECT ua.*, map.sourceid1 AS migrate_map_sourceid1, map.source_row_status AS migrate_map_source_row_status
FROM
url_alias ua
LEFT OUTER JOIN zend8.migrate_map_d7_url_alias map ON pid = map.sourceid1
WHERE ( (map.sourceid1 IS NULL ) OR (map.source_row_status = '1') )
One query went on till 6000 seconds before I finally killed it.
At first, I thought it was my configuration and I kept increasing join_buffer_size
(32M against the default 256K) and trying. While there was some improvement, it was largely slow. The only way to speed this up reliably is to use indexes.
This is possibly related to #2688297: File migration slows down and eats more and more memory, eventually stops.
Solution
Add indexes to migrate_map_* table on source and destination columns.
Data Model changes
No changes to data model but indexes are added. Since this is not done at install, there shouldn't be much of a change.
Comment | File | Size | Author |
---|---|---|---|
#15 | add_indexes_to_migrate_map-2843595-15.patch | 2.51 KB | hussainweb |
#15 | interdiff-5-15.txt | 1.99 KB | hussainweb |
#5 | add_indexes_to_migrate_map-2843595-5.patch | 2.36 KB | hussainweb |
Comments
Comment #2
hussainwebAttaching a patch against 8.2.x (might work for 8.3.x too).
Comment #3
hussainwebComment #5
hussainwebUpdating the test.
Comment #6
jonhattan@hussainweb can you share total time of those big migrations with and w/o the patch? I'm seeing a significant improvement on my side, although no query lasted so long as you reported.
I'm testing the patch with two migrations:
* 150k core files.
* 28k nodes. Configured with several migration processes, so it is consulting maps intensively. Process migrations are configured for users (111 items in the map), files (148k items), terms (72 items) and nodes (33k items across 7 maps).
After patch I see an improvement of ~200s and ~1900s respectively.
Without patch:
* File: Processed 148278 items (148278 created, 0 updated, 0 failed, 0 ignored) - done with 'abo_media__files' [1260.55 sec, 339.58 MB]
* Node: Processed 29014 items (23467 created, 5503 updated, 44 failed, 0 ignored) - done with 'abo_articles__news' [12426.82 sec, 442.34 MB]
Patched:
* File: Processed 148278 items (148278 created, 0 updated, 0 failed, 0 ignored) - done with 'abo_media__files' [1046.27 sec, 339.58 MB]
* Node: Processed 29014 items (23467 created, 5503 updated, 44 failed, 0 ignored) - done with 'abo_articles__news' [10492.26 sec, 442.35 MB]
Also I'm seeing the gain in the node migration increases as the migration advances:
Comment #7
hussainweb@jonhattan, that is an impressive test. I didn't measure time precisely but I can tell for url_alias migration which contains 30k entries, it took about 40 minutes to run before the patch and about 30 seconds with the patch. The file migration where I saw the 6000s query had 300k entries and never finished. It took an unreasonable amount of time even after I increased
join_buffer_size
to 32M. Personally, I would discard the very long times for url_alias and file migrations from before because it could have been a misconfigured MySQL. I have since tuned the config using the MySQL tuner script.I can try repeating the migration without the patch a bit later. I am focused on completing the migration right now. However, your results show a 5-18% gain and that is not small.
Comment #8
jonhattanOn patch review, I've identified what operation benefits each of the indexes.
public function lookupDestinationIds(array $source_id_values)
is the one that benefits this index. It is called by migration process plugin and MigrateExecutable::import().public function lookupSourceID(array $destination_id_values);
is the one that benefits this patch. It is called by id_map\Sql::deleteDestination() and dedupe_entity source plugin.The `source` index has a big impact on any migration, while the `destination` one has a limited scope -- at least in my setups where I've never used `dedupe_entity`.
Other interesting index may be on column `source_row_status`. It is used by SqlBase. I'll try and test it in a few hours.
Comment #9
heddnI'd like to see a test with multiple destination and multiple source ids that are indexed. Not just a single of each.
Comment #10
hussainweb@jonhattan, you're right about destid not being used much but I thought I should put it in for completeness. Our main concern here is joins during migration and I don't think destid is useful there at all. We could consider dropping it from indexes.
@heddn, I think that test is not really checking for efficiency of indexes but to see if they are just present. The existing definition just creates one of sourceid and destid columns and that is why there is only one column in the index each. Are you saying we should add more columns to the map and index? We could do that but is there a benefit there?
Comment #11
heddnre #10: to be more clear, the reason I wanted to see multiple columns indexed for either source or destination is because I think they should have a combined index and I don't think they are setup to combine them. Because you'd always be looking up with both values, no?
Comment #12
jonhattanre #8. Did some tests with an index for `source_row_status` column. I don't see a gain in migrations run (`drush migrate-import`). Only observed a little gain in `drush migrate-status` with cold mysql caches, but nothing relevant.
Comment #13
hussainweb@heddn, #11, I just cross-checked the patch and they are combined indexes. You are correct that it would always be looking with both values and a combined index helps. The source and destination indexes are different, though, as they are not queried together, obviously.
@heddn, @jonhattan, do we agree to remove the index from destination columns? I don't think they would be used frequently for lookups and only result in increased space and slower inserts.
Comment #14
heddn@hussainweb yeah, nixing the destination is fine for now. If we find a need for them, we can add them later. Since you found that it creates a combined index, we should add a test to verify that doesn't change.
Comment #15
hussainwebRemoving destination from index and adding another source id column in the tests to check if the index contains both columns.
Comment #17
hussainwebSetting to Needs Review after a random failure.
Comment #18
heddnLooks good now. Ready for a committer.
Comment #21
hussainwebTentatively setting back to RTBC after retesting after random failure related to CI system (it didn't even kick in project clone).
Comment #24
catchVery nice find. Committed/pushed to 8.3.x and 8.4.x, thanks! Tagging and bumping status retrospectively.
Comment #25
alexpottComment #27
super_romeo CreditAttribution: super_romeo commentedHi!
Rollback of 500 000 records occupied 5 hours:
DELETE FROM migrate_map_upgrade_d7_url_alias WHERE destid1 = '189521'
When I add index:
CREATE INDEX destid ON migrate_map_upgrade_d7_url_alias (destid1) USING BTREE;
it occupied 10 minutes.
Comment #28
heddnThis was committed and closed some time ago. If there is still an issue, please open a new bug report.