Problem/Motivation

Migration is very slow when there are lots of rows in the source. I saw this initially with url_alias migration and then with file migration (~200K files). I tracked it down to slow queries involving JOINs.

 User@Host: drupal[drupal] @ localhost []
# Thread_id: 62  Schema: drupal  QC_hit: No
# Query_time: 184.817538  Lock_time: 0.000041  Rows_sent: 0  Rows_examined: 897152256
use drupal;
SET timestamp=1484412007;
SELECT ua.*, map.sourceid1 AS migrate_map_sourceid1, map.source_row_status AS migrate_map_source_row_status
FROM
url_alias ua
LEFT OUTER JOIN zend8.migrate_map_d7_url_alias map ON pid = map.sourceid1
WHERE ( (map.sourceid1 IS NULL ) OR (map.source_row_status = '1') )

One query went on till 6000 seconds before I finally killed it.

At first, I thought it was my configuration and I kept increasing join_buffer_size (32M against the default 256K) and trying. While there was some improvement, it was largely slow. The only way to speed this up reliably is to use indexes.

This is possibly related to #2688297: File migration slows down and eats more and more memory, eventually stops.

Solution

Add indexes to migrate_map_* table on source and destination columns.

Data Model changes

No changes to data model but indexes are added. Since this is not done at install, there shouldn't be much of a change.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

hussainweb created an issue. See original summary.

hussainweb’s picture

Attaching a patch against 8.2.x (might work for 8.3.x too).

hussainweb’s picture

Status: Active » Needs review

Status: Needs review » Needs work

The last submitted patch, 2: add_indexes_to_migrate_map-2843595-2.patch, failed testing.

hussainweb’s picture

Status: Needs work » Needs review
FileSize
812 bytes
2.36 KB

Updating the test.

jonhattan’s picture

@hussainweb can you share total time of those big migrations with and w/o the patch? I'm seeing a significant improvement on my side, although no query lasted so long as you reported.

I'm testing the patch with two migrations:

* 150k core files.
* 28k nodes. Configured with several migration processes, so it is consulting maps intensively. Process migrations are configured for users (111 items in the map), files (148k items), terms (72 items) and nodes (33k items across 7 maps).

After patch I see an improvement of ~200s and ~1900s respectively.

Without patch:

* File: Processed 148278 items (148278 created, 0 updated, 0 failed, 0 ignored) - done with 'abo_media__files' [1260.55 sec, 339.58 MB]
* Node: Processed 29014 items (23467 created, 5503 updated, 44 failed, 0 ignored) - done with 'abo_articles__news' [12426.82 sec, 442.34 MB]

Patched:

* File: Processed 148278 items (148278 created, 0 updated, 0 failed, 0 ignored) - done with 'abo_media__files' [1046.27 sec, 339.58 MB]
* Node: Processed 29014 items (23467 created, 5503 updated, 44 failed, 0 ignored) - done with 'abo_articles__news' [10492.26 sec, 442.35 MB]

Also I'm seeing the gain in the node migration increases as the migration advances:

#100th node
Importing node 91643 [14.77 sec, 323.24 MB]
Importing node 91643 [14.18 sec, 323.24 MB]
gain: 0.59s

#500th node
Importing node 134921 [34.02 sec, 328.1 MB]  
Importing node 134921 [32.86 sec, 328.1 MB]
gain: 1.16s

#1000th node
Importing node 150625 [63.12 sec, 328.97 MB]
Importing node 150625 [59.38 sec, 328.97 MB]
gain: 3.74s

#3000th node
Importing node 179873 [280.15 sec, 340.78 MB]
Importing node 179873 [232.77 sec, 340.78 MB]
gain: 48s

#5000th node
Importing node 196131 [597.2 sec, 350.05 MB]
Importing node 196131 [497.89 sec, 349.81 MB] 
gain: 100s

#10000th node
Importing node 236129 [1802.89 sec, 378.34 MB]
Importing node 236129 [1533.05 sec, 378.34 MB]
gain: 269s

#15000th node
Importing node 257473 [3729.33 sec, 389.81 MB]
Importing node 257473 [3195.08 sec, 389.82 MB]
gain: 534s

#20000th node
Importing node 276783 [6316.9 sec, 421.27 MB]
Importing node 276783 [5456.51 sec, 421.27 MB]
gain: 860s

#25000th node
Importing node 296347 [9619.2 sec, 433.12 MB]
Importing node 296347 [8407.38 sec, 433.13 MB]
gain: 1212s

#28000th node
Importing node 307477 [11927.59 sec, 441.05 MB]
Importing node 307477 [10135.5 sec, 441.05 MB]
gain: 1792s
hussainweb’s picture

@jonhattan, that is an impressive test. I didn't measure time precisely but I can tell for url_alias migration which contains 30k entries, it took about 40 minutes to run before the patch and about 30 seconds with the patch. The file migration where I saw the 6000s query had 300k entries and never finished. It took an unreasonable amount of time even after I increased join_buffer_size to 32M. Personally, I would discard the very long times for url_alias and file migrations from before because it could have been a misconfigured MySQL. I have since tuned the config using the MySQL tuner script.

I can try repeating the migration without the patch a bit later. I am focused on completing the migration right now. However, your results show a 5-18% gain and that is not small.

jonhattan’s picture

On patch review, I've identified what operation benefits each of the indexes.

  1. +++ b/core/modules/migrate/src/Plugin/migrate/id_map/Sql.php
    @@ -313,8 +313,10 @@ protected function ensureTables() {
    +        $indexes['source'][] = $mapkey;
    

    public function lookupDestinationIds(array $source_id_values) is the one that benefits this index. It is called by migration process plugin and MigrateExecutable::import().

  2. +++ b/core/modules/migrate/src/Plugin/migrate/id_map/Sql.php
    @@ -333,6 +335,7 @@ protected function ensureTables() {
    +        $indexes['destination'][] = $mapkey;
    

    public function lookupSourceID(array $destination_id_values); is the one that benefits this patch. It is called by id_map\Sql::deleteDestination() and dedupe_entity source plugin.

The `source` index has a big impact on any migration, while the `destination` one has a limited scope -- at least in my setups where I've never used `dedupe_entity`.

Other interesting index may be on column `source_row_status`. It is used by SqlBase. I'll try and test it in a few hours.

heddn’s picture

Status: Needs review » Needs work
Issue tags: -migrate
+++ b/core/modules/migrate/tests/src/Unit/MigrateSqlIdMapEnsureTablesTest.php
@@ -72,6 +72,10 @@ public function testEnsureTablesNotExist() {
+      'indexes' => [
+        'source' => ['sourceid1'],
+        'destination' => ['destid1'],
+      ],

I'd like to see a test with multiple destination and multiple source ids that are indexed. Not just a single of each.

hussainweb’s picture

@jonhattan, you're right about destid not being used much but I thought I should put it in for completeness. Our main concern here is joins during migration and I don't think destid is useful there at all. We could consider dropping it from indexes.

@heddn, I think that test is not really checking for efficiency of indexes but to see if they are just present. The existing definition just creates one of sourceid and destid columns and that is why there is only one column in the index each. Are you saying we should add more columns to the map and index? We could do that but is there a benefit there?

heddn’s picture

re #10: to be more clear, the reason I wanted to see multiple columns indexed for either source or destination is because I think they should have a combined index and I don't think they are setup to combine them. Because you'd always be looking up with both values, no?

jonhattan’s picture

re #8. Did some tests with an index for `source_row_status` column. I don't see a gain in migrations run (`drush migrate-import`). Only observed a little gain in `drush migrate-status` with cold mysql caches, but nothing relevant.

hussainweb’s picture

@heddn, #11, I just cross-checked the patch and they are combined indexes. You are correct that it would always be looking with both values and a combined index helps. The source and destination indexes are different, though, as they are not queried together, obviously.

@heddn, @jonhattan, do we agree to remove the index from destination columns? I don't think they would be used frequently for lookups and only result in increased space and slower inserts.

heddn’s picture

@hussainweb yeah, nixing the destination is fine for now. If we find a need for them, we can add them later. Since you found that it creates a combined index, we should add a test to verify that doesn't change.

hussainweb’s picture

Status: Needs work » Needs review
FileSize
1.99 KB
2.51 KB

Removing destination from index and adding another source id column in the tests to check if the index contains both columns.

Status: Needs review » Needs work

The last submitted patch, 15: add_indexes_to_migrate_map-2843595-15.patch, failed testing.

hussainweb’s picture

Status: Needs work » Needs review

Setting to Needs Review after a random failure.

heddn’s picture

Status: Needs review » Reviewed & tested by the community

Looks good now. Ready for a committer.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 15: add_indexes_to_migrate_map-2843595-15.patch, failed testing.

hussainweb’s picture

Status: Needs work » Reviewed & tested by the community

Tentatively setting back to RTBC after retesting after random failure related to CI system (it didn't even kick in project clone).

  • catch committed 1f958ab on 8.4.x
    Issue #2843595 by hussainweb, heddn, jonhattan: Add indexes to...

  • catch committed 9641d34 on 8.3.x
    Issue #2843595 by hussainweb, heddn, jonhattan: Add indexes to...
catch’s picture

Priority: Normal » Major
Issue tags: +Migrate critical

Very nice find. Committed/pushed to 8.3.x and 8.4.x, thanks! Tagging and bumping status retrospectively.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

super_romeo’s picture

Hi!

Rollback of 500 000 records occupied 5 hours:
DELETE FROM migrate_map_upgrade_d7_url_alias WHERE destid1 = '189521'
When I add index:
CREATE INDEX destid ON migrate_map_upgrade_d7_url_alias (destid1) USING BTREE;
it occupied 10 minutes.

heddn’s picture

This was committed and closed some time ago. If there is still an issue, please open a new bug report.