Problem/Motivation

The migration system keeps track of what has been migrated already from a source by writing a record for each migrated source row to a map table.

The SqlBase source plugin base class makes use of the map table by doing an SQL JOIN to the map table, so that the source query is filtered to only those source records that haven't already been imported.

This means that if you do an incremental migration, the migration process doesn't have to go through lots of source records that have already been imported, because they are simply eliminated from the query result.

However, the ContentEntity migration source, which provides entities from the current Drupal site as the source rows, doesn't consider the map.

This means that if you do an incremental migration, or do your migration in batches, either for performance or during development, ALL the entities that have already been migrated are iterated over, loaded, and checked against the map.

This makes incremental migrations very slow, as they have to go over all the already migrated entities before they get to entities that need to be migrated.

Steps to reproduce

Run an incremental migration with lots of source records (10k or so).

Proposed resolution

Move the addMapJoin() method to a new MapJoinTrait.

Use the new trait in both SqlBase and ContentEntity.

In ContentEntity, get the SQL query from the source entity query, and JOIN to the map table.

Remaining tasks

User interface changes

None.

API changes

None.

Data model changes

None.

Release notes snippet

TBD

Issue fork drupal-3188914

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

joachim created an issue. See original summary.

joachim’s picture

In the absence of #3188918: Allow EntityQuery to be converted to the underlying SQL query , this is going to involve adding a query tag to the entity query, then doing the JOIN in a hook_query_TAG_alter().

joachim’s picture

Status: Active » Needs review

MR: https://git.drupalcode.org/project/drupal/-/merge_requests/246

The fix depends on #3188918: Allow EntityQuery to be converted to the underlying SQL query .

The MR includes cherry-picked commits from:

- #3188918: Allow EntityQuery to be converted to the underlying SQL query , because the fix depends on that issue. This allows the MR to be tested.
- #3158436: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale, because although that's independent of this, the two fixes will clash and I'm working on both sequentially for my project...

joachim’s picture

MR for 8.9.x: https://git.drupalcode.org/project/drupal/-/merge_requests/247. Has the same cherry-picks as the 9.2.x MR.

benjifisher’s picture

Issue summary: View changes
Status: Needs review » Needs work
  1. The testbot found some coding standards problems. Please fix those (and get some passing tests) before marking an issue NR.

  2. It looks as though this issue should be postponed on #3188918: Allow EntityQuery to be converted to the underlying SQL query and #3158436: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale.

  3. Why is ContentEntity::mapJoinable() so much simpler than SqlBase::mapJoinable()?

  4. Depending on the answer to the previous question, it might make sense to move mapJoinable() to the new trait. Maybe just some of the conditions are common to the two classes, in which case we could move those conditions to a new method in the trait.

  5. If we do that, then it would be really easy to add the ignore_map option to SqlBase. I have seen cases where joining the tables causes problems. It might not be in scope for this issue.

  6. I think we need some test coverage for the new option.

anmolgoyal74 made their first commit to this issue’s fork.

joachim’s picture

> It looks as though this issue should be postponed [...]

Yes.
The EntityQuery patch is necessary.
The batch size patch is just very tangled up with this one, so one had to come first when I was working on them.

> Depending on the answer to the previous question, it might make sense to move mapJoinable() to the new trait. Maybe just some of the conditions are common to the two classes, in which case we could move those conditions to a new method in the trait.

The SqlBase::mapJoinable() considers highwater marks, and also needs to consider different database servers. It could be looking at ANY kind of SQL database.

Whereas ContentEntity::mapJoinable() needs to consider the entity storage type. It knows it's looking at a Drupal DB, and specifically at Drupal entities.

They're really very different things!

The only thing they have in common is checking $this->configuration['ignore_map'], and I don't think it's worth extracting just that check to a trait.

> If we do that, then it would be really easy to add the ignore_map option to SqlBase. I have seen cases where joining the tables causes problems. It might not be in scope for this issue.

It already has that option. Or do you mean SourcePluginBase? Out of scope for this issue to add it there, I would say.

benjifisher’s picture

@anmolgoyal74: I think MR 247 is for the 8.9.x branch. If you look at MR !246, then you should be able to apply my suggestion to make the same change to the trait.

benjifisher’s picture

@joachim:

Thanks for the reply.

The SqlBase::mapJoinable() considers highwater marks ...

Why don't we check highwater marks in ContentEntityBase?

If there are two common checks, it is a matter of taste. If there are three common checks, then it is worth adding something to the trait.

joachim’s picture

> Why don't we check highwater marks in ContentEntityBase?

We could do, but that feels like a feature request rather than part of this bug. Or at least, a follow-up. This is already quite a big MR.

> @anmolgoyal74: I think MR 247 is for the 8.9.x branch

Yes. I needed to patch my 8.9 site with this, and I figured other people might find it useful too. I added a suffix to the branch name to try to make it clear.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

chriscalip’s picture

Feedback.

a.) I believe there is a simpler approach to enforcing highwater. Retain the original and simpler initial population method of $ids in method initializeIterator. A simple sql query that returns ids is efficient and effective. The key is to do the enforcement at the php script level and not additionally entangle highwater and batchsize in the db sql queries.

https://gist.github.com/chriscalip/17d4f8ac3160d878dd8f6d92d04bc7a0#file...
Is an implementation of highwater enforcement at php level.

b.) Intention.

- drush mim XXX Execution process only-new records.
- drush mim XXX --update Execution process all previously migrated and new records.

https://gist.github.com/chriscalip/17d4f8ac3160d878dd8f6d92d04bc7a0#file...
Admittedly, gist does workaround to enforce intention that there is a difference between intention to process only-new records and intention to process all (previously migrated and new records.). If we can increase scope to include advance handling between only-new and process all via (--update). That would be great.

joachim’s picture

Your code in your gist has no comments, so I'm not completely following it.

But this:

> $ids = $this->query()->execute();

won't scale for large source migrations. Loading too many IDs will exhaust PHP memory. I've seen it myself.

The whole point of doing an SQL JOIN to the map table is because doing this in PHP is slow and won't scale. Doing a JOIN to the map table means that the already migrated IDs never even make it into PHP, which is what we want.

chriscalip’s picture

@joachim

No, loading array of many ids will not cause PHP memory exhaustion instead loading an array of many entities will lead to memory exhaustion. An entity is several magnitudes higher in memory consumption than an integer. The migration system will collapse even earlier before it reaches method initializeIterator if the row count is in the trillions.

Hence

> $ids = $this->query()->execute();
> $ids = filter($ids);
> $ids = subsets($ids);

Will scale, I have seen it myself.

The whole point is to do filtering/subsets on the IDs to a smaller set of IDs. Do you have a counter example that a simple check of

$isHigher = ($highWater <= $target);
// wherein $highwater is the highwater value.
// wherein $target is the intended target row id; could be revision-id or base-id.

will not work?

Per request I added comments on method initializeIterator.

joachim’s picture

> No, loading array of many ids will not cause PHP memory exhaustion

I've seen that happening! We had a site with several million entities to migrate, and the migration crashed exactly because the query was getting ALL the IDs in one go!

I needed the patch at #3158436: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale and this one to prevent memory exhaustion & also the migration taking AGES to skip past already migrated entities.

If you don't believe me, try:

$startMemory = memory_get_usage();
$array = range(1, 6000000);
echo memory_get_usage() - $startMemory, ' bytes';

Also, this issue is not about highwater. It's about the migration map, which is different and more precise.

On my machine that's 268439632 -- 250 or so MB.

chriscalip’s picture

Counter argument accepted. Fair enough.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

joachim’s picture

Status: Needs work » Needs review

Rebased a new branch on 10.1.

needs-review-queue-bot’s picture

Status: Needs review » Needs work
FileSize
4.04 KB

The Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

ameymudras made their first commit to this issue’s fork.

joachim’s picture

It looks like #3188918: Allow EntityQuery to be converted to the underlying SQL query is not going to get in.

We can maybe do this by decorating the entity query factory instead, the way workspaces module does it?

joachim’s picture

Unfortunately, doing it the way Workspaces module does it isn't going to be possible, because of the way Workspaces module does it: #3353696: [PP-2] Workspace QueryFactory alters queries in a way that's not compatible with any other module doing the same

joachim’s picture

New plan is to use a query tag and hook_query_alter().

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Berdir’s picture

Status: Needs work » Needs review
FileSize
11.09 KB

Restarting this with an isolated patch that uses an alter hook which calls back to the source plugin. I've also not kept the trait, it's just two instances and both the return argument and the optional argument are only needed for SqlBase, making this more complicated and refactoring SqlBase at the same time seems to cause some test fails too. And as shown further down, entity queries have other things to consider.

Test fails show that this has never been used/tested on an entity with translations, as it caused the query to explode due to the missing table alias. That seemed easy enough to fix at first and I was able to get ContentEntitySource green without major issues.

But then I tried to write a test that actually verifies that the join works, and the combination of how source plugins work, translations, entity queries caused my brain to melt.

This turned out to be _hard_. Entity queries essentially ignore that the fact that such a thing as translations exist, match on any data table row they can and will just aggregate multiple matches back to the ids in the result array. First I thought that I can't get it work with included translations, but it turned out that was actually the easier version, the tricky part was when they were not included, then I needed to ensure to only query the default translation or the query still found translations not matching the map table. The test now verifies with 40 (!) permutations of included translation, revision (which doesn't really do much yet, but it will be a fun one to support once revisions as a source actually work, the current state makes no sense to me at all), map join settings and every possible migrate map combination with two languages.

The test verifies both the query result (which essentially returns only entity ids that need to have at least one translation still migrated) and the returned row count (which is then filtered down on the specific translations and always considers the map status).

Berdir’s picture

Berdir’s picture

Hm, missed cspell, kinda annoying, using the same comment as in SqlBase for that.

joachim’s picture

> Entity queries essentially ignore that the fact that such a thing as translations exist, match on any data table row they can and will just aggregate multiple matches back to the ids in the result array

Is that related to my comment here: #2942948: Add allTranslations() and allTranslations()->count to entity query?

Berdir’s picture

It's related to that, yes. But my change here seems to work fine according to the tests.

We have been running our migration for a few days now (yeah, it's slow) and it's been working well from what see so far. We have noticed one bug though, and that is that the join is also applied to the count query, so that the total is essentially just the remaining articles. I think that doesn't really affect the actual migration, but migrate status is confused then.

The fix seems easy enough, just need to skip the method when there are no fields.

needs-review-queue-bot’s picture

Status: Needs review » Needs work
FileSize
2.13 KB

The Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.