ContentEntity migration source doesn't consider the migration map [#3188914]

Problem/Motivation

The migration system keeps track of what has been migrated already from a source by writing a record for each migrated source row to a map table.

The SqlBase source plugin base class makes use of the map table by doing an SQL JOIN to the map table, so that the source query is filtered to only those source records that haven't already been imported.

This means that if you do an incremental migration, the migration process doesn't have to go through lots of source records that have already been imported, because they are simply eliminated from the query result.

However, the ContentEntity migration source, which provides entities from the current Drupal site as the source rows, doesn't consider the map.

This means that if you do an incremental migration, or do your migration in batches, either for performance or during development, ALL the entities that have already been migrated are iterated over, loaded, and checked against the map.

This makes incremental migrations very slow, as they have to go over all the already migrated entities before they get to entities that need to be migrated.

Steps to reproduce

Run an incremental migration with lots of source records (10k or so).

Proposed resolution

Move the addMapJoin() method to a new MapJoinTrait.

Use the new trait in both SqlBase and ContentEntity.

In ContentEntity, get the SQL query from the source entity query, and JOIN to the map table.

Remaining tasks

User interface changes

None.

API changes

None.

Data model changes

None.

Release notes snippet

TBD

Comment	File	Size	Author
#35	3188914-nr-bot.txt	2.13 KB	needs-review-queue-bot
#34	contententity-source-map-join-34-interdiff.txt	1.78 KB	Berdir
#34	contententity-source-map-join-34.patch	11.4 KB	Berdir
#34	11.x: PHP 8.2 & MySQL 8 Custom Commands Failed
#32	contententity-source-map-join-32-interdiff.txt	1.48 KB	Berdir
#32	contententity-source-map-join-32.patch	10.96 KB	Berdir
#32	11.x: PHP 8.2 & MySQL 8 29,441 pass
#31	contententity-source-map-join-31-interdiff.txt	1.29 KB	Berdir
#31	contententity-source-map-join-31.patch	10.84 KB	Berdir
#31	11.x: PHP 8.2 & MySQL 8 Custom Commands Failed
#30	contententity-source-map-join-28.patch	11.09 KB	Berdir
#30	11.x: PHP 8.2 & MySQL 8 Custom Commands Failed
#24	3188914-nr-bot.txt	4.04 KB	needs-review-queue-bot

Issue fork drupal-3188914

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3188914-10.1-content-entity-migration-source-map changes, plain diff MR !3660
Check out this branch
3188914-content-entity-migration-source-map-89 changes, plain diff MR !247
Check out this branch
3188914-content-entity-migration-source-map changes, plain diff MR !246
Check out this branch

About issue forks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

17 December 2020 at 16:51

joachim created an issue. See original summary.

Comment #2

joachim CreditAttribution: joachim commented 17 December 2020 at 17:02

In the absence of #3188918: Allow EntityQuery to be converted to the underlying SQL query , this is going to involve adding a query tag to the entity query, then doing the JOIN in a hook_query_TAG_alter().

Comment #3

11 January 2021 at 15:59

joachim opened merge request !246

Comment #4

joachim CreditAttribution: joachim commented 11 January 2021 at 16:02

Status:

Active

» Needs review

MR: https://git.drupalcode.org/project/drupal/-/merge_requests/246

The fix depends on #3188918: Allow EntityQuery to be converted to the underlying SQL query .

The MR includes cherry-picked commits from:

- #3188918: Allow EntityQuery to be converted to the underlying SQL query , because the fix depends on that issue. This allows the MR to be tested.
- #3158436: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale, because although that's independent of this, the two fixes will clash and I'm working on both sequentially for my project...

Comment #5

11 January 2021 at 16:14

joachim opened merge request !247

Comment #6

joachim CreditAttribution: joachim commented 11 January 2021 at 16:15

MR for 8.9.x: https://git.drupalcode.org/project/drupal/-/merge_requests/247. Has the same cherry-picks as the 9.2.x MR.

Comment #7

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher commented 14 February 2021 at 16:54

Issue summary:	View changes
Status:	Needs review	» Needs work

The testbot found some coding standards problems. Please fix those (and get some passing tests) before marking an issue NR.
It looks as though this issue should be postponed on #3188918: Allow EntityQuery to be converted to the underlying SQL query and #3158436: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale.
Why is ContentEntity::mapJoinable() so much simpler than SqlBase::mapJoinable()?
Depending on the answer to the previous question, it might make sense to move mapJoinable() to the new trait. Maybe just some of the conditions are common to the two classes, in which case we could move those conditions to a new method in the trait.
If we do that, then it would be really easy to add the ignore_map option to SqlBase. I have seen cases where joining the tables causes problems. It might not be in scope for this issue.
I think we need some test coverage for the new option.

Comment #8

14 February 2021 at 17:29

anmolgoyal74 made their first commit to this issue’s fork.

Comment #9

joachim CreditAttribution: joachim at TrainingCloud commented 15 February 2021 at 09:28

> It looks as though this issue should be postponed [...]

Yes.
The EntityQuery patch is necessary.
The batch size patch is just very tangled up with this one, so one had to come first when I was working on them.

> Depending on the answer to the previous question, it might make sense to move mapJoinable() to the new trait. Maybe just some of the conditions are common to the two classes, in which case we could move those conditions to a new method in the trait.

The SqlBase::mapJoinable() considers highwater marks, and also needs to consider different database servers. It could be looking at ANY kind of SQL database.

Whereas ContentEntity::mapJoinable() needs to consider the entity storage type. It knows it's looking at a Drupal DB, and specifically at Drupal entities.

They're really very different things!

The only thing they have in common is checking $this->configuration['ignore_map'], and I don't think it's worth extracting just that check to a trait.

> If we do that, then it would be really easy to add the ignore_map option to SqlBase. I have seen cases where joining the tables causes problems. It might not be in scope for this issue.

It already has that option. Or do you mean SourcePluginBase? Out of scope for this issue to add it there, I would say.

Comment #10

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher commented 16 February 2021 at 03:27

@anmolgoyal74: I think MR 247 is for the 8.9.x branch. If you look at MR !246, then you should be able to apply my suggestion to make the same change to the trait.

Comment #11

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher commented 16 February 2021 at 03:32

@joachim:

Thanks for the reply.

The SqlBase::mapJoinable() considers highwater marks ...

Why don't we check highwater marks in ContentEntityBase?

If there are two common checks, it is a matter of taste. If there are three common checks, then it is worth adding something to the trait.

Comment #12

joachim CreditAttribution: joachim as a volunteer commented 16 February 2021 at 10:51

> Why don't we check highwater marks in ContentEntityBase?

We could do, but that feels like a feature request rather than part of this bug. Or at least, a follow-up. This is already quite a big MR.

> @anmolgoyal74: I think MR 247 is for the 8.9.x branch

Yes. I needed to patch my 8.9 site with this, and I figured other people might find it useful too. I added a suffix to the branch name to try to make it clear.

Comment #13

16 February 2021 at 10:51

Version:

9.2.x-dev

» 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #14

chriscalip CreditAttribution: chriscalip commented 15 May 2021 at 00:26

Feedback.

a.) I believe there is a simpler approach to enforcing highwater. Retain the original and simpler initial population method of $ids in method initializeIterator. A simple sql query that returns ids is efficient and effective. The key is to do the enforcement at the php script level and not additionally entangle highwater and batchsize in the db sql queries.

https://gist.github.com/chriscalip/17d4f8ac3160d878dd8f6d92d04bc7a0#file...
Is an implementation of highwater enforcement at php level.

b.) Intention.

- drush mim XXX Execution process only-new records.
- drush mim XXX --update Execution process all previously migrated and new records.

https://gist.github.com/chriscalip/17d4f8ac3160d878dd8f6d92d04bc7a0#file...
Admittedly, gist does workaround to enforce intention that there is a difference between intention to process only-new records and intention to process all (previously migrated and new records.). If we can increase scope to include advance handling between only-new and process all via (--update). That would be great.

Comment #15

joachim CreditAttribution: joachim as a volunteer commented 15 May 2021 at 12:20

Your code in your gist has no comments, so I'm not completely following it.

But this:

> $ids = $this->query()->execute();

won't scale for large source migrations. Loading too many IDs will exhaust PHP memory. I've seen it myself.

The whole point of doing an SQL JOIN to the map table is because doing this in PHP is slow and won't scale. Doing a JOIN to the map table means that the already migrated IDs never even make it into PHP, which is what we want.

Comment #16

chriscalip CreditAttribution: chriscalip commented 17 May 2021 at 14:45

@joachim

No, loading array of many ids will not cause PHP memory exhaustion instead loading an array of many entities will lead to memory exhaustion. An entity is several magnitudes higher in memory consumption than an integer. The migration system will collapse even earlier before it reaches method initializeIterator if the row count is in the trillions.

Hence

> $ids = $this->query()->execute();
> $ids = filter($ids);
> $ids = subsets($ids);

Will scale, I have seen it myself.

The whole point is to do filtering/subsets on the IDs to a smaller set of IDs. Do you have a counter example that a simple check of

$isHigher = ($highWater <= $target);
// wherein $highwater is the highwater value.
// wherein $target is the intended target row id; could be revision-id or base-id.

will not work?

Per request I added comments on method initializeIterator.

Comment #17

joachim CreditAttribution: joachim as a volunteer commented 18 May 2021 at 10:48

> No, loading array of many ids will not cause PHP memory exhaustion

I've seen that happening! We had a site with several million entities to migrate, and the migration crashed exactly because the query was getting ALL the IDs in one go!

I needed the patch at #3158436: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale and this one to prevent memory exhaustion & also the migration taking AGES to skip past already migrated entities.

If you don't believe me, try:

$startMemory = memory_get_usage();
$array = range(1, 6000000);
echo memory_get_usage() - $startMemory, ' bytes';

Also, this issue is not about highwater. It's about the migration map, which is different and more precise.

On my machine that's 268439632 -- 250 or so MB.

Comment #18

chriscalip CreditAttribution: chriscalip commented 18 May 2021 at 16:28

Counter argument accepted. Fair enough.

Comment #19

18 May 2021 at 16:28

Version:

9.3.x-dev

» 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #20

18 May 2021 at 16:28

Version:

9.4.x-dev

» 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #21

18 May 2021 at 16:28

Version:

9.5.x-dev

» 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #22

14 March 2023 at 14:19

joachim opened merge request !3660

Comment #23

joachim CreditAttribution: joachim as a volunteer commented 14 March 2023 at 14:19

Status:

Needs work

» Needs review

Rebased a new branch on 10.1.

Comment #24

needs-review-queue-bot CreditAttribution: needs-review-queue-bot as a volunteer commented 18 March 2023 at 21:57

Status:

Needs review

» Needs work

File	Size
3188914-nr-bot.txt	4.04 KB

The Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

Comment #25

20 March 2023 at 14:29

ameymudras made their first commit to this issue’s fork.

Comment #26

joachim CreditAttribution: joachim as a volunteer commented 8 April 2023 at 07:53

It looks like #3188918: Allow EntityQuery to be converted to the underlying SQL query is not going to get in.

We can maybe do this by decorating the entity query factory instead, the way workspaces module does it?

Comment #27

joachim CreditAttribution: joachim as a volunteer commented 12 April 2023 at 13:11

Unfortunately, doing it the way Workspaces module does it isn't going to be possible, because of the way Workspaces module does it: #3353696: [PP-2] Workspace QueryFactory alters queries in a way that's not compatible with any other module doing the same

Comment #28

joachim CreditAttribution: joachim as a volunteer commented 14 April 2023 at 12:46

New plan is to use a query tag and hook_query_alter().

Comment #29

14 April 2023 at 12:46

Version:

10.1.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #30

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 26 May 2023 at 22:29

Status:

Needs work

» Needs review

File	Size
contententity-source-map-join-28.patch	11.09 KB
11.x: PHP 8.2 & MySQL 8 Custom Commands Failed

Restarting this with an isolated patch that uses an alter hook which calls back to the source plugin. I've also not kept the trait, it's just two instances and both the return argument and the optional argument are only needed for SqlBase, making this more complicated and refactoring SqlBase at the same time seems to cause some test fails too. And as shown further down, entity queries have other things to consider.

Test fails show that this has never been used/tested on an entity with translations, as it caused the query to explode due to the missing table alias. That seemed easy enough to fix at first and I was able to get ContentEntitySource green without major issues.

But then I tried to write a test that actually verifies that the join works, and the combination of how source plugins work, translations, entity queries caused my brain to melt.

This turned out to be _hard_. Entity queries essentially ignore that the fact that such a thing as translations exist, match on any data table row they can and will just aggregate multiple matches back to the ids in the result array. First I thought that I can't get it work with included translations, but it turned out that was actually the easier version, the tricky part was when they were not included, then I needed to ensure to only query the default translation or the query still found translations not matching the map table. The test now verifies with 40 (!) permutations of included translation, revision (which doesn't really do much yet, but it will be a fun one to support once revisions as a source actually work, the current state makes no sense to me at all), map join settings and every possible migrate map combination with two languages.

The test verifies both the query result (which essentially returns only entity ids that need to have at least one translation still migrated) and the returned row count (which is then filtered down on the specific translations and always considers the map status).

Comment #31

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 26 May 2023 at 22:58

File	Size
contententity-source-map-join-31.patch	10.84 KB
11.x: PHP 8.2 & MySQL 8 Custom Commands Failed
contententity-source-map-join-31-interdiff.txt	1.29 KB

I like you too phpcs.

Comment #32

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 27 May 2023 at 06:11

File	Size
contententity-source-map-join-32.patch	10.96 KB
11.x: PHP 8.2 & MySQL 8 29,441 pass
contententity-source-map-join-32-interdiff.txt	1.48 KB

Hm, missed cspell, kinda annoying, using the same comment as in SqlBase for that.

Comment #33

joachim CreditAttribution: joachim as a volunteer commented 27 May 2023 at 06:49

> Entity queries essentially ignore that the fact that such a thing as translations exist, match on any data table row they can and will just aggregate multiple matches back to the ids in the result array

Is that related to my comment here: #2942948: Add allTranslations() and allTranslations()->count to entity query?

Comment #34

Berdir

German

Switzerland

CreditAttribution: Berdir at MD Systems GmbH commented 5 June 2023 at 06:24

File	Size
contententity-source-map-join-34.patch	11.4 KB
11.x: PHP 8.2 & MySQL 8 Custom Commands Failed
contententity-source-map-join-34-interdiff.txt	1.78 KB

It's related to that, yes. But my change here seems to work fine according to the tests.

We have been running our migration for a few days now (yeah, it's slow) and it's been working well from what see so far. We have noticed one bug though, and that is that the join is also applied to the count query, so that the total is essentially just the remaining articles. I think that doesn't really affect the actual migration, but migrate status is confused then.

The fix seems easy enough, just need to skip the method when there are no fields.

Comment #35

needs-review-queue-bot CreditAttribution: needs-review-queue-bot as a volunteer commented 8 June 2023 at 10:24

Status:

Needs review

» Needs work

File	Size
3188914-nr-bot.txt	2.13 KB

The Needs Review Queue Bot tested this issue. It fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.