This is useful for a large amount of datasource records using Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity

Currently the more datasource records using this plugin; the more memory consumption happens during migrate-import due to eager-loading.

How to reproduce? Have a entity-bundle with lots of fields and that entity-bundle has lots of records (eg.nodes).

$printMem = function () {
  $usage = memory_get_usage();
  $peak = memory_get_peak_usage();
  var_dump([
    'memory_get_usage()' => $usage,
    'simple usage' => number_format(round($usage / 1024 / 1024)) . ' MB',
    'memory_get_peak_usage()' => $peak,
    'simple peak' => number_format(round($peak / 1024 / 1024)) . ' MB',
  ]);
};

$array = [];
for ($i = 0; $i < 1000000; $i++) { $array[] = $i; }
$printMem();
unset($array);
$printMem();
// 50 MB usage, 82 MB peak usage.
// legend: 82 MB is app bootstrap on a drush cli. eg. use (drush php) or (drupal shell) to execute series of php codes.

$node = node_load(8878);
$printMem();
// 52 MB usage, 82 MB peak usage.

use Drupal\migrate\MigrateExecutable;
use Drupal\migrate\MigrateMessage;
$migrationMachineName = 'XXX';
$migration = \Drupal::service('plugin.manager.migration')->createInstance($migrationMachineName);
$migration->getIdMap()->prepareUpdate();
$executable = new MigrateExecutable($migration, new MigrateMessage());
$executable->import();
$printMem();
// 439 MB usage, 439 MB peak usage.

Issue fork drupal-3158436

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

chriscalip created an issue. See original summary.

joachim’s picture

Title: Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity » Give batch_size feature to Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity so it can scale
Version: 8.9.x-dev » 9.2.x-dev
Category: Feature request » Bug report

I've run into this problem with a ContentEntity migration that has a large number of records to migrate.

With about 2 million source entities, my server ran out of memory with 512MB. This happens even if I try to migrate a single entity with the --limit option to the 'drush mim' command:

> PHP Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 167772160 bytes) in docroot/core/lib/Drupal/Core/Database/Statement.php on line 112
[warning] Drush command terminated abnormally.

I suspect the problem is here:

  protected function initializeIterator() {
    $ids = $this->query()->execute();
    return $this->yieldEntities($ids);
  }

Before the migration even begins, the iterator is initialised, and here, we load EVERY entity ID for the WHOLE source.

An array of 2 million entity IDs is actually going to be pretty big! Going by what it says on https://stackoverflow.com/questions/36709580/what-is-the-actual-memory-c..., it's probably going to be something like 400MB, which would explain why my server ran out of memory!

I'm changing this to a bug, as without this, the ContentEntity migration source doesn't scale to large datasets.

joachim’s picture

I've added a batch to the query in yieldEntities(), and I'm trying a large migration with a debug statement that shows memory_get_peak_usage(). The result shows that memory keeps creeping up:

      49/2363472 [>---------------------------]   0%^ "MEM: 56878608"
      99/2363472 [>---------------------------]   0%^ "MEM: 60614792"
     147/2363472 [>---------------------------]   0%^ "MEM: 63745448"
     197/2363472 [>---------------------------]   0%^ "MEM: 66387128"
     247/2363472 [>---------------------------]   0%^ "MEM: 68985048"
     297/2363472 [>---------------------------]   0%^ "MEM: 71753216"
     345/2363472 [>---------------------------]   0%^ "MEM: 75365616"
     395/2363472 [>---------------------------]   0%^ "MEM: 78616168"
     444/2363472 [>---------------------------]   0%^ "MEM: 81138752"
     494/2363472 [>---------------------------]   0%^ "MEM: 83738640"
     543/2363472 [>---------------------------]   0%^ "MEM: 86345864"
     593/2363472 [>---------------------------]   0%^ "MEM: 89973384"
     643/2363472 [>---------------------------]   0%^ "MEM: 93585248"
     693/2363472 [>---------------------------]   0%^ "MEM: 96147040"
     743/2363472 [>---------------------------]   0%^ "MEM: 98774776"
     793/2363472 [>---------------------------]   0%^ "MEM: 101238888"
     843/2363472 [>---------------------------]   0%^ "MEM: 104826200"
     893/2363472 [>---------------------------]   0%^ "MEM: 107882864"
     943/2363472 [>---------------------------]   0%^ "MEM: 110234888"
     993/2363472 [>---------------------------]   0%^ "MEM: 112600632"
    1042/2363472 [>---------------------------]   0%^ "MEM: 115346560"
    1092/2363472 [>---------------------------]   0%^ "MEM: 118852912"
    1142/2363472 [>---------------------------]   0%^ "MEM: 121736408"
    1192/2363472 [>---------------------------]   0%^ "MEM: 124094272"
    1241/2363472 [>---------------------------]   0%^ "MEM: 126407608"
    1291/2363472 [>---------------------------]   0%^ "MEM: 129376568"
    1341/2363472 [>---------------------------]   0%^ "MEM: 132904624"
    1391/2363472 [>---------------------------]   0%^ "MEM: 135537744"
    1441/2363472 [>---------------------------]   0%^ "MEM: 137886248"

Processing about 1000 source entities adds about 57 MB to the memory usage, so 20K entities is going to be 1GB of memory.

I'm fixing this by forcibly clearing the entity memory cache every batch.

MR coming!

joachim’s picture

With the clearing of the entity memory cache, memory is creeping up, but much more slowly:

      49/2363472 [>---------------------------]   0%^ "MEM: 56870648"
      99/2363472 [>---------------------------]   0%^ "MEM: 60550400"
     147/2363472 [>---------------------------]   0%^ "MEM: 60854584"
     197/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     247/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     297/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     345/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     395/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     444/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     494/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     543/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     593/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     643/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     693/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     743/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     793/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     843/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     893/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     943/2363472 [>---------------------------]   0%^ "MEM: 61228472"
     993/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1042/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1092/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1142/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1192/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1241/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1291/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1341/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1391/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1441/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1490/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1540/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1589/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1639/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1689/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1739/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1789/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1839/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1889/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1937/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    1987/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2036/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2086/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2136/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2186/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2236/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2286/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2336/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2386/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2436/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2485/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2534/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2583/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2633/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2683/2363472 [>---------------------------]   0%^ "MEM: 61677016"
    2732/2363472 [>---------------------------]   0%^ "MEM: 61716920"
    2782/2363472 [>---------------------------]   0%^ "MEM: 61716920"
    2832/2363472 [>---------------------------]   0%^ "MEM: 61716920"
    2882/2363472 [>---------------------------]   0%^ "MEM: 61716920"
    2932/2363472 [>---------------------------]   0%^ "MEM: 61716920"
    2982/2363472 [>---------------------------]   0%^ "MEM: 61842328"
    3032/2363472 [>---------------------------]   0%^ "MEM: 61842328"
    3082/2363472 [>---------------------------]   0%^ "MEM: 61842328"
    3132/2363472 [>---------------------------]   0%^ "MEM: 61842328"
    3182/2363472 [>---------------------------]   0%^ "MEM: 61842328"
    3231/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3281/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3331/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3379/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3428/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3478/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3528/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3578/2363472 [>---------------------------]   0%^ "MEM: 61878600"
    3628/2363472 [>---------------------------]   0%^ "MEM: 61878600"

About 200k memory per 1000 source entities, looking at the figures that come later on. So 5k entities will add 1 MB, and and to use 1GB of memory will take 1M source rows.

joachim’s picture

Status: Active » Needs review

Opened merge requests for 9.2 and 8.9 (the changes are the same in both branches).

quietone’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

This needs a test.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

larowlan’s picture

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

_pratik_’s picture

Assigned: Unassigned » _pratik_
_pratik_’s picture

Assigned: _pratik_ » Unassigned
StatusFileSize
new3.83 KB

Attached the rerolled patch for 10.1.x.

nivethasubramaniyan’s picture

StatusFileSize
new4.58 KB
new2.15 KB

Fixing CCF in #15

sandeepsingh199’s picture

StatusFileSize
new4.15 KB

Tried to fix ERROR of #15 and #16.

sandeepsingh199’s picture

Status: Needs work » Needs review
StatusFileSize
new716 bytes

attaching interdiff

medha kumari’s picture

Issue tags: -Needs reroll
StatusFileSize
new4.15 KB

Rerolled the patch #17 in Drupal 10.1.x.

_utsavsharma’s picture

StatusFileSize
new3.83 KB

Rerolled the patch for 10.1.x.
Please review.

shivam-kumar’s picture

StatusFileSize
new4.26 KB

Errors in #20 have been fixed in this Patch and rerolled for 10.1.x.

shivam-kumar’s picture

StatusFileSize
new1.07 KB

Added interdiff for #21.

needs-review-queue-bot’s picture

Status: Needs review » Needs work
StatusFileSize
new1.62 KB

The Needs Review Queue Bot tested this issue. It either no longer applies to Drupal core, or fails the Drupal core commit checks. Therefore, this issue status is now "Needs work".

Apart from a re-roll or rebase, this issue may need more work to address feedback in the issue or MR comments. To progress an issue, incorporate this feedback as part of the process of updating the issue. This helps other contributors to know what is outstanding.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

pooja saraah’s picture

StatusFileSize
new4.13 KB
new761 bytes

Fixed failed commands on #21
Attached patch against Drupal 10.1.x

joachim’s picture

Status: Needs work » Needs review

Failure looks unrelated.

Running this patch on 10.0 -- thanks for the reroll!

smustgrave’s picture

Status: Needs review » Needs work
Issue tags: +Needs Review Queue Initiative

Seems the retest aborted also. I don't think there is a build problem but could be wrong.

Am moving to NW for the tests.

berdir’s picture

The while (TRUE) here seems wrong, you don't need a while with yield, otherwise it will run forever, that's why all the CI runs time out.

Also, we can add a loadMultiple() because loading 50 nodes at once will result in ~50x fewer queries.

joachim’s picture

+++ b/core/modules/migrate_drupal/src/Plugin/migrate/source/ContentEntity.php
@@ -165,31 +177,59 @@ public function __toString() {
+    while (TRUE) {
+      $query = $this->query();
+      if (isset($this->batchSize) && $this->batchSize > 0) {
+        // Run the query in batches, to prevent large source sizes exhausting
+        $query->range($current_batch * $this->batchSize, $this->batchSize);
+      }
+      $ids = $query->execute();
+
+      // End the loop when we run out of source entities.
+      if (empty($ids)) {
+        break;
+      }
+

We need to end the loop when $ids is empty, but we can only determine what $ids is once we've run a query.

It's not something we can put inside the condition for the while().

Hence why I went for the while (TRUE) { break; } construction.

What do you suggest instead?

BTW this issue started with a MR, please could we stick to that and not post patches?

berdir’s picture

You are right. This would be much nicer as a do { ... } while (empty($ids) though.

That said, after reading it again, I see the problem. it only works if you do have a batch size, because without a batch size, it will not add a range and then you will run the same query over and over again and $ids will never be empty. You could do a count($ids) < $batch_size || !$batch_size or so.

The thing with batch size is that it's quite tricky. Rerunning the query can give you different results, it is for example fundamentally incompatible with the map table join that you're also working on. The map table join will no longer return the results that already match and then the batch would also skip non-processed results, causing you to skip $batch_size * $n on every query execution. Additionally, the query currently has no sort, so some databases will not guarantee a stable sort order, again risking to only process a random subset, so you'd need to add a sort. That's why \Drupal\migrate\Plugin\migrate\source\SqlBase::initializeIterator() is so complicated, it's handling all those different features and can only use some combinations of them.

So you have to pick between using the map join to speed up reruns (and then you can do multiple batches of 100k/1M or so entities) and a batch size anyway.

IMHO, using loadMultiple() and the memory reset will have a much bigger impact on memory usage, although a large enough data set will eventually make you run into a hard limit as well, but most people will hit the limit much earlier due to loaded entities. Based on a blackfire run against a data set of 420k source entities, Drupal\Core\Entity\Query\Sql\Query::result() uses about 250MB for that. I did some tests with range(0,1000000) which uses only 33MB, but the query gives us strings which clearly use up much more space.

What do you think about focusing on that first, possibly in a new issue. I'm confident that we can get that into core much more easily as there shouldn't be any negative side effects/possible bugs unlike batch. Maybe we need a test with more entities than the chunk size, but that's about it. We could maybe use Settings::get('entity_update_batch_size', 50), or a different setting name, so we don't have to create 50 entities in a test.

For those tests, I compared using array_chunk(), array_splice() and array_slice(). array_chunk() more than doubles memory usage as it creates many smaller arrays, array_splice() is extremely slow for this use case (get the first N entries from the array) probably because it reorders the whole array on every call, array_slice() with an incrementing position looks like a clean winner:

  $ids = range(1, 1000003);

  $i = 0;
  while  ($ids_chunk = array_slice($ids, $i, 50)) {
    $i += 50;
  }
mathilde_dumond’s picture

I created a related issue and uploaded a patch that runs the query in one time, but loads the entities by batches, and clears the memory regularly. It may not be enough for all situations, but already helps with memory and performance #3354201: Use loadMultiple() and reset memory cache in ContentEntity source

joachim’s picture

It's been a while since I've worked on all this, and my current migration project is a low-budget affair so not much more than rerolling patches. So I don't remember all the details.

> IMHO, using loadMultiple() and the memory reset will have a much bigger impact on memory usage, although a large enough data set will eventually make you run into a hard limit as well, but most people will hit the limit much earlier due to loaded entities

I hit the memory limit from just getting an array of entity IDs from the query on a migration that was 6M source entities. That sounds like a lot, but it was just a customer portal website. It's not an outlandish use case.

> Rerunning the query can give you different results, it is for example fundamentally incompatible with the map table join that you're also working on. The map table join will no longer return the results that already match and then the batch would also skip non-processed results, causing you to skip $batch_size * $n on every query execution.

Took me a while to get my head round that, but yes, I see what you mean.

What's strange is that I definitely used batch_size AND map join (with the patch from #3188914: ContentEntity migration source doesn't consider the migration map) and I don't remember having problems with it. Maybe there was some query caching going on in mySQL? Or maybe my map join patch wasn't working properly?

At any rate, that's something that could be figured out - don't advance the LIMIT part of the query if the source uses a map join.

The map join is important too, as without it, resuming a migration after a crash takes a horrendous amount of time as it checks everything it's already migrated.

> Additionally, the query currently has no sort, so some databases will not guarantee a stable sort order, again risking to only process a random subset, so you'd need to add a sort.

#2845863: Migrate SQL source query isn't ordered :)

> What do you think about focusing on that first, possibly in a new issue. I'm confident that we can get that into core much more easily as there shouldn't be any negative side effects/possible bugs unlike batch.

You mean doing a separate issue which just handles clearing the entity storage static cache, and postponing this issue to be worked on after the new one? Yes, makes sense to split this up into simpler issues. Looks like @mathilde_dumond has already created it -- thanks! :)

joachim’s picture

Status: Needs work » Postponed

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

benjifisher’s picture

In #3498915: Move content_entity source plugin to migrate module, we moved the content_entity source plugin to the migrate module. I am adding that as a related issue.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.