Problem/Motivation

I am attempting to migrate around ~300,000 files from Drupal 7. As I do the migration import, I hit this:

Memory usage is 435.23 MB (85% of limit 512 MB), reclaiming memory.                                                                                                                                        [warning]
Memory usage is now 439.99 MB (86% of limit 512 MB), not enough reclaimed, starting new batch                                                                                                              [warning]

What's interesting, is that on the first run, I got about 70,000 files in one go before hitting the wall, then it halfed, then it halfed again, and now I'm to less than 1,000 per run before hitting the memory limit.

Proposed resolution

Figure out what's causing the memory usage to be so high.

Remaining tasks

  1. Figure out what the problem is
  2. Write Patch

User interface changes

N/A

API changes

N/A

Data model changes

N/A

Comments

davidwbarratt created an issue. See original summary.

davidwbarratt’s picture

Issue summary: View changes
mikeryan’s picture

The memory is almost certainly be sucked up by entity cache, I would think. MigrateExecutable::attemptMemoryReclaim() does the following:

    // Entity storage can blow up with caches so clear them out.
    $manager =  \Drupal::entityManager();
    foreach ($manager->getDefinitions() as $id => $definition) {
      $manager->getStorage($id)->resetCache();
    }

It sounds like this is somehow failing to actually reclaim the entity storage...

mikeryan’s picture

OK, I devel-generated a bunch of stuff in my local D6 site copy and ran migration against that with a touch of instrumentation. I ended up with massive quantities of comment and I'm seeing:

Upgrading d6_comment
Memory usage is 180.79 MB (71% of limit 256 MB), reclaiming memory.                                          [warning]
before static reclaim: 180.79 MB
before entity reclaim: 180.74 MB
after reclaim: 42.98 MB
Memory usage is 180.28 MB (70% of limit 256 MB), reclaiming memory.                                          [warning]
before static reclaim: 180.28 MB
before entity reclaim: 180.28 MB
after reclaim: 42.55 MB
Memory usage is 180.25 MB (70% of limit 256 MB), reclaiming memory.                                          [warning]
before static reclaim: 180.25 MB
before entity reclaim: 180.25 MB
after reclaim: 42.55 MB
...

So, entity cache reclamation in general seems to work fine, seems like what you're seeing is file-specific. I'm doubtful that the migration itself is leaking memory, I suspect the leak is elsewhere, but it needs a deeper dig with massive quantities of files (oh where is drush generate-files?).

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

swentel’s picture

I've seen this happening as well. First 40k files go very fast, but after that it's slowing down to a point where it's simply not usefull anymore. I haven't actually checked if this is a problem with entity cache, that might be suspect too (*). For some reason, I was focusing on the migrate map and message table. When more and more records are going in, the slower it becomes and I suspect it's the source_ids_hash column which acts as an index as well, but I can't confirm for now. I've started working on a patch for migrate upgrade in #2708723: Allow to run different background processes which does two things:

- allow drush to run different background processes in the background using an offset
- use temporary tables to move the data after each chunk to this table, moves it back after a plugin has finished.

Using that technique I was able to import 400k files in over a little of 70 minutes.
I will do a test with not triggering different backprocesses, but instead staying in the same while loop that drush is in, but still move the data from the migrate tables to a temporary table.

(*) Although in my case, the actual files aren't found (so far, just trying pure data migration), so there aren't files being saved, only records in the migrate map and message tables telling that the file isn't found.

catch’s picture

Title: Migrate Import produces diminishing returns by eating more and more memory » File migration slows down and eats more and more memory
Priority: Normal » Major

Re-titling this and bumping to major. Anecdotally from irc this does sound like it's specific to the file migration.

Anonymous’s picture

I have in the past run into issues like this in Drupal 7 and it is tough to debug. While I was debugging I found some platform issues beyond Drupal (php bugs I believe), but in the end it turned out to be an infrastructure issue (firewall vs. replication servers not in sync). In any case, we want to catch the issue so we can prevent the issue in the future.

@davidwbarratt and @swentel can we get details on the environment you are running the migrations in?

Specifically:
- what PHP version? and,
- what host OS?

swentel’s picture

@Ryan Weal

I'm running the migration (for now) on my local machine, Ubuntu 15.10 - php 5.6.11-1ubuntu3.3

mikeryan’s picture

@swentel - were you seeing the memory leakage problem reported in the original issue, or just the slowdown?

berdir’s picture

We've seen something similar when migrating a larger amount of files. Only happened for files, not for nodes.

Given that file migrations involve downloading *lots* of of files and putting them in the local file system, I can totally imagine that this is outside of our control and something in PHP itself or even lower.

What helped for us was just processing a few hundred items and then doing that in a bash loop.

swentel’s picture

@mikeryan particularly the slowdown, haven't really checked the memory.

ultimike’s picture

Issue summary: View changes
Status: Active » Postponed (maintainer needs more info)
StatusFileSize
new74.78 KB
new967 bytes

I spent some time today (at the DrupalCon New Orleans migrate sprint) testing this in order to figure out if this could be an issue with Migrate. Here's what I did:

  • Created a D7 site with 10,000 files (image).
  • Created a custom migration to migrate the 10,000 files into D8.
  • Added some code (see attached patch) to MigrateExecutable.php for timing and memory tracking.
  • I ran the migration via drush and graphed the resulting data. The timing data wasn't very interesting, everything looked good there. The memory data is shown in the image below. As you can see, when the memory hits a certain point, the attemptMemoryReclaim() method is called. As the data shows, there doesn't appear to be any indications that the maximum memory use is increasing over time (at least for 10,000 files).

Memory data

In speaking with mikeyan, vasi, and others, the issue _could_ be that PHP's memory_limit is either set to "-1" or it could be set to a value higher than the available memory on the machine. In both cases, the attemptMemoryReclaim() will **never** be run, and an out-of-memory condition can occur.

-mike

davidwbarratt’s picture

So in our case, we actually did not want the files being moved over. We decided we'd instead use Stage File Proxy to move them over as needed.

To accomplish this, we overrode the entity:file destination plugin like so:

<?php

namespace Drupal\gc_migrate\Plugin\migrate\destination;

use Drupal\migrate\Plugin\migrate\destination\EntityContentBase;

/**
 * Entity File destination.
 *
 * Every migration that uses this destination must have an optional
 * dependency on the d6_file migration to ensure it runs first.
 *
 * @MigrateDestination(
 *   id = "entity:file"
 * )
 */
class EntityFile extends EntityContentBase {}

We originally "fixed" this problem by creating a drushrc.php file with this content:

/**
 * @file
 * Drush runtime config.
 */

ini_set('memory_limit', '1G');

The server has 7.5GB of memory, so more than enough to handle it.

We are attempting to migrate over 333,000 files.

With us overriding the destination plugin to not move the files, the only thing I can think of is that their must be something with the file entity save itself.

For now we'll try increasing the memory limit to 2G but this is a little ridiculous.

davidwbarratt’s picture

Status: Postponed (maintainer needs more info) » Active
catch’s picture

I'd expect the same problem with other entity types, but any chance it's #2635440: Document what cache clearing from ContentEntityStorageBase::resetCache() actually clears clearing the persistent cache? It would be useful to know if switching the entity cache backend to null makes things better or worse while running the migration.

Anonymous’s picture

this looks like an xhprof run would help?

benjy’s picture

@davidwbarratt, given the comment in #13, can you provide a setup that would allow someone here to reproduce your issue?

mikeryan’s picture

Status: Active » Postponed (maintainer needs more info)

@davidwbarratt: A couple other points of clarification:

  1. Have you been using stage_file_proxy all along (with EntityFile overrridden as shown), or switched to it after the initial memory leak problem? I.e., have you seen the memory leak problem only when using stage_file_proxy, or did you see it with a standard file setup?
  2. What's your PHP version?

@beejeebus: See ultimike's analysis in #13 above.

neclimdul’s picture

I don't have that many files but I ran a similar test on a very large set of nodes and saw similar results to #13. Memory peaked pretty high during each run so I had to set the limit at 512 but when I did so it never hit the reclaim and ran smoothly.

Even so, I also ran the same test with a null entity cache in reference to #16. It was something like 5 minutes faster on an 8 hour migration. As long as the run was though that's basically within the margin of error for network traffic and other randomness affecting the run so I don't know if it really made a difference or not.

kevinwal’s picture

Running into this as well with saving files in a migration. I'll follow up with details but still seeing the issue even with patch from #16 #2635440: Remove persistent cache clearing from ContentEntityStorageBase::resetCache()

mikeryan’s picture

On our current project, we've found node migration getting progressively slower, and tracked it down to pathauto - do the people seeing this problem on file migration A. have pathauto enabled and B. Have an alias pattern for files?

mikeryan’s picture

(credit where credit is due - @geerlingguy tracked this down to pathauto)

berdir’s picture

See #2765729: PathautoPattern->applies() exponentially slows down operations with large numbers of nodes for the pathauto issue.

I'm not sure that the case for file is because of pathauto as well, we didn't have any file pathauto patterns (which you can only have with file_entity anyway).

However, it's not actually in pathauto itself but some cache context collection/merging that seems to get slower and slower, possibly due to a huge array of cache contexts somewhere. Could be something similar for files as well.

Anonymous’s picture

re. #19 - the point of an xhprof dump would be to get the information that #13 doesn't provide.

nothing i've seen in this issue shows what actually happens when the OP runs their migration.

fixing the issue for the OP without the info that something like an xhprof dump provides is mostly an exercise in educated guessing.

mikeryan’s picture

Two questions:

  1. Has anyone besides davidwbarratt seen a memory leak problem with file migration?
  2. @davidwbarratt: Are you still conducting migrations? If so, do you still see this memory leak?

If the answers are "no", my inclination is to close this as not reproducible.

kevinwal’s picture

We are doing quite a few migrations that can include images and do have pathauto on the sites. I'll try to reproduce.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

mpp’s picture

@mikeryan: I can confirm there is a performance issue with migrations of large datasets.

When performing a test migration of 30 000 nodes, it takes 4 hours when I run it at once:
../vendor/bin/drush mi migrate_researchers_en --limit=30000

When I run the same migration but in steps of 5 000, it only takes 20 minutes:

../vendor/bin/drush mi migrate_researchers_en --limit=5000
../vendor/bin/drush mi migrate_researchers_en --limit=15000
../vendor/bin/drush mi migrate_researchers_en --limit=20000
../vendor/bin/drush mi migrate_researchers_en --limit=25000
../vendor/bin/drush mi migrate_researchers_en --limit=30000

I had pinpointed the issue to a pathauto issue (#2765729) but now that patch is in and I'm on Drupal 8.2.

mikeryan’s picture

@mpp: What about memory? That's what this issue is about...

mikeryan’s picture

I will note that for running a large migration in one go, #2309695: Add query batching to SqlBase may be helpful - can you try the patch there?

mpp’s picture

@mikeryan: I get messages "Reclaiming memory" with a max_memory_limit of 1G.

mikeryan’s picture

@mpp: OK - could you try profiling memory usage with xhprof? (and, could someone who has done memory profiling with xhprof chime in with hints? Haven't done it myself to this point...). Alternatively, maybe you could try ultimike's instrumentation in https://www.drupal.org/node/2688297#comment-11189891 above.

heddn’s picture

Issue tags: -Migrate critical

Reviewed this in the weekly migrate maintainers call. Based on the number of reports, we are going to downgrade this from a migrate critical. If this becomes more prevalent, we can always re-add the tag.

ohthehugemanatee’s picture

Adding to the list of reports. :|

I have an SqlBase source with 63000 rows, used in two consecutive paragraph migrations (splitting the source row into two paragraphs). One of the paragraph migrations references previously-migrated files, the other just contains text content. I can run them both individually, but if I run them together I get the OOM issue described here.

The workaround so far is to tag migrations into groups, and write a bash script that runs them in sequence. Not optimal, but it gets me through the migration.

swilmes’s picture

@mikeryan We are having the memory issue on our migration and have ran xhprof, which led back to array_merge using massive amounts of memory. We still haven't figured out why, but I may be able to provide more details Wednesday when we revisit the issue.

berdir’s picture

Are you using pathauto? Try using the latest dev, not the beta version. I'll release a new beta soon.

swilmes’s picture

@Berdir I am using pathauto. We have ran xhprof while using it and pathauto uses the most memory when its used. We ran it without pathauto, and that's when we see array_merge as being the largest consumption of memory.

EDIT: array_merge is not the issue when not using pathauto. I was mistaken, and that was where pathauto was where pathauto was using memory. The memory usage with pathauto disabled seems to be coming from database queries.

hussainweb’s picture

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

mikeryan’s picture

A question for anyone seeing memory issues with migration - are you using search_api? I'm seeing some memory issues myself now on a project that has search_api enabled, and given that it has a history of memory issues (although none open on D8 at the moment), that seems a bit suspicious...

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

heddn’s picture

I'd be curious if this is solved or helped with #2701335: Run garbage collection during migration memory reclamation.

mpp’s picture

@mikeryan, indeed we're using search_api with search_api_solr.

neclimdul’s picture

I'll take a look, last I looked this was still an issue. (no on search api btw)

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

luksak’s picture

I am running into memory issues when mirgating from a sql source to media entities. The file entities were already migrated earlier. Uninstalling search_api didn't help...

neclimdul’s picture

I've been struggling with this for years as evidenced by this issue. At this point migrates memory reclaiming is "pretty good". There does seem to be some place in migrations itself that is leaking though and as a result the memory reclaim isn't working. I think I might have tracked it down to source plugin and prepare row shoving a lot of data onto rows but glancing at the code again I can't put my finger on what it was.

It was complicated enough and immediate solution didn't seem likely I've had to just work around it at this point by just running the migration repeatedly until it finishes. Far from ideal but it seems to work as the migration is able to work back up to the high_water without leaking and the pick up the processing where it leaks again until it stops or completes.

neclimdul’s picture

Another note, @webflo mention in some chat at some point he thought entity's caching was broken by moving to a MemoryBackend instead of the property on the Managers. I haven't been able to reproduce this and the code looks fine as there is compatibility core on the managers to expose he same api but maybe it affects some one else.

luksak’s picture

I found out that I configured my high water property incorrectly. Now I am able to run the migration in batches

benjifisher’s picture

Any migration source that derives from SqlBase supports the batch_size configuration.

I am not sure I was seeing the problem described here, but the symptom was pretty much the same as the issue description. On a D7 -> D8 migration, I was running out of memory when migrating something like 100K users. I fixed it by adding

source:
  plugin: d7_user
  batch_size: 10000

to my migration config.

luksak’s picture

Huh, interesting... How does this play together with the limit of a migration import?

benjifisher’s picture

This was a while ago, so I may be misremembering, but I think drush mim --limit=5000 my_user_migration interfered with the batch_size setting. I consider that a bug.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

akalam’s picture

I think the problem comes from design. Migrate loads all rows everytime the migrate:import command is run. This eats more and more memory when the total rows grows up.

Migrate focused their effort in managing the memory and trying to free it, instead of trying to load only the needed data and not all. We think would be better to load the entire data only when the row needs to be imported. Imagine a periodic migration with a total count of 1.000.000 rows, where maybe you only need to import 10 new rows.

Here is a code example of a source plugin extending ContentEntity.

<?php

namespace Drupal\my_module\Plugin\migrate\source;

use Drupal\migrate\Row;
use Drupal\migrate_drupal\Plugin\migrate\source\ContentEntity;

/**
 * Lightweight replacement for ContentEntity source plugin.
 *
 * This class is focused on memory performance.
 *
 * @package Drupal\my_module\Plugin\migrate\source
 */
class LightweightContentEntity extends ContentEntity {

  /**
   * {@inheritdoc}
   */
  protected function initializeIterator() {
    $ids = $this->query()->execute();
    return $this->generateIterator($ids);
  }

  /**
   * Returns a lightweight iterator for all entity ids.
   *
   * @param array $ids
   *   The entity ids.
   */
  protected function generateIterator(array $ids) {
    foreach ($ids as $id) {
      yield [
        'id' => $id,
      ];
    }
  }

  /**
   * {@inheritdoc}
   */
  public function next() {
    $this->currentSourceIds = NULL;
    $this->currentRow = NULL;

    // In order to find the next row we want to process, we ask the source
    // plugin for the next possible row.
    while (!isset($this->currentRow) && $this->getIterator()->valid()) {

      $row_data = $this->getIterator()->current() + $this->configuration;
      $this->fetchNextRow();
      $row = new Row($row_data, $this->getIds());

      // Populate the source key for this row.
      $this->currentSourceIds = $row->getSourceIdValues();

      // Pick up the existing map row, if any, unless fetchNextRow() did it.
      if (!$this->mapRowAdded && ($id_map = $this->idMap->getRowBySource($this->currentSourceIds))) {
        $row->setIdMap($id_map);
      }

      // Clear any previous messages for this row before potentially adding
      // new ones.
      if (!empty($this->currentSourceIds)) {
        $this->idMap->delete($this->currentSourceIds, TRUE);
      }

      // Preparing the row gives source plugins the chance to skip.
      if ($this->prepareRow($row) === FALSE) {
        continue;
      }

      // Check whether the row needs processing.
      // 1. This row has not been imported yet.
      // 2. Explicitly set to update.
      // 3. The row is newer than the current highwater mark.
      // 4. If no such property exists then try by checking the hash of the row.
      if (!$row->getIdMap() || $row->needsUpdate() || $this->aboveHighwater($row) || $this->rowChanged($row)) {
        // The call to populateRow() method is the only change between this
        // method and the parent::next() method. We load the entity only when
        // the row should be updated.
        $this->populateRow($row);
        $this->currentRow = $row->freezeSource();
      }

      if ($this->getHighWaterProperty()) {
        $this->saveHighWater($row->getSourceProperty($this->highWaterProperty['name']));
      }
    }
  }

  /**
   * Populates a row based on the entity.
   *
   * @param \Drupal\migrate\Row $row
   *   The migrate row.
   *
   * @throws \Drupal\Component\Plugin\Exception\InvalidPluginDefinitionException
   * @throws \Drupal\Component\Plugin\Exception\PluginNotFoundException
   */
  protected function populateRow(Row $row) {

    $storage = $this->entityTypeManager
      ->getStorage($this->entityType->id());
    $entity = $storage->load($row->get('id'));
    foreach ($entity->toArray() as $property => $value) {
      if ($property != 'id') {
        $row->setSourceProperty($property, $value);
      }
    }
  }

}

I would like someone else to review that approach and tell if it could be interesting to generalize and move it to the base source plugin somehow.

heddn’s picture

Title: File migration slows down and eats more and more memory » [PP-1] File migration slows down and eats more and more memory
Status: Postponed (maintainer needs more info) » Postponed

At the very least, this should be pended on #3006750: Remove memory management from MigrateExecutable. Over there is the first step to making memory management more managable.

After that, we need to have a better idea of what is the issue. It is really, really hard to fix something like this without a lot of profiling and debugging. Cause we don't really know what is causing the memory to get eaten up and not reclaimed.

But as pre-step, let's externalize memory management.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

wim leers’s picture

Title: [PP-1] File migration slows down and eats more and more memory » [PP-1] File migration slows down and eats more and more memory, eventually stops
Version: 8.9.x-dev » 9.3.x-dev
Priority: Major » Critical
Issue tags: +Performance
Related issues: +#2309695: Add query batching to SqlBase

This is still an ongoing problem. We (@huzooka, @narendraR and I) are currently investigating this too, and using the infrastructure that #2309695: Add query batching to SqlBase introduced did not solve the problem (by the way: literally nothing in Drupal core uses it). The root cause seems to lie at a lower level than that. Expect news soon, @narendraR is digging deeper currently.

This can make a migration impossible to continue. Unless you resort to drush and set the memory limit to "unlimited". But that is not a reasonable demand.

Because this is extremely disruptive for sites migrating, bumping this to critical. The 46 followers of this issue prove that this problem has affected many migrations already.

heddn’s picture

Priority: Critical » Major

We discussed this in the migrate maintainers call last night. Given the definitions of critical include things like data loss and no other work around (both of which aren't the case here), we suggested this should be drop to a major. Hopefully we have some more details shortly from your research. This has been a tough nut to crack.

bhanu951’s picture

Just to add I was able to solve this issue partially by adding batch_size key in the source and --feedback to the migration import command.

pasqualle’s picture

pasqualle’s picture

pasqualle’s picture

Related issues:
wim leers’s picture

@narendraR got stuck in his investigation, because using batch_size is incompatible with \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable():

    // With batching, we want a later batch to return the same rows that would
    // have been returned at the same point within a monolithic query. If we
    // join to the map table, the first batch is writing to the map table and
    // thus affecting the results of subsequent batches. To be safe, we avoid
    // joining to the map table when batching.
    if ($this->batchSize > 0) {
      return FALSE;
    }

… which is not at all mentioned in #2309695: Add query batching to SqlBase. It means this (using batch_size) will not scale: you cannot interrupt a migration and continue it later. It'll need to iterate over every source row until it finds the last one it actually migrated.

I don't see a clear solution. Anyone else? 🤞🤓

neclimdul’s picture

I mean, I don't know how anyone runs large migrations safely outside of drush but that's an entirely different discussion. You are right though that the memory requirement is a non-solution and I'm pretty sure still runs into slowing down to basically a stop in the end.

I know how frustrating this is but its been years since I've looked at this so I can just give you a sense of what I was looking at and how we got through our migrations and maybe that will give you some clue in your search.

1. I'm sure it goes without saying but #3006750: Remove memory management from MigrateExecutable for flushing cashes and managing memory was _key_. It looks nothing like what we used but flushing caches is needed. I also just nulled out some caches on the container because they where 99.9999% misses with all the writes.
2. With memory management, bigger isn't always better. There's a sweet spot between thrashing the cache clears and the slowdown of php's allocation scheme so choose a "reasonable" value for your memory limit. I believe I have migrate_manifest patched to pass an arg into the GC watcher so we could tun it to batches of migrations as well. Maybe worth investigating.
3. And most specific to this issue, I had the most problem with certain process plugins so there seemed to be some sort of leakyness there. I hinted at this in an earlier comment. This might be why the "batch" concept worked for some people, probably an internal iterator storing rows as they're generated that buckles under its size and maybe batching flushes that? Files at the time seemed to be the one I just had to deal with but there was a lot of tuning of how processes plugins worked across the project to keep rows light and things running. Hopefully that's still a relevant and a useful pointer.

wim leers’s picture

@neclimdul Thank you, those are all super valuable context and real-world experience anecdotes! 🙏

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

andypost’s picture

Title: [PP-1] File migration slows down and eats more and more memory, eventually stops » File migration slows down and eats more and more memory, eventually stops
Status: Postponed » Needs work

I did re-roll of #3006750: Remove memory management from MigrateExecutable

But is it still a blocker?

qzmenko’s picture

This is still a problem, but in our case for nodes migration.

We need to migrate ~2 million nodes. At the beginning of the migration, ~10 nodes per second are imported. After 50k imported nodes, the speed becomes ~2 nodes per second.

I tried changing the batch_size in the migration, but it did not affect the migration speed at first glance.

fjgarlin’s picture

I'm affected by this as well. In this case user's migration. Memory keeps creeping up (around 2 million users).

I've tried different options and no luck. The last thing I am trying came from this article, where it tries to play with the limit option in a loop for the migration as seen in the script suggested.

This is currently running so I don't know the result of it. It's still not ideal, because when using the --limit option, it stills tries to do some gathering of the previous runs.

For example, if I run drush migrate:import my_user_migration --limit 100, the output the first time would be

  Migration my_user_migration [100 inserted, 0 updated...]

But then, on the second run, drush migrate:import my_user_migration --limit 100, the output would be

  Migration my_user_migration [0 inserted, 0 updated...]
  Migration my_user_migration [100 inserted, 0 updated...]

Note the 0 inserted, 0 updated.

--

I even tried with a postSave event subscriber where I'd crear some caches but it would still not make a difference. This is what I tried:

      // '@config.factory', '@entity.memory_cache', '@entity_type.manager'

      $this->memoryCache->deleteAll();
      $this->configFactory->clearStaticCache();
      // Entity storage can blow up with caches so clear them out.
      foreach ($this->entityTypeManager->getDefinitions() as $id => $definition) {
        $this->entityTypeManager->getStorage($id)->resetCache();
      }
berdir’s picture

There might be some other module that keeps things in memory, due to post processing.

resetCache() is a persistent cache clear, so it's fairly expensive and adds costs on its own. It will not add anything useful on top @entity.memory_cache->resetAll() which you do as well.

However, that can only clear the usage of those objects within the entity storage, if anything else holds on to these objects, they will remain in memory. Pretty impossible to say what it would be in your case, probably would require some kind of profiling with xhprof or blackfire or something like that. If it is specific to users, you could try to look for user presave/insert/update hook implementations.

heddn’s picture

For a 2M user migration, I stripped down the user source plugin so it only pulls back uids. Then I moved the actual gathering of data into a prepareRow. It had an amazing effect on the speed and memory usage of the user migration. By default the user source does what is essentially a select * from users. What you want is something more like seelct uid from users.

fjgarlin’s picture

@heddn - this is the migration and plugin that I am using:
- Migration: https://git.drupalcode.org/project/drupalorg_migrate/-/blob/1.0.x/migrat...
- User plugin: https://git.drupalcode.org/project/drupalorg_migrate/-/blob/1.0.x/src/Pl...

So, your suggestion would be to override the User::query method to:

  public function query() {
    return $this->select('users', 'u')
      ->fields('u', ['uid'])
      ->condition('u.uid', 0, '>');
  }

Then in the prepareRow, do you do:
- A select * from users where uid=$uid
- And then several $row->setSourceProperty for each property?

I am going to try the above locally but wanted to also ask about the approach to make sure I understood you correctly.

fjgarlin’s picture

For what is worth, I am not seeing any significant increase in speed after doing the above.

Before the change it was around 1100 records per minute
After the change it seems to be around 1150 records per minute

But this difference might just be the output number of the migration or me just looking a second late/early.

The code I did:

  public function query() {
    // Query by UID earlier to speed up queries.
    return $this->select('users', 'u')
      ->fields('u', ['uid'])
      ->condition('u.uid', 0, '>');
  }

  public function prepareRow(Row $row) {
    // Try to determine early if this row needs to be skipped.
    $prepare_row = SourcePluginBase::prepareRow($row);
    if ($prepare_row) {
      $uid = $row->getSourceProperty('uid');

      // Set all properties here as we only queried by UID earlier.
      $row_data = $this->select('users', 'u')
        ->fields('u')
        ->condition('u.uid', $uid)
        ->execute()
        ->fetchAssoc();
      foreach ($row_data as $field => $value) {
        $row->setSourceProperty($field, $value);
      }

      return parent::prepareRow($row);
    }

    return FALSE;
  }

heddn’s picture

Speed should be about the same, especially in the beginning of the migration. But by the time you get to the 1M row mark, you're memory usage should be in a better place. That's where this alternative approach (which you outlined well) really starts to shine.

fjgarlin’s picture

Great. Thanks for the info.

I went ahead and committed the above here https://git.drupalcode.org/project/drupalorg_migrate/-/commit/044bdebd94... and I will trigger again the full migration and monitor things.

benjifisher’s picture

Status: Needs work » Postponed (maintainer needs more info)

This issue has been around for almost 10 years. Although several reliable users report running into this issue, no one has been able to provide steps to reproduce (STR) the problem. In Comment #13, @ultimike tried really hard to reproduce the problem just by creating and migrating 10K files, and did not see any evidence of a problem.

I am setting the status to Postponed (maintainer needs more information). If someone can provide STR, then we can un-postpone this issue.

Often, we set a time limit on this status and close the issue if there is no response. In this case, I think we should leave the issue open indefinitely. I think there are many useful comments, and open issues (even if they are postponed) are much more discoverable than closed issues.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.