Problem/Motivation

When performing a large migration with millions of rows, the getFields() method of the Drupal\user\Plugin\migrate\source\d7\FieldableEntity class is not efficient.

For example, in the Drupal\user\Plugin\migrate\source\d7\User class, the following occurs in prepareRow().

    // Get Field API field values.
    foreach ($this->getFields('user') as $field_name => $field) {
      // Ensure we're using the right language if the entity and the field are
      // translatable.
      $field_language = $entity_translatable && $field['translatable'] ? $language : NULL;
      $row->setSourceProperty($field_name, $this->getFieldValues('user', $field_name, $uid, NULL, $field_language));
    }

If you have a users table with a million users, you might set a batch size of 10k during a migration. This would incur only 100 sql queries to fetch the million rows of data.

But, when prepareRow() is called for each row, another query will be fired by getFields().

For a table with a million rows, you will now incur 1 million sql queries to get the field instances.

Its one thing when it occurs in getFieldValues(), since the data will be different per row. (Though there are ways we could optimize that, too.)

But, in the case of fetching the fields for an entity type/bundle, it is idempotent and therefore will be the same for every entity type/bundle combination.

There is no need to incur this overhead on every invocation.

The cost is even higher when the DB is remote, given the latency of the connection.

Steps to reproduce

1. Run a migration...large or small.
2. Put a breakpoint in the getFields() method. (Or add a watchdog message. Or whatever is your preferred method to track the number of calls.)
3. Observe the number of times this method is called.

Proposed resolution

Add caching, using a class property, to \Drupal\migrate_drupal\Plugin\migrate\source\d7\FieldableEntity::getFields to reduce the cost of fetching the fields to basically once per entity type/bundle.

Note that the d6 source plugin, Node, is already using caching for the fields.

Issue fork drupal-3260276

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

TomTech created an issue. See original summary.

TomTech’s picture

Attached is a patch that implements a simple static cache, to reduce the overhead during migration.

cilefen’s picture

Status: Active » Needs review

Status: Needs review » Needs work

The last submitted patch, 2: optimize-get-fields-3260276-2.patch, failed testing. View results

TomTech’s picture

Status: Needs work » Needs review

erroneous ckeditor5 test initially indicated 1 failure. Tests have been re-run, and all now pass successfully.

mikelutz’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: +Portland2022

This seems fine. I was thinking about whether it would be a good idea to use a more robust caching system here, but I don't see any advantage, and a few disadvantages. Using drupal_static will get fresh data each process, so we don't have to worry about clearing it if we re-run migrations after changing the data. The amount of data should be reasonable, it scales based on the number of entity types and bundles which should be limited, so I don't see any memory issues arising. most importantly we aren't caching data in memory that might scale based on the number of rows or anything like that. I also don't think that this optimization requires any additional explicit tests. The fact that the entire migration suite of tests run and pass prove that it works.

quietone’s picture

Version: 9.4.x-dev » 10.0.x-dev

Adding tests for D10.

catch’s picture

Status: Reviewed & tested by the community » Needs work

This is a good find.

However, drupal_static() is on its way out, so we shouldn't introduce new usages: #1577902: [META] Remove all usages of drupal_static() & drupal_static_reset() .

This should either use a class property, or the memory cache backend (see discussion in #3047289: Standardize how we implement in-memory caches).

arunkumark’s picture

Status: Needs work » Needs review
FileSize
1.67 KB

As per comment #8, the patch has been re-rolled.

catch’s picture

Status: Needs review » Needs work

We strongly discourage static variables in classes - this should use a class property or the memory cache backend.

arunkumark’s picture

Status: Needs work » Needs review
FileSize
1.99 KB

Static variables have been replaced by class-protected variables.

arunkumark’s picture

quietone’s picture

Status: Needs review » Needs work
+++ b/core/modules/migrate_drupal/src/Plugin/migrate/source/d7/FieldableEntity.php
@@ -20,6 +20,14 @@
+   * Memory cache for cache items.
+   *
+   * An associative array of Cache values and its CID.

I think we can improve on this. How about just one summary line, Stores the field configuration data for each entity field.

There is also a blank line before and after the line that begins $cid = . For me it would be a bit easier to read is the blank line fore it was removed.

Should we do the same for D6 in the d6 node source plugin?

ravi.shankar’s picture

Status: Needs work » Needs review
FileSize
2.03 KB
1.02 KB

Here I have made changes as per comment #13, please review.

I think we should do the same for D6.

danflanagan8’s picture

Status: Needs review » Reviewed & tested by the community

Should we do the same for D6 in the d6 node source plugin?

I'm pretty green on D6 migration, but it looks like the D6 node source already does static caching for field definitions. In the d6 node source, getFieldInfo does the same job as FieldableEntity::getFields(). And getFieldInfo already uses a static cache using its $fieldInfo property. Here's the property definition:

  /**
   * Cached field and field instance definitions.
   *
   * @var array
   */
  protected $fieldInfo;

So I don't think there's any additional work to do on this issue. I think we should be happy with #14. If we were dying for more work we could change the variable name $cache to $fieldInfo to be consistent with what's already in the D6 node source, but I don't think that's a requirement. I'm happy enough to RTBC. If a migrate maintainer disagrees, though, I know the community would be happy to make a quick update. :)

alexpott’s picture

Status: Reviewed & tested by the community » Needs work
+++ b/core/modules/migrate_drupal/src/Plugin/migrate/source/d7/FieldableEntity.php
@@ -19,6 +19,13 @@
+  /**
+   * Stores the field configuration data for each entity field.
+   *
+   * @var array[]
+   */
+  protected $cache = [];

Given this is an abstract class and designed to be extended this name feels way too generic for me.

Let's call it $fieldNameCache instead. So it's a bit more descriptive about what it is.

It's tempting to not add this to $this but use a static variable in the function.

Also this is overriding \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::$cache with a completely different type of thing so very interesting things could happen here.

ravi.shankar’s picture

Status: Needs work » Needs review
FileSize
1.83 KB
1.66 KB

Made changes as per comment #16, please review.

danflanagan8’s picture

Status: Needs review » Needs work

Per @catch in #11:

We strongly discourage static variables in classes - this should use a class property or the memory cache backend.

That makes me uncomfortable using a static variable here. Can we go back to using a protected property but use the new name?

And I would be inclined to use $fieldInfo as the name since that would match the D6 node source as described in #15. We might as well use a name that's already in use for doing this exact thing somewhere else.

jasonawant’s picture

Status: Needs work » Needs review
FileSize
2.01 KB
724 bytes
1.44 KB

Attached patch uses patch from 14 with changed protected property name to match that of the Drupal 6 node source plugin here

Two interdiffs for comparison.

danflanagan8’s picture

Thanks, @jasonawant!

It looks like you missed updating instances of $this->cache to $this->fieldInfo.

jasonawant’s picture

Assigned: Unassigned » jasonawant
Status: Needs review » Needs work
FileSize
2.02 KB
1.06 KB
1.45 KB

That I did! Thanks for calling that out. Here's an updated patch and interdiffs.

jasonawant’s picture

Assigned: jasonawant » Unassigned
Status: Needs work » Needs review
danflanagan8’s picture

Status: Needs review » Reviewed & tested by the community

That looks great, @jasonawant. And thanks for the thorough interdiffs as well as the link in #19. I'm going to throw this back to RTBC. Thanks!

quietone’s picture

Title: Migration FieldableEntity getFields is not efficient » Add static cache to Migration FieldableEntity::getFields

Changing title to state the change being made here.

quietone’s picture

Version: 10.0.x-dev » 11.x-dev
Issue summary: View changes

I am doing triage on the core RTBC queue.

The issue summary is clear. The proposed resolution is not correct, this is not using a static cache. It is a simple change so I have updated the issue summary. I read the comments and found that all points have been addressed.

  • catch committed 1e830175 on 11.x
    Issue #3260276 by arunkumark, jasonawant, ravi.shankar, TomTech,...
catch’s picture

Status: Reviewed & tested by the community » Fixed

Above I'd suggested class property or using memory cache, given this is a trait, using MemoryCache is less straightforward, so class property seems fine.

Committed/pushed to 11.x, thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.