Problem/Motivation
When performing a large migration with millions of rows, the getFields()
method of the Drupal\user\Plugin\migrate\source\d7\FieldableEntity
class is not efficient.
For example, in the Drupal\user\Plugin\migrate\source\d7\User
class, the following occurs in prepareRow()
.
// Get Field API field values.
foreach ($this->getFields('user') as $field_name => $field) {
// Ensure we're using the right language if the entity and the field are
// translatable.
$field_language = $entity_translatable && $field['translatable'] ? $language : NULL;
$row->setSourceProperty($field_name, $this->getFieldValues('user', $field_name, $uid, NULL, $field_language));
}
If you have a users table with a million users, you might set a batch size of 10k during a migration. This would incur only 100 sql queries to fetch the million rows of data.
But, when prepareRow()
is called for each row, another query will be fired by getFields()
.
For a table with a million rows, you will now incur 1 million sql queries to get the field instances.
Its one thing when it occurs in getFieldValues()
, since the data will be different per row. (Though there are ways we could optimize that, too.)
But, in the case of fetching the fields for an entity type/bundle, it is idempotent and therefore will be the same for every entity type/bundle combination.
There is no need to incur this overhead on every invocation.
The cost is even higher when the DB is remote, given the latency of the connection.
Steps to reproduce
1. Run a migration...large or small.
2. Put a breakpoint in the getFields() method. (Or add a watchdog message. Or whatever is your preferred method to track the number of calls.)
3. Observe the number of times this method is called.
Proposed resolution
Add caching, using a class property, to \Drupal\migrate_drupal\Plugin\migrate\source\d7\FieldableEntity::getFields to reduce the cost of fetching the fields to basically once per entity type/bundle.
Note that the d6 source plugin, Node, is already using caching for the fields.
Comment | File | Size | Author |
---|---|---|---|
#21 | interdiff-14-21.txt | 1.45 KB | jasonawant |
#21 | interdiff-19-21.txt | 1.06 KB | jasonawant |
#21 | 3260276-21.patch | 2.02 KB | jasonawant |
| |||
#11 | optimize-get-fields-3260276-11.patch | 1.99 KB | arunkumark |
#9 | optimize-get-fields-3260276-9.patch | 1.67 KB | arunkumark |
|
Issue fork drupal-3260276
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
TomTech CreditAttribution: TomTech commentedAttached is a patch that implements a simple static cache, to reduce the overhead during migration.
Comment #3
cilefen CreditAttribution: cilefen at Institute for Advanced Study commentedComment #5
TomTech CreditAttribution: TomTech commentederroneous ckeditor5 test initially indicated 1 failure. Tests have been re-run, and all now pass successfully.
Comment #6
mikelutzThis seems fine. I was thinking about whether it would be a good idea to use a more robust caching system here, but I don't see any advantage, and a few disadvantages. Using drupal_static will get fresh data each process, so we don't have to worry about clearing it if we re-run migrations after changing the data. The amount of data should be reasonable, it scales based on the number of entity types and bundles which should be limited, so I don't see any memory issues arising. most importantly we aren't caching data in memory that might scale based on the number of rows or anything like that. I also don't think that this optimization requires any additional explicit tests. The fact that the entire migration suite of tests run and pass prove that it works.
Comment #7
quietone CreditAttribution: quietone at PreviousNext commentedAdding tests for D10.
Comment #8
catchThis is a good find.
However, drupal_static() is on its way out, so we shouldn't introduce new usages: #1577902: [META] Remove all usages of drupal_static() & drupal_static_reset() .
This should either use a class property, or the memory cache backend (see discussion in #3047289: Standardize how we implement in-memory caches).
Comment #9
arunkumarkAs per comment #8, the patch has been re-rolled.
Comment #10
catchWe strongly discourage static variables in classes - this should use a class property or the memory cache backend.
Comment #11
arunkumarkStatic variables have been replaced by class-protected variables.
Comment #12
arunkumarkComment #13
quietone CreditAttribution: quietone at PreviousNext commentedI think we can improve on this. How about just one summary line,
Stores the field configuration data for each entity field.
There is also a blank line before and after the line that begins
$cid =
. For me it would be a bit easier to read is the blank line fore it was removed.Should we do the same for D6 in the d6 node source plugin?
Comment #14
ravi.shankar CreditAttribution: ravi.shankar at OpenSense Labs commentedHere I have made changes as per comment #13, please review.
I think we should do the same for D6.
Comment #15
danflanagan8I'm pretty green on D6 migration, but it looks like the D6 node source already does static caching for field definitions. In the d6 node source,
getFieldInfo
does the same job asFieldableEntity::getFields()
. And getFieldInfo already uses a static cache using its$fieldInfo
property. Here's the property definition:So I don't think there's any additional work to do on this issue. I think we should be happy with #14. If we were dying for more work we could change the variable name
$cache
to$fieldInfo
to be consistent with what's already in the D6 node source, but I don't think that's a requirement. I'm happy enough to RTBC. If a migrate maintainer disagrees, though, I know the community would be happy to make a quick update. :)Comment #16
alexpottGiven this is an abstract class and designed to be extended this name feels way too generic for me.
Let's call it $fieldNameCache instead. So it's a bit more descriptive about what it is.
It's tempting to not add this to $this but use a static variable in the function.
Also this is overriding \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::$cache with a completely different type of thing so very interesting things could happen here.
Comment #17
ravi.shankar CreditAttribution: ravi.shankar at OpenSense Labs commentedMade changes as per comment #16, please review.
Comment #18
danflanagan8Per @catch in #11:
That makes me uncomfortable using a static variable here. Can we go back to using a protected property but use the new name?
And I would be inclined to use
$fieldInfo
as the name since that would match the D6 node source as described in #15. We might as well use a name that's already in use for doing this exact thing somewhere else.Comment #19
jasonawantAttached patch uses patch from 14 with changed protected property name to match that of the Drupal 6 node source plugin here
Two interdiffs for comparison.
Comment #20
danflanagan8Thanks, @jasonawant!
It looks like you missed updating instances of
$this->cache
to$this->fieldInfo
.Comment #21
jasonawantThat I did! Thanks for calling that out. Here's an updated patch and interdiffs.
Comment #22
jasonawantComment #23
danflanagan8That looks great, @jasonawant. And thanks for the thorough interdiffs as well as the link in #19. I'm going to throw this back to RTBC. Thanks!
Comment #24
quietone CreditAttribution: quietone at PreviousNext commentedChanging title to state the change being made here.
Comment #25
quietone CreditAttribution: quietone at PreviousNext commentedI am doing triage on the core RTBC queue.
The issue summary is clear. The proposed resolution is not correct, this is not using a static cache. It is a simple change so I have updated the issue summary. I read the comments and found that all points have been addressed.
Comment #27
catchAbove I'd suggested class property or using memory cache, given this is a trait, using MemoryCache is less straightforward, so class property seems fine.
Committed/pushed to 11.x, thanks!