Hello all, it’s time for the weekly migration subsystem meeting. The meeting will take place in slack in various threads
This meeting:
➤ Is for core migrate maintainers and developers and anybody else in the community with an interest in migrations
➤ Usually happens every Thursday and alternates between 1400 and 2100 UTC.
➤ Is done on the #migration channel in Drupal Slack (see www.drupal.org/slack for information).
➤ Happens in threads, which you can follow to be notified of new replies even if you don’t comment in the thread. You may also join the meeting later and participate asynchronously!
➤ Has a public agenda anyone can add to. See the parent issue for an idea of the typical agenda.
➤*Transcript will be exported and posted* to the agenda issue. For anonymous comments, start with a :bust_in_silhouette: emoji. To take a comment or thread off the record, start with a :no_entry_sign: emoji.
2021-07-01 - The hope is that most or all of the maintainers will attend. We will try to focus on longer-term goals than in the weekly meeting.
Add your items here.
| benjifisher |
@Arthur Deryckere (KenowaX): can you explain the problem? Is there an open issue for this? |
| Arthur Deryckere (KenowaX) |
There isn't one yet (that I have created anyways).The problem is rather simple : when working on large projects with a wide panel of migration yml files in the config folder, things get hard trying to stay organized.Here's a small example : |
| Arthur Deryckere (KenowaX) |
These are only migrations from a to dThere would be 3-4 screenshots like these. |
| Arthur Deryckere (KenowaX) |
There are migrations for 3 languages as well. Meaning I have to work with a big quantity of migrations.Collections would enable me to place these inside specific folders like so : |
| Arthur Deryckere (KenowaX) |
image.png |
| Arthur Deryckere (KenowaX) |
Which correspond to the config collections in database :migrations.product_feeds.en , fr and nl respectively. |
| Arthur Deryckere (KenowaX) |
Migrate uses the default "" (empty) collection for configuration files when communicating with the database. |
| Arthur Deryckere (KenowaX) |
Would be nice if we could have a migrate collection by default (maybe ?) and allow for custom subdirectories to organize stuff. |
| benjifisher |
Drupal core does not support using config entities to define migrations. That functionality is provided by the migrate_plus module. |
| Arthur Deryckere (KenowaX) |
Oh, so migrate uses custom modules config for this ?I thought it would support both. |
| benjifisher |
An alternative is to define your migrations using YAML files under mymodule/migrations/. If you do that, then @Matroskeen recently created https://www.drupal.org/project/migrate_scanner to handle pretty much exactly the problem you describe. |
| Arthur Deryckere (KenowaX) |
So, migrate doesn't store migration config files in the database by default. This is migrate_plus' doing ? |
| benjifisher |
Oh, so migrate uses custom modules config for this ?I am not sure what you are asking. By "migrate" do you mean the core migrate module? |
| Arthur Deryckere (KenowaX) |
Yes |
| Arthur Deryckere (KenowaX) |
You answered the question with your previous reply :mymodule/migrations/ |
| mikelutz (he/him) |
Correct, the config entities are defined by the migrate_plus module. |
| benjifisher |
Migrations are plugins. The core migrate module supports plugins defined by YAML files mymodule/migrations/*.yml. |
| Arthur Deryckere (KenowaX) |
But doesn't migrate core store the migration config in the database ? |
| benjifisher |
If you stop using migrate_plus.migration.* config entities to define your migrations, then the main thing you lose is migration groups. |
| benjifisher |
The plugin definitions (not config) are stored in the plugin cache. I think that is in the database, at least by default. |
| Arthur Deryckere (KenowaX) |
Yeah, I can't afford this as I use migration groups for checking the "migration type" with a custom module. |
| benjifisher |
What does that check involve? |
| Arthur Deryckere (KenowaX) |
Comparing the migration config to see if it's a product_feed or not. Now that I think about it, I could just place these in the module that handles the product_feeds and not the others... |
| Matroskeen |
If you’re referring to migration_group property, migrate plugins also support it.I’m using migration_group all the time for my custom migrations, that are located in migrations/ directory. (edited) |
| Arthur Deryckere (KenowaX) |
So, I guess for migrate, there is no use in handling config collections as it doesn't even use them.Migrate_Plus on the other hand could use this functionnality. |
| benjifisher |
I searched for "collection" in the migrate_plus issue queue and did not find anything. You could create a feature request. |
| benjifisher |
@Matroskeen, can you use migration groups to share configuration? I thought that was strictly a migrate_plus feature. |
| Arthur Deryckere (KenowaX) |
I'll look into this a bit further then I'll do a feature request, yes. |
| Matroskeen |
@benjifisher, not sure about configuration, I’m using it primary for Drush commands like import, rollback:drush mim --group=my_group (edited) |
| dinarcon |
Migration groups can be used to share any "configuration" defined in the migration plugin including those that are added by migrate_plus for configuration entities. Here is in example https://github.com/dinarcon/drupal-migrations-basic/blob/main/config/ins... |
| benjifisher |
Does that require migrate_plus to be installed? But it will work with migrations that are not defined by config entities? |
| mikelutz (he/him) |
Well, so standard core migration yamls can have any keys they want, technically. There’s no reason why they can’t have migration_group keys. |
| mikelutz (he/him) |
And the drush command in migrate_tools uses that key in deciding what to run, so you can use it to group migrations for drush. |
| mikelutz (he/him) |
At a glance, the shared configuration should work too on a yml defined core migration, provided you have migrate_plus installed and a migration_group config entity to match. |
| mikelutz (he/him) |
The group configuration is merged in inside of migrate_plus_migration_plugins_alter which should alter core plugins along with config entities. |
| dinarcon |
Right, migrate_plus needs to be installed, but it also works for core migration plugins. (edited) |
| benjifisher |
The DOM process plugins in migrate_plus already violate this. |
| benjifisher |
I was looking at MigrateProcessInterface and noticed * @return string|array * The newly transformed value. */ public function transform(...);(edited) |
| benjifisher |
If we want to allow other return types, then how do we indicate this? An extra method that provides some extra metadata? @marvil07 suggested using TypedData a long time ago. I need to read up on that ... |
| marvil07 |
Oh, I remember this :slightly_smiling_face:On that nedcamp I remember mentioning that using a proper interface would be a good idea instead of just string|array, and yes typed data may be a good start.I do not remember writing explicitely about this, but I may be wrong. |
| benjifisher |
Yesterday, @cornifex brought up #3048464: SubProcess migrate process plugin should throw exception on invalid input (no replies). I wonder if that is related. (I have not looked at the issue yet, just the title.) |
| marvil07 |
I guess the closest was the presentation note:Different parsers than DOMJust an FYI, my goto for HTML parsing has been querypath, it's especially good if you're dealing with old-school HTML (no
, etc.).‒ mikeryan on #2958281-7Make process plugins data extensible: use core typed data.
|
| marvil07 |
^ ok that is not fully related :disappointed: |
| mikelutz (he/him) |
So, my initial thought here is that string/array are kinda handled by the system. all plugins can process strings, and arrays may be multiples and parsed through. The more I think of it, it’s pretty ugly right now. There’s no restriction on what the original source is, but there is after it goes through one process plugin. But most plugins assume they are getting a string or an array. |
| marvil07 |
Process plugins return string|array. Should we change this?Back to the original question: yes, it sounds like something that can help standarize how data is passed across migrate process plugins. |
| mikelutz (he/him) |
What I’m starting to wonder is if we should have some means (not sure what yet, annotations or additional methods) that lets a plugin define it’s allowed inputs and outputs. That would let us validate a pipeline to ensure that a plugin isn’t taking in something it isn’t designed to handle. |
| marvil07 |
and for backward compatibility the interface should be able to convert into a string. |
| marvil07 |
@mikelutz (he/him) that may be useful, right now, I think that validation mainly happens internally on th dom plugins |
| marvil07 |
E.g. https://git.drupalcode.org/project/migrate_plus/-/blob/8.x-5.x/src/Plugi... |
| marvil07 |
if (!($value instanceof \DOMDocument)) { |
| mikelutz (he/him) |
It happens in a lot of plugins in core though. First line of the extract plugin isif (!is_array($value)) { throw new MigrateException('Input should be an array.');}It’s not the only one. |
| mikelutz (he/him) |
explode:if ($strict && !is_string($value)) { throw new MigrateException(sprintf('%s is not a string', var_export($value, TRUE)));} |
| marvil07 |
maybe that is a good first step, something like a validateValue() method added into the migrate process plugin interface |
| benjifisher |
Since we provide a base class for process plugins, the BC requirements are not so strict. It is possible to add to the interface. |
| benjifisher |
Is TypedData the right API to use for this? |
| marvil07 |
this 1st step (validation) is about runtime, but for core to evaluate the YAML without walking through data, declaring output type as Mike mentioned is also needed. |
| benjifisher |
What is the right way to declare types?annotationplugin definitionnew method(s) |
| marvil07 |
My initial though, IIRC, was about declaring a proper interface for the value passed around to/from process plugins. |
| mikelutz (he/him) |
Right, being able to call plugin->validate() before transform() would allow us to standardize error reporting around plugin validation inside the executable at least. I kinda prefer the annotation method. default to any/any, but make the annotations required in drupal 10 |
| benjifisher |
We already have an interface: MigrateProcessInterface. That does not let us change the types. Or do you mean adding methods like allowedInputTypes() and returnTypes() to the interface? |
| benjifisher |
Right, and the base validate() method would check the types specified (in the annotation). Then an implementation could override the method to perform additional checks if it needs to. |
| marvil07 |
The migrate process plugin interface is for the process plugin, I meant to use an interface like MigrateProcessValueInterface to pass around as input/output of proces plugins |
| marvil07 |
i.e. replace string|array with a tentative MigrateProcessValueInterface |
| benjifisher |
I see. Yes, the whole idea of using arrays and multiple() and the handle_multiples annotation seems awkward. |
| benjifisher |
I think we have the outline of a plan. Who wants to volunteer to create an issue? |
| marvil07 |
I even had some code about this on an old core checkout! |
| marvil07 |
This is why I though typed data may be a good fit, it supports the list handling from the base interface https://git.drupalcode.org/project/drupal/-/blob/9.2.x/core/lib/Drupal/C... |
| marvil07 |
Hi @berdir !Your opinion on this would be really welcomed :slightly_smiling_face:We are basically wondering if typed data may be a good fit to represent the data passed between migrate process plugins. |
| berdir |
hm, I'm not sure. I'd say the main use case of type data is to enrich it with metadata, complex structures that can be traversed in a generic way. e.g. you know if something is a list or a map and if a map, what kind of properties it has |
| berdir |
I don't really see the data that's passed around in migrate processing as a good fit for that. would add a lot of complexity and massive API change to get started... (edited) |
| benjifisher |
See the top of the thread: the current @return comment specifies strin|array, but there are already examples that return DOMDocument objects.This is useful. We want to find a way to allow it.So far, these examples are in contrib (migrate_plus), not core.Marco and I are responsible for violating the @return type. |
| benjifisher |
@Arthur Deryckere (KenowaX), are you looking for advice on how to use the existing features or are you suggesting an improvement to the system? |
| marvil07 |
In my experience, the timing is mainly related to what is being done, e.g. a usual performance improvement is tryign to depend less on network operations (e.g. download plugin, or stop getting the source from a remote url). |
| Arthur Deryckere (KenowaX) |
So, not really sure about this as I haven't been a developer very long. But I'm trying to get as many data rows imported as possible in as little time as possible and I'm trying to find ways to optimize this.So I'm looking for advice in how to achieve this.Also, I thought that maybe minimizing the number of round trips to the Database would be a good way to improve on the system. But as I said, haven't been doing this long and I don't really know how migrate handles this. I do believe that a trip is done per row.Also, in the same context : I once added in a custom module a way to add a message in the migration log per imported row. Removing this would reduce import time by up to a half. (So clearly had to find another way to do this. I ended up doing a single message at the end of the migration).These are the kinds of examples I'm looking for, on how to or how not to import to improve performance. (edited) |
| Joshua Turton (srjosh) |
String operations - anything using DOM, or any Php string function - take a while, especially when multiplied by numerous rows. Sometimes you can work around that by doing some level of pre-migration or post-migration step, but sometimes you just gotta bite the bullet and take the time. |
| mikelutz (he/him) |
So, In the typical use case, migrations are run once to import data from a different site initialize a Drupal site, or they are run in the background via cron on a semi regular basis to manage and update data. Because nothing in the migration system is render blocking, there isn’t as much of a need to focus on performance as there might be in a part of Drupal that is executed during a page load. Additionally, as mentioned above, typically the bottlenecks on performance are external. Network time for sources, database write time for destinations, etc. In a file migration that take 4 hours, 3.8 hours are taken by transferring files, for example. |
| Joshua Turton (srjosh) |
There's also the age-old solution - throw more hardware at it. I've had instances where we had someone on the team spin up a monster-sized AWS instance and ran migrations there, instead of on my laptop or the Pantheon env. It can be an economical fix as long as you remember to turn it off when you're done. |
| mikelutz (he/him) |
Typically optimizations on migrations are around ensuring that you aren’t processing rows that you don’t need to on incremental migrations. If you are just importing rows once and going to be done with it, there isn’t much point in trying to optimize it, you will spend more time doing that than you will save. |
| Joshua Turton (srjosh) |
... as long as it isn't crashing or timing out. |
| Arthur Deryckere (KenowaX) |
In my use case, were talking about near a million data rows to be updated daily.I've already set up track_changes in a way that helps me overcome a lot of time. (We went from 3 days to import to 11 hours).Now I'm looking into a way to optimize smaller details.For instance, I've noticed that, for anyone using migrate_source_csv, the source plugin loads a CSV file twice because the CsvObjectFile class is missing a \Countable implementation. That being done, you're already halving the load time for the file as it is not done once but twice. (I know this isn't to do with Migrate Core but I'm looking at this from a large viewpoint).Another thing I noticed is that migrate_plus loads all the configurations from the database. This isn't necessary as only migration configuration files should be loaded (hence, the config collection post had earlier).These are all the small details that can be optimized for better overall performance.@Joshua Turton (srjosh) do you have a concrete example of string php functions ? And what about the case of arrays of strings ? Don't array functions also take a lot of time ? (edited) |
| Joshua Turton (srjosh) |
concrete as in, "I've benchmarked them"? No. Concrete as in, "These specific functions take a while"? Also no. But my general experience has been that anything using a regex or a search and replace is a taxing functionality. |
| Arthur Deryckere (KenowaX) |
Thx. :slightly_smiling_face: |
| Arthur Deryckere (KenowaX) |
Also, on the same page : in a general case, making one big sql call is better than lots of small ones.However, Migrate does one call per row.Couldn't we generate a query string for the entire migration and throw that at the database ? |
| Joshua Turton (srjosh) |
do you have enough RAM on your box to process a million-row dataset? |
| Joshua Turton (srjosh) |
I don't. |
| Arthur Deryckere (KenowaX) |
Doesn't change the fact that doing things 10k rows by 10k rows would be better than row by row. |
| Arthur Deryckere (KenowaX) |
Unless the database connection is kept open for the duration of the migration ? |
| mikelutz (he/him) |
What specifically do you mean when talking about one sql call per row? With what source are you referring? |
Arthur Deryckere (KenowaX), Abhinand Gokhala K, benjifisher, marvil07, Neslee, gaurav mahlawat, anmolgoyal74, mikelutz (he/him), Matroskeen, Joshua Turton (srjosh), quietone, dinarcon, Berdir
Comments
Comment #2
abhinand gokhala k commentedAttending!
Comment #3
rahul b commentedattended
Comment #13
quietone commentedComment #14
quietone commentedComment #15
quietone commentedComment #16
quietone commentedNo corrections needed - from the meeting after this.
Comment #18
gauravvvv commented