Hello all, it’s time for the weekly migration subsystem meeting. The meeting will take place in slack in various threads
This meeting:
➤ Is for core migrate maintainers and developers and anybody else in the community with an interest in migrations
➤ Usually happens every Thursday and alternates between 1400 and 2100 UTC.
➤ Is done on the #migration channel in Drupal Slack (see www.drupal.org/slack for information).
➤ Happens in threads, which you can follow to be notified of new replies even if you don’t comment in the thread. You may also join the meeting later and participate asynchronously!
➤ Has a public agenda anyone can add to. See the parent issue for an idea of the typical agenda.
➤*Transcript will be exported and posted* to the agenda issue. For anonymous comments, start with a :bust_in_silhouette: emoji. To take a comment or thread off the record, start with a :no_entry_sign: emoji.

Core migration issues

Next video meeting

2021-07-01 - The hope is that most or all of the maintainers will attend. We will try to focus on longer-term goals than in the weekly meeting.

Agenda Items

Add your items here.

0️⃣ Who is here today? What is the best movie you have seen recently?

Arthur Deryckere (KenowaX) Yo ! :smile:
Abhinand Gokhala K hi
benjifisher Welcome! I think this is the first time the two of you have joined this meeting. If you have specific questions, then add something to 1️⃣. If you are looking for ways to help, then 4️⃣ is for you.
marvil07 o/
Neslee Hello :wave:
gaurav mahlawat hello
anmolgoyal74 Hi
mikelutz (he/him) Hey all
Matroskeen Hello :wave:I’ve seen just a few movies recently, so I would probably pick “The Mauritanian”
Joshua Turton (srjosh) I'm here! I haven't seen any movies lately, but Wandavision kinda rocked my world.
quietone Vicki. The Two Popes

1️⃣ What should we talk about today? Suggest topics here and I will add threads. I will also check for comments on the issue for today's meeting.

Arthur Deryckere (KenowaX) Could we possibly talk about Migrate and config Collection compatibilities ?
Arthur Deryckere (KenowaX) Could we also talk about optimizing the time migrate takes to import (small and large scale data) ?
Arthur Deryckere (KenowaX) Woops, not enough time. Maybe next time. ^^"
benjifisher The hour is almost up, but I can open a thread and people can comment asynchronously.

2️⃣ Action items. To be added later.

3️⃣ Statistics

benjifisher Fixed since last week's meeting: 3 (not counting the issue for the meeting).
benjifisher RTBC: 6, 1 of which is Major.
benjifisher NR: 32, including 3 Major and 9 that have not been updated in more than three months.
benjifisher Google sheet for recording stats: https://docs.google.com/spreadsheets/d/1o0Rjlc1vnnLP5bM5P-SMMyGzqn7258hi...
benjifisher For the last few weeks, I have been focusing on #2571235: [regression] Roles should depend on objects that are building the granted permissions, which is not on our queue. It is now RTBC, so I will get back to reviewing migrations issues.
benjifisher Correction to RTBC: just 3, of which 2 are Major. I just moved one of them to NW because the patch no longer applies (and the testbot reports this but does not update the issue status) so now there are just 2 RTBC, one of which is Major.

4️⃣ How can I help? Comment in this thread if you are looking for ways to help. Give us some idea of what you would like to do: documentation, code review, testing, project management, ...

Abhinand Gokhala K @benjifisher How can I get the priority tickets for testing and reviewing?
benjifisher Search for issues with status: NR and component: migration system. https://www.drupal.org/project/issues/drupal?status=8&component=migratio...
benjifisher For me, the priorities are the Major issues and the ones that have not been updated in a long time. For someone just getting started, it makes sense to do the opposite.
benjifisher A recently created issue with just a few comments is easier to approach than a Major issue with 100+ comments.
Neslee I am interested in reviewing things, and will follow up on the above link
benjifisher Many issues relate to upgrades from Drupal 6/7. If you need advice on setting up for that, we can talk about it in this channel, either during the weekly meeting or any time (but with slower response times).
benjifisher NR issues need both review and testing (R and T in RTBC). If you can do either one, that is helpful. Detailed comments are even more helpful.
benjifisher For example, "I tested it and it works" is less helpful than a list of the steps you went through and the results you got.
quietone @Neslee Glad to see the interest in reviews. ping me anytime with your questions about review.

5️⃣ Migrate and config Collection compatibilities

benjifisher @Arthur Deryckere (KenowaX): can you explain the problem? Is there an open issue for this?
Arthur Deryckere (KenowaX) There isn't one yet (that I have created anyways).The problem is rather simple : when working on large projects with a wide panel of migration yml files in the config folder, things get hard trying to stay organized.Here's a small example :
Arthur Deryckere (KenowaX) These are only migrations from a to dThere would be 3-4 screenshots like these.
Arthur Deryckere (KenowaX) There are migrations for 3 languages as well. Meaning I have to work with a big quantity of migrations.Collections would enable me to place these inside specific folders like so :
Arthur Deryckere (KenowaX) image.png
Arthur Deryckere (KenowaX) Which correspond to the config collections in database :migrations.product_feeds.en , fr and nl respectively.
Arthur Deryckere (KenowaX) Migrate uses the default "" (empty) collection for configuration files when communicating with the database.
Arthur Deryckere (KenowaX) Would be nice if we could have a migrate collection by default (maybe ?) and allow for custom subdirectories to organize stuff.
benjifisher Drupal core does not support using config entities to define migrations. That functionality is provided by the migrate_plus module.
Arthur Deryckere (KenowaX) Oh, so migrate uses custom modules config for this ?I thought it would support both.
benjifisher An alternative is to define your migrations using YAML files under mymodule/migrations/. If you do that, then @Matroskeen recently created https://www.drupal.org/project/migrate_scanner to handle pretty much exactly the problem you describe.
Arthur Deryckere (KenowaX) So, migrate doesn't store migration config files in the database by default. This is migrate_plus' doing ?
benjifisher Oh, so migrate uses custom modules config for this ?I am not sure what you are asking. By "migrate" do you mean the core migrate module?
Arthur Deryckere (KenowaX) Yes
Arthur Deryckere (KenowaX) You answered the question with your previous reply :mymodule/migrations/
mikelutz (he/him) Correct, the config entities are defined by the migrate_plus module.
benjifisher Migrations are plugins. The core migrate module supports plugins defined by YAML files mymodule/migrations/*.yml.
Arthur Deryckere (KenowaX) But doesn't migrate core store the migration config in the database ?
benjifisher If you stop using migrate_plus.migration.* config entities to define your migrations, then the main thing you lose is migration groups.
benjifisher The plugin definitions (not config) are stored in the plugin cache. I think that is in the database, at least by default.
Arthur Deryckere (KenowaX) Yeah, I can't afford this as I use migration groups for checking the "migration type" with a custom module.
benjifisher What does that check involve?
Arthur Deryckere (KenowaX) Comparing the migration config to see if it's a product_feed or not. Now that I think about it, I could just place these in the module that handles the product_feeds and not the others...
Matroskeen If you’re referring to migration_group property, migrate plugins also support it.I’m using migration_group all the time for my custom migrations, that are located in migrations/ directory. (edited)
Arthur Deryckere (KenowaX) So, I guess for migrate, there is no use in handling config collections as it doesn't even use them.Migrate_Plus on the other hand could use this functionnality.
benjifisher I searched for "collection" in the migrate_plus issue queue and did not find anything. You could create a feature request.
benjifisher @Matroskeen, can you use migration groups to share configuration? I thought that was strictly a migrate_plus feature.
Arthur Deryckere (KenowaX) I'll look into this a bit further then I'll do a feature request, yes.
Matroskeen @benjifisher, not sure about configuration, I’m using it primary for Drush commands like import, rollback:drush mim --group=my_group (edited)
dinarcon Migration groups can be used to share any "configuration" defined in the migration plugin including those that are added by migrate_plus for configuration entities. Here is in example https://github.com/dinarcon/drupal-migrations-basic/blob/main/config/ins...
benjifisher Does that require migrate_plus to be installed? But it will work with migrations that are not defined by config entities?
mikelutz (he/him) Well, so standard core migration yamls can have any keys they want, technically.  There’s no reason why they can’t have migration_group keys.
mikelutz (he/him) And the drush command in migrate_tools uses that key in deciding what to run, so you can use it to group migrations for drush.
mikelutz (he/him) At a glance, the shared configuration should work too on a yml defined core migration, provided you have migrate_plus installed and a migration_group config entity to match.
mikelutz (he/him) The group configuration is merged in inside of migrate_plus_migration_plugins_alter which should alter core plugins along with config entities.
dinarcon Right, migrate_plus needs to be installed, but it also works for core migration plugins. (edited)

6️⃣ Process plugins return string|array. Should we change this?

benjifisher The DOM process plugins in migrate_plus already violate this.
benjifisher I was looking at MigrateProcessInterface and noticed * @return string|array * The newly transformed value. */ public function transform(...);(edited)
benjifisher If we want to allow other return types, then how do we indicate this? An extra method that provides some extra metadata? @marvil07 suggested using TypedData a long time ago. I need to read up on that ...
marvil07 Oh, I remember this :slightly_smiling_face:On that nedcamp I remember mentioning that using a proper interface would be a good idea instead of just string|array, and yes typed data may be a good start.I do not remember writing explicitely about this, but I may be wrong.
benjifisher Yesterday, @cornifex brought up #3048464: SubProcess migrate process plugin should throw exception on invalid input (no replies). I wonder if that is related. (I have not looked at the issue yet, just the title.)
marvil07 I guess the closest was the presentation note:Different parsers than DOMJust an FYI, my goto for HTML parsing has been querypath, it's especially good if you're dealing with old-school HTML (no

, etc.).‒ mikeryan on #2958281-7Make process plugins data extensible: use core typed data.

marvil07 ^ ok that is not fully related :disappointed:
mikelutz (he/him) So, my initial thought here is that string/array are kinda handled by the system.  all plugins can process strings, and arrays may be multiples and parsed through. The more I think of it, it’s pretty ugly right now.  There’s no restriction on what the original source is, but there is after it goes through one process plugin.  But most plugins assume they are getting a string or an array.
marvil07 Process plugins return string|array. Should we change this?Back to the original question: yes, it sounds like something that can help standarize how data is passed across migrate process plugins.
mikelutz (he/him) What I’m starting to wonder is if we should have some means (not sure what yet, annotations or additional methods) that lets a plugin define it’s allowed inputs and outputs.  That would let us validate a pipeline to ensure that a plugin isn’t taking in something it isn’t designed to handle.
marvil07 and for backward compatibility the interface should be able to convert into a string.
marvil07 @mikelutz (he/him) that may be useful, right now, I think that validation mainly happens internally on th dom plugins
marvil07 E.g. https://git.drupalcode.org/project/migrate_plus/-/blob/8.x-5.x/src/Plugi...
marvil07 if (!($value instanceof \DOMDocument)) {
mikelutz (he/him) It happens in a lot of plugins in core though. First line of the extract plugin isif (!is_array($value)) { throw new MigrateException('Input should be an array.');}It’s not the only one.
mikelutz (he/him) explode:if ($strict && !is_string($value)) { throw new MigrateException(sprintf('%s is not a string', var_export($value, TRUE)));}
marvil07 maybe that is a good first step, something like a validateValue() method added into the migrate process plugin interface
benjifisher Since we provide a base class for process plugins, the BC requirements are not so strict. It is possible to add to the interface.
benjifisher Is TypedData the right API to use for this?
marvil07 this 1st step (validation) is about runtime, but for core to evaluate the YAML without walking through data, declaring output type as Mike mentioned is also needed.
benjifisher What is the right way to declare types?annotationplugin definitionnew method(s)
marvil07 My initial though, IIRC, was about declaring a proper interface for the value passed around to/from process plugins.
mikelutz (he/him) Right, being able to call plugin->validate() before transform() would allow us to standardize error reporting around plugin validation inside the executable at least.  I kinda prefer the annotation method. default to any/any, but make the annotations required in drupal 10
benjifisher We already have an interface: MigrateProcessInterface. That does not let us change the types. Or do you mean adding methods like allowedInputTypes() and returnTypes() to the interface?
benjifisher Right, and the base validate() method would check the types specified (in the annotation). Then an implementation could override the method to perform additional checks if it needs to.
marvil07 The migrate process plugin interface is for the process plugin, I meant to use an interface like MigrateProcessValueInterface to pass around as input/output of proces plugins
marvil07 i.e. replace string|array with a tentative MigrateProcessValueInterface
benjifisher I see. Yes, the whole idea of using arrays and multiple() and the handle_multiples annotation seems awkward.
benjifisher I think we have the outline of a plan. Who wants to volunteer to create an issue?
marvil07 I even had some code about this on an old core checkout!
marvil07 This is why I though typed data may be a good fit, it supports the list handling from the base interface https://git.drupalcode.org/project/drupal/-/blob/9.2.x/core/lib/Drupal/C...
marvil07 Hi @berdir !Your opinion on this would be really welcomed :slightly_smiling_face:We are basically wondering if typed data may be a good fit to represent the data passed between migrate process plugins.
berdir hm, I'm not sure. I'd say the main use case of type data is to enrich it with metadata, complex structures that can be traversed in a generic way. e.g. you know if something is a list or a map and if a map, what kind of properties it has
berdir I don't really see the data that's passed around in migrate processing as a good fit for that. would add a lot of complexity and massive API change to get started... (edited)
benjifisher See the top of the thread: the current @return comment specifies strin|array, but there are already examples that return DOMDocument objects.This is useful. We want to find a way to allow it.So far, these examples are in contrib (migrate_plus), not core.Marco and I are responsible for violating the @return type.

7️⃣ optimizing the time migrate takes to import (small and large scale data)

benjifisher @Arthur Deryckere (KenowaX), are you looking for advice on how to use the existing features or are you suggesting an improvement to the system?
marvil07 In my experience, the timing is mainly related to what is being done, e.g. a usual performance improvement is tryign to depend less on network operations (e.g. download plugin, or stop getting the source from a remote url).
Arthur Deryckere (KenowaX) So, not really sure about this as I haven't been a developer very long. But I'm trying to get as many data rows imported as possible in as little time as possible and I'm trying to find ways to optimize this.So I'm looking for advice in how to achieve this.Also, I thought that maybe minimizing the number of round trips to the Database would be a good way to improve on the system. But as I said, haven't been doing this long and I don't really know how migrate handles this. I do believe that a trip is done per row.Also, in the same context : I once added in a custom module a way to add a message in the migration log per imported row. Removing this would reduce import time by up to a half. (So clearly had to find another way to do this. I ended up doing a single message at the end of the migration).These are the kinds of examples I'm looking for, on how to or how not to import to improve performance. (edited)
Joshua Turton (srjosh) String operations - anything using DOM, or any Php string function - take a while, especially when multiplied by numerous rows.  Sometimes you can work around that by doing some level of pre-migration or post-migration step, but sometimes you just gotta bite the bullet and take the time.
mikelutz (he/him) So, In the typical use case, migrations are run once to import data from a different site initialize a Drupal site, or they are run in the background via cron on a semi regular basis to manage and update data. Because nothing in the migration system is render blocking, there isn’t as much of a need to focus on performance as there might be in a part of Drupal that is executed during a page load. Additionally, as mentioned above, typically the bottlenecks on performance are external.  Network time for sources, database write time for destinations, etc. In a file migration that take 4 hours, 3.8 hours are taken by transferring files, for example.
Joshua Turton (srjosh) There's also the age-old solution - throw more hardware at it.  I've had instances where we had someone on the team spin up a monster-sized AWS instance and ran migrations there, instead of on my laptop or the Pantheon env.  It can be an economical fix as long as you remember to turn it off when you're done.
mikelutz (he/him) Typically optimizations on migrations are around ensuring that you aren’t processing rows that you don’t need to on incremental migrations. If you are just importing rows once and going to be done with it, there isn’t much point in trying to optimize it, you will spend more time doing that than you will save.
Joshua Turton (srjosh) ... as long as it isn't crashing or timing out.
Arthur Deryckere (KenowaX) In my use case, were talking about near a million data rows to be updated daily.I've already set up track_changes in a way that helps me overcome a lot of time. (We went from 3 days to import to 11 hours).Now I'm looking into a way to optimize smaller details.For instance, I've noticed that, for anyone using migrate_source_csv, the source plugin loads a CSV file twice because the CsvObjectFile class is missing a \Countable implementation. That being done, you're already halving the load time for the file as it is not done once but twice. (I know this isn't to do with Migrate Core but I'm looking at this from a large viewpoint).Another thing I noticed is that migrate_plus loads all the configurations from the database. This isn't necessary as only migration configuration files should be loaded (hence, the config collection post had earlier).These are all the small details that can be optimized for better overall performance.@Joshua Turton (srjosh) do you have a concrete example of string php functions ? And what about the case of arrays of strings ? Don't array functions also take a lot of time ? (edited)
Joshua Turton (srjosh) concrete as in, "I've benchmarked them"? No.  Concrete as in, "These specific functions take a while"? Also no. But my general experience has been that anything using a regex or a search and replace is a taxing functionality.
Arthur Deryckere (KenowaX) Thx. :slightly_smiling_face:
Arthur Deryckere (KenowaX) Also, on the same page : in a general case, making one big sql call is better than lots of small ones.However, Migrate does one call per row.Couldn't we generate a query string for the entire migration and throw that at the database ?
Joshua Turton (srjosh) do you have enough RAM on your box to process a million-row dataset?
Joshua Turton (srjosh) I don't.
Arthur Deryckere (KenowaX) Doesn't change the fact that doing things 10k rows by 10k rows would be better than row by row.
Arthur Deryckere (KenowaX) Unless the database connection is kept open for the duration of the migration ?
mikelutz (he/him) What specifically do you mean when talking about one sql call per row? With what source are you referring?

8️⃣ Wrap up

benjifisher Thanks for participating! I will update 2️⃣. Please continue to add comments in the threads. In 1-7 days, we will post a transcript for today's meeting.
marvil07 Thanks @benjifisher!

Participants:

Arthur Deryckere (KenowaX), Abhinand Gokhala K, benjifisher, marvil07, Neslee, gaurav mahlawat, anmolgoyal74, mikelutz (he/him), Matroskeen, Joshua Turton (srjosh), quietone, dinarcon, Berdir

Comments

quietone created an issue. See original summary.

abhinand gokhala k’s picture

Attending!

rahul b’s picture

attended

quietone credited KenowaX.

quietone credited marvil07.

quietone credited mikelutz.

quietone credited srjosh.

quietone’s picture

Issue summary: View changes
quietone’s picture

Issue summary: View changes
quietone’s picture

Status: Active » Needs review
quietone’s picture

Status: Needs review » Fixed

No corrections needed - from the meeting after this.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

gauravvvv’s picture