Problem/Motivation

Follow-up from #2745797: Add option to content entity destinations for validation
Relevant discussion at https://youtu.be/9rBkBl4uGlg?t=1265

Discussion on whether we can make entity validation the new default in migrations, and how we can do that.

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Comments

mikelutz created an issue. See original summary.

wim leers’s picture

Issue tags: +migrate-d7-d8
StatusFileSize
new924 bytes

Thanks for creating this, @mikelutz! I'll watch the recording ASAP.


I'm personally frightened by the fact that we've been running migrations without running entity validation. I'm very grateful #2745797: Add option to content entity destinations for validation happened. It's a step in the right direction.

I'd have loved to have seen a "enable entity validation for all the things" flag. It's true that setting that would likely break existing in-development migration projects. But I'm just starting to migrate wimleers.com as an experiment, so that would actually be welcome. Hence the attached patch that simply forces entity validation for all entities.

Where I'm coming from: migrating my own site from D7 to D8

After a number of date and filter fixes, the wimleers.com migration mostly works when using migrate_drupal_ui 🥳

Albeit that in quite a few places, I’m getting entity validation errors if I go to a node edit form and hit “save” without changing anything! 😨

Which confirms the need for us to be able to run with entity validation turned on.

So what happens when I migrate my own site with the attached patch applied?

The results are kinda frightening. It looks like migrate_drupal is not taking dependencies into account when generating and ordering migration tasks.

At the very least, path_alias should run the absolute latest (they’re entities now in 8.8, and hence have validation!), and taxonomy_term should run very early. Those two tweaks alone would fix a lot. I’m not sure a full dependency-based approach is feasible.

But in theory we could introspect the data model for each entity type/bundle and selectively migrate all entities within a given bundle together, and introspect their dependencies (i.e. things they could reference).

OTOH, if there's one thing that we actually have reasonable certainty about, it's that references continue to work. So I think it'd be interesting to have a validate everything except for \Drupal\Core\Entity\Plugin\Validation\Constraint\ValidReferenceConstraint mode.

That's the next thing I will do: investigate how/where that migration execution order is determined, first manually tweak it, and hopefully make it smart by using a dependency graph.

EDIT: and @mikelutz wrote: The core migrations were build on the assumption that we created the entity type/bundle on a previous step. It’s all very very flaky, I agree. — this gives me hope that we can make a leap forward here 😊It makes sense that we shipped with the simplest possible approach first, but now that all the foundations exist, I think we're finally in a position where we can do better!

mikelutz’s picture

Relevant slack discussion:

mikelutz @wimleers (he/him) I can see you and @gabesullice are going to be keeping me very busy, lol.
wimleers (he/him)
wimleers (he/him) This is just me in poking around mode
wimleers (he/him) And trying to migrate wimleers.com
mikelutz Yeah, that’s what scares me. What happens when you get into actual development mode, lol.
wimleers (he/him) Now trying to use https://www.drupal.org/node/3073707 and applying it to all entities, like so:diff --git a/core/modules/migrate/src/Plugin/migrate/destination/EntityContentBase.php b/core/modules/migrate/src/Plugin/migrate/destination/EntityContentBase.phpindex 3f8dae982f..9dd2741c1f 100644--- a/core/modules/migrate/src/Plugin/migrate/destination/EntityContentBase.php+++ b/core/modules/migrate/src/Plugin/migrate/destination/EntityContentBase.php@@ -167,7 +167,7 @@ public function import(Row $row, array $old_destination_id_values = []) { throw new MigrateException('Unable to get entity'); } assert($entity instanceof ContentEntityInterface);- if ($this->isEntityValidationRequired($entity)) {+ if (1||$this->isEntityValidationRequired($entity)) { $this->validateEntity($entity); } $ids = $this->save($entity, $old_destination_id_values);
wimleers (he/him) The results are pretty scary.
wimleers (he/him) It looks like migrate_drupal is not taking dependencies into account when performing migrations
mikelutz Well, about all we can take into account is migration dependencies and the declared required destination module.
wimleers (he/him) At the very least, path_alias should run the absolute latest (they’re entities now in 8.8, and hence have validation!), and taxonomy_term should run very early. Those two tweaks alone would fix a lot. I’m not sure a full dependency-based approach is feasible.
wimleers (he/him) No, we could introspect the data model for each entity type/bundle and selectively migrate per bundle
wimleers (he/him) @phenaproxima (he/him) ^^
mikelutz The core migrations were build on the assumption that we created the entity type/bundle on a previous step. It’s all very very flaky, I agree.
wimleers (he/him) But also a lot of potential for improvement, so yay
mradcliffe I ran my og migration code and was happy that it worked still.
wimleers (he/him) It’s easy for me to come in and complain, but it obviously took ENORMOUS effort to even get to this point. After a number of date and filter fixes, the wimleers.com migration works. Albeit that in quite a few places, I’m getting entity validation errors if I go to a node edit form and hit “save” without changing anything. Which confirms the need for us to be able to run with entity validation turned on.
mradcliffe action module migrations on the other hand keep blowing up on me so I finally created an issue for it - #3095427: Handle missing action plugin during migration gracefully
wimleers (he/him) @mradcliffe “still” related to… what? Having validations turned on?
mradcliffe Related to yay migrations
mradcliffe is running vbo deletions on hundreds of thousands of users on a test site so has a lot of time to read threads
mikelutz On the other hand, it goes to Heddn’s point. at least you have data, and you can adjust it on save. If we fail the migration because the data doesn’t validate, then you don’t have that. So if we require validation, then we need to find a way to massage the data such that it can pass validation, or else everybody is going to end up with migrations that don’t migrate anything.
megachriz I’ve migrated 2 sites so far (with one being very simple). With the slightly complex one I had almost no issues, except that I had a bunch of phantom users afterwards (with usernames like -1, -2, etc. and no mail addresses).
megachriz Be aware that validations can be disruptive in some cases, some validators validate against the current logged in user. The D7 version of Organic Groups does that for example.
wimleers (he/him) @mikelutz I buy the “at least you have your data” argument in that that is certainly better as not being able to migrate at all. BUT! It’d be much better still to get a sense of the % of validation errors that are being observed. What % of Term entities are affected? What % of Node entities? And do the % differ between different bundles? This allows for a much more informed migration, and reduces scary surprises after the fact. We can’t expect people to find the 2000 poorly migrated pieces of data in a pile of 1 million.
megachriz But otherwise I should say that validations should be on by default.
wimleers (he/him) +1 for the user context thing — I already ran into that
wimleers (he/him) Let’s continue this very fruitful discussion on the issue. I summarized what I’ve been doing, my observations, my attempts, and my next steps here: #3095456: [Plan] Discuss strategy for long-term use of entity validation in the migration system#comment-13359633
mikelutz I’ll copy this thread over.
megachriz It would be nice if validations against the current logged in user could be separated from say data type related validation errors (“this value should be null” for example).
wim leers’s picture

StatusFileSize
new2.63 KB

I'm still doing more digging, but I've got a very rough PoC patch working for this:

OTOH, if there's one thing that we actually have reasonable certainty about, it's that references continue to work. So I think it'd be interesting to have a validate everything except for \Drupal\Core\Entity\Plugin\Validation\Constraint\ValidReferenceConstraint mode.

Doing that cuts most of the noise away until we figure out a more migration execution order that respects dependencies (references).

I am finding lots of legitimate validation violations. Expect a big update tomorrow. :)

mikelutz’s picture

That is another reason we don't validate that I hadn't even thought about. The core migrations as ran through the UI preserve all entity ids across the board. So without validation running migrations or using stubs was unnecessary. Once the entire set of migrations ran, then all the references would validate.

I would like to see it all tightened up, though I also worry about circular references and other situations where it might be difficult to manage.

This is out of scope, but I'm just brainstorming here:

Core has no concept of migration groups, nor any concept of running several migrations together. It only supports migration relations as dependencies. Contrib provides the concept of groups, and a drush command to run a set of migration by group or tag, but under the hood, it still spins up a MigrateExecutable instance for each migrations and runs them one after another, completely independently (after using the dependency tree to sort them)

I wonder if it would help both with validation, and with the concept of allowing a dry run if core had some sense of migrations that were not independent, and were meant to be run together as a group. Some sort of Task runner that manages running a set of migrations and can do some stuff at the beginning and end of a group of runs. For the purposes of a dry run, it could keep track of the mapping data a dependent migration would need, and then make sure it's all cleared out at the end. For validation, it could keep track of saved entities and potentially validate them (particularly the references) at the end of the entire group run. OF course at that point the entities are already saved, so about all we could do is report, and maybe roll everything back.

I don't think it would be able to validate references during a dry-run, which makes the dry-run less useful (Although (and this is out of scope for my out of scope thought), maybe we could create a database connection with a new prefix, clone the tables, run the dry-run on THAT, and provide some sort of introspection, though there is a whole other can of worms. Still, to be useful it would require core having some concept of running more than one migration together, which it currently doesn't)

This could also be something done in contrib instead, but we lose the possibility of core doing a post migration group reference validation. Anyway, I'm just brainstorming long term ideas. They would be a lot of work, and I haven't fully thought through the practicality.

wim leers’s picture

I would like to see it all tightened up, though I also worry about circular references and other situations where it might be difficult to manage.

Yep, hence my proposal to exempt specifically entity references. We might be able to detect circular references potential by reasoning about the data model. But even if we can deduce that circular references are possible, we still need to be able to migrate. That's when #4 comes into play.

For validation, it could keep track of saved entities and potentially validate them (particularly the references) at the end of the entire group run.

Interesting. This risks being brittle too though. I think a simpler approach would be to track for which entities ValidReference constraints were not validated, and then validate all those entities again after the migration. Or simpler still, just run validation again after the full migration has been completed. If during migration all entities passed validation except for the ValidReference constraint and after migration all entities pass validation completely, that would allow us to safely conclude that all migrated entities are valid, including circular references :)

That approach does seem highly practical, at least to my new-to-migration-and-its-complexities eyes. Do you see problems in that approach?

wim leers’s picture

Based on my initial exploration work and extrapolating from migrating D7 wimleers.com to D8 (which itself was upgraded from D5), I think that gradually enabling entity validation should be doable.

The argument in #2745797: Add option to content entity destinations for validation was that we don't want to break existing migrations. That is true for custom migrations. But there's nothing stopping migrate_drupal in Drupal core from adding validate: true to all generated migrations. That would not affect custom migrations at all.

When @webchick attempted to migrate her blog about half a year ago, she ran into the problem that it's really hard to act based on the encountered errors. Migrate Drupal UI tells you to go to /admin/reports/dblog for the "detailed upgrade log". But that's literally just a log, often with incomplete information that is extremely hard to act on, even for developers.

I believe the first step is to surface migration messages per migration, which tends to also mean per entity type and if it's a bundleable entity type, per entity type + bundle. This makes it easier for the migrator to create a mental model of which things are failing and how that may have effects on depending migrations.

At that time, #2745797 did not land yet, so that did not include validation errors. But the patch here does enable entity validation upon migration for every entity. When applying the patch in #4 to migrate wimleers.com (along with #3095922: The comment "language" column in D7 might be empty after migrating from D6, this is invalid in D8 applied, which I would never have discovered until months or years after the migration to D8 and it'd then have caused obscure bugs), the super rough PoC UI for surfacing migration messages per migration, posted at #3063856-5: Add ability to view migrate_message table data, looks like this:

That gives you a good sense of what's going on. Lots of entity types have zero validation errors 🥳 Most of the other ones are related to uploaded files that are larger than what is allowed and/or have disallowed extensions (turns out this uncovered a problem in the destination environment, something that would also be excellent for #3061676: Create an audit plugin class/manager for migrations to check!). A few things point to invalid data on the source environment that could easily be fixed on the source environment (caused by validation not being strict enough in D6/D7, or the upgrade having been imperfect in the past). This would empower the migrator to take more targeted action (and so far you don't even have to be a developer).

(We haven't even mentioned #2969551: Migrate messages from caught exceptions need file and line details, but that would of course make #3063856 even more valuable. But that's not strictly related to this issue.)

A consequence of running migrations with entity validation turned on is that we need to become smarter about the order in which we run migrations. I'm happy to have been completely wrong in my assessment in #2: migrations already are run in order that respects dependencies! But there's some dependency information missing I think, but that only becomes apparent once you turn on entity validation. So that should be simple 🙂

So this is what I propose as next steps:

  1. Make migrate messages easier to analyze: #3063856: Add ability to view migrate_message table data
  2. Add the ability in migrate_drupal(_ui) to opt in to migrate with entity validation. It should excempt ValidReferenceConstraint like in #4, but then also run entity validation on all entities again post-migration, like #5 + #6 suggest. That would give us 100% confidence about the data in the destination site, and hence rule out any future weird bugs once running Drupal 8.
  3. Improve the existing migrations based on the information that migrations-with-validation-on uncover. For example #3095922: The comment "language" column in D7 might be empty after migrating from D6, this is invalid in D8, #3095146: Drupal 7 date fields configured to not collect the hour/minute/second granularities can have a different "settings" structure than the migration assumes, #3095195: Drupal 7 date fields configured to not collect the hour/minute/second granularities can have "00" MM or DD attributes, and more to follow.
  4. Similarly, we will probably find more things to audit for, which surfaces valuable use cases for #3061676: Create an audit plugin class/manager for migrations.
  5. Eventually we should be able to make migration with entity validation opt out instead of opt in.
heddn’s picture

Seems a lot of child issues about dependency ordering. Let's add dedicated test coverage about an expected order of running migrations as a first step.

xjm’s picture

Title: [Plan] Discuss strategy for long term use of entity validation in the migration system » [Plan] Discuss strategy for long-term use of entity validation in the migration system
Version: 9.0.x-dev » 9.1.x-dev

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.