The batch process which is executed from the Migrate UI (/admin/structure/migrate/manage/{migration_group}/migrations/{migration}/execute) causes the PRE_IMPORT and POST_IMPORT events to be triggered once for every batch slice. Those events (described in https://www.drupal.org/node/2544874) should in fact run only once for an entire migration.
If I understand correctly, \Drupal\migrate_tools\MigrateBatchExecutable::calculateBatchLimit causes the migration to be sliced into max. 100 parts. (Introduced in #2470882: Implement running migration processes through the UI) This by itself should not be a problem, but \Drupal\migrate\MigrateExecutable::import() seems to treat each slice as a separate import. When I run my migration from the command line with Drush, this problem does not occur.
I'm not sure if this should be fixed in core or in migrate_tools, but I'm posting it in the migrate_tools queue because this doesn't happen with just Drupal core.
Comments
Comment #2
mstrelan commentedFurther to this, I thought it might be possible to implement the POST_IMPORT listener and check the Migration status via $event->getMigration()->getStatus(), however it is always STATUS_IMPORTING. There also doesn't seem to be a way to reliably check the processed, imported or updated count against the number of source rows, unless I'm missing something.
Comment #3
mikeryanThis isn't really a bug - the PRE_IMPORT and POST_IMPORT events are behaving as designed, triggered on every migration *execution*. For example, migrate_tools uses this post import hook to report on the items processed during the run - if the hook only fired when the migration was "complete", then
wouldn't report anything unless those were the last 100 rows of the source.
It also would be problematic to implement these hooks to only fire when a migration has not been run at all and when it's completed because there are scenarios where you can't get an accurate source count and thus know for certain if it's "unstarted" or "complete".
For those scenarios where it can be done, though, you can get the effect yourself In the PRE_IMPORT hook get the id map processedCount() value - if 0, then you can do your one-time-ever work. And in the POST_IMPORT hook, you can get both the processedCount() and the source plugin's count() - if they're equal, everything has been processed and you can do your post-everything work.
Is this helpful?
Comment #4
mstrelan commentedThanks for your reply. It sounds like it should be helpful, but doesn't work for me.
The
$map_countincreases with each batch slice until it reaches 989. The$source_countis always 2979.Processed 3000 items (988 created, 1688 updated, 0 failed, 324 ignored) - done with 'my_migration'.
The 1688 updated is where there are duplicate rows in the source data. I don't have control of the source data, this is what the client wants to import. The 324 ignored is where there is missing data or it's skipped for various reasons.
On subsequent executions of the same migration where the data needs to be updated, the
$map_countstarts at 989 and doesn't change. I also have not tested what happens if there are new rows or missing rows in the spreadsheet. I believe the ID map retains the old values, so the counts would be unreliable.What I really need is an event that's triggered when the batch is complete. Ideally this would also need to be fired when the drush migration is completed so there is no difference between Batch UI or drush. I guess I could use hook_batch_alter() to override the finished callback, but then I'd also have to replicate the same for drush somehow.
When
drush mi --limit=100is executed, how does it know where in the source file to start, and when to finish? Can I use the same logic? I've also tried to check $migration->getSourcePlugin()->next() from the post import event handler but it is always null.Comment #5
mikeryanThe framework does depend on processing exactly *one* row for each source ID - if the data isn't deduped before (or within) the source plugin, then yes, you're going to have all sorts of trouble (especially anything involving the source counts). In this case, it seems what the numbers you're showing are saying is that there are 2979 source rows, but only 989 distinct source IDs. The map table is indexed by source ID, so there's no way to track different source rows with identical IDs, and the number processed will never reach the source count. Things would run much more smoothly if you could, say, run the data through some sort of deduping preprocessor.
I take it this is a CSV/XML/JSON source? The source plugin starts reading from the beginning, but by default skips any rows whose ID is already recorded in the map table (this is what happens regardless of --limit, actually). Each row that wasn't already in the map table and thus gets processed is counted, and if you're using --limit it exits when the count reaches that point.
P.S. Please be sure to reset the status to "Active" when giving more info to a "Postponed (maintainer needs more info)" - if I didn't have time to respond upon email notification this morning, your response might very well have lingered for a while ("maintainer needs more info" issues are at the bottom of the list when triaging issues to address).
Comment #6
mstrelan commentedThanks @mikeryan, this has helped me greatly. Sorry to OP for hijacking the thread, but it sounds like this works as designed.