I have a fairly basic D6->D7 migration set up using migrate_d2d, and I"m having an issue with my node migrations where the migration runs twice. It will complete the migration once, stop (or do something else, I don't know what), and then re-run with the same nodes again, updating them. The final output from the drush mi command will be

Processed 4 (2 created, 2 updated, 0 failed, 0 ignored) in 98.7 sec (2/min) - done with 'CNSArticle'           

when I've run it for two nodes using the --idlist parameter. It didn't do this with user, taxonomy, and file migrations, and the node migrations are fairly straightforward, with no additions or overrides of the query.

Here's the code to set up the migrations:

/**
 * Register all D6->D7 migrations.
 */
function cns_migrate_register_migrations() {
  // Define arguments used for all migrations.
  $common_arguments = array(
    'source_connection' => 'legacy',
    'source_version' => 6,
    'format_mappings' => array(
      1 => 'filtered_html',
      2 => 'full_html',
      3 => 'full_html',
      4 => 'full_html',
    ),
  );

  $node_arguments = array(
    array(
      'class_name' => 'CNSArticleMigration',
      'description' => t('Migration of CNS article nodes from Drupal 6'),
      'machine_name' => 'CNSArticle',
      'source_type' => 'article',
      'destination_type' => 'article',
      'dependencies' => array('Category', 'Tags'),
    ),
    array(
      'class_name' => 'APArticleMigration',
      'description' => t('Migration of AP article nodes from Drupal 6'),
      'machine_name' => 'APArticle',
      'source_type' => 'articleap',
      'destination_type' => 'articleap',
      'dependencies' => array('Category', 'Tags'),
    ),
  );
  // Tell the node migrations where the users are coming from, so they can
  // set up the dependency and resolve D6->D7 uids.
  $common_node_arguments = $common_arguments + array(
      'user_migration' => 'CNSUser',
      'group_name' => 'node',
    );
  foreach ($node_arguments as $arguments) {
    $arguments = array_merge_recursive($arguments, $common_node_arguments);
    MigrationBase::registerMigration($arguments['class_name'], $arguments['machine_name'],
      $arguments);
  }
} 

And the classes themselves:

class ArticleMigration extends DrupalNode6Migration {
  public function __construct(array $arguments) {
    parent::__construct($arguments);

    // Mappings for taxonomy terms.
    $this->addFieldMapping('field_category', 1)
      ->sourceMigration('Category');
    $this->addFieldMapping('field_category:source_type')->defaultValue('tid');
    $this->addFieldMapping('field_tags', 2)
      ->sourceMigration('Tags');
    $this->addFieldMapping('field_tags:source_type')->defaultValue('tid');
    // Images multifield
    $this->addFieldMapping('field_images:field_image', 'field_image')
      ->sourceMigration('CNSFile');
    $this->addFieldMapping('field_images:field_image:file_class')
      ->defaultValue('MigrateFileFid');
    $this->addFieldMapping('field_images:field_image:preserve_files')
      ->defaultValue(TRUE);
  }

  public function prepare($entity, stdClass $row) {
    // Add source nid to Disqus field so that comments are mapped to new nodes.
    // Uses patch from https://www.drupal.org/node/1175668#comment-7847119.
    // nid 77900 is when Disqus comments were first used.
    if ($row->nid > 77900) {
      $entity->disqus = array(
        'domain' => variable_get('disqus_domain', ''),
        'url' => $entity->path['alias'],
        'title' => $entity->title,
        'did' => NULL,
        'developer' => 1,
      );

      $entity->disqus_identifier_auto = FALSE;
      $entity->disqus_status = TRUE;
      $entity->disqus_identifier = 'node/' . $row->nid;
    }
    else {
      // Pre Disqus, so we disable Disqus on the node.
      $entity->disqus_status = FALSE;
      $entity->disqus_identifier_auto = FALSE;
    }
  }
}

class CNSArticleMigration extends ArticleMigration {
  public function __construct(array $arguments) {
    parent::__construct($arguments);

    // Author
    $this->addFieldMapping('field_author', 'field_source')
      ->sourceMigration('CNSUser');
  }
}

class APArticleMigration extends ArticleMigration {
  public function __construct(array $arguments) {
    parent::__construct($arguments);

    // Author
    $this->addFieldMapping('field_author_ap', 'field_apauthor');
    // AP tracking fields.
    $this->addSimpleMappings(array('field_ap_mgmt_id', 'field_ap_mgmt_seq_num'));

  }
}

This also happens if I use the CNSArticleMigration and APArticleMigration classes directly instead of deriving from the ArticleMigration class. I also thought that the multifield migrate handler (something I'm having a separate issue with) might be the cause, but when I remove that, I still have this problem.

I've googled around and haven't been able to find a similar issue. Anyone know what could be causing this? I'm not sure if it's particular to migrate_d2d or migrate, so I'm starting here.

Thanks.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

wonder95’s picture

Issue summary: View changes

Correcting node migration class names in description.

mikeryan’s picture

Status: Active » Postponed (maintainer needs more info)

I don't see anything obviously wrong here. To be clear, you're doing something like

drush mi CNSArticle --idlist=123,456

to get the status of 2 created, 2 updated?

One thing that causes unexpected processing is stub creation - if there's a reference to an as-yet-unmigrated item, a stub will be created, and then when the target migration runs there's an update. I'm not seeing any such references in the above code, however.

Have you tried debugging to see exactly what's being processed? E.g., dump in the incoming nid in prepareRow() to see exactly what source nodes are being processed, and the resulting nid in complete() to see what's being created/updated?

jcisio’s picture

Status: Postponed (maintainer needs more info) » Active

The problem is in MigrateSourceSQL->getNextRow(). In performRewind() a lot of stuffs are done, like filter with idlist. But later in getNextRow() when it runs out of records, it makes a getNextBatch() which does nothing at all and the same ids are found again.

I think there are 2 places to be fixed:

- When there is idlist, if num_processed == count(idlist) return.
- Refactor performRewind and getNextBatch to have the same conditions. Or simply remove getNextBatch() as I don't think why it is necessary.

jcisio’s picture

mikeryan’s picture

All right, looks like a bug in getNextBatch() then - as you've discovered, it's a relatively new feature which helps with very large queries.

mikeryan’s picture

mikeryan’s picture

Priority: Normal » Major

The problem (or, at least A problem, see below) is getNextBatch() clones the original query, from before the modifications made in performRewind(). So, given our batch size of 1000, with more than 1000 rows in the source:

1. performRewind() clones the original query (which pulls more than 1000 rows), and adds various stuff, including in this particular case an IN() clause that limits the results to just two rows.
2. The two rows are processed.
3. getNextRow() sees no new rows, and calls getNextBatch().
4. getNextBatch() clones the original query, without doing the alterations from performRewind(), and executes it.
5. We are now pulling from the original query starting with row 1001, and thus processing rows we shouldn't.

So... This is clearly wrong, but it doesn't lead to the same result as was originally reported, so I'm not quite sure if fixing getNextBatch to use the modified query necessarily addresses the original problem. But, I'll submit a patch to address this problem and you can try it.

Thanks.

mikeryan’s picture

Status: Active » Needs review
Issue tags: +Migrate 2.7
FileSize
1.34 KB

I did replicate this with the original behavior, where the specified item was not in the first batch - this patch works for me.

jcisio’s picture

Version: 7.x-2.6-rc2 » 7.x-2.x-dev
FileSize
1.34 KB
+++ b/plugins/sources/sql.inc
@@ -356,6 +356,9 @@ class MigrateSourceSQL extends MigrateSource {
+    // Save our fixed-up query so getNextBatch() matches it.
+    $this->alteredQuery = clone $this->query();
+

Wouldn't it be clone $this->query? It works fine and fixes the issue.

In #3 I thought about another optimization: when idlist is provided, we know exactly the maximum number of rows to be processed because IDs are unique. When the this number is reached, we could end the iteration. E.g. when there are 1,000,000 rows and the idlist contains 3 IDs, it avoids running the batch 1,000 times. However, as I tested, because the query is filtered with IDs, the execution time for each query is less than 1 ms, so it might not worth complicating the logic.

  • mikeryan committed 86a140f on 7.x-2.x authored by jcisio
    Issue #2403593 by mikeryan, jcisio: Migration runs twice, updates...
mikeryan’s picture

Committed, thanks.

mikeryan’s picture

Status: Needs review » Fixed
wonder95’s picture

This patch worked perfectly for me.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.