Follow-up to #2824267: Highwater condition isn't added (on remote databases)

Problem/Motivation

per ohthehugemanatee in https://www.drupal.org/node/2824267#comment-11946115:

What I don't understand, is how does this work when the map is NOT joinable? If mapJoinable() returns FALSE, the query conditions look like: "(original conditions) AND (above high water)." How do we know which rows have NEEDS_UPDATE set?

For a detailed analysis, see #20 through #23.

Proposed resolution

Ideally, the highwater condition should not be added to the query when there are NEEDS_UPDATE rows present.

Do not add the highwater query condition if the map table is not joinable (if the map table in the destination site cannot be joined against the source table in the source site).

Remaining tasks

  1. Test coverage: #20 + #21
  2. Fix: #22 + #23
  3. Review

User interface changes

None.

API changes

None.

Data model changes

None.

CommentFileSizeAuthor
#58 2859314-58.patch11.51 KBWim Leers
#38 2859314-38.patch11.48 KBWim Leers
#38 interdiff.txt2.02 KBWim Leers
#33 2859314-33.patch10.8 KBWim Leers
#33 interdiff.txt2.79 KBWim Leers
#32 2859314-32.patch9.68 KBWim Leers
#32 interdiff.txt2.5 KBWim Leers
#30 2859314-30.patch7.18 KBWim Leers
#30 interdiff.txt3.55 KBWim Leers
#29 2859314-29.patch3.63 KBWim Leers
#29 interdiff.txt1.03 KBWim Leers
#23 2859314-23.patch3.72 KBWim Leers
#23 interdiff.txt816 bytesWim Leers
#21 2859314-21-test-only.patch2.92 KBWim Leers
#21 interdiff.txt2.28 KBWim Leers
#20 2859314-20-test-only.patch945 bytesWim Leers
#13 highwater_condition-2859314-13.patch887 bytesmikeryan
#9 Highwater_condition_with_unjoined_maps_skips_NEEDS_UPDATE_rows-2859314-09.patch952 bytesjmmarquez
#6 Highwater_condition_with_unjoined_maps_skips_NEEDS_UPDATE_rows-2859314-6.patch930 bytesgaurav.kapoor
#3 Highwater_condition_with_unjoined_maps_skips_NEEDS_UPDATE_rows-2859314-03.patch898 bytesjmmarquez
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mikeryan created an issue. See original summary.

jmmarquez’s picture

I am working on it!!

jmmarquez’s picture

Status: Needs review » Needs work
jmmarquez’s picture

gaurav.kapoor’s picture

Status: Needs review » Needs work
gaurav.kapoor’s picture

oops!.My point is that $condition_added should have true value before if condition.

jmmarquez’s picture

jmmarquez’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work
mikeryan’s picture

+++ b/core/modules/migrate/src/Plugin/migrate/source/SqlBase.php
@@ -270,7 +270,8 @@ protected function initializeIterator() {
+      $condition_added = $condition_added?false:true;

Note a couple of Drupal coding standards:

1. FALSE and TRUE should always be uppercase.
2. There need to be spaces around any operators (include ? and :).

That being said, I'm not sure what the intent is here - this is flipping the sense of $condition_added, so the value is actually the opposite of what it's supposed to represent?

mikeryan’s picture

Status: Needs work » Needs review
Issue tags: +Needs tests
FileSize
887 bytes

Could it be this simple?

At any rate, we'll need a test demonstrated the failure.

Status: Needs review » Needs work

The last submitted patch, 13: highwater_condition-2859314-13.patch, failed testing.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Wim Leers’s picture

I can confirm this bug.

I's been driving me maddeningly insane 😬😬😬

I've been observing this problem where there's two distinct MySQL DBs, on separate hosts. In that case, this code in \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable() returns FALSE:

    foreach (['username', 'password', 'host', 'port', 'namespace', 'driver'] as $key) {
      if (isset($source_database_options[$key]) && isset($id_map_database_options[$key])) {
        if ($id_map_database_options[$key] != $source_database_options[$key]) {
          return FALSE;
        }
      }
    }

I've been pulling my hair out why I could never reproduce this behavior locally. I first blamed server reliability. But then I reproduced the problem locally, because I had my source DB in SQLite and destination DB in MySQL. Which effectively hits this same edge case.

Of course, it makes total sense that it's impossible to join a source DB table against a migrate mapping DB table in another database. It's just that the migration system never informs you of this explicitly.

More importantly, this should only make things run more slowly, not incorrectly: it is still possible to iterate over every row in the source plugin, perform a query against the mapping table to check the migration mapping status, and thus figure out the correct answer using N cheap queries instead of 1 complex query (basically, JOINing in PHP rather than in SQL).


Writing test coverage for this is simple precisely thanks to the test that #2824267: Highwater condition isn't added (on remote databases) added. We can simply force \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable() to think it is dealing with a remote DB … which is exactly what #2824267: Highwater condition isn't added (on remote databases) aimed to test!

This simple simulation in the test coverage will make Drupal\Tests\migrate\Kernel\HighWaterTest::testHighWaterUpdate() fail, hence proving this bug in the process.

Wim Leers’s picture

Title: Highwater condition with unjoined maps skips NEEDS_UPDATE rows » Highwater condition with unjoined maps skips unprocessed and NEEDS_UPDATE rows
Issue tags: -Needs tests
FileSize
2.28 KB
2.92 KB

In my case, the problem I'm seeing more often is that the migration system, for whatever reason (I still suspect temporary server unreliability), some rows are unprocessed. Say you have rows 1 through 10,000, then rows 3453, 3454, 8434 and 9123 are unprocessed. Somehow, the migration system did not record these as ignored, skipped, failed — nothing. There's simply no corresponding entries for them in the mapping table. I've seen this happen several times now, and never consistently the same rows.

So I'm updating the title from Highwater condition with unjoined maps skips NEEDS_UPDATE rows to Highwater condition with unjoined maps skips unprocessed and NEEDS_UPDATE rows.

I fully realize that this probably points to a deeper rooted problem elsewhere. But that does not mean it's okay for the migration system to not make good on its promises: it is supposed to find unprocessed rows and process them, no matter for what reason they were not yet processed.

In my case, the crucial comment from ::mapJoinable():

      //    conditions in the query). So, ultimately the SQL condition will look
      //    like (original conditions) AND (map IS NULL OR map needs update
      //      OR above high water).

or, the essence of it:

(original conditions) AND (map IS NULL OR map needs update OR above high water)

This comment … well … it turns out to be a lie when using remote databases! 😨😳🙃 This took many hours of debugging 😬

You see, what actually happens on migration setups where the source and destination DBs are not joinable is this:

(original conditions) AND (above high water)

And since later rows are processed, and it's \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next() that updates the high water mark upon accessing the source, not upon saving the mapping, then any unprocessed rows with high water property values below the high water mark are simply never even considered — they're filtered away in the source plugin's query!

The attached patch's additional test coverage for HighWaterTest simulates this. (The test coverage addition in #20 already ensures that all test methods in HighWaterTest are simulated to use remote databases.)

Wim Leers’s picture

#13: This suggested change is insufficient.

Let me quote the crucial code from ::mapJoinable():

      //    conditions in the query). So, ultimately the SQL condition will look
      //    like (original conditions) AND (map IS NULL OR map needs update
      //      OR above high water).
      $conditions = $this->query->orConditionGroup();
      $condition_added = FALSE;
      $added_fields = [];
      if ($this->mapJoinable()) {
        …
      }
      // 2. If we are using high water marks, also include rows above the mark.
      //    But, include all rows if the high water mark is not set.
      if ($this->getHighWaterProperty()) {
        …
      }

IOW:

(original conditions) AND (map IS NULL OR map needs update OR above high water)

Your suggested simple change (changing that last if to elseif) causes that to change to two possible query shapes:

  1. if the map is joinable (first if-test evaluates to true):
    (original conditions) AND (map IS NULL OR map needs update)
    
  2. if the map is not joinable (first if-test evaluates to false):
    (original conditions) AND (above high water)
    

I hope I've clearly explained why this would be wrong.

Wim Leers’s picture

The simplest possible fix is this change:

      //    conditions in the query). So, ultimately the SQL condition will look
      //    like (original conditions) AND (map IS NULL OR map needs update
      //      OR above high water).
      $conditions = $this->query->orConditionGroup();
      $condition_added = FALSE;
      $added_fields = [];
      if ($this->mapJoinable()) {
        …
      }
      // 2. If we are using high water marks, also include rows above the mark.
      //    But, include all rows if the high water mark is not set.
-       if ($this->getHighWaterProperty()) {
+       if ($this->mapJoinable() && $this->getHighWaterProperty()) {
        …
      }

This means that only the following two query shapes are possible:

  1. (original conditions) AND (map IS NULL OR map needs update OR above high water)
    
  2. (original conditions)
    

Both are guaranteed to yield correct results (the passing tests will prove that). That being said, this will be less efficient for the case where there are no unprocessed or NEEDS_UPDATE rows in the source, and only net new rows in the source: in that scenario (and only that scenario!), (original conditions) AND (above high water) would have been perfectly fine.

However, correctness is always more important than performance. I propose leaving that optimization for a follow-up issue.

Wim Leers’s picture

However, correctness is always more important than performance. I propose leaving that optimization for a follow-up issue.

I am not sure if the migration system tracks enough metadata about a source DB to be able to know which rows are net new — only when the source table has an auto-incrementing integer primary key we can reasonably infer that.

In that case we could do (original conditions) AND (PK in source > MAX(PK) in map AND above high water). But we'd need to introspect the schema of the source and I'm not sure the migration system has ever done this?

The last submitted patch, 20: 2859314-20-test-only.patch, failed testing. View results

Wim Leers’s picture

Issue summary: View changes

The last submitted patch, 21: 2859314-21-test-only.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 23: 2859314-23.patch, failed testing. View results

Wim Leers’s picture

Version: 8.9.x-dev » 9.2.x-dev
Status: Needs work » Needs review
FileSize
1.03 KB
3.63 KB

Looks like #23 introduced failures in:

  1. Testing Drupal\Tests\migrate\Kernel\HighWaterNotJoinableTest
    F..                                                                 3 / 3 (100%)
    
    Time: 00:02.744, Memory: 4.00 MB
    
    There was 1 failure:
    
    1) Drupal\Tests\migrate\Kernel\HighWaterNotJoinableTest::testSource with data set #0 (array(array(array(1, 'Item 1', 1), array(2, 'Item 2', 2), array(3, 'Item 3', 3))), array(array(2, 'Item 2', 2), array(3, 'Item 3', 3)), 3, array(array('changed')), 1)
    Value at 'array[0][id]' is not correct.
    Failed asserting that two strings are equal.
    --- Expected
    +++ Actual
    @@ @@
    -'2'
    +'1'
    
  2. Testing Drupal\Tests\migrate\Kernel\SqlBaseTest
    .FF                                                                 3 / 3 (100%)
    
    Time: 00:02.659, Memory: 4.00 MB
    
    There were 2 failures:
    
    1) Drupal\Tests\migrate\Kernel\SqlBaseTest::testHighWater
    Expectation failed for method name is "orderBy" when invoked at least once.
    Expected invocation at least once but it never occurred.
    
    /var/www/html/vendor/phpunit/phpunit/src/Framework/MockObject/Matcher.php:235
    /var/www/html/vendor/phpunit/phpunit/src/Framework/MockObject/InvocationHandler.php:174
    /var/www/html/vendor/phpunit/phpunit/src/Framework/MockObject/Api/Api.php:86
    /var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
    
    2) Drupal\Tests\migrate\Kernel\SqlBaseTest::testHighWater with data set "high-water value set" (33)
    Expectation failed for method name is "orderBy" when invoked at least once.
    Expected invocation at least once but it never occurred.
    
    /var/www/html/vendor/phpunit/phpunit/src/Framework/MockObject/Matcher.php:235
    /var/www/html/vendor/phpunit/phpunit/src/Framework/MockObject/InvocationHandler.php:174
    /var/www/html/vendor/phpunit/phpunit/src/Framework/MockObject/Api/Api.php:86
    /var/www/html/vendor/phpunit/phpunit/src/Framework/TestResult.php:722
    
  3. 1) Drupal\Tests\comment\Kernel\Plugin\migrate\source\d6\CommentSourceWithHighWaterTest::testSource with data set #0 (array(array(array(1, 0, 2, 3, 'subject value 1', 'comment value 1', 'hostname value 1', 1382255613, 0, '', '', '', '', 'testformat1', 'story'), array(2, 1, 3, 4, 'subject value 2', 'comment value 2', 'hostname value 2', 1382255662, 0, '', '', '', '', 'testformat2', 'page')), array(array(2, 'story', 'en'), array(3, 'page', 'fr'))), array(array(2, 1, 3, 4, 'subject value 2', 'comment value 2', 'hostname value 2', 1382255662, 1, '', '', '', '', 'testformat2', 'page', 'fr')), 2, array(array('timestamp')), 1382255613)
    Value at 'array[0][cid]' is not correct.
    Failed asserting that two strings are equal.
    --- Expected
    +++ Actual
    @@ @@
    -'2'
    +'1'
    

The failure in SqlBaseTest is legitimate: the fix in #23 can be refined slightly to continue to run the ORDER BY :high water property: without doing the WHERE :high_water_property: > SOME_INTEGER.

Fixed that here.

Also, since bug fixes must first land in 9.2.x, assigning that version instead.

Wim Leers’s picture

The failure in HighWaterNotJoinableTest is not a regression this patch introduces; it's this patch that uncovered the incorrect way that that test is testing things.

That test is explicitly setting a high water mark, but is not ensuring the relevant mapping rows exist!

And in the case of a migration already having run, it is the map table's contents that determines which source rows are considered "valid", that's literally what \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next() exists for: to implement \Iterator::next() which has this description:

    /**
     * Move forward to next element
     * @link https://php.net/manual/en/iterator.next.php
     * @return void Any returned value is ignored.
     * @since 5.0
     */
    public function next();

The current HighWaterNotJoinableTest does not test this correctly. In fact, it does not even test with an unjoinable map, even though that is literally its name.

So we have quite a bit of work to be done here to make this test actually test what it's supposed to test.

Wim Leers’s picture

Self-review of the #30 interdiff to explain it:

  1. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterNotJoinableTest.php
    @@ -19,6 +24,46 @@ class HighWaterNotJoinableTest extends MigrateSqlSourceTestBase {
    +    // Ensure that \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable()
    +    // considers the ID map plugin not joinable.
    +    $connection = $this->prophesize(Connection::class);
    +    $connection->getConnectionOptions()->willReturn([
    +      'driver' => 'foo',
    +      'database' => 'bar',
    +    ]);
    

    This is what will ensure that \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable() will return FALSE: a foo DB is obviously not joinable against a sqlite DB.

  2. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterNotJoinableTest.php
    @@ -19,6 +24,46 @@ class HighWaterNotJoinableTest extends MigrateSqlSourceTestBase {
    +    // Pretend the row with source ID 1 has already been imported; this is
    +    // reflected in the high water mark. We consider none of the other rows
    +    // imported, so those will return the empty array.
    +    $id_map->getRowBySource(Argument::exact(['id' => '1']))->willReturn([
    +      'source_row_status' => MigrateIdMapInterface::STATUS_IMPORTED,
    +    ]);
    +    $id_map->getRowBySource(Argument::any())->willReturn([]);
    

    This mocks the migration mapping so far, and corresponds to the high water mark that this test is already setting: only a single row has been migrated so far.

  3. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterNotJoinableTest.php
    @@ -19,6 +24,46 @@ class HighWaterNotJoinableTest extends MigrateSqlSourceTestBase {
    +    // ::saveIdMapping() gets called when MigrateSourceTestBase::testSource()
    +    // re-iterates over all rows while intentionally triggering a skip exception
    +    // using migrate_skip_all_rows_test_migrate_prepare_row().
    +    $id_map->saveIdMapping(
    +      Argument::any(),
    +      Argument::any(),
    +      Argument::exact(MigrateIdMapInterface::STATUS_IGNORED)
    +    )->willReturn();
    

    This wouldn't be necessary if MigrateSourceTestBase::testSource() wasn't doing the explicit skip-row-testing-thing.

  4. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterNotJoinableTest.php
    @@ -19,6 +24,46 @@ class HighWaterNotJoinableTest extends MigrateSqlSourceTestBase {
    +    // @todo This should not be necessary.
    +    $id_map->delete(Argument::any(), Argument::any())->willReturn();
    

    And this … well, this I believe is another bug in \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next(): why/how does it make sense to delete messages we've accumulated so far for this source row before we've even decided whether we're going to process this row again?! 🤯🙃

  5. +++ b/core/modules/migrate/tests/src/Kernel/MigrateSqlSourceTestBase.php
    @@ -23,7 +23,7 @@ protected function getDatabase(array $source_data) {
    -    $connection_options = ['database' => ':memory:'];
    +    $connection_options = ['driver' => 'sqlite', 'database' => ':memory:'];
    

    Without this, \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable() will still return TRUE — every DB connection is supposed to specify 'driver', but the absence of strict assertions in the DB abstraction layer made it more forgiving than it should be, and hence allowed this to work okay.

  6. +++ b/core/modules/migrate/tests/src/Kernel/MigrateSqlSourceTestBase.php
    @@ -73,6 +73,10 @@ protected function getDatabase(array $source_data) {
    +    if ($high_water) {
    +      \Drupal::keyValue('migrate:high_water')->set($this->migration->reveal()->id(), $high_water);
    +    }
    

    This is the value that \Drupal\migrate\Plugin\migrate\source\SourcePluginBase::__construct() attempts to look up and assigns to $this->originalHighWater.

Wim Leers’s picture

And finally, Drupal\Tests\comment\Kernel\Plugin\migrate\source\d6\CommentSourceWithHighWaterTest::testSource() fails for roughly the same reasons as Drupal\Tests\migrate\Kernel\HighWaterNotJoinableTest.

We can largely copy/paste the solution.

Wim Leers’s picture

+++ b/core/modules/comment/tests/src/Kernel/Plugin/migrate/source/d6/CommentSourceWithHighWaterTest.php
@@ -18,6 +22,46 @@ class CommentSourceWithHighWaterTest extends MigrateSqlSourceTestBase {
+    // @todo This should not be necessary.
+    $id_map->delete(Argument::any(), Argument::any())->willReturn();

+++ b/core/modules/migrate/tests/src/Kernel/HighWaterNotJoinableTest.php
@@ -19,6 +24,46 @@ class HighWaterNotJoinableTest extends MigrateSqlSourceTestBase {
+    // @todo This should not be necessary.
+    $id_map->delete(Argument::any(), Argument::any())->willReturn();

Per #31.4, I propose we remove the need for this by making SourcePluginBase::next() behave more logically.

The last submitted patch, 29: 2859314-29.patch, failed testing. View results

The last submitted patch, 30: 2859314-30.patch, failed testing. View results

The last submitted patch, 32: 2859314-32.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 33: 2859314-33.patch, failed testing. View results

Wim Leers’s picture

#32 actually passed, the only problem was this:

Declaring ::setUp without a void return typehint

… even though the parent method already declares it exactly that way 🤷‍♀️

I figured that #33 would trigger more failures. That should be simple to fix. This is because \Drupal\migrate\Row's logic is tightly coupled to implementation details in \Drupal\migrate\Plugin\migrate\source\SourcePluginBase.

Wim Leers’s picture

GREEN! 🥳🤩 Now also testing against 9.1.x.

quietone’s picture

The change in #33 is in another issue which I cannot find. Noting it here hoping that I remember to find it.

@Wim Leers, glad to see action on highwater!

quietone’s picture

Found it, the change in #33 is the same as #3036368: No messages after the migration.. Not marking it as a duplicate because I want to determine if the change is made for the same reason.

quietone’s picture

Meant to add it as a related issue.

Wim Leers’s picture

@Wim Leers, glad to see action on highwater!

😊

Found it, the change in #33 is the same as #3036368: No messages after the migration.. Not marking it as a duplicate because I want to determine if the change is made for the same reason.

My proposed change in #33 is more efficient though: it doesn't delete the row if it did not exist yet. Not only is it more efficient, it requires fewer method invocations to be mocked.

I'm sure it's weird/puzzling why I use $row->changed() as a condition. I'm doing that to verify that the row indeed already has been migrated. Because we cannot rely on Row::getIdMap(): it never evaluates to FALSE due to \Drupal\migrate\Row::$idMap's default value, and the logic in Row's methods now is dependent on $idMap never being empty. This is wrong, but it's also the reality now.

(That's also why I had to do #38.)

quietone’s picture

Status: Needs review » Needs work

In #41 I didn't notice that the change included a change in the logic, I thought it was just moving a block of code. Thanks for pointing it out.

I set aside time to look at this today. And as I started to do so, so did a headache and them some very annoying machine noise outside. So this is not thorough, more like a random assortment of things I noticed.

  1. +++ b/core/modules/migrate/src/Plugin/migrate/source/SourcePluginBase.php
    @@ -382,6 +376,11 @@ public function next() {
    +        if ($row->changed() && !empty($this->currentSourceIds)) {
    

    Is testing for row changed necessary? Can there be more explanation here as to why the row is deleted here.

  2. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterTest.php
    @@ -81,6 +83,20 @@ protected function setUp(): void {
    +  protected function prepareMigration(MigrationInterface $migration) {
    +    // Simulate that the source and destination DB are on different hosts and
    +    // are hence not joinable.
    +    // @see \Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable()
    +    $source = $migration->getSourceConfiguration();
    +    if ($source['plugin'] === 'high_water_test') {
    

    This will run for each test in this class and all the migrations in this test class use the source plugin 'high_water_test'. Therefor this changes the conditions for all the tests in the file. Do we not want to retain testing when the map is joinable?

  3. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterTest.php
    @@ -260,6 +276,49 @@ public function testHighWaterUpdate() {
    +    // Add two more item to the source database:
    

    I found this accurate but awkward to read. Maybe something like this? Change as you like.
    // Add two items to the source database, one that is below highwater and one
    // that is above hightwater. The below high water item simulates a source
    // row that failed to migrate for any reason. The above high water one
    // simulates a new row in the source.

  4. +++ b/core/modules/migrate/tests/src/Kernel/HighWaterTest.php
    @@ -260,6 +276,49 @@ public function testHighWaterUpdate() {
    +    // The previously unprocessed rows must have been detected and processed.
    

    I think this is referring to the two items added so let's use the same language used above when adding them. Maybe something like this?

    // The two new items should be detected and processed. Assert the above high water row first because this is more likely to pass tests. The assert order is specific to help with finding the root cause of any error.

  5. +++ b/core/modules/migrate/src/Row.php
    @@ -368,6 +368,15 @@ public function getMultiple(array $properties) {
    +    if (!array_key_exists('original_hash', $this->idMap)) {
    

    Please explain these changes.

Wim Leers’s picture

I set aside time to look at this today.

Thank you! 🙏😊

And as I started to do so, so did a headache and them some very annoying machine noise outside.

😞 yeah that can really destroy my ability to focus too…

#44:

  1. I predicted this would need explaining, so I explained it in #44. Happy to add a comment after you confirm this is the right way to check this.
  2. I agree we should retain that. Keeping "NW" for that.
  3. OMG the 4th word is just plain wrong 🙈Sorry about that.
    The below high water item simulates a source
    // row that failed to migrate for any reason. 
    

    But "failed" usually results in a state of "processed" but with a \Drupal\migrate\Plugin\MigrateIdMapInterface::STATUS_FAILED entry in the map. I'm explicitly saying that is not the case.

    So I'm concerned about this particular wording that you proposed. I completely agree it can and should be better, and I won't be satisfied until it's crystal clear/unambiguous to you too 😊

  4. 👍 Sounds good to me!
  5. They were added in #38, to address the many failures in #33.

    #33 has many failures like this:

    1) Drupal\Tests\migrate_drupal_ui\Functional\d6\Upgrade6Test::testMigrateUpgradeExecute
    Exception: Notice: Undefined index: original_hash
    Drupal\migrate\Row->changed()() (Line: 399)
    

    because these mocked return values are valid but trigger those exceptions:

    +++ b/core/modules/comment/tests/src/Kernel/Plugin/migrate/source/d6/CommentSourceWithHighWaterTest.php
    @@ -18,6 +22,44 @@ class CommentSourceWithHighWaterTest extends MigrateSqlSourceTestBase {
    +    $id_map->getRowBySource(Argument::exact(['cid' => '1']))->willReturn([
    +      'source_row_status' => MigrateIdMapInterface::STATUS_IMPORTED,
    +    ]);
    +    $id_map->getRowBySource(Argument::any())->willReturn([]);
    

    The problem here is that \Drupal\migrate\Row::changed() but also several other methods on that class (f.e. needsUpdate(), ::getHash(), et cetera) assume that \Drupal\migrate\Plugin\MigrateIdMapInterface::getRowBySource() returns at minimum ['original_hash' ⇒ SOMETHING, 'hash' ⇒ SOMETHING, 'source_row_status' ⇒ SOMETHING] aka:

    1. the columns for the migrate_map_* SQL table(s) that are NOT NULL
    2. \Drupal\migrate\Row::$idMap hardcodes this too, and even with a 'source_row_status' => MigrateIdMapInterface::STATUS_NEEDS_UPDATE, that does not make sense — a previously unseen row cannot have a STATUS_NEEDS_UPDATE entry in the migrate ID map plugin, can it?

    IOW: nothing in the \Drupal\migrate\Plugin\MigrateIdMapInterface::getRowBySource() interface guarantees those minimum key-value pairs, and \Drupal\migrate\Plugin\migrate\id_map\NullIdMap::getRowBySource() proves this. But so far nothing has been calling NullIdMap::getRowBySource() (or other alternative map plugins), which is how it went unnoticed all this time.

quietone credited DeFr.

quietone credited SylvainM.

quietone’s picture

Found a duplicate. Adding credit.

The patch in that issue solves the issue in much the same way. I do like that there is no initialization of the map row is setIdMap(). There are tests as well, which I haven't looked at. Maybe they will be helpful?

Finally, I would like to separate the moving of the block of code in SourcePluginBase because it is needed for other than highwater. Highwater is challenging enough. With that in mind I am going to work on #3036368: No messages after the migration.. Just need to figure how to test that.

Wim Leers’s picture

The patch in that issue solves the issue in much the same way.

Good :)

I do like that there is no initialization of the map row is setIdMap().

It's because that issue's patch does not have unit tests for the actual root cause: in the patch here, I'm mocking the map plugin to ensure that we're actually testing the source plugin as a unit.

Second, the tests that it does have do not test the interaction with the map plugin, because they have a custom source plugin @MigrateSource=high_water_test_not_joinable which simply overrides the mapJoinable() function to return FALSE.

Third, as described in #31.4, having mocked the map plugin caused me to stumble upon another bug: the fact that migration messages are deleted at a time/in a state where it does not make sense to do so. That's why that bug is being fixed here too as of #33. That's basically the #3036368: No messages after the migration. issue you linked.

There are tests as well, which I haven't looked at. Maybe they will be helpful?

Not really, see the above.

Wim Leers’s picture

Finally, I would like to separate the moving of the block of code in SourcePluginBase because it is needed for other than highwater. Highwater is challenging enough. With that in mind I am going to work on #3036368: No messages after the migration... Just need to figure how to test that.

Works for me to move that out of here. 😊 It would make this patch simpler for sure.

I've basically already written the test coverage for that here: the interdiff for #33 shows how you can test that. I removed it because that bug was literally a distraction for solving the bug in this issue. You want to change the prophecy that I removed there to basically test that it will never get called. So instead of this from #33:

+++ b/core/modules/comment/tests/src/Kernel/Plugin/migrate/source/d6/CommentSourceWithHighWaterTest.php
@@ -55,8 +55,6 @@ protected function setUp() {
-    // @todo This should not be necessary.
-    $id_map->delete(Argument::any(), Argument::any())->willReturn();

something like:

$id_map->delete(Argument::any(), Argument::any())
  ->willReturnCallback(function () {
    throw new \Exception('This should never be called.');
  }
);
DeFr’s picture

Chiming in from #2996274: Migrate SqlBase: Needs-update source_row_status ignored if mapJoinable() = FALSE and Highwater Marks is ON ; not sure those issues are duplicate, at least the widening of the scope in #21 + the fix means that at least in our case, the proposed fix doesn't work.

The case I needed a fix for is the one where the map *is* joinable (so mapJoinable() will return TRUE) but some external process changes the source_row_status in the migrate map to flag some items that would normally be below the high water mark as explicitely needing an update anyway. (Concretly, site contributors of the site the import is run on can flag imported content as needing to be updated, if they detect that something seems odd. In that case, the source row status is set to NEEDS_UPDATE, and the next import will re-import it).

quietone’s picture

@DeFr, yes, duplicate is isn't right and was premature. On further reflection, I have changed my mind (again) but am refraining from commenting until I review your comment and have a good night sleep.

DeFr’s picture

So, I took the time to re-read #2996274: Migrate SqlBase: Needs-update source_row_status ignored if mapJoinable() = FALSE and Highwater Marks is ON, this issue, SqlBase::initializeIterator and and the client mentionned in #52.

So first of all, what I mentionned above for the client in #52 is incorrect ; even though this client is running both import from the save SQL server, with explicitely set the map to be not joinable, to avoid the performance hit of that cross database join. So I think duplicate was the right call.

I'm still concerned about the proposed fix here, because it means you're going to iterate over every row the source. I'm not really convinced by the correctness over performance here, because if iterating over the whole result set, instantiating Row objects for all of them is acceptable, then migrate as is already provides a better suited alternative: the track_changes property. It's a bit slower still, because it means Migrate will compute need to serialize the row to compute its sha256 hash, but if you're shooting for correctness, then that's probably the right call. High water fields should only be used when you can't do that do to performance concerns. for context, the migrate source for that client is ~500k records, but in this huge amount of record, only ~50 are updated hourly, which is how often that specific migration is ran. With the highwater optimization as is, it means it'll only read 50 records from the source server, process them, and ran rather quickly ; going over the whole 500k records every hour on the other hand would be a huge performance killer. That's why the proposed patch in the duplicate issue only did that when at least one item was explicitly needing an update.

Going back to #21 and why some items might be unprocessed ; I think I know what's going on. Obviously only a theory at this point, I think what's going on here is that highwater fields should not only be monotonously increasing in the source, they also ideally should be unique, because SqlBase adds a > condition to the query. If you have a bunch of content added with exactly the same value for the highwater field and you run the migration in batch, if the batch splits the source items at the wrong time, you'll get some unprocessed items. ( This is likely to happen if you're importing data from a Drupal 7 website which itself get a bunch of data imported and use 'changed' as the high water property, most of the times all those items will get their changed property set to the timestamp that import was started). To work around this, for the client mentioned above, we're using a computed field as the high water, composed of changed + (nid / 1 000 000), assuming that there will never be more than 1 000 0000 items in the source.

Wim Leers’s picture

I'm still concerned about the proposed fix here, because it means you're going to iterate over every row the source.

Correct.

I'm not really convinced by the correctness over performance here,

🤯 Curious to read the rationale now! 😄

the track_changes property

track_changes doesn't fix the problem of the migration system having failed to process and not having written that lack of processing into the id_map plugin. This is what I described in #21. So far, it's been happening for big clients with lots of data, but without any useful hints, errors, log entries or whatnot being generated by the migration system.

With the highwater optimization as is, it means it'll only read 50 records from the source server, process them, and ran rather quickly ; going over the whole 500k records every hour on the other hand would be a huge performance killer.

This is fair; but note that this will only be bad for performance if and only if the source and destination DB are not joinable.

Obviously only a theory at this point, I think what's going on here is that highwater fields should not only be monotonously increasing in the source, they also ideally should be unique, because SqlBase adds a > condition to the query

This can't be the root cause. Not only because as it is, the migration system increments the high water mark whenever it reads from the source, not when it succesfully processes the row, not even because it blindly writes whatever the current high water mark is without checking if it is bigger or smaller, but simply because the migration system fails to track in the id_map plugin which rows were unprocessed. I know the migration system is catching exceptions and explicitly recording failures, but I am seeing repeated evidence that this is not guaranteed to work correctly.

If you have a bunch of content added with exactly the same value for the highwater field and you run the migration in batch

This is a great theory 👏🤩, but alas does not apply to what I've seen sadly 😞 — I've seen it happen literally the first and only time you run a migration, it still fails to process rows and does not record those failures in the id_map plugin.

DeFr’s picture

track_changes doesn't fix the problem of the migration system having failed to process and not having written that lack of processing into the id_map plugin

Are you sure about that ? That'd be really weird, because the original hash is stored in the id map ; in #21 you state that there's no record at all for those record in the id map, so I don't really see how the original hash would end up in there.

This is fair; but note that this will only be bad for performance if and only if the source and destination DB are not joinable.

Yes, but it's also the case where it's likely to affect you the most, because it means that you're probably querying a remote server, transfering all the data over will not be cheap.

It also might be a good reason to purposefully prevent the map from being joinable : getting a query that the SQL server will be able to easily optimize, instead of having to join all your records, can be a huge benefits boost if you're working with large databases.

I've seen it happen literally the first and only time you run a migration

Not sure exactly how you're running that migration, but the batch mentioned isn't incompatible with running that migration a single time ; you can get those batches either by specifying an explicit batch_size in the SQL Source or by simply running the migration on a system that has an explicit memory limit set, in which case Migrate executable will start a new batch when you're approaching the memory limit.

That new query will be ran with the last highwater value recordded, and can lead to records with the same highwater value being skipped.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Wim Leers’s picture

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.