Follow-up to #2824267: Highwater condition isn't added (on remote databases)
Problem/Motivation
per ohthehugemanatee in https://www.drupal.org/node/2824267#comment-11946115:
What I don't understand, is how does this work when the map is NOT joinable? If mapJoinable() returns FALSE, the query conditions look like: "(original conditions) AND (above high water)." How do we know which rows have NEEDS_UPDATE set?
For a detailed analysis, see #20 through #23.
Proposed resolution
Ideally, the highwater condition should not be added to the query when there are NEEDS_UPDATE rows present.
Do not add the highwater query condition if the map table is not joinable (if the map table in the destination site cannot be joined against the source table in the source site).
Remaining tasks
User interface changes
None.
API changes
None.
Data model changes
None.
Comment | File | Size | Author |
---|---|---|---|
#58 | 2859314-58.patch | 11.51 KB | Wim Leers |
#38 | 2859314-38.patch | 11.48 KB | Wim Leers |
#38 | interdiff.txt | 2.02 KB | Wim Leers |
#33 | 2859314-33.patch | 10.8 KB | Wim Leers |
#33 | interdiff.txt | 2.79 KB | Wim Leers |
Comments
Comment #2
jmmarquez CreditAttribution: jmmarquez commentedI am working on it!!
Comment #3
jmmarquez CreditAttribution: jmmarquez commentedUploading patch.
Comment #5
jmmarquez CreditAttribution: jmmarquez at La Drupalera by Emergya for La Drupalera by Emergya commentedComment #6
gaurav.kapoor CreditAttribution: gaurav.kapoor at OpenSense Labs commentedComment #8
gaurav.kapoor CreditAttribution: gaurav.kapoor at OpenSense Labs commentedoops!.My point is that $condition_added should have true value before if condition.
Comment #9
jmmarquez CreditAttribution: jmmarquez at La Drupalera by Emergya for La Drupalera by Emergya commentedI upload patch again.
Comment #10
jmmarquez CreditAttribution: jmmarquez at La Drupalera by Emergya for La Drupalera by Emergya commentedComment #12
mikeryanNote a couple of Drupal coding standards:
1. FALSE and TRUE should always be uppercase.
2. There need to be spaces around any operators (include ? and :).
That being said, I'm not sure what the intent is here - this is flipping the sense of $condition_added, so the value is actually the opposite of what it's supposed to represent?
Comment #13
mikeryanCould it be this simple?
At any rate, we'll need a test demonstrated the failure.
Comment #20
Wim LeersI can confirm this bug.
I's been driving me maddeningly insane 😬😬😬
I've been observing this problem where there's two distinct MySQL DBs, on separate hosts. In that case, this code in
\Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable()
returns FALSE:I've been pulling my hair out why I could never reproduce this behavior locally. I first blamed server reliability. But then I reproduced the problem locally, because I had my source DB in SQLite and destination DB in MySQL. Which effectively hits this same edge case.
Of course, it makes total sense that it's impossible to join a source DB table against a migrate mapping DB table in another database. It's just that the migration system never informs you of this explicitly.
More importantly, this should only make things run more slowly, not incorrectly: it is still possible to iterate over every row in the source plugin, perform a query against the mapping table to check the migration mapping status, and thus figure out the correct answer using N cheap queries instead of 1 complex query (basically, JOINing in PHP rather than in SQL).
Writing test coverage for this is simple precisely thanks to the test that #2824267: Highwater condition isn't added (on remote databases) added. We can simply force
\Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable()
to think it is dealing with a remote DB … which is exactly what #2824267: Highwater condition isn't added (on remote databases) aimed to test!This simple simulation in the test coverage will make
Drupal\Tests\migrate\Kernel\HighWaterTest::testHighWaterUpdate()
fail, hence proving this bug in the process.Comment #21
Wim LeersIn my case, the problem I'm seeing more often is that the migration system, for whatever reason (I still suspect temporary server unreliability), some rows are unprocessed. Say you have rows 1 through 10,000, then rows 3453, 3454, 8434 and 9123 are unprocessed. Somehow, the migration system did not record these as ignored, skipped, failed — nothing. There's simply no corresponding entries for them in the mapping table. I've seen this happen several times now, and never consistently the same rows.
So I'm updating the title from
to .I fully realize that this probably points to a deeper rooted problem elsewhere. But that does not mean it's okay for the migration system to not make good on its promises: it is supposed to find unprocessed rows and process them, no matter for what reason they were not yet processed.
In my case, the crucial comment from
::mapJoinable()
:or, the essence of it:
This comment … well … it turns out to be a lie when using remote databases! 😨😳🙃 This took many hours of debugging 😬
You see, what actually happens on migration setups where the source and destination DBs are not joinable is this:
And since later rows are processed, and it's
\Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next()
that updates the high water mark upon accessing the source, not upon saving the mapping, then any unprocessed rows with high water property values below the high water mark are simply never even considered — they're filtered away in the source plugin's query!The attached patch's additional test coverage for
HighWaterTest
simulates this. (The test coverage addition in #20 already ensures that all test methods inHighWaterTest
are simulated to use remote databases.)Comment #22
Wim Leers#13: This suggested change is insufficient.
Let me quote the crucial code from
::mapJoinable()
:IOW:
Your suggested simple change (changing that last
if
toelseif
) causes that to change to two possible query shapes:I hope I've clearly explained why this would be wrong.
Comment #23
Wim LeersThe simplest possible fix is this change:
This means that only the following two query shapes are possible:
Both are guaranteed to yield correct results (the passing tests will prove that). That being said, this will be less efficient for the case where there are no unprocessed or NEEDS_UPDATE rows in the source, and only net new rows in the source: in that scenario (and only that scenario!),
(original conditions) AND (above high water)
would have been perfectly fine.However, correctness is always more important than performance. I propose leaving that optimization for a follow-up issue.
Comment #24
Wim LeersI am not sure if the migration system tracks enough metadata about a source DB to be able to know which rows are net new — only when the source table has an auto-incrementing integer primary key we can reasonably infer that.
In that case we could do
(original conditions) AND (PK in source > MAX(PK) in map AND above high water)
. But we'd need to introspect the schema of the source and I'm not sure the migration system has ever done this?Comment #26
Wim LeersComment #29
Wim LeersLooks like #23 introduced failures in:
The failure in
SqlBaseTest
is legitimate: the fix in #23 can be refined slightly to continue to run theORDER BY :high water property:
without doing theWHERE :high_water_property: > SOME_INTEGER
.Fixed that here.
Also, since bug fixes must first land in
9.2.x
, assigning that version instead.Comment #30
Wim LeersThe failure in
HighWaterNotJoinableTest
is not a regression this patch introduces; it's this patch that uncovered the incorrect way that that test is testing things.That test is explicitly setting a high water mark, but is not ensuring the relevant mapping rows exist!
And in the case of a migration already having run, it is the map table's contents that determines which source rows are considered "valid", that's literally what
\Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next()
exists for: to implement\Iterator::next()
which has this description:The current
HighWaterNotJoinableTest
does not test this correctly. In fact, it does not even test with an unjoinable map, even though that is literally its name.So we have quite a bit of work to be done here to make this test actually test what it's supposed to test.
Comment #31
Wim LeersSelf-review of the #30 interdiff to explain it:
This is what will ensure that
\Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable()
will return FALSE: afoo
DB is obviously not joinable against asqlite
DB.This mocks the migration mapping so far, and corresponds to the high water mark that this test is already setting: only a single row has been migrated so far.
This wouldn't be necessary if
MigrateSourceTestBase::testSource()
wasn't doing the explicit skip-row-testing-thing.And this … well, this I believe is another bug in
\Drupal\migrate\Plugin\migrate\source\SourcePluginBase::next()
: why/how does it make sense to delete messages we've accumulated so far for this source row before we've even decided whether we're going to process this row again?! 🤯🙃Without this,
\Drupal\migrate\Plugin\migrate\source\SqlBase::mapJoinable()
will still returnTRUE
— every DB connection is supposed to specify'driver'
, but the absence of strict assertions in the DB abstraction layer made it more forgiving than it should be, and hence allowed this to work okay.This is the value that
\Drupal\migrate\Plugin\migrate\source\SourcePluginBase::__construct()
attempts to look up and assigns to$this->originalHighWater
.Comment #32
Wim LeersAnd finally,
Drupal\Tests\comment\Kernel\Plugin\migrate\source\d6\CommentSourceWithHighWaterTest::testSource()
fails for roughly the same reasons asDrupal\Tests\migrate\Kernel\HighWaterNotJoinableTest
.We can largely copy/paste the solution.
Comment #33
Wim LeersPer #31.4, I propose we remove the need for this by making
SourcePluginBase::next()
behave more logically.Comment #38
Wim Leers#32 actually passed, the only problem was this:
… even though the parent method already declares it exactly that way 🤷♀️
I figured that #33 would trigger more failures. That should be simple to fix. This is because
\Drupal\migrate\Row
's logic is tightly coupled to implementation details in\Drupal\migrate\Plugin\migrate\source\SourcePluginBase
.Comment #39
Wim LeersGREEN! 🥳🤩 Now also testing against
9.1.x
.Comment #40
quietone CreditAttribution: quietone as a volunteer commentedThe change in #33 is in another issue which I cannot find. Noting it here hoping that I remember to find it.
@Wim Leers, glad to see action on highwater!
Comment #41
quietone CreditAttribution: quietone as a volunteer commentedFound it, the change in #33 is the same as #3036368: No messages after the migration.. Not marking it as a duplicate because I want to determine if the change is made for the same reason.
Comment #42
quietone CreditAttribution: quietone as a volunteer commentedMeant to add it as a related issue.
Comment #43
Wim Leers😊
My proposed change in #33 is more efficient though: it doesn't delete the row if it did not exist yet. Not only is it more efficient, it requires fewer method invocations to be mocked.
I'm sure it's weird/puzzling why I use
$row->changed()
as a condition. I'm doing that to verify that the row indeed already has been migrated. Because we cannot rely onRow::getIdMap()
: it never evaluates toFALSE
due to\Drupal\migrate\Row::$idMap
's default value, and the logic inRow
's methods now is dependent on$idMap
never being empty. This is wrong, but it's also the reality now.(That's also why I had to do #38.)
Comment #44
quietone CreditAttribution: quietone as a volunteer commentedIn #41 I didn't notice that the change included a change in the logic, I thought it was just moving a block of code. Thanks for pointing it out.
I set aside time to look at this today. And as I started to do so, so did a headache and them some very annoying machine noise outside. So this is not thorough, more like a random assortment of things I noticed.
Is testing for row changed necessary? Can there be more explanation here as to why the row is deleted here.
This will run for each test in this class and all the migrations in this test class use the source plugin 'high_water_test'. Therefor this changes the conditions for all the tests in the file. Do we not want to retain testing when the map is joinable?
I found this accurate but awkward to read. Maybe something like this? Change as you like.
// Add two items to the source database, one that is below highwater and one
// that is above hightwater. The below high water item simulates a source
// row that failed to migrate for any reason. The above high water one
// simulates a new row in the source.
I think this is referring to the two items added so let's use the same language used above when adding them. Maybe something like this?
// The two new items should be detected and processed. Assert the above high water row first because this is more likely to pass tests. The assert order is specific to help with finding the root cause of any error.
Please explain these changes.
Comment #45
Wim LeersThank you! 🙏😊
😞 yeah that can really destroy my ability to focus too…
#44:
But "failed" usually results in a state of "processed" but with a
\Drupal\migrate\Plugin\MigrateIdMapInterface::STATUS_FAILED
entry in the map. I'm explicitly saying that is not the case.So I'm concerned about this particular wording that you proposed. I completely agree it can and should be better, and I won't be satisfied until it's crystal clear/unambiguous to you too 😊
#33 has many failures like this:
because these mocked return values are valid but trigger those exceptions:
The problem here is that
\Drupal\migrate\Row::changed()
but also several other methods on that class (f.e.needsUpdate()
,::getHash()
, et cetera) assume that\Drupal\migrate\Plugin\MigrateIdMapInterface::getRowBySource()
returns at minimum['original_hash' ⇒ SOMETHING, 'hash' ⇒ SOMETHING, 'source_row_status' ⇒ SOMETHING]
aka:migrate_map_*
SQL table(s) that areNOT NULL
\Drupal\migrate\Row::$idMap
hardcodes this too, and even with a'source_row_status' => MigrateIdMapInterface::STATUS_NEEDS_UPDATE,
that does not make sense — a previously unseen row cannot have aSTATUS_NEEDS_UPDATE
entry in the migrate ID map plugin, can it?IOW: nothing in the
\Drupal\migrate\Plugin\MigrateIdMapInterface::getRowBySource()
interface guarantees those minimum key-value pairs, and\Drupal\migrate\Plugin\migrate\id_map\NullIdMap::getRowBySource()
proves this. But so far nothing has been callingNullIdMap::getRowBySource()
(or other alternative map plugins), which is how it went unnoticed all this time.Comment #49
quietone CreditAttribution: quietone as a volunteer commentedFound a duplicate. Adding credit.
The patch in that issue solves the issue in much the same way. I do like that there is no initialization of the map row is setIdMap(). There are tests as well, which I haven't looked at. Maybe they will be helpful?
Finally, I would like to separate the moving of the block of code in SourcePluginBase because it is needed for other than highwater. Highwater is challenging enough. With that in mind I am going to work on #3036368: No messages after the migration.. Just need to figure how to test that.
Comment #50
Wim LeersGood :)
It's because that issue's patch does not have unit tests for the actual root cause: in the patch here, I'm mocking the map plugin to ensure that we're actually testing the source plugin as a unit.
Second, the tests that it does have do not test the interaction with the map plugin, because they have a custom source plugin
@MigrateSource=high_water_test_not_joinable
which simply overrides themapJoinable()
function to return FALSE.Third, as described in #31.4, having mocked the map plugin caused me to stumble upon another bug: the fact that migration messages are deleted at a time/in a state where it does not make sense to do so. That's why that bug is being fixed here too as of #33. That's basically the #3036368: No messages after the migration. issue you linked.
Not really, see the above.
Comment #51
Wim LeersWorks for me to move that out of here. 😊 It would make this patch simpler for sure.
I've basically already written the test coverage for that here: the interdiff for #33 shows how you can test that. I removed it because that bug was literally a distraction for solving the bug in this issue. You want to change the prophecy that I removed there to basically test that it will never get called. So instead of this from #33:
something like:
Comment #52
DeFr CreditAttribution: DeFr at Axess Open Web Services commentedChiming in from #2996274: Migrate SqlBase: Needs-update source_row_status ignored if mapJoinable() = FALSE and Highwater Marks is ON ; not sure those issues are duplicate, at least the widening of the scope in #21 + the fix means that at least in our case, the proposed fix doesn't work.
The case I needed a fix for is the one where the map *is* joinable (so mapJoinable() will return TRUE) but some external process changes the source_row_status in the migrate map to flag some items that would normally be below the high water mark as explicitely needing an update anyway. (Concretly, site contributors of the site the import is run on can flag imported content as needing to be updated, if they detect that something seems odd. In that case, the source row status is set to NEEDS_UPDATE, and the next import will re-import it).
Comment #53
quietone CreditAttribution: quietone as a volunteer commented@DeFr, yes, duplicate is isn't right and was premature. On further reflection, I have changed my mind (again) but am refraining from commenting until I review your comment and have a good night sleep.
Comment #54
DeFr CreditAttribution: DeFr at Axess Open Web Services commentedSo, I took the time to re-read #2996274: Migrate SqlBase: Needs-update source_row_status ignored if mapJoinable() = FALSE and Highwater Marks is ON, this issue, SqlBase::initializeIterator and and the client mentionned in #52.
So first of all, what I mentionned above for the client in #52 is incorrect ; even though this client is running both import from the save SQL server, with explicitely set the map to be not joinable, to avoid the performance hit of that cross database join. So I think duplicate was the right call.
I'm still concerned about the proposed fix here, because it means you're going to iterate over every row the source. I'm not really convinced by the correctness over performance here, because if iterating over the whole result set, instantiating Row objects for all of them is acceptable, then migrate as is already provides a better suited alternative: the track_changes property. It's a bit slower still, because it means Migrate will compute need to serialize the row to compute its sha256 hash, but if you're shooting for correctness, then that's probably the right call. High water fields should only be used when you can't do that do to performance concerns. for context, the migrate source for that client is ~500k records, but in this huge amount of record, only ~50 are updated hourly, which is how often that specific migration is ran. With the highwater optimization as is, it means it'll only read 50 records from the source server, process them, and ran rather quickly ; going over the whole 500k records every hour on the other hand would be a huge performance killer. That's why the proposed patch in the duplicate issue only did that when at least one item was explicitly needing an update.
Going back to #21 and why some items might be unprocessed ; I think I know what's going on. Obviously only a theory at this point, I think what's going on here is that highwater fields should not only be monotonously increasing in the source, they also ideally should be unique, because SqlBase adds a > condition to the query. If you have a bunch of content added with exactly the same value for the highwater field and you run the migration in batch, if the batch splits the source items at the wrong time, you'll get some unprocessed items. ( This is likely to happen if you're importing data from a Drupal 7 website which itself get a bunch of data imported and use 'changed' as the high water property, most of the times all those items will get their changed property set to the timestamp that import was started). To work around this, for the client mentioned above, we're using a computed field as the high water, composed of changed + (nid / 1 000 000), assuming that there will never be more than 1 000 0000 items in the source.
Comment #55
Wim LeersCorrect.
🤯 Curious to read the rationale now! 😄
track_changes
doesn't fix the problem of the migration system having failed to process and not having written that lack of processing into theid_map
plugin. This is what I described in #21. So far, it's been happening for big clients with lots of data, but without any useful hints, errors, log entries or whatnot being generated by the migration system.This is fair; but note that this will only be bad for performance if and only if the source and destination DB are not joinable.
This can't be the root cause. Not only because as it is, the migration system increments the high water mark whenever it reads from the source, not when it succesfully processes the row, not even because it blindly writes whatever the current high water mark is without checking if it is bigger or smaller, but simply because the migration system fails to track in the
id_map
plugin which rows were unprocessed. I know the migration system is catching exceptions and explicitly recording failures, but I am seeing repeated evidence that this is not guaranteed to work correctly.This is a great theory 👏🤩, but alas does not apply to what I've seen sadly 😞 — I've seen it happen literally the first and only time you run a migration, it still fails to process rows and does not record those failures in the
id_map
plugin.Comment #56
DeFr CreditAttribution: DeFr at Axess Open Web Services commentedAre you sure about that ? That'd be really weird, because the original hash is stored in the id map ; in #21 you state that there's no record at all for those record in the id map, so I don't really see how the original hash would end up in there.
Yes, but it's also the case where it's likely to affect you the most, because it means that you're probably querying a remote server, transfering all the data over will not be cheap.
It also might be a good reason to purposefully prevent the map from being joinable : getting a query that the SQL server will be able to easily optimize, instead of having to join all your records, can be a huge benefits boost if you're working with large databases.
Not sure exactly how you're running that migration, but the batch mentioned isn't incompatible with running that migration a single time ; you can get those batches either by specifying an explicit batch_size in the SQL Source or by simply running the migration on a system that has an explicit memory limit set, in which case Migrate executable will start a new batch when you're approaching the memory limit.
That new query will be ran with the last highwater value recordded, and can lead to records with the same highwater value being skipped.
Comment #58
Wim Leers#3190815: Source count caching broken: impossible to enable source count caching for SqlBase-based source plugins (plus, unneeded cache I/O) broke this between
9.1.7
and9.2.0-rc1
. Rebased. Keeping at current status.