Problem/Motivation

If the source database has multiple files with the same timestamp value, then the returned result is unpredictable and random most of the time when using d7_file migrate source plugin.

We already have this for d6 plugin:

$this->select('files', 'f')
      ->fields('f')
      ->condition('f.filepath', '/tmp%', 'NOT LIKE')
      ->orderBy('f.timestamp')
      // If two or more files have the same timestamp, they'll end up in a
      // non-deterministic order. Ordering by fid (or any other unique field)
      // will prevent this.
      ->orderBy('f.fid');

But for d7 we don't have it yet:

$this->select('file_managed', 'f')
      ->fields('f')
      ->condition('f.uri', 'temporary://%', 'NOT LIKE')
      ->orderBy('f.timestamp');

Proposed resolution

Add orderBy('f.fid') to d7_file source plugin query.

Remaining tasks

Submit a patch/merge request, review, commit.

Issue fork drupal-3220155

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

Matroskeen created an issue. See original summary.

matroskeen’s picture

Issue summary: View changes
Issue tags: +Bug Smash Initiative
quietone’s picture

Did some digging and the d7 File source plugin was added in #2503861: Migrate file_managed table to file entities on 5 Aug 2015 and the change to d6 mentioned in the IS was in #2504815: d6 to d8 migration throws integrity contraint warning with duplicate file paths on 5 Oct 2015. Both phenaproxima and mikeryan worked on those issues, so why didn't they think D7 needed a similar change? I can't find an answer to that.

This comment seems to agree with Matroskeen

matroskeen’s picture

Status: Active » Needs review
StatusFileSize
new3.49 KB
new4.23 KB

No one picked this up, so I'd like to self-post a patch :)
I tried to come up with a test-only patch as well, so let's see.

The last submitted patch, 4: 3220155-4-test_only.patch, failed testing. View results

heddn’s picture

Status: Needs review » Needs work

If we automatically add this, it will break highwatermark adding order by. Marking NW because we can't just simply add the order by or it will break highwater.

matroskeen’s picture

Issue tags: -Novice

Thanks for the reply. It seems it's a good chance for me to dive into highwatermark feature and try to understand how that works.
My assumption about the task complexity was wrong, so I'm removing the "Novice" tag.

vsujeetkumar made their first commit to this issue’s fork.

matroskeen’s picture

Status: Needs work » Needs review
Issue tags: +Needs issue summary update

Before going ahead and fixing the remaining tests, I'd like to validate the approach.

If there is a high watermark, we're sorting by this value. Otherwise, we're doing a sort by source IDs.

I think it should be BC-friendly, and here is why:

  1. if the source plugin already contains orderBy by same ID field(s) - nothing changes, the second one will be ignored;
  2. if the source plugin (original or extended) contains orderBy by another field - the results are still sorted by initial value, and source IDs are just secondary-sorted values (orderBy in SqlBase is coming second);
  3. if the source plugin has no orderBy yet, it'll be sorted by source IDs, that should be primary keys;

I was also thinking about removing existing orderBy from source plugins, but it might change the sorting in some cases.
For example, if there is a plugin, that extends default one and adding some extra orderBy (which behaves as secondary for now). After removing the orderBy from parent plugin, the secondary orderBy becomes primary, which obviously is not expected.

matroskeen’s picture

Hiding initial patches.

matroskeen’s picture

Thinking more about this, it seems that highwater marks are not quite correct when the source plugin already provides some order.
In this case, ordering by highwater value is secondary order because it comes 2nd, isn't it?

quietone’s picture

mikelutz’s picture

Status: Needs review » Closed (works as designed)

This would be NW for test fails, but generally, highwatermarks aren't used in core UI migration, and do break if the source plugin adds an orderby while relying on the default sql source base to manage highwatermark. Strictly speaking the random order in a migration is not a problem. We added a specific order by in d6 to fix a test that was dependant on order, and failed on postgre. Highwater really only works if you are ordering by the highwater field, otherwise highwater has no real meaning. If the records aren't coming in in the order of the highwater field, then things get skipped or otherwise just don't make sense. This is why we don't want to order source plugins unless there is a very specific reason why we need to, which is rare. We probably shouldn't have ordered the d6 files, we probably should have fixed the test to deal with the random order.

So, this is closed, works as designed, and we should try to remove the order by in d6 and fix the actual test, rather than breaking the source plugin for highwater for the sake of the test.