Problem/Motivation

You get an error while updating from a D6 site

system module
Update #7061
Failed: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'public://sites/default/files/HotspotFinder.apk' for key 'uri': INSERT INTO {file_managed} (fid, uid, filename, uri, filemime, filesize, status, timestamp) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, 

Proposed resolution

This issue has no solutions yet.

workaround / analysis

As boran tells in #2 you could analyze your files table for the fids

Some SQL to help you identify the duplicate filepath.

SELECT count(*), min(fid), max(fid), filepath FROM files GROUP BY filepath HAVING count(*) > 1 ORDER BY count(*);

Next you could delete the lowest fids with the next SQLs. Be aware that this does not fix the problem but you could test other parts of the upgrade process. Some node/user/? could mis there FIDs

CREATE TABLE temp_fids AS SELECT min(fid) fid FROM files GROUP BY filepath HAVING count(*)>1;
DELETE FROM files WHERE fid in (SELECT fid FROM temp_fids);

As observed in #11 a relation between published and unpublished nodes could be a cause.

Remaining tasks

Is this related to #966210: DB Case Sensitivity: system_update_7061() fails on inserting files with same name but different case

Original report by boran

system module
Update #7061
Failed: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'public://sites/default/files/HotspotFinder.apk' for key 'uri': INSERT INTO {file_managed} (fid, uid, filename, uri, filemime, filesize, status, timestamp) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6, :db_insert_placeholder_7); Array ( [:db_insert_placeholder_0] => 56 [:db_insert_placeholder_1] => 38 [:db_insert_placeholder_2] => HotspotFinder.apk [:db_insert_placeholder_3] => public://sites/default/files/HotspotFinder.apk [:db_insert_placeholder_4] => application/octet-stream [:db_insert_placeholder_5] => 507667 [:db_insert_placeholder_6] => 1 [:db_insert_placeholder_7] => 1253524245 ) in system_update_7061() (line 2808 of /disk2/www/drupal7/modules/system/system.install).

On the D6 site there are entries in the file table with the same filepath:

| fid | uid | filename             | filepath                                 | filemime                 | filesize | status | timestamp  
|  54 |  38 | HotspotFinder.apk    | sites/default/files/HotspotFinder.apk    | application/octet-stream |   178090 |      1 | 1252920928 |
|  56 |  38 | HotspotFinder.apk    | sites/default/files/HotspotFinder.apk    | application/octet-stream |   507667 |      1 | 1253524245 |

On D7 the file_managed table has the uri column as a unique key.
So the upgrade script fails, as noted above, when one tries to insert a file with the same path.

possible solutions:
- if there duplicates, only upgrade the first one, reference that, and issues a warning to the User. The user will need to manually update any non core modules that might reference the duplicate file entries.
- make a copy of the file for each duplicate and reference that

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

wessex’s picture

I have just experienced this error too, upgrading from 6.22 > 7.8.

Is there a workaround for the time being?

boran’s picture

Priority: Major » Normal
Status: Needs work » Active
Issue tags: -D7 upgrade path

I was working on a patch to handle the first file and warn about the others, but am lacking time right now.

Suggestion:
query your file table to see what duplicates you have (do a group by on the filepath). If these are very few:
- note the fid, and find out in what other tables reference the fid (cck fields, nodes.. )
- manually delete the duplicates from the file table,
- upgrade to 7.x
- manually correct the affected nodes/fields or whatever the affected file was attached too.

catch’s picture

clemens.tolboom’s picture

Status: Closed (duplicate) » Needs work

@catch: this report does not have different case file names.

I reopened this as I guess the root case here is different.

As I commented on #966210: DB Case Sensitivity: system_update_7061() fails on inserting files with same name but different case (http://drupal.org/node/966210#comment-5116402) my system had no case sensitive problem but duplicate entries based on uid in the files table.

What worries me here it the mentioned files have different sizes. Which one should be used?

clemens.tolboom’s picture

I tried to delete the duplicate file rows by first checking for how many duplicates there are
[edit: replaced filename by filepath see comment #6]

SELECT min(fid) FROM files GROUP BY filepath HAVING count(*)>1;

then define a delete query

DELETE FROM files where fid in (SELECT min(fid) FROM files GROUP BY filepath HAVING count(*)>1);

which is not allowed by mysql so I created a temp table having the lowest fid to delete

CREATE TABLE temp_clemens AS SELECT min(fid) fid FROM files GROUP BY filepath HAVING count(*)>1;

The delete the rows to further test the upgrade.

DELETE FROM files WHERE fid IN (SELECT fid FROM temp_clemens);

my 2cents

clemens.tolboom’s picture

Issue summary: View changes

fix closing tag

clemens.tolboom’s picture

Hmmm comment #5 should mention filepath instead of filename as this is about the D7 file URI schemes like public://img.png

Deciphered’s picture

#5,

Based on information on the MySQL site, this is the query I used to workaround this issue myself:

CREATE TABLE tmp SELECT * FROM files GROUP BY filepath, filename;
DROP TABLE files;
ALTER TABLE tmp RENAME TO files;

Basically does the same thing.

I still don't know what effect it will have on the site to have removed columns from the Files table, but I would guess that it's not really a good thing. If anything, there should be some sort of find and replace functionality that replaces the removed FIDs with the leftover FIDs.

clemens.tolboom’s picture

@Deciphered #7

What you did was dropping the unique constraint on the filepath column. And others.
(I should keep the files table by renaming it instead of dropping)

So yes ... that is a nice temporary workaround for testing the rest of the upgrade process.

But no ... I guess when you delete one of the fids through the UI the FILE will be deleted too. So the other fid becomes useless.

catch’s picture

Issue tags: +D7 upgrade path

So there was never a unique key on filepath in Drupal 6, but we have a unique key on uri in D7. This looks like at least major to me. Also tagging with upgrade path.

catch’s picture

Priority: Normal » Major
clemens.tolboom’s picture

Issue summary: View changes

Updated issue summary.

clemens.tolboom’s picture

A very weird observation I have is this SQL. Note this is data from a D5 site.

SELECT n.title, n.type, n.status, f.filepath FROM node n INNER JOIN files f ON n.nid = f.nid 
WHERE f.filepath IN (
  SELECT filepath FROM files GROUP BY filepath HAVING count(*) > 1 ORDER BY count(*)
)
ORDER BY f.filepath, n.status;

This give a list like this

+--------------------------------------------------+--------------+--------+---------------------------------------------------------------+
| title                                          | type         | status | filepath                                                      |
+--------------------------------------------------+--------------+--------+---------------------------------------------------------------+
| 2x2 Value Matrix                               | type_x |      0 | files/2x2_Value_Matrix.doc                                    | 
| 2x2 Value Matrix                               | type_x |      1 | files/2x2_Value_Matrix.doc                                    | 
...

That is _ALL_ duplicate filepath entries have different node.status

Maybe the cause was #92820: uploaded file attached to multiple nodes when submitting with two node/adds open

So my fix can be deleting all the files of all unpublished nodes and let the site owner repair them based on status and title.

[edit]
DELETE files FROM files INNER JOIN node n ON files.nid = n.nid WHERE n.status = 0;
[/edit]

(Puzzle for this particular site is why they created duplicate node entries in the first place)

clemens.tolboom’s picture

A problem with #11 is the NIDs between the unpublished and published nodes are not next or near to each other. They differ between 94 - 220. But the published nodes all have consecutive NIDs

select f.fid, g.fid, (g.fid - f.fid) fdiv, f.filepath, g.filepath FROM files f INNER JOIN files g ON f.filepath = g.filepath WHERE f.fid <> g.fid ORDER BY g.fid;

I should analyze the created and updated times. Maybe next time as I still have to upgrade a site :)

giorgio79’s picture

Priority: Normal » Major
Status: Active » Needs work
Issue tags: +D7 upgrade path

Appreciated Clemens, your solution at #5 allowed me to execute the system update, so far so good. I have other errors now with other modules but hopefully I am a step closer for a functioning d7 site :)

markwk’s picture

Running into same error. I'd say this major.

clemens.tolboom’s picture

@markwk: it is quite useless just saying me too :)

Please tell us more about your particular site. And ... :)

[Stock response from Dreditor templates and macros.]

Please update the issue summary by editing the issue. This helps people to understand this issue quicker.

markwk’s picture

@clemens.tolboom: it's even more "quite useless" to say what is useless or useful for someone else :P

I'm updating a site to D7 that started in something hybrid vBulletin/D5 and now in D6. It has lots of files and due to the nature of uploads quite a few have same names with different cases. I'm currently just deleting the duplicates in the db so I can move forward with other upgrade issues.

I don't really feel like digging into a core issue/patch to solve this issue so revising my end to make Drupal upgrade and be happy.

I haven't gotten around to coding a solution but since it involves only 60-100 files, I'll probably add two more steps to the bash script:
1. to take the duplicate files and append a _1 to the file, so where filename.doc and FileName.doc are creating problems, the more recent one will become Filename_1.doc.
2. rename the filepath with the revised file name.

Heine’s picture

@markwk, this issue is not about files that only differ in case. Issue #966210: DB Case Sensitivity: system_update_7061() fails on inserting files with same name but different case deals with those.

I too encountered this on a D6-D7 upgrade on a site that was upgraded from D5. The issue seems to be caused by node revisions created in the D5-site. Even though the revisions in my case point to the same filepath, they are all linked to different fids via the upload modules db table.

If others can confirm, and we can indeed pinpoint upload as the culprit a proper cause of action would be to find all fids pointing to the same filepath, keep one record and update the upload table to replace the spurious fids with the saved one. A good place to do so would IMO be in a new hook_update_N for a future D6 release.

boran’s picture

I don't see how we can do a generic fix, unless one decides to take the first file, ignore the duplicates and print a warning.
The impact of ignoring the duplicate will be site specific, i.e. could make little difference or mean content is lost.
However by not providing an automated fix, sysadmins are obliged to dive into the sql and make the fixes manually. Although they lose time, they are at least more aware of the data loss risks. This issue doe not seem to affect that many sites? (would be nice if one could se the number of followers to this thread :)

Heine’s picture

@boran, if the cause of this is the D5 upload module (and not the filesystem itself), it should be possible to provide a generic fix.

boran’s picture

Ah. In my case there only D6. no D5.

scott.whittaker’s picture

I'm also having this issue as a result of upgrading a large site from Drupal 5 to 6 and finally 7. I have over 9,000 files with duplicate names, between 2 and 2,500 copies of each, leaving a total of over 28,000 conflicted files.

None of these are due to revisions. Simply deleting references to copies is not an option since that will result in thousands of pages with missing content. That only leaves a few options, none of them good:

1. Change the definition of the file_managed table to allow the uri to be non-unique. That would be quick and easy, but I have no idea what problems that might cause down the line.

2. Create a table mapping duplicate file paths to the first fid with that path, reduce the file_managed table and work through all fields that reference files, replacing the fids. This might also have issues down the line since multiple content items will be referencing the same file in the table, and who knows what will happen if one of those content items delete a referenced file?

3. Make sure I have a copy of all the referenced files and make a script to duplicate and rename all the referenced files, then update the files table with the new filenames and filepaths before migrating again. This would probably be the safest but will result in thousands of duplicate files, not to mention being time consuming to implement.

Any advice?

Heine’s picture

Scott, can you check if those files were managed by the upload module?

clemens.tolboom’s picture

@Heine How could one check for that? In #17 you mentioned the upload module. Could you elaborate a little on that? Ie some sql example?

Jumoke’s picture

Scott i am upgrading a site from D5, I have run into the same issue with multiple duplicate files. What was your resolution?

James Tipler’s picture

I hope I'm not about to post this twice, post failed last time.

This script resolves the issue by passing over the database generating symlinks for each duplicate file, and then updating the "filepath" attribute, so that when the URI's are generated, they are all to unique locations. Not an ideal solution, but if you have less than a couple of thousand duplicates it only causes 2 or 3 megabytes of fluff in the file system.

I was a bit lazy and created it with absolute paths, so you'll need to adjust it a bit to be appropriate to your specific drupal installation.

Obviously this assumes a Linux based install.

Hope that helps someone.

function clean_this_mess_up($username, $password, $database){

	mysql_connect("localhost", $username, $password);
	mysql_select_db($database);
	$i = 0;

	while(1 == 1){

	$duplicatePath = "duplicate_files_".$i;
	$result = mysql_query("SELECT filepath, min(fid) fid FROM files GROUP BY filepath HAVING count(*)>1");

	if(mysql_num_rows($result) == "0"){
		echo "No duplicates left. \n";
		return true;
		}

	mkdir("/local/www/drupal/".$duplicatePath, 0755, true);
	$i++;

	while ($row = mysql_fetch_array($result)) {
		echo $row['filepath']."\n";

		$path = $row['filepath'];
		$pos = strrpos($path, "/");
		$dir = substr($path, 0, $pos);

		if(!file_exists("/local/www/drupal/".$path)){
			if(!is_dir("/local/www/drupal/".$duplicatePath."/".$dir)){
				mkdir("/local/www/drupal/".$duplicatePath."/".$dir, 0755, true);
				}
			touch("/local/www/drupal/".$duplicatePath."/".$path);
    			}

		else{

		echo $duplicatePath."/".$dir."\n";
		
		if(!is_dir("/local/www/drupal/".$duplicatePath."/".$dir)){
			mkdir("/local/www/drupal/".$duplicatePath."/".$dir, 0755, true);
			}
			
			$exec = "ln -s \"/local/www/drupal/$path\" \"/local/www/drupal/$duplicatePath/$path\" \n";
			exec ( $exec );
			}
		}

	mysql_free_result($result);

	$exec = "chown -R www-data:www-data \"/local/www/drupal/$duplicatePath\" ";

	exec ( $exec );

	$result = mysql_query("CREATE TABLE temp_concat AS SELECT min(fid) fid FROM files GROUP BY filepath HAVING count(*)>1;");
	$result = mysql_query('UPDATE files SET filepath = CONCAT("'.$duplicatePath.'/'.'",filepath) WHERE fid IN (SELECT fid FROM temp_concat);');
	$result = mysql_query('drop table temp_concat;');

	}

}
Rhino’s picture

Is there no better solution for this? Like Scott I have several thousand files and am moving up from Drupal 4.7-5-6 and now I wanted to upgrade to 7. Should i even bother? What happens between 7 and 8?

Danny Englander’s picture

@James Tipler - I am curious about your code in #25, where do you put that and at what point does it get run? Thanks.

splash112’s picture

Had the same problem upgrading a module

Could a solution be something like:

      $file['uri'] = $scheme . str_replace($basename, '', $file['uri']);
      $file['uri'] = file_stream_wrapper_uri_normalize($file['uri']);

      $duplicate = db_select('file_managed', 'f')
        ->fields('f', array('fid'))
        ->condition('uri', $file['uri'])
        ->execute()
        ->fetchField();
      
      if ($duplicate && $duplicate != $file['fid']) {
        $file['fid'] = $duplicate;
      }

At least the upgrade now works and the file can be associated with the right entity.

fgm’s picture

Same problem here, on a site upgraded from 4.7 to 5 then 6, and now upgrading to 7. After looking into it a bit, I think it happens because in earlier versions upload could allow different revisions of the same node to use the same uploaded file, whereas more recent versions (D6) renamed duplicate files. In my case, I notice the entries in {upload} have the same description, belong to the same nid, and different file objects have been created: notice how the first entry is for fid 129 and all the later ones for fid 134 bound to the same actual file path. (note: file and site names anonymized)

fid vid description list nid weight
129 786 bar.jpg 0 217 0
134 792 bar.jpg 0 217 0
134 795 bar.jpg 0 217 0
134 796 bar.jpg 0 217 0
134 799 bar.jpg 0 217 0
134 803 bar.jpg 0 217 0
134 813 bar.jpg 0 217 0
134 815 bar.jpg 0 217 0

In turn, just before update 7061, this gives these entries in {files}

fid uid filename filepath filemime filesize status timestamp
129 57 bar.jpg sites/foo/files/bar.jpg image/jpeg 36131 1 0
134 57 bar.jpg sites/foo/files/bar.jpg image/jpeg 36131 1 0

And the duplicate happens because the update tries to insert an entry for fid 134 in {file_managed} after it has first inserted this row:

File ID uid filename uri filemime filesize status timestamp
129 57 bar.jpg public://sites/foo/files/bar.jpg image/jpeg 36131 1 0

But, obviously, this URI would be a duplicate of the original one.

Considering this, I think there can not be a database-only solution: any completely correct solution should work around the different file reuse cases by replicating reused files, renaming the copies as needed, referencing the updated file name instead of the original one, and only then applying the DB update strategy.

In my case, this happened for only one fid, so I just removed the offending node revision before running update 7061, removed fid 129 manually from files and upload, and all went like a charm.

David_Rothstein’s picture

Priority: Major » Critical

This sounds like a critical bug to me.

flaviovs’s picture

This is how I managed to workaround this issue and upgrade a site from D6 to D7:

  1. Still in D6, install and enable FileField
  2. Create a FileField field called "field_attachment" in some content type in which uploads are enabled. Setup the field to have unlimited values, description and list setting, just to make it like the D6 core Upload module
  3. Add the same field to all other content types where uploads are enabled
  4. Run the drush script available in https://drupal.org/node/656204#comment-3632546 This script migrate files from the Uploads module to FileFields. There may be other ways of doing it, but the script seems to work well
  5. Disable and uninstall the Uploads module
  6. Run D7 update.php. Now the upgrade should not generate any error or exception related to this issue

After migrating to D7, don't forget to perform the D6 CCK to D7 fields migration.

flaviovs’s picture

WRT last post, one should notice that the presented workaround only allows upgrading from D6 to D7. When the CCK migration is run, migrating the (FileField) fields to core files will abort if the non-unique files are not addressed.

iamEAP’s picture

Status: Needs work » Active

Resetting to "active," since there's no actual patch for review.

Before anyone else spends time looking for solutions, we actually need to better understand the roots of the problem; and more importantly, because this issue is now blocking D8, determine whether or not Drupal core itself is responsible for the duplicate file path entries. Best comment I've seen so far in this vein was #17.

The issue seems to be caused by node revisions created in the D5-site. Even though the revisions in my case point to the same filepath, they are all linked to different fids via the upload modules db table.

Next step should be to verify that a D5 site with revisions and files generates duplicate entries. Then, upgrade to D6 and verify that the duplicate entries persist. If that's the case, we can push this back to Drupal 6 and write the proposed update function. If not, we either continue searching for a Core problem that produces the issue, or we cede that the problem is coming from an API-incompatible contrib module and close this.

iamEAP’s picture

Installed D5, enabled node revisions, created a node with a file uploaded. Created a new revision of the node (which creates a new file revision). Created a new revision of the node with the file deleted, and then created a final revision of the node, uploading the exact same file/filename (which creates a new file entry, appending _0 to the file path).

Then, upgraded to D6. There is no duplicate entry created in the files table.

Conclusion: it's probably not a core process that produces this, at least with the last D5 and the latest D6.

Unless we find a common thread that's core-related, this may be a won't fix.

flaviovs’s picture

I experienced this issue when doing a D6->D7 upgrade of site that never had a D5 version, so I doubt that the issue is D5-related.

However, the site makes heavy use of revisions (by the use of the Revisioning module), so maybe there's other revision-related use cases that trigger the bug?

jmrivero’s picture

#7 worked to complete the upgrade.. almost.
The double grouping keeps the files with same name but different path in the tables, I used this instead:

CREATE TABLE tmp1 SELECT * FROM files GROUP BY filename;
CREATE TABLE tmp2 SELECT * FROM tmp1 GROUP BY filepath;
DROP TABLE files;
ALTER TABLE tmp2 RENAME TO files;

And then cleared the system_update table:

delete FROM system_update_7061;
jmrivero’s picture

Issue summary: View changes

Added another analysis for originated D5 sites upgrading to D7

mheinke’s picture

expierincing this exact issue with a site upgrade from 6.x to 7.x. like others, its impossible for me to just delete duplicates. is there a solution to this?

mheinke’s picture

I got around this by doing a fresh install of drupal 7. inserting the drupal 6 content tables, using CCK migrate to migrate content types, then Migrate D2D to port content

not ideal, and took about 7-8 hours. but it worked

Pinolo’s picture

Having a limited set of duplicated filepaths (~200), I created a modified version of the script proposed in #25. Instead of creating symlinks, it copies the relevant files adding a suffix to the filename and consequently updating the files table.
Also, the script is now a bit more portable, while still being quite less than efficient…

function clean_this_mess_up($username, $password, $database) {
  $thisroot = $_SERVER['DOCUMENT_ROOT'] . '/';
  mysql_connect("localhost", $username, $password);
  mysql_select_db($database);
  $i = 0;
  while (1 == 1) {
    $result = mysql_query("SELECT filepath, min(fid) AS fid FROM files GROUP BY filepath HAVING count(*)>1");
    if (mysql_num_rows($result) == "0") {
      echo "No duplicates left. \n";
      return TRUE;
    }
    $i++;
    $c = 0;
    while ($row = mysql_fetch_array($result)) {
      $path = $row['filepath'];
      $pos = strrpos($path, "/");
      $dir = substr($path, 0, $pos + 1);
      $fname = substr($path, $pos + 1);
      $epos = strrpos($fname, ".");
      $fname_only = substr($fname, 0, $epos);
      $fname_ext = substr($fname, $epos);
      $fname_new = $fname_only . '_fclean_' . $c . $fname_ext;
      $path_new = $dir . $fname_new;
      echo $path_new . "\n";
      copy($thisroot . $path, $thisroot . $path_new);
      $result = mysql_query("UPDATE files SET filepath = '" . $path_new . "' WHERE fid = " . $row['fid'] . " LIMIT 1");
      $c++;
    }
    mysql_free_result($result);
  }
  echo $c . " \n";
}
mjcarter’s picture

The script in #39 works great.

I have made some improvements with regard to error handling and behaviour if there are problems (such as if the file referenced in the DB doesn't exist) as well as upgrading the script to use the mysqli database layer.

I also added the necessary function call to invoke the script at the bottom - this is where you should edit in your database details.

<?php
function clean_this_mess_up($server, $username, $password, $database) {
  $thisroot = $_SERVER['DOCUMENT_ROOT'] . '/';
  $mysqli = new mysqli($server, $username, $password, $database);
  
  echo '<h1>Scanning for duplicate file entries...</h1>';

  $i = 0;
  while (1 == 1) {
    $result = $mysqli->query("SELECT filepath, min(fid) AS fid FROM files GROUP BY filepath HAVING count(*)>1");
	if (!$result) {
		throw new Exception("Database Error [{$this->database->errno}] {$this->database->error}");
	}
    if (mysqli_num_rows($result) == "0") {
      echo "No duplicates left. \n";
      return TRUE;
    }
    $i++;
    $c = 0;
    while ($row = $result->fetch_assoc()) {
		if(!$result) {
			throw new Exception("Database Error [{$this->database->errno}] {$this->database->error}");
		}
		$path = $row['filepath'];
		echo 'Duplicate file in DB found at ' . $path . '';
		$pos = strrpos($path, "/");
		$dir = substr($path, 0, $pos + 1);
		$fname = substr($path, $pos + 1);
		$epos = strrpos($fname, ".");
		$fname_only = substr($fname, 0, $epos);
		$fname_ext = substr($fname, $epos);
		$fname_new = $fname_only . '_fclean_' . $c . $fname_ext;
		$path_new = $dir . $fname_new;
		
		if(file_exists($thisroot . $path)) {
			$result_update = $mysqli->query("UPDATE files SET filepath = '" . $path_new . "' WHERE fid = " . $row['fid'] . " LIMIT 1");
			if (!$result_update) {
				throw new Exception("Database Error [{$this->database->errno}] {$this->database->error}");
			} else {
				copy($thisroot . $path, $thisroot . $path_new);
				echo ' <b>New deduplicate URL</b>: ' . $path_new . '<br><br>';
			} 
		}
		  else {
			echo 'The file ' . $thisroot . $path . ' <b>Not found</b> - nothing changed <br><br> <h1>Fix file not found error and retry</h1>';
			return TRUE;
		}
    }
    mysqli_free_result($result);
  }
  echo $c . " \n";
}
clean_this_mess_up('SERVER', 'USERNAME' ,'PASSWORD' ,'DATABASE');
?>
Trejkaz’s picture

The additional suggestion in #36 nearly worked for us. We did execute all the statements in #36:

CREATE TABLE tmp1 SELECT * FROM files GROUP BY filename;
CREATE TABLE tmp2 SELECT * FROM tmp1 GROUP BY filepath;
DROP TABLE files;
ALTER TABLE tmp2 RENAME TO files;
DELETE FROM system_update_7061;

But we also found that the file_managed table had partial updates in it, so we also had to run:

delete from file_managed;

And on the next attempt, the migration finally worked.

So I think there are a number of bugs here:

  1. When duplicate entries exist for some reason, the migration isn't doing the right thing.
  2. When the migration does not complete successfully, partial updates are left in the file_managed and system_update_7061 table which are not cleaned up on the next run.
  3. Drupal is not doing migrations in a transaction (it is best practice to roll back all partial updates in a migration and transactions are the obvious way to do that.)
jessZ’s picture

I am having this issue. Not much of a code person and I could live with re-uploading some of the files but there are over 1000 nodes with attachments. How can I instruct the ISP/Tech people doing the upgrade to run script to resolve the following error and get past 7061.

DOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '70' for key 'PRIMARY': INSERT INTO {system_update_7061} (vid) SELECT DISTINCT u.vid AS vid FROM {upload} u; Array ( ) in system_update_7061()

Tried the script but same error.

jessZ’s picture

SELECT count(*), min(fid), max(fid), filename FROM files GROUP BY filename HAVING count(*) > 1 ORDER BY count(*)
I only found 3 x2 (6 files with same filepath and less than 25x 2 with same filename. how can i group via columns in phpmyadmin for a fast delete.

I deleted all the duplicates files, but still get the same error Any other suggestion on workarounds?

• Failed: PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '70' for key 'PRIMARY': INSERT INTO {system_update_7061} (vid) SELECT DISTINCT u.vid AS vid FROM {upload} u; Array ( ) in system_update_7061() (line 2804 of /home/reimagi8/public_html/new/modules/system/system.install).

jessZ’s picture

AHA~ Clear system update !
delete FROM system_update_7061;

new message!
The following updates returned messages
system module
Update #7061

Upload module has been migrated to File module.

now let's go see what i got.

Yes! now i can see the pdf attachments. Now on to the image module headache.

kbahey’s picture

This is a modified version of the script in #40 with some fixes.
First, a fix to handle UTF8 in file names, and not aborting when accessing a file as stored in the database.
Second, a fix to rename the files with the unique fid, rather than repeatedly adding _fclean_ until the length of the column is exhausted and the name gets truncated.

I also removed the HTML tags, so it runs from the command line.

Ideally, this script should be Drupalized, using the proper DB API, rather than accessing MySQL directly. But that is for another day ...

<?php
# Change these values 
define('DB_HOST', 'localhost');
define('DB_USER', 'dbuser');
define('DB_PASS', 'dbpass');
define('DB_NAME', 'dbname');
define('DOC_PATH', '/var/www/');

define('DRUPAL_ROOT', DOC_PATH);
require_once DRUPAL_ROOT . '/includes/unicode.inc';

function image_duplicates() {
  $thisroot = DOC_PATH;
  $mysqli = new mysqli(DB_HOST, DB_USER, DB_PASS, DB_NAME);
 
  $result = $mysqli->query('SET NAMES utf8');

  echo "Scanning for duplicate file entries...\n";
 
  $i = 0;
  while (1 == 1) {
    $result = $mysqli->query("SELECT filepath, min(fid) AS fid FROM files GROUP BY filepath HAVING count(*)>1");
    if (!$result) {
        throw new Exception("Database Error [{$mysqli->errno}] {$mysqli->error}");
    }
    if (mysqli_num_rows($result) == "0") {
      echo "No duplicates left. \n";
      return TRUE;
    }
    $i++;

    $c = 0;
    while ($row = $result->fetch_assoc()) {
        if(!$result) {
            throw new Exception("Database Error [{$mysqli->errno}] {$mysqli->error}");
        }
        $path = drupal_convert_to_utf8($row['filepath'], 'utf-8');
        echo 'Duplicate file in DB found at ' . $path . "\n";

        $rename_token = '_rename_';

        $pos = strrpos($path, "/");
        $dir = substr($path, 0, $pos + 1);
        $fname = substr($path, $pos + 1);
        $epos = strrpos($fname, ".");
        $fname_only = substr($fname, 0, $epos);
        $fname_ext = substr($fname, $epos);
        $fname_new = $fname_only . '_' . $row['fid'] . $fname_ext;
        $path_new = $dir . $fname_new;
       
        if(file_exists($thisroot . $path)) {
            $result_update = $mysqli->query('UPDATE files SET filepath = "' . $path_new . '" WHERE fid = ' . $row['fid'] . " LIMIT 1");
            if (!$result_update) {
                throw new Exception("Database Error [{$mysqli->errno}] {$mysqli->error}");
            } else {
                copy($thisroot . $path, $thisroot . $path_new);
                echo ' New deduplicate URL: ' . $path_new . "\n";
            }
        }
          else {
            echo 'File not found: ' . $thisroot . $path . " -- nothing changed. Fix file not found error and retry\n";
            return TRUE;
        }
    }
    mysqli_free_result($result);
  }
  echo $c . " \n";
}

image_duplicates();
David_Rothstein’s picture

Issue tags: +migration

Does this issue affect the Drupal 6-to-8 migration path too?

Given the nature of it (data in Drupal 6 databases that doesn't work with the uniqueness assumptions of later Drupal versions), I would expect that it does, but I haven't tested it.

dawehner’s picture

@amateesc and myself discussed about that and we are wondering why we can't fix the problem on the DB layer and "ignore" the fact where its coming from.

From reading the issue, its clear, you have some duplicate filepaths inside the {files} table. We should be able to
group by them and then update the upload module and point to just the first FIDs for all of them. That itself should be a pretty reasonable and doable.

Do people think that this is enough to fix the critical upgrade path issue? Yes, its not clear why exactly this happens, but at least it fixes the issue.

fgm’s picture

IIRC this appears only on D6 sites upgraded from D5, when the file attachments were changed from being per-node to being per-revision, causing inconsistent behaviors for nodes where files changed over the course of revisions.

C-Logemann’s picture

@fgm I am currently working on a customer upgrade project with duplicate paths. This was not upgraded from D5.

dawehner’s picture

@C_Logemann
Can you think of another reason, maybe the usage of some filefield module, why you ended up with those duplicated paths?

C-Logemann’s picture

@dawehner
Especially in this case it is possible that the old programmers made something wrong when they imported existing data from an old custom CMS. But I didn't have access to the old import code to check.

In my current test upgrade process I trigger some MYSQL commands to solve the problem. When I found this issue I thought there was maybe something wrong with D6 like duplicate entries in the search table which sometimes breaks an import in MYSQL?

boran’s picture

My original post was for a D6 site, which had never been on D5. There was little custom programming. I no longer have the site or DB to experiment with though.

Fabianx’s picture

I agree with dawehner its the references that need to be updated when de-duplicating the {files} table, not the files itself.

David_Rothstein’s picture

So that would basically be the first option listed in the issue summary:

- if there duplicates, only upgrade the first one, reference that, and issues a warning to the User. The user will need to manually update any non core modules that might reference the duplicate file entries.

With the downside mentioned there.

(Basically, system_update_7061() is trying to preserve the existing file IDs, but it can't if we throw away some.)

But I guess that might still be preferable to the alternative of creating lots of copies of the same file... And we could consider keeping a record somewhere of the old file IDs and which new ID they point to, in case any other code out there needs it.

jelo’s picture

My site was not upgraded from D5 to D6. My problem might have come from the module filefield_sources. A feature of that module was that you could reference existing files from any other nodes. I am not sure how it was implemented, but it sounds like a logical candidate to have stored a duplicate file path.

dawehner’s picture

FileSize
2.31 KB

This so far is just some example data which could look like the problematic data, right?

Fabianx’s picture

Status: Active » Needs review

Testbot: Go!

Yes, example data looks great!

Status: Needs review » Needs work

The last submitted patch, 56: 1260938-56.patch, failed testing.

Status: Needs work » Needs review

Fabianx queued 56: 1260938-56.patch for re-testing.

Status: Needs review » Needs work

The last submitted patch, 56: 1260938-56.patch, failed testing.

dawehner’s picture

Status: Needs work » Needs review
FileSize
4.92 KB
7.66 KB
  • The previous test data was a little bit broken (fixed that so we now have an entry in {node} as well}
  • Wrote a really naive version of the update code. Its a little bit tricky to do both the {files} and node revision updates right, without adding a new update function in between.
  • Expanded the test coverage to ensure that things work as expected

The last submitted patch, 61: 1260938-fail.patch, failed testing.

Fabianx’s picture

#61: Without looking too close at the current patch, I think it should be a two-step process and certainly be a pre-step to doing any import / conversion.

This can purely work on the D6 data structures still:

a) Create mapping table from duplicates (that is useful later always if you need to cleanup more data manually) and remove duplicates, now {files} is sane
b) Cleanup field records => now those are sane, too and map to the right de-duplicated {files} again based on the mapping table.

No need to deal with it in the middle of the conversion, this could be one of the first steps really.

dawehner’s picture

FileSize
8.01 KB

I'd prefer to not have to introduce a new update method in the middle of it, given that this will trigger at least one new schema update for all the D7 sites out there.

Given that the easiest way for #63 would be to not make it a 2 step process but simply fill up that mapping table over time.

Status: Needs review » Needs work

The last submitted patch, 65: 1260938-65.patch, failed testing.

dawehner’s picture

Status: Needs work » Needs review
FileSize
8.05 KB

Back to green., was not able to use DBTNG correctly.

PS: Uploading a patch against D7 using git diff 8.0.x might take you a while.

Status: Needs review » Needs work

The last submitted patch, 67: 1260938-67.patch, failed testing.

dawehner’s picture

Status: Needs work » Needs review
FileSize
8.05 KB
548 bytes

PHP 5.4 does not have short array syntax.

fgm’s picture

@dawehner, yes it does : http://php.net/manual/fr/migration54.new-features.php

But you probably meant some earlier version, since D7 does not require PHP 5.4 (only recommends it), only 5.2.5.

dawehner’s picture

@dawehner, yes it does : http://php.net/manual/fr/migration54.new-features.php

Ehe, yeah I wanted to express that versions before 5.4 did not had the future yet.

@fgm
Do you think this simple patch makes kinda sense?

neclimdul’s picture

Patch makes some sense to me but we don't ever use the mapping table. is that on purpose?

fgm’s picture

/me cringes seeing a db_query in a for loop, but.... I'm not sure of the whole logic, it seems the goal is to find duplicate fids for filepaths, so but wouldn't it be possible to just query for duplicates all at once, like SELECT count(fid), filepath FROM {files} GROUP BY 2 HAVING count(fid) > 1 to avoid having such a loop ?

neclimdul’s picture

+++ b/modules/system/system.install
@@ -2806,12 +2806,71 @@ function system_update_7061(&$sandbox) {
+        $kept_fid = array_shift($fids);

You could use a having like that to limit the iterations but you'd still need a select in the for loop.

dawehner’s picture

/me cringes seeing a db_query in a for loop, but.... I'm not sure of the whole logic, it seems the goal is to find duplicate fids for filepaths, so but wouldn't it be possible to just query for duplicates all at once, like SELECT count(fid), filepath FROM {files} GROUP BY 2 HAVING count(fid) > 1 to avoid having such a loop ?

Well, we certainly want to avoid to run just one query, because that won't scale. Iterating over FIDs make its easier. Feel free to improve it though.

Patch makes some sense to me but we don't ever use the mapping table. is that on purpose?

Well, Fabianx asked for one

dawehner’s picture

Thank you @neclimdul again for sharing the database dump. Sadly it does not contain the problems described as part of the issue.
Sadly that database dump did not contained some of the problems describe pretty good in #29, as the amount of entries in {upload} wasn't enough,
though I have seen the corresponding data in {files}, see query:

SELECT files.filepath, COUNT(files.fid) FROM files GROUP BY files.filepath HAVING count(files.fid) > 1;

1.pdf	2
2.pdf	2
3.xlsx	2
4.xlsx	2
5.jpg 	2
6.jpg 	2
7.jpg 	2
8.jpg 	13
9.gif 	4
10.jpg	2
11.JPG	7
12.jpg	2
13.jpg	2
14.jpg	6

Beside from that, the issue fixes the problem at least @fgm has, so it can get in ...

vijaycs85’s picture

  1. +++ b/modules/system/system.install
    @@ -2806,12 +2806,71 @@ function system_update_7061(&$sandbox) {
    +    $table = array(
    +      'fields' => array(
    +        'old_fid' => array('type' => 'int'),
    +        'new_fid' => array('type' => 'int'),
    +        ),
    +      'primary key' => array('old_fid'),
    +    );
    +    db_create_table('system_update_7061_fid_mapping', $table);
    

    We might want to make sure that the table doesn't exist?

  2. +++ b/modules/system/system.install
    @@ -2806,12 +2806,71 @@ function system_update_7061(&$sandbox) {
    +  $fid_filepath = db_query_range('SELECT fid, filepath FROM {files} WHERE fid > :lastfid ORDER BY FID ASC', 0, $limit, array(':lastfid' => $sandbox['last_fid_processed']))
    +    ->fetchAll();
    ...
    +      $fids = db_query('SELECT fid FROM {files} WHERE filepath = :filepath ORDER BY fid ASC', array(':filepath' => $filepath))
    +        ->fetchCol();
    

    I don't have any solution, but looks like it's a lots of queries. isn't there any other way to find all the duplicates and increment progress and delete from files?

dawehner’s picture

FileSize
8.14 KB
1021 bytes

Thank you a lot!!

We might want to make sure that the table doesn't exist?

Can't be a bad thing.

I don't have any solution, but looks like it's a lots of queries. isn't there any other way to find all the duplicates and increment progress and delete from files?

Yeah, that is a bit unfortunate though on the other hand, the site won't be online while executing them.

I don't have any solution, but looks like it's a lots of queries. isn't there any other way to find all the duplicates and increment progress and delete from files?

How worried are you about that? This way its at least easier to read, IMHO.

Berdir’s picture

Status: Needs review » Needs work
  1. +++ b/modules/system/system.install
    @@ -2806,12 +2806,73 @@ function system_update_7061(&$sandbox) {
    +    // There is a chance that we have multiple file entries pointing to the same
    +    // physical file, see https://www.drupal.org/node/1260938
    +    // Therefore create batch requests to update those files.
    

    Nitpick: needs a . or , or something after the issue link as the next line starts a new sentence.

    Could also remove the "Therefore" here and below as well I think.

  2. +++ b/modules/system/system.install
    @@ -2806,12 +2806,73 @@ function system_update_7061(&$sandbox) {
    +  $fid_filepath = db_query_range('SELECT fid, filepath FROM {files} WHERE fid > :lastfid ORDER BY FID ASC', 0, $limit, array(':lastfid' => $sandbox['last_fid_processed']))
    +    ->fetchAll();
    +  if (!empty($fid_filepath)) {
    +    foreach ($fid_filepath as $object) {
    

    Naming is a bit weird here, e.g. $object. that's usually $rows and $row or so?

    I also don't really like temporary variables like $filepath, seems easy enough to just inline that in the one place it is used? Maybe also use $duplicate_fids or so to separate it from $fid?

    I was also writing a very long comment about what happens when we delete files and then progress will never reach max and we'd never finish. But the !empty() check prevents that....

    However, something else can happen. If you have exactly 100 rows (or 200, 300, .. and no deletions), then #finished will be 1 and update.php will consider this batch function to be complete and won't call it again?

    You might be able to simulate this in the test by setting batch size to count($files)?

    So we need to make sure to never set #finished to 1, just .99 or something?

dawehner’s picture

Status: Needs work » Needs review
FileSize
8.21 KB
2.71 KB

I also don't really like temporary variables like $filepath, seems easy enough to just inline that in the one place it is used? Maybe also use $duplicate_fids or so to separate it from $fid?

Fixed

However, something else can happen. If you have exactly 100 rows (or 200, 300, .. and no deletions), then #finished will be 1 and update.php will consider this batch function to be complete and won't call it again?

So $sandbox['max'] is the following bit:

    $sandbox['max'] = db_query("SELECT COUNT(*) FROM {system_update_7061}")->fetchField()
       + db_query('SELECT COUNT(*) FROM {files}')->fetchField();

so even if we have that many rows, max will be still much higher

Berdir’s picture

Status: Needs review » Reviewed & tested by the community
+++ b/modules/system/system.install
@@ -2832,36 +2833,34 @@ function system_update_7061(&$sandbox) {
           ->fields(array('old_fid', 'new_fid'));
-        foreach ($fids as $fid) {
+        foreach ($duplicate_fids as $fid) {
           $insert->values(array('old_fid' => $fid, 'new_fid' => $kept_fid));

Nice only noticed now that this was overriding $fid.. would have been a fun bug ;)

Right now I understand the max change, nice trick ;)

I haven't manually tested this, but it looks good to me after an extensive review and we have tests. Let's get this in.

David_Rothstein’s picture

Status: Reviewed & tested by the community » Needs review

I think given the number of actual sites affected by this bug, the patch could use some testing on real sites.

+      $duplicate_fids = db_query('SELECT fid FROM {files} WHERE filepath = :filepath ORDER BY fid ASC', array(':filepath' => $row->filepath))
....
+        db_update('upload')
+          ->fields(array('fid' => $kept_fid))
+          ->condition('fid', $duplicate_fids, 'IN')
+          ->execute();

Is that safe? The {upload} table in Drupal 6 has a primary key on vid + fid, so if the same node revision (vid) referenced two files with the same filepath, this looks like it would blow up. (I have no idea if that could have happened in reality, but given the way this issue has gone it wouldn't be shocking at all.)

Overall it seems this code might be a lot simpler if we integrated it with the main update rather than running it out in front; then we should avoid risks like the above. Would be great to have the simplest possible change in this patch given how many problems this particular update function has had over the years.

+// Under some configurations we end up having two fids with the same filepath,
+//see https://www.drupal.org/node/1260938.

Minor - code style/spacing is wrong on the second line.

David_Rothstein’s picture

FileSize
9.39 KB
3.7 KB

In other words, I'm wondering if this test (which should fail) needs to pass.

Status: Needs review » Needs work

The last submitted patch, 83: 1260938-83.patch, failed testing.

David_Rothstein’s picture

Status: Needs work » Needs review
FileSize
9.14 KB
3.5 KB

Actually I think that test is wrong - Drupal 7's File module doesn't prevent a node revision from referencing the same file more than once so I guess if you are in that situation in Drupal 6, you should see separate attachments on the node revision after the upgrade, not just one.

Interdiff is against #80. This should still fail, though, for the same reasons.

Status: Needs review » Needs work

The last submitted patch, 85: 1260938-85.patch, failed testing.

David_Rothstein’s picture

Status: Needs work » Needs review
FileSize
8.64 KB

And does the fix actually have to involve anything more than this? This is the "integrate it with the main update rather than running it out in front" approach I mentioned in #82. It passes for me locally.

No interdiff, since essentially all the code is moved compared to the earlier patches.

David_Rothstein’s picture

As written this is not very efficient though - definitely more database queries than necessary (even for sites that have no duplicate files).

If we care, we could try to build a list of which file IDs actually have duplicates once up front (and store it in the $sandbox), then ensure we only do the extra stuff once for each of those.

David_Rothstein’s picture

Actually I just realized something. The upgrade path never actually deletes the Drupal 6 {files} table. I think it's deliberately left behind so that contrib modules can use the data in it. So we don't need to introduce a new "system_update_7061_fid_mapping" table after all. All the information anyone needs about the duplicate files is already in the {files} table.

Removing that, combined with a query out front that builds a list of duplicate filepaths, should make this a really simple patch.

i will write a new patch and post it, assuming it works.

David_Rothstein’s picture

FileSize
7.49 KB

It seemed to work. The new patch just changes system.install, but now it requires only a few lines of code.

fgm’s picture

Following @dawehner's request off-issue, I tried to reproduce the problem I had two years ago, but could not find a DB backup still containing it : no way to ascertain this specific problem has been fixed, sorry about that.

steinmb’s picture

Also tried to dig back in backups but unable to find a valid d6 db. backup though I'll do some more digging.

neclimdul’s picture

latest patch regresses system_update_7061_fid_mapping which might be needed by things like filefield migrations.

David_Rothstein’s picture

@neclimdul, see my explanation in #89 for that. To get the mapping, they can essentially just do the same few lines of code that core is doing here, so it didn't seem worth it to me to cache it in a separate database table anymore.

webchick’s picture

I think one of the problems we're running into here regarding real-world usage is a lot of people will use https://www.drupal.org/project/migrate_d2d for major version upgrades, so if they hit that issue they probably went a different route.

The fact that two of the people previously having this issue no longer do and can't find a backup old enough to show the problem might demonstrate that this does not actually need to be critical. :P Or at least, that it doesn't need to hold up D8, since obviously there is a workaround.

dawehner’s picture

Thank you david for picking this up!

(I have no idea if that could have happened in reality, but given the way this issue has gone it wouldn't be shocking at all.

You never know :)

Overall it seems this code might be a lot simpler if we integrated it with the main update rather than running it out in front; then we should avoid risks like the above. Would be great to have the simplest possible change in this patch given how many problems this particular update function has had over the years.

Totally agree. I just went with the version I was able to parse the flow during the update, and I certainly don't like the old version of the patch.

  1. +++ b/modules/simpletest/tests/upgrade/upgrade.upload.test
    @@ -64,12 +64,32 @@ class UploadUpgradePathTestCase extends UpgradePathTestCase {
    +    $this->assertEqual(0, db_query("SELECT COUNT(*) FROM {file_managed} GROUP BY uri HAVING COUNT(fid) > 1")->fetchField());
    

    Good idea to ensure that.

  2. +++ b/modules/system/system.install
    @@ -2803,6 +2803,12 @@ function system_update_7061(&$sandbox) {
    +    $sandbox['duplicate_filepath_fids_to_use'] = db_query("SELECT filepath, MIN(fid) FROM {files} GROUP BY filepath HAVING COUNT(*) > 1")->fetchAllKeyed();
    

    Such a lookup could potentially return a lot of data. Should we maybe do that inline?

  3. +++ b/modules/system/system.install
    @@ -2832,6 +2838,16 @@ function system_update_7061(&$sandbox) {
    +      // If this file has a duplicate filepath, replace it with the first file
    +      // that has the same filepath.
    +      if (isset($sandbox['duplicate_filepath_fids_to_use'][$file['filepath']]) && $record->fid != $sandbox['duplicate_filepath_fids_to_use'][$file['filepath']]) {
    +        $file = db_select('files', 'f')
    +          ->fields('f', array('fid', 'uid', 'filename', 'filepath', 'filemime', 'filesize', 'status', 'timestamp'))
    +          ->condition('f.fid', $sandbox['duplicate_filepath_fids_to_use'][$file['filepath']])
    +          ->execute()
    +          ->fetchAssoc();
    +      }
    

    Oh wonderful, this is indeed much easier to understand!

David_Rothstein’s picture

@webchick as far as I can see from reading this thread, the only workarounds people have come up with involve writing custom SQL queries or custom PHP scripts to massage their Drupal 6 site's data into a format that won't crash. If that's enough to downgrade a bug from critical, we wouldn't have too many critical bugs :)

Since this is an issue with the Drupal 6 database, anyone using migrate_d2d would presumably face something like it (although the symptoms might look different?). I don't see code in that module for dealing with it.

David_Rothstein’s picture

FileSize
7.78 KB
928 bytes

@neclimdul, see my explanation in #89 for that. To get the mapping, they can essentially just do the same few lines of code that core is doing here

However, since neither I nor anyone else actually knew that core leaves behind the {files} table for this purpose, it wouldn't be a bad idea to have a code comment here explaining that. Tried adding one in the attached.

+++ b/modules/simpletest/tests/upgrade/upgrade.upload.test
@@ -64,12 +64,32 @@ class UploadUpgradePathTestCase extends UpgradePathTestCase {
+    $this->assertEqual(0, db_query("SELECT COUNT(*) FROM {file_managed} GROUP BY uri HAVING COUNT(fid) > 1")->fetchField());

Good idea to ensure that.

Thanks, though I'm pretty sure that was code that you wrote :)

Such a lookup could potentially return a lot of data. Should we maybe do that inline?

Well, how big do you think it would have to be before it's a problem? I think it could be large, but not incredibly large. It is a one-dimensional array and I think the maximum number of elements is half the number of rows in the {files} table (so if the site has 1,000,000 files this array could have up to 500,000 elements). That maximum would presumably require a completely pathological setup to achieve, and probably it would be much smaller for real sites. Overall it doesn't sound like a huge problem to me for this to wind up in the batch sandbox (or in memory all at once), but I could be convinced otherwise.

David_Rothstein’s picture

the only workarounds people have come up with involve writing custom SQL queries or custom PHP scripts to massage their Drupal 6 site's data into a format that won't crash.

I take that back, #38 mentions using Migrate D2D. I think with that solution it won't crash the site, though I'm still skeptical that it actually migrates the duplicate references correctly - I would guess it probably winds up dropping some data. Anyway, it's a whole other ballgame :)

catch’s picture

@David I've also not been able to find anything in migrate to handle this, so I opened #2504553: Ensure that Drupal 6 duplicate file names are correctly migrated for the core 6-8 migration which will also have to deal with this issue.

catch’s picture

On finding a database dump, I believe Drupal.org ran into this.

#966210: DB Case Sensitivity: system_update_7061() fails on inserting files with same name but different case made the column binary (case sensitive) and was worked on by drumm as part of the Drupal.org upgrade, because it ran into this issue.

#1691438: Regression: {file_managed}.filename should NOT be binary (and case-sensitive by default) reverted that patch, because it broke filename searches or similar.

Which left this issue to fix the duplicates (whether due to case insensitivity or not).

So if there is a Drupal.org Drupal 6 database dump still around, that should hit the error when running update.php to Drupal 7.

scorchio’s picture

I don't know whether this will help anyone on working the issue, but I'm actually working on a real world D6 -> D7 migration which triggers this error. I will test #98 ASAP and update.

catch’s picture

@scorchio that would be incredibly helpful - if we're able to confirm this successfully fixes the issue on a real database with the problem, that should allow the patch to get committed to 7.x.

flaviovs’s picture

I am not so sure if fixing this on the D7 upgrade script is the right approach. There is no reliable way to map duplicate fids by just looking at {files}. We have many contrib modules out there using managed files, not to say user developed custom modules.

Ignoring duplicate fids means ignoring data. Surely, this data is hidden in the leftover {files} table (which, for me, is a bug by itself on the D7 upgrade script, but that's another issue), but until this data is also migrated by the corresponding module, it is ignored data -- and information will be missing on your site (i.e. it will be broken).

Yes, we can change "all" modules to handle entries in the leftover {files} table. This may look like a straightforward process, but the devil is in the details... For instance, the Upload module is gone. So who is going to fix fids that were left behind in the {upload} table? Also, we should question ourselves: at this point in the Drupal timeline, should we bloat D7 modules update scripts to handle (invalid) duplicate files that may be present in {files}? Can we fix all modules? I do not think so.

IMHO, it is a D6 (and below?) bug, so it deserves to be fixed there.

That said, and since I still have a very long list of D6 sites to upgrade to D7, I developed the Duplicate file fixer module that de-dups entries by generating copies of duplicated files (well, it actually hard links the files, so disk space is not duplicated). A straightforward and pragmatic approach that, at the end of the day, will let you upgrade to D7 without much hassle and loss of data (hopefully).

The module is still, and probably always will be, in alpha, so guys backup before running. The module has some heuristics to show more details about a duplicate file (such as checking for node attachments and FileFields fields) that may let you fix things by hand, but if you think about a check that can be done against other modules, open an issue and explain your case. Of course, feedbacks are more than welcome.

TL;DR version: if "Update #7061" is failing on you, go to https://www.drupal.org/project/d6dupfix and checkout out a module that may help you

David_Rothstein’s picture

@scorchio that would definitely be very helpful if you can do that and report back here!

@catch's idea of testing with a Drupal.org database would be a good idea too (in retrospect Drupal.org is basically the poster-child site for running into an issue like this)... but maybe harder to actually arrange in practice.

I think with one manual test on a real site we're definitely in good shape here. Especially since the newer patch is simpler than the previous ones (and looks like it has less of a chance of accidentally breaking something else as a side effect) I would feel better about committing it with less testing than before. But knowing that it actually works on at least one real site and doesn't cause any bad side effects would definitely be helpful.

David_Rothstein’s picture

@flaviovs, thanks for posting that module.

I think we do need some fix for this in Drupal 7 core. I'm not sure if this was actually a bug in Drupal 5 core (and Drupal 6 contrib) or just the way they worked... but either way nothing prevented it so we need to have the code do something to handle this during the upgrade.

On the risk of data loss, I took a look at the Drupal 6 files table and this doesn't seem like a big concern for many of the columns since they are properties of the file itself. I could imagine it being an issue for columns like 'uid' and 'filename' (and maybe 'status', although hopefully in practice none of these duplicates are temporary files). In comparison to a fatal error on upgrade it seems like a lesser concern :) Ultimately we have to do something to fix this, and the alternative of having lots of versions of the same file doesn't sound good either - especially if the site ever installed a module like https://www.drupal.org/project/file_entity which would expose all of them in the user interface.

Perhaps instead of keeping the record from the first file we should keep the last? In other words, change MIN to MAX in this:

+    $sandbox['duplicate_filepath_fids_to_use'] = db_query("SELECT filepath, MIN(fid) FROM {files} GROUP BY filepath HAVING COUNT(*) > 1")->fetchAllKeyed();

That might be better if we expect that in a typical case the newer record is associated with a newer node revision and therefore the most up-to-date representation of the file's actual current properties.

flaviovs’s picture

David, see my comment #35. I faced this problem on a site that never had a D5 version, so definitely this is an issue that pops out by itself on D6.

As of the loss of information, perhaps I didn't make myself clear. The information that is going to be missing is the one in the ignored rows of the {files} table. Granted, the underlining data is still there, but unless it gets migrated somehow (NB: and before a conscious DBA drop the leftover table after the upgrade is "done"), it is lost in the sense that it will not appear on your site.

It does not help to get MAX(fid) instead of MIN(fid).

Let me show you an example to illustrate the issue better: in one of the D6 sites I had to migrate, there were a couple of duplicated entries.

mysql> create unique index test_dups on files (filepath);
ERROR 1062 (23000): Duplicate entry 'sites/default/files/file.jpg' for key 'test_dups'

Now, let's try to see what's on {files} about this entry:

mysql> select * from files where filepath = 'sites/default/files/file.jpg';
+-----+-----+----------+------------------------------+------------+----------+--------+------------+
| fid | uid | filename | filepath                     | filemime   | filesize | status | timestamp  |
+-----+-----+----------+------------------------------+------------+----------+--------+------------+
| 510 |   1 | file.jpg | sites/default/files/file.jpg | image/jpeg |  4222986 |      1 | 1283285068 |
| 530 |   1 | file.jpg | sites/default/files/file.jpg | image/jpeg |   134293 |      1 | 1283349155 |
+-----+-----+----------+------------------------------+------------+----------+--------+------------+
2 rows in set (0.00 sec)

(NB: for some unknown reason, filesize differs.)

Now, after some investigation, I found that those fids are managed by an image field called "field_foo". So, let's try to see where those files were used in:

mysql> select * from content_field_foo where field_foo_fid in (510, 530) \G
********************** 1. row **********************
           vid: 112151
           nid: 112147
         delta: 0
 field_foo_fid: 510
field_foo_list: 1
field_foo_data: a:3:{s:11:"description";s:0:"";s:3:"alt";s:0:"";s:5:"title";s:0:"";}
********************** 2. row **********************
           vid: 112152
           nid: 112148
         delta: 11
 field_foo_fid: 530
field_foo_list: 1
field_foo_data: a:3:{s:11:"description";s:0:"";s:3:"alt";s:0:"";s:5:"title";s:0:"";}
2 rows in set (0.00 sec)

As you can see, the same physical file has two entries in {files}. But those entries point to two separated nodes. We're going to have problems here if you do not migrate all entries, e.g. if you migrate only fid = 510, then everything will appear ok when you visit node/112147, but when you go to node/112148, your image file won't be there.

✦ ✦ ✦

The Duplicate file fixer (DFF) module takes the approach that, since technically two rows on {files} are meant to be two separate files -- but aren't being keep this way due to a bug somewhere --, then it makes sense to fix the issue by duplicating the files with different names, and de-dup the entries. Surely, we all know that it is the same file, but why bother? In our case, we certainly have a bunch of image and PDF files that content managers keep uploading again and again to the same site to use on different contexts, so we are screwed already. Well... at least DFF hard link the files, so at least disk space is not wasted.

Yes, this is definitely a hack, but I am afraid that it is the most sensible thing to do.

David_Rothstein’s picture

@flaviovs, as far as I can see, the D6->D7 core upgrade only moves over files that are controlled by core (i.e. those that were added to the site by the core Upload module).

So in your example, neither the file with fid 510 nor the one with 530 will be moved from the {files} table to the {file_managed} table. They are both left behind, and it's up to the CCK/Filefield/Imagefield upgrade path to decide how to move them over, along with all other files that are associated with a CCK field. That behavior is the same with or without the patch in this issue.

Fabianx’s picture

Issue tags: +D8 Accelerate

#107: You are right that Duplicate File Fixer is the right approach for a contrib, one case fits all solution.

However core cannot and will not support the CCK upgrade path, but only fix the {upload} table upgrade path.

We still need to provide a patch here that people can upgrade their sites.

Compromise proposal:

- What about we show a message when duplicate files have been detected:

Duplicate file xyz had duplicate records 1,4,6 and was reduced to 15. [warning]

There had been duplicate files in the {files} table. If you only used the core {upload} module those have been fixed automatically, if you used contrib modules like the contributed file_field or image_field modules, please see {hand book page}. [warning]

In the handbook page we can the legitimately point to duplicate file fixer as another possibility that can be run pre-upgrade ...

Thoughts?

flaviovs’s picture

So in your example, neither the file with fid 510 nor the one with 530 will be moved from the {files} table to the {file_managed} table. They are both left behind, and it's up to the CCK/Filefield/Imagefield upgrade path to decide how to move them over, along with all other files that are associated with a CCK field. That behavior is the same with or without the patch in this issue.

David, I do not think that letting non-core modules do low-level upgrade from legacy {files} entries to core tables is a good idea. IMHO, {files} must be migrated by core update, with all fids kept unchanged. Only then contrib/custom can update their tables. Since fids don't change, they don't even need to touch the core tables (just use file_usage_add() to signal their files as being used).

Maybe that's the differences in POV that make us disagree. In my POV, if we ignore some fids, how can a contrib module upgrade?

(...)
- What about we show a message when duplicate files have been detected:

Duplicate file xyz had duplicate records 1,4,6 and was reduced to 15. [warning]

There had been duplicate files in the {files} table. If you only used the core {upload} module those have been fixed automatically, if you used contrib modules like the contributed file_field or image_field modules, please see {hand book page}. [warning]
(...)

Sounds good to me. The important thing here is to pause and warn the user that dups were found, and that the upgrade process is unable to handle all cases. We definitely do no want an upgrade that leaves a site with random files "missing" without a warning!

David_Rothstein’s picture

@flaviovs, I think you should create another issue if you want to propose that core migrate files it doesn't control. Again, the patch here is not changing that one way or another.

A drupal_set_message() that indicates the duplicate files sounds OK as there is some small possibility that the data from duplicate files isn't combined correctly (see my comment in #106). The message could get long though; maybe it should just be in the return value of the update function instead? Also since we're not touching the contrib-controlled files at all I don't think it needs to mention duplicates found there. I.e. if this part of the suggested message were 100% true:

If you only used the core {upload} module those have been fixed automatically,

...then there wouldn't be a point having a message at all. But as I pointed out above there is a chance that the upgrade can't fix them perfectly.

I don't see it as a blocker for this issue personally, but if someone wants to work on a message and a handbook page it sounds reasonable.

David_Rothstein’s picture

Also any thoughts on my question in #106 about switching from MIN to MAX?

In other words, if the Drupal 6 data looks like this:

mysql> select fid,uid,filename,filepath from files where filepath = 'sites/default/files/file.jpg';
+-----+-----+----------+--------+
| fid | uid | filename | filepath    
+-----+-----+----------+- ------+
| 510 |   1 | file.jpg | sites/default/files/file.jpg |
| 530 |   2 | file-new.jpg | sites/default/files/file.jpg |

Should the migrated file in Drupal 7 actually use uid 2 and filename "file-new.jpg" (based on the idea that the most recent entry is more likely to be the up-to-date one)?

Fabianx’s picture

Status: Needs review » Needs work

#110: We cannot stop easily, but we can show the warning at least after the upgrade is done.

We do leave {files} alone, we only change {files_managed}. Also you need to make a backup before upgrading D6 anyway and we can add to the handbook page.

So the message should be fine, changing to CNW for that change.

flaviovs’s picture

@flaviovs, I think you should create another issue if you want to propose that core migrate files it doesn't control.

Well... {files} is a core table used to control all files in a Drupal site. Drupal gives fids to contrib so that they can store and refer to managed files later, so I do not agree with the "files it doesn't control" part.

But...

Migrating {files} is what I thought it already did. I was inspecting the upgrade routines and had a shocking surprise -- it does not. So my POV (although technically sound IMHO), does not apply. And there's no room to change this at this point of time. So let's move on here...

Also any thoughts on my question in #106 about switching from MIN to MAX?
(...)
Should the migrated file in Drupal 7 actually use uid 2 and filename "file-new.jpg" (based on the idea that the most recent entry is more likely to be the up-to-date one)?

That's my point. You cannot safely migrate by just looking at {files}. It does not matter which entry you migrate. It does matter only the entries that you do not migrate, because there may be some module, somewhere, which expects a fid that won't be there. The uid issue you pointed out just highlights the problem.

See my example above, which I consider more complex:

  • If you migrate file entry A, node X will be OK, but node Y will be broken.
  • If you migrate file entry B, node Y will be OK, but node X will be broken.
Fabianx’s picture

We look at nodes - if you read the patch.

We just only take care of the files uploaded by upload module as that is the only thing that Drupal 6 Core supported.

Therefore I still think a quick warning (not error) is a good compromise.

pfrenssen’s picture

Assigned: Unassigned » pfrenssen

I'm going to dig in to this fun issue :P

pfrenssen’s picture

I fired up a clean installation of Drupal 6.36 and can easily generate different entries with the same filepath in the files table:

  1. Go to /admin/user/settings and enable "Picture support".
  2. Go to /user/1/edit and upload a user picture.
  3. Go back to /user/1/edit and delete the user picture.
  4. Clear all files from the temporary folder.
  5. Go back to /user/1/edit and upload the same user picture again.

I now end up with two identical entries in the files table, only the fid and timestamp are different.

izmeez’s picture

@pfrenssen The results in #117 sounds like a bug in d6 don't you think?

pfrenssen’s picture

Issue tags: +D8 Accelerate London

@izmeez, no in Drupal 6 this was allowed, but not in Drupal 7, which is the reason we have this issue :)

I just provided that example to illustrate how easy it is to end up with a database in D6 that would expose this issue.

pfrenssen’s picture

Actually the example that I provided puts duplicate file entries in the table but these do not trigger the integrity constraint violation, the user pictures are updated in user_update_7012() but this gets the data from the {users} table, not the {files} table.

pfrenssen’s picture

I found a way to replicate it in D6 core with the Upload module too:

  1. Enable the Upload module.
  2. Create an article and attach an image to it.
  3. Remove the uploaded file from the sites/default/files folder.
  4. Create a second article and attach the same image to it.

The image will be present twice in the database with the same filepath. If the site is then upgraded to Drupal 7 the integrity constraint violation is thrown.

It is of course not recommended to delete uploaded files, but you can image this happening in reality, especially if the site administrators have access to the files on the production server, or if a module such as IMCE is installed.

pfrenssen’s picture

Tested the patch with this real life D6 database and it solves the issue. The corrected file ID is visible in both nodes. The entries in the {file_managed} and {file_usage} tables are correct.

I reviewed the patch and it looks really solid. As mentioned by others I'm also not so much concerned that the queries are done in a loop. This is a batched operation and the duplicated files can be expected to be only affecting a small percentage of the total number of files. This will have little impact on the total duration of the upgrade.

I also think that @flaviovs concerns are not applicable in this context. The patch solves the problem responsibly, it only loops over the files from the Upload module and fixes any duplicates properly. Any duplicate files that might be relied upon by contrib modules are still intact in the legacy {files} table. Contrib modules can use this data in their own update hooks.

  1. +++ b/modules/simpletest/tests/upgrade/drupal-6.upload.database.php
    @@ -127,6 +127,38 @@ db_insert('files')->fields(array(
    +// On some Drupal 6 sites, more than one file can have the same filepath. See
    +// https://www.drupal.org/node/1260938.
    

    Better to use the @see syntax on a new line, so that this will be linked properly in the documentation on api.drupal.org.

  2. +++ b/modules/simpletest/tests/upgrade/upgrade.upload.test
    @@ -64,12 +64,32 @@ class UploadUpgradePathTestCase extends UpgradePathTestCase {
    +    // Ensure that duplicate filepaths are reduced.
    

    The 'reduced' seems to imply that the problem is improved but not completely solved. Maybe better to say "Ensure that filepaths are deduplicated".

  3. +++ b/modules/system/system.install
    @@ -2803,6 +2803,16 @@ function system_update_7061(&$sandbox) {
    +    $sandbox['duplicate_filepath_fids_to_use'] = db_query("SELECT filepath, MIN(fid) FROM {files} GROUP BY filepath HAVING COUNT(*) > 1")->fetchAllKeyed();
    

    I would indeed change this to use MAX() instead as was suggested by @David_Rothstein, so that we will have the filesize and timestamp of the latest version of the file, not of the original one. This will be more likely to be correct.

    If we change this, then the documentation of this line should also be updated + the fids in the test coverage.

dawehner’s picture

Status: Needs work » Needs review
FileSize
8.01 KB
3.57 KB

I would indeed change this to use MAX() instead as was suggested by @David_Rothstein, so that we will have the filesize and timestamp of the latest version of the file, not of the original one. This will be more likely to be correct.

That is a good point, IMHO. Let's ensure that we have the needed test coverage just to be sure it actually works that way.

David_Rothstein’s picture

FileSize
7.97 KB
2.62 KB
  1. +// @see https://www.drupal.org/node/1260938.
    

    I'm pretty sure @see is only for docblocks - see https://www.drupal.org/coding-standards/docs#see (or should I say "@see https://www.drupal.org/coding-standards/docs#see" ? :)

    For an inline code comment there's no reason @see would be used for anything on api.drupal.org, so it's better to have it be a readable sentence.

  2. +    // We use MAX(fid) to retrieve the latest filesize and timestamp.
    

    I think we actually just need to fix the documentation elsewhere (which is now incorrect) rather than add a new sentence.

I fixed both the above in the attached. The rest of the changes look good to me.

Fabianx’s picture

Status: Needs review » Reviewed & tested by the community

As there is consensus this is good enough for core, we should go ahead.

=> RTBC (yes!)

Can we still update the hand-book page to at least notion the other module?

Thanks,

Fabian

pfrenssen’s picture

Assigned: pfrenssen » Unassigned

Changes look good, +1 to RTBC.

David_Rothstein’s picture

@Fabianx, so just to make sure, earlier you thought this needed some messaging to the user who is running update.php to inform them that duplicate files were replaced - do you not think this anymore?

Would still be great to see the final version tested on a real site that experienced this problem, but if that can't happen at some point we'll need to commit it without that.

Fabianx’s picture

I still would love a message and/or a handbook update, but both are not critical for me and the majority thinks that core should fix what it supports and mainly inform contrib that they need to adjust their upgrade path, too.

e.g. CCK

So I think a message could be follow-up, but at least a handbook update should be made - as its very hard to find solutions to the problem - else.

scorchio’s picture

Sorry for not reporting back earlier - I had to apply a manual workaround to let the mentioned project continue. (BTW, thanks to webchick for pinging me.)

Today I tried to redo the upgrade from scratch using 7.38 + #124 but when I started to migrate the fields with content_migrate, I hit the same issue again. In the {files} table, I have entries like this:

mysql> SELECT filepath, fid FROM `files` group by filepath having count(filepath) > 1;
+----------------------------------------------------------------+------+
| filepath                                                       | fid  |
+----------------------------------------------------------------+------+
| sites/default/files/ecard-large/iStock_000003150555Medium.jpeg |  870 |
| sites/default/files/ecard-large/iStock_000005822894Medium.jpeg |  868 |
| sites/default/files/ecard-large/iStock_000005873270Medium.jpeg |  898 |
| sites/default/files/ecard-large/iStock_000008664370Medium.jpeg |  874 |
| sites/default/files/ecard-large/iStock_000010557900Medium.jpeg |  865 |
| sites/default/files/ecard-large/iStock_000010806700Medium.jpeg |  876 |
| sites/default/files/ecard-large/iStock_000010854956Medium.jpeg |  867 |
| sites/default/files/ecard-large/iStock_000011172968Medium.jpeg |  866 |
| sites/default/files/ecard-large/iStock_000011742381Medium.jpeg |  897 |
| sites/default/files/ecard-large/iStock_000012045994Medium.jpeg |  872 |
| sites/default/files/ecard-large/iStock_000012134200Medium.jpeg |  882 |
| sites/default/files/ecard-large/iStock_000012153967Medium.jpeg |  873 |
| sites/default/files/ecard-large/iStock_000012320370Medium.jpeg |  883 |
| sites/default/files/ecard-large/iStock_000012585730Medium.jpeg |  877 |
| sites/default/files/ecard-large/iStock_thumbgws_0.jpeg         | 1265 |
| sites/default/files/header_imgs/classes.jpg                    |  192 |
| sites/default/files/header_imgs/donate_3.jpg                   |  948 |
| sites/default/files/header_imgs/find_physician_2.jpg           | 1653 |
| sites/default/files/header_imgs/forphysicians_12.jpg           | 1177 |
| sites/default/files/header_imgs/online_1.jpg                   |  757 |
| sites/default/files/header_imgs/online_1_42.jpg                | 1542 |
| sites/default/files/header_imgs/online_1_43.jpg                | 1564 |
| sites/default/files/header_imgs/online_1_44.jpg                | 1565 |
| sites/default/files/header_imgs/online_1_45.jpg                | 1566 |
| sites/default/files/home_slideshow_large/joint-bg-2.jpg        |  225 |
| sites/default/files/march_maintenance_brochure.pdf             | 1141 |
| sites/default/files/related_uploads/Healthline_Winter_2010.pdf |  511 |
| sites/default/files/service-icons/icon_cancer.gif              |   55 |
| sites/default/files/service-icons/icon_cardio.gif              |   54 |
| sites/default/files/service-icons/icon_emergency.gif           |   59 |
| sites/default/files/service-icons/icon_joint.gif               |   56 |
| sites/default/files/service-icons/icon_spine.gif               |   57 |
| sites/default/files/service-icons/icon_women.gif               |   58 |
+----------------------------------------------------------------+------+

Take fid 192 for example:

SELECT vid, nid, field_header_image_fid FROM `content_field_header_image` WHERE `field_header_image_fid` = 192;
+-------+-------+------------------------+
| vid   | nid   | field_header_image_fid |
+-------+-------+------------------------+
| 36275 | 13759 |                    192 |
| 34153 | 13759 |                    192 |
| 89072 | 13759 |                    192 |
+-------+-------+------------------------+
3 rows in set (0,00 sec)

What am I missing here?

David_Rothstein’s picture

@scorchio and I talked about this on IRC earlier. The above is a CCK field, so it's migrated by CCK rather than by Drupal core - that's why the patch here doesn't fix it.

However, the above definitely does prove that CCK needs a similar fix. @scorchio filed #2529562: Filefield D6 -> D7 migration: Integrity constraint violation - Duplicate entry for key 'uri' and I think the existing issue #1176186: D6 -> D7 upgrade: Duplicate files cause Integrity constraint violation: 1062 Duplicate entry 'xx' for key 'PRIMARY' might be about this also.

webchick’s picture

Good catch. Thanks so much for doing the testing!!

However, I believe that since those issues are coming from a contrib module, they do not qualify as blockers for D8, yes?

flaviovs’s picture

@scorchio,

Sorry for not reporting back earlier - I had to apply a manual workaround to let the mentioned project continue. (BTW, thanks to webchick for pinging me.)

Today I tried to redo the upgrade from scratch using 7.38 + #124 but when I started to migrate the fields with content_migrate, I hit the same issue again. In the {files} table, I have entries like this:
(...)
What am I missing here?

As said by David, this problem is generated by a CCK field, so you will have to wait for a patch to come up for the issue. The patch discussed here only migrate files managed by D6 core Upload module.

You may want to take a look on the Duplicate file fixer module, which takes another approach. It let you fix the duplicates manually, or can be used to de-dup the entries automatically in a module-agnostic way (disclaimer: I am the author of DFF).

In your case, by using DFF you will have easy access to the old node revisions having duplicated file pointers, and may choose to delete them if not needed anymore. If you want to keep them, the module can de-dup the entries by linking new files so that you can proceed on the upgrade.

webchick’s picture

@David_Rothstein, just curious if you think this one might make the 7.x bug fix release window for Sept. 2?

While we'd all feel more comfortable with real-world site testing, but after publicizing this issue on Twitter, the one person we could find who a) reported hitting this in the real-world and b) still had a backup recent enough that still includes the problem (as others had long since used workarounds to bypass the issue) turns out does not actually have this specific issue (or rather they do, but with a non-core module). There are also 48 subscribers to this issue, presumably most of them with a vested interest in having this fixed, and no one else has yet to speak up and say they've tried the patch and it works, despite it being available for literally a month. :\ I'm kind of out of ideas on how to solicit more real-world testing.

There was testing done by @pfrenssen on a real D6 database back in #122, so we at least have some sort of believable evidence it works. Do you feel that's sufficient to move forward here?

drzraf’s picture

I was beaten by this bug. The root issue is that filename were duplicated, and this was possible due to (I don't remember well) either non-uniqueness on the mysql column or case sensibility on the filename.
I once guess these duplicated filenames where caused by image_attach but I'm not sure either.

I gave up on this upgrade path long ago (although I spent hundreds of hours during almost two years of my free time for an NGO trying to overcome all the issues only related with a D6-D7 upgrade-path).
But with almost 40 modules it was just impossible. I kept the site running as it could using this ageing D6 and hoping for a better future.
I'll not even put a single finger in this drupal-upgrade-hell again or look at a sql-dump unless I'm strongly confident that it's for a successful upgrade.
And if that time does not come I won't touch drupal anymore until I'm forced to do so, and will move to something else. [the frustration of the time lost is still entire]

Side note: The time needed for maintainer/code authors to get an eye on bugs is the main issue here.
Many bugs where opened 2+ years ago but the strategy was "ignore them so that their importance is reduced as workarounds are discovered".
Starting two (or more) years later to try to resolve them will just make you lose 2x or 3x more times than if the issue had been considered in the first place.

good luck anyway

drzraf’s picture

FileSize
8.54 KB

anyway...
but can't guarantee about column join correctness

catch’s picture

@dzraf people getting stuck on 6-7 upgrades is a big reason why #2030501: [meta] Ensure that Drupal 6 sites have a functional upgrade path to either Drupal 7 or 8 before Drupal 6 loses security support (and hence this issue) was made a release blocker for 8.x. It's also why the 6.x-8.x and 7.x-8.x upgrade paths will be using migrate rather than hook_update_N().

Thanks for the sample data.

David_Rothstein’s picture

@webchick: Yeah, I think this should definitely go into the next release either way (that's basically what I was aiming for in #127).

@drzraf: This issue was marked critical in June 2013 - if you ran into it before that you could have marked it critical earlier? As far as I know it is the last remaining critical issue with the Drupal 6-to-7 core upgrade path (and even then it only seems to affect a relatively small number of sites) but if there are others you know of that would qualify as critical you should definitely mark them as such. Unfortunately it sounds like you may have run into problems with contrib modules too, which is a whole other ballgame...

jelo’s picture

I commented in #55 that I had this issue. However, it may have been caused by filefield_sources. Anyway, I have a D6 backup that has duplicate file paths in the files table. When I run this query SELECT count(*), min(fid), max(fid), filepath FROM files GROUP BY filepath HAVING count(*) > 1 ORDER BY count(*); I get 42 duplicate files in the backup. I upgraded from 6.22 to 7.32 or something like that at that time.

So I went, rebuild the old site, prepared everything for the upgrade (following the steps closely) which now puts me at an update from 6.36 to 7.38. As soon as I want to run the main update with update.php I run into this error:
PDOException: SQLSTATE[42S22]: Column not found: 1054 Unknown column 's.ssid' in 'where clause': SELECT u.*, s.* FROM {users} u INNER JOIN {sessions} s ON u.uid = s.uid WHERE s.ssid = :ssid; Array ( [:ssid] => xxxxxxxxxxxxxxxxx ) in _drupal_session_read() (line 92 of /includes/session.inc).

Is this a bug that was introduced? My D6 sessions table has a column called sid, but not ssid.

Jens

dawehner’s picture

I think you need to use the alternative way to get to /update.php:

- Open settings.php with a text editor.

- Find the line that says:
$update_free_access = FALSE;

- Change it into:
$update_free_access = TRUE;

jelo’s picture

@dawehner: that doesn't make a difference, same error. Due to this error, I cannot test the patch provided here. I may just have to log a new issue with regard to the upgrade path and this error which I did not encounter when upgrading the site from a previous 6 version to a previous 7 version.

webchick’s picture

The update_fix_d7_requirements() function should be taking care of this. The code is at http://cgit.drupalcode.org/drupal/tree/includes/update.inc?h=7.x#n763.

There do seem to be some other people reporting this ssid problem at #1008352: Upgrade D6.20 to D7 fails with PDO Exception. Comment #57 there especially looks like a good list. In general, it seems to happen if you've tried to re-do the D6 -> D7 upgrade multiple times without restoring from clean D6 backup in between each run. None of the logic in that function will fire if it thinks the system's already been upgraded (has a system.schema value >= 7000).

David_Rothstein’s picture

@jelo, thanks for trying to test this!

Are you running updates over an https:// URL, by any chance? Looking through #1008352: Upgrade D6.20 to D7 fails with PDO Exception it seems like that's the actual condition that triggers the bug. The two possible workarounds are:

  1. Run updates over http:// instead, or
  2. Set $update_free_access = TRUE in settings.php and log out of the Drupal 6 site before running the updates so that you run them as the anonymous user (just changing settings.php by itself isn't enough).

If this is the problem, we really should retitle the above issue to focus on it and up the priority to "Major". (I don't have time to test it right now, but I assume the fix would involve moving the code that's currently in update_fix_d7_requirements() to somewhere that runs a little earlier in the bootstrap.) But I don't believe it's related to this issue.

jelo’s picture

@webchick: the backup I am using has never been upgraded before and I get the error the first time hitting update.php.

David: that may be it. I did run it over https. I am on my way out on holiday. If I can find some time to fire up that old backup again, I will test it over http to check if I can get past this problem to test the patch from this issue.

webchick’s picture

Spoke to David today, he said there will not be a Sep. 2 bug fix release of D7. :( Earliest opportunity is now in October. He did say it could potentially be committed sooner than that, however.

David_Rothstein’s picture

Status: Reviewed & tested by the community » Fixed
Issue tags: +7.40 release notes

OK, people are really itching to get this committed and it has been RTBC for long enough. It would still be great to see real-world testing (feel free to post here with that even afterwards - would still be very helpful). But it looks like it's probably ready to go.

Therefore, committed to 7.x - thanks!

Above there was some discussion of having a handbook page about this issue (which could mention https://www.drupal.org/project/d6dupfix or other solutions as well). I'm not sure what the status of that is, but anyone should feel free to create one and if necessary we could link to it from somewhere within core (for example, a message displayed to sites that have this problem, as was suggested earlier).

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.