Currently if a new file has the same name as an existing the filename is extended by a number so "myname.png" becomes "myname_1.png", "myname_2.png", "myname_n.png". In a lot of cases users will upload the same file several times. For example if an onlineshop has n different similar products often the same picture is uploaded n times.

Proposal: Add the option on filefield fields to check for duplicates. This would reduce file space usage and speed up loading of websites for viewers. Maybe we could even use MD5 via Javascript to check the file before uploading it?

Additional thoughts: To make things easyer (access rights, searching across several tables) I would propose to only search within the filefields table.

Note: I'm not familiar with the D7 db schema and couldn't check right now, the idea came while working with D6 filefield.

Comments

yched’s picture

Component: field system » file.module

Re-categorizing

pjcdawkins’s picture

This would be useful for my purposes.

IMCE already has the option (under Common settings, at admin/config/media/imce) to replace existing files. I think it's based on the filename only. Only checking the filename would be crazy, it would produce far too many false positives (I can imagine multiple editors uploading different images all named logo_for_website.jpg).

I guess we could check the file hash (MD5, SHA-1, etc.) server-side after upload, and (seamlessly/transparently) just use the existing file if it's identical. We probably shouldn't attempt this at all for private files, but it sounds feasible for public managed files.

It sounds like we could only do it for file fields, not for IMCE etc, because IMCE can have directories reserved for particular users only.

There's already a File Hash module, which claims to save hashes for all managed files. I haven't tried it out.

By the way, the current FileField Sources module goes some way towards reducing duplicates, because it provides a UI for editors to choose existing files (although most can't be bothered in my experience, they just upload again, it's just as quick).

pjcdawkins’s picture

So, there is the existing FILE_EXISTS_REPLACE "replace behaviour" for <a href="http://api.drupal.org/api/drupal/core%21includes%21file.inc/function/file_save_upload/8">file_save_upload()</a> and file_destination(), which is an option alongside FILE_EXISTS_RENAME and FILE_EXISTS_ERROR.

The problem is, FILE_EXISTS_REPLACE checks only the file URI, which is a more or less useless way of checking whether a file already exists. As far as I can see, it even re-uses a file ID without checking whether the old entry points to a file with the same domain (public/private), or whether the current user has view/edit access to that file.

These seem to be the options:

  • add a big patch to file.inc, with a new replace behaviour constant (e.g. FILE_HASH_EXISTS_REPLACE), and perhaps rename FILE_EXISTS_REPLACE to FILE_URI_EXISTS_REPLACE for clarity.
  • add a small patch to file.inc introducing some new hooks, allowing this behaviour to be implemented in a contrib module.

I think it needs to be the latter. The contrib module could depend on File Hash for calculating SHAs for uploaded and existing files. Obviously some kind of cron / batch operation would be needed for existing files. The workflow might be:

  1. File is uploaded.
  2. Calculate its hash (probably SHA).
  3. Compare hash with the hashes of existing files, filtering to only those files with the same access rights as those intended for the uploaded file.
    1. If hash isn't found elsewhere, save the uploaded file (as normal) to filesystem and database.
    2. If the hash exists tied to an existing file, validate that file (e.g. check its FS permissions are still good), and load that file's full details from DB.*
  4. Return $file object as normal.

* The original (oldest) file details will be loaded, assuming that the most authoritative owner of a file is the first person to have uploaded it. But there could be a log of ALL uploads, e.g. a simple database table containing (fid, uid, upload_time) columns, so you could even say how many times each file has been uploaded.

Any thoughts?

swentel’s picture

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Krzysztof Domański’s picture