Currently if a new file has the same name as an existing the filename is extended by a number so "myname.png" becomes "myname_1.png", "myname_2.png", "myname_n.png". In a lot of cases users will upload the same file several times. For example if an onlineshop has n different similar products often the same picture is uploaded n times.
Proposal: Add the option on filefield fields to check for duplicates. This would reduce file space usage and speed up loading of websites for viewers. Maybe we could even use MD5 via Javascript to check the file before uploading it?
Additional thoughts: To make things easyer (access rights, searching across several tables) I would propose to only search within the filefields table.
Note: I'm not familiar with the D7 db schema and couldn't check right now, the idea came while working with D6 filefield.
Comments
Comment #1
yched CreditAttribution: yched commentedRe-categorizing
Comment #2
pjcdawkins CreditAttribution: pjcdawkins commentedThis would be useful for my purposes.
IMCE already has the option (under Common settings, at admin/config/media/imce) to replace existing files. I think it's based on the filename only. Only checking the filename would be crazy, it would produce far too many false positives (I can imagine multiple editors uploading different images all named
logo_for_website.jpg
).I guess we could check the file hash (MD5, SHA-1, etc.) server-side after upload, and (seamlessly/transparently) just use the existing file if it's identical. We probably shouldn't attempt this at all for private files, but it sounds feasible for public managed files.
It sounds like we could only do it for file fields, not for IMCE etc, because IMCE can have directories reserved for particular users only.
There's already a File Hash module, which claims to save hashes for all managed files. I haven't tried it out.
By the way, the current FileField Sources module goes some way towards reducing duplicates, because it provides a UI for editors to choose existing files (although most can't be bothered in my experience, they just upload again, it's just as quick).
Comment #3
pjcdawkins CreditAttribution: pjcdawkins commentedSo, there is the existing FILE_EXISTS_REPLACE "replace behaviour" for
<a href="http://api.drupal.org/api/drupal/core%21includes%21file.inc/function/file_save_upload/8">file_save_upload()</a>
and file_destination(), which is an option alongside FILE_EXISTS_RENAME and FILE_EXISTS_ERROR.The problem is, FILE_EXISTS_REPLACE checks only the file URI, which is a more or less useless way of checking whether a file already exists. As far as I can see, it even re-uses a file ID without checking whether the old entry points to a file with the same domain (public/private), or whether the current user has view/edit access to that file.
These seem to be the options:
I think it needs to be the latter. The contrib module could depend on File Hash for calculating SHAs for uploaded and existing files. Obviously some kind of cron / batch operation would be needed for existing files. The workflow might be:
* The original (oldest) file details will be loaded, assuming that the most authoritative owner of a file is the first person to have uploaded it. But there could be a log of ALL uploads, e.g. a simple database table containing (fid, uid, upload_time) columns, so you could even say how many times each file has been uploaded.
Any thoughts?
Comment #4
swentel CreditAttribution: swentel commentedMarked #206782: Prevent Duplicate File Uploads as duplicate
Comment #11
Krzysztof DomańskiFilenames now include the _NUMBER if renamed by file_save_upload() due to FILE_EXISTS_RENAME