Problem/Motivation

Drupal will upload files into one directory by default. Users can affect this by changing configuration of file/image fields, but real-world experience shows that they don't. Uploading thousands or millions of files into one directory is a very bad idea, which can have performance implications too. There are reports of sites going down just as a result of this issue.

Proposed resolution

Let's change default location configuration variable in file/image fields. This variable is currently empty, which results in all files being uploaded in a root folder (usually public://). Since use of tokens is supported we can provide configuration that will dynamically adapt for different fields.

"field/[field-storage:name]/[date:custom:Y]/[date:custom:m]" is suggested.

Remaining tasks

- review patch
- review change record

User interface changes

None.

API changes

None.

Original issue summary by @meba

It is a well known fact that if you upload thousands of files into a single directory you will have performance issues on certain systems such as Cloud systems using GlusterFS, NFS or other shared filesystems or simply slow hard drives. The reason being that list operations on that single directory take too much overhead over network.

The solution typically is to ensure there is no more than a couple of thousands of files in a single directory in your files/. This can be easily achieved through dividing directories by date:

30/12/2013/file.jpg
29/12/2013/file2.jpg

https://drupal.org/project/filefield_paths provides this functionality but many people do not install it when they create their website and migration is complicated.

I propose that we automatically add date based tokens to all file system paths in Drupal core and allow modules such as FileField Paths to override these tokens to provide advanced functionality.

Beta phase evaluation

Reference: https://www.drupal.org/core/beta-changes
Issue category Task because it's not really new functionality, but it's not really a bug as nothing is broken.
Issue priority Normal because it's not a blocker for anything.
Prioritized changes The main goal of this issue is usability.
Disruption This issue should have no disruption for any existing APIs, modules, or sites.
Files: 
CommentFileSizeAuthor
#64 interdiff.txt849 bytesrteijeiro
#64 2128055_65.patch6.26 KBrteijeiro
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,551 pass(es).
[ View ]
#56 2128055_55.patch6.29 KBslashrsm
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,098 pass(es).
[ View ]
#56 interdiff.txt949 bytesslashrsm
#52 2128055_52.patch6.28 KBslashrsm
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,153 pass(es).
[ View ]
#52 interdiff.txt1.28 KBslashrsm
#50 interdiff.txt3.57 KBslashrsm
#50 2128055_50.patch6.16 KBslashrsm
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 88,792 pass(es).
[ View ]
#44 interdiff.txt3.75 KBslashrsm
#44 2128055_44.patch4.55 KBslashrsm
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 87,931 pass(es).
[ View ]
#42 2128055_42.patch1.09 KBslashrsm
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 87,926 pass(es), 2 fail(s), and 0 exception(s).
[ View ]
#29 added_data_pattern_in_destination_folder-2128055_28.patch1.02 KBstijntilleman
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Unable to apply patch added_data_pattern_in_destination_folder-2128055_28.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#28 Added_data_pattern_in_destination_folder-2128055_2.patch1005 bytesestoyausente
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]
#25 Added_data_pattern_in_destination_folder-2128055-25.patch1021 bytesestoyausente
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]

Comments

attiks’s picture

I don't think it has to be that many levels, we could use 20131230 so we only create one extra sub directory.

In the past we used something like substr(md5($file_name), 0, 4);, but maybe a date is easier for all use cases.

Dries’s picture

Yesterday, a big Drupal site was down for several hours because Drupal uploaded hundreds of thousands of files into a single directory. The amount of files caused severe problems for the underlying filesystem, leading to data corruption. We should have safe defaults, so I propose providing a mechanism in core to prevent this kind of data corruption.

Dave Reid’s picture

I think just [current-date:custom:Y]/[current-date:custom:m] (e.g. 2013/11) would be good.

attiks’s picture

#3 We did a project last month where we imported 30.000 files, so Y/m isn't enough, can we settle for Ymd (without /)?

pdrake’s picture

One problem with using dates in the URL for files is that users may assume that it is a timestamp of sorts for the file, which may lead to confusion about the freshness of content. The date in the URL has no relation to the creation time of the content itself, just the date on which the file was uploaded (or perhaps the date on which it was migrated). Additionally, as attiks said, certain situations (such as content migration) may result in the creation of too many files on a given day.

Crell’s picture

I agree that we should have a dynamic default, but it should not be date-based. I would recommend something like the node type or field type (or both). That makes it very easy and predictable, but still helps to reduce the "crapload of files" problem. It doesn't help if you're going to have 300,000 nodes of the same type, all with files attached, but if you're building that scale site you should set up custom rules anyway. (Perhaps better help text in the latter case could be helpful?)

meba’s picture

You would be surprised about the amount of sites that are not "enterprise" but have this many files.

Let's discuss *how* to make this happen and choose a pattern later. Right now we agree on the need for the pattern, we have the ability to choose it later.

brantwynn’s picture

I agree with #6 that sorting by date makes things less searchable. I am less concerned about whats suggested in #5 since we are going to have a confusing URL in making things granular in most cases. I would suggest a pattern of [entity]/[bundle]/[field]/[id]/my_snarky_meme.gif for a file. This allows developers to easily find files and should prevent too many files from being stored in a single directory. The edge case would be a node with a multifield with millions of attachments to that field. It doesn't seem a likely scenario but its possible someone could do it.

Dave Reid’s picture

We already have a way to make this happen - there is a setting in the file field configuration, but it's default is nothing, which means all files get dumped into the public directory. Discussing how to make this happen means choosing a pattern now.

Dave Reid’s picture

@brantwynn: Always consider that adding files to new content means we don't have context such as ID. Also, we'd need to fix that the token replacement when the file uploaded has no context related to it's parent entity at all (see file_field_widget_uri() and nothing that calls this function uses the $data parameter.

brantwynn’s picture

@Dave Reid Could we get around this by storing newly saved files in a tmp directory and then moving them after the entity is saved?

Dave Reid’s picture

@brantwynn: Sure, it would require some major changes to file fields, but it might be worth a try in order to get that context. I'm sure the maintainers of filefield paths would appricate that work going into core.

slashrsm’s picture

I think we should think about this out of field context. We can create default pattern for fields, but we should not forget that files can come from other sources (there is support for WYSIWYG in D8 now, modules like Media, etc.).

Dave Reid’s picture

@slashrsm Indeed. Also images could via contrib be attached to multiple entities, so I think the same cons for a date approach also apply to an entity/ID approach.

meba’s picture

Any pattern that is not going to be based on either semi-random / date information will fail the same way as no pattern. Remember - most sites have one major content type and one major image field.

Let's summarize:

  • Field based token: Simple and understandable to end users but has potential issues with performance
  • Date based token: Might confuse users who are unsure whether this is date of being uploaded, updated or anything else. Migrate module might complicate things if many images uploaded at the same time
  • Random (think first couple of letters of md5 of filename): Performant, does not confuse users as date does but I wonder if it will still confuse users because searching a filesystem will now be harder if I SSH in. People will ask "What are those strange characters in my URL?"
attiks’s picture

There;s another use case, people using IMCE plugin for wysiwyg, they probably will have a hard time finding their images. No idea if these type of module still exists in Drupal 8, but in general if we start moving images, we will make it hard for all modules trying to access the file system directly.

Crell’s picture

A value unique to the entity isn't going to help much. Instead of 500,000 image files in one directory, you get a directory with 500,000 directories in it, each of which has one file. This is not an improvement.

I think we need to be careful of our goal here; it shouldn't be to "solve" the performance problem. It should be to mitigate it. Whatever we do, we're going to have patterns that result in epic fail in some use cases. We should try to have a default pattern that isn't going to epic fail in the most common use cases, and *suggests to users that they can setup their own rules* based on their use case. ("Teachable moment", as they say in the Ed biz.)

slashrsm’s picture

It looks that we have two options here:

- We use something file-related for sub-folder token. Upload date seems the most appropriate in this group, but it has some obvious problems. Pros: semantically nice urls, easy to understand, seen elsewhere (Wordpress?); Cons: possibility of migration problems (not if timestamp is also migrated over), may cause confusion among users and less experienced developers.
- Pseudo-random sub-folder token. I personally like the approach that hash_wrapper takes, since it ensures good dispersion by design (seen it in action on a site with tens millions of files and it works great). We could also use parts of UUID or filesize's modulo. Those two methods also ensure quite good dispersion of files and seem easier to calculate (compared to hashing) and probably a bit easier to understand (compared to hash_wrapper). Pros: good dispersion in most use-cases, no problems with migrations; Cons: harder to understand, does not provide nice/semantic URLs, harder to directly browse file system.

I completely agree with @Crell about the "teachable moment". People should be aware about implications of huge file libraries and not simply follow defaults we provide. While this might work for smaller sites it is definitely not acceptable for high-end projects.

Fidelix’s picture

How about these:

{field_machinename}/{year}-{weekoftheyear}
{field_machinename}/{year}-{dayoftheyear}
{field_machinename}/{year}-{month}-{day}/

Crell’s picture

field name/year-month seems like a reasonable "common default case" to me. (Most sites won't need per-day, I suspect.)

Fidelix’s picture

I proposed "week of the year" because it won't be recognizable by the average user looking at the folders thinking that's the image timestamp, neither it's too random/dirty like a MD5 or another type of hash.

But yeah, it we take that aside then field_name/year-month is good enough, I think.

askibinski’s picture

I would prefer not using field/entity machine names in the paths because files can be uploaded from outside a field (wysiwyg editor) or media module.

Year + Month + Day (eg. /2014/01/21/) should probably be enough for most cases.

slashrsm’s picture

How do we want to implement that? As as default configuration on image/file fields or on a lower level (more general approach)?

Crell’s picture

Just a default on newly-created file fields, IMO. KISS. This is not quite a Novice issue, but not far from it.

estoyausente’s picture

Issue tags:+D8SVQ, +#SprintWeekend2014
StatusFileSize
new1021 bytes
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]

Something like this?

I put the next pattern: {field_machinename}/{year}/{month}/

If somewhere needs a better implementation for this topic because he has thousand of files in a month... If you want I can try other implementation. I accepts suggestion.

stijntilleman’s picture

Shouldn't it be the other way around?

If the user set an upload location you add /{year}/{month} to that location. The way I understand this issue is that if nothing was entered by the user the token will automatically filled with {field_machinename}/{year}/{month}. That way the files root will not be cluttered with files.

estoyausente’s picture

@stijntilleman mmm... I'm not sure. It's posible, yes. I'm going to change and test again.
Do you think that is a correct pattern?

estoyausente’s picture

StatusFileSize
new1005 bytes
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]

I change the patch like #26 told me.

stijntilleman’s picture

StatusFileSize
new1.02 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Unable to apply patch added_data_pattern_in_destination_folder-2128055_28.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

@estoyausente I tryed applying your patch but it didn't work. I manually created the patch and this one should apply.

Crell’s picture

Hm, this is doing the work at runtime. I was thinking we just set the pattern by default in filefield, so it shows in the UI. That helps suggest to people that it's changeable.

Also, remember to set the issue to "needs review" when you post a patch so that the testbot can find it.

estoyausente’s picture

Status:Active» Needs review

Thanks @stijntilleman (I'm novice and sometimes have problem with the patchs ^^).

@Cell done. Now it have need review status.

And... I'm not sure about your suggest. In the filetype fileld form, show an input with a pattern, for configure the fieldpattern? It sound great. But I'm not sure if I can do it. I Never work with form in D8. I will try it.

The last submitted patch, 28: Added_data_pattern_in_destination_folder-2128055_2.patch, failed testing.

The last submitted patch, 25: Added_data_pattern_in_destination_folder-2128055-25.patch, failed testing.

no longer here 793948’s picture

Has any decision been reached about how things will be done in D8?

- Pseudo-random sub-folder token. I personally like the approach that hash_wrapper takes, since it ensures good dispersion by design (seen it in action on a site with tens millions of files and it works great). We could also use parts of UUID or filesize's modulo. Those two methods also ensure quite good dispersion of files and seem easier to calculate (compared to hashing) and probably a bit easier to understand (compared to hash_wrapper). Pros: good dispersion in most use-cases, no problems with migrations; Cons: harder to understand, does not provide nice/semantic URLs, harder to directly browse file system.

I love this one! I think the cons are quite minor, as you should be handling files through Drupal and not directly through the filesystem. This way of doing it would also help to identify and not duplicate identical files, a nice benefit.

erwangel’s picture

From user's side experience, I can say that an alphabetical/lexicographical order on directory tree is more meaningful to a user than a day or any other number split, although this second option has some sens to news sites. I am operating both news and e-commerce sites.
- For news sites based on drupal I have this system :
$files/file-type/optional-content-type/year/month/day/filename.ext
- For user generated content sections like picture galleries I have this system :
$files/file-type/user/year/month/day/filename.ext
- For e-commerce sites based on Magento I use this architecture :
$files/optional-category-or-brand/firstletter/secondletter/filename.ext
where firstletter and secondletter derive from filename, i.e. "f" and "i" => f/i/filename.ext

So architecture make sens depending to the use case. We sometimes have to locate a file in a filesystem architecture and then it should be intuitive. There is no ideal situation, but we can give the choice in initial settings of uploading file path to use a numerical or an alphabetical subdivision system along with some tokens about file-type, user-name or content-type. I think this can be easy to configure and understand even to starters and it covers a large number of usage cases.

pbattino’s picture

Pardon the question, but why this is not considered a "D7 also" issue? I know nothing about the main structural differences between D7 and D8 as far as files are concerned, but having a look at this patch seems no D8 specific.

Is is possible to have this issues backported?

batigol’s picture

Heh, yea it was one of the biggest problem for me - multidomains and multilanguage markets site - real pain in the ass.

Media was totally out, I did use:

https://www.drupal.org/project/insert
https://www.drupal.org/project/imce_filefield
https://www.drupal.org/project/filefield_paths
https://www.drupal.org/project/filefield_sources
https://www.drupal.org/project/colorbox

and File (Field) Path settings was something... like this - [current-date:custom:Y-m]/[node:nid]-[node:language]-[node:domain:id]

kerios83’s picture

Please remember that some images need to used twice or more times on the same page.

alimac’s picture

Issue tags:-#SprintWeekend2014+SprintWeekend2014

Minor tag cleanup - please ignore.

Status:Needs review» Needs work

The last submitted patch, 29: added_data_pattern_in_destination_folder-2128055_28.patch, failed testing.

slashrsm’s picture

Status:Needs work» Needs review
StatusFileSize
new1.09 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] 87,926 pass(es), 2 fail(s), and 0 exception(s).
[ View ]

Attached patch follows the suggestion that @Crell gave many times in this thread. It simply adds a default file directory pattern on a file/image field. It uses a workaround for field name replacement as we don't have a token for that in core. Implementing that in this patch would be a) completely out of scope and b) would make the patch much, much bigger. If the workaround is not acceptable I'd rather simplify the default pattern instead.

Status:Needs review» Needs work

The last submitted patch, 42: 2128055_42.patch, failed testing.

slashrsm’s picture

Status:Needs work» Needs review
StatusFileSize
new4.55 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 87,931 pass(es).
[ View ]
new3.75 KB

Test fixes.

Crell’s picture

Status:Needs review» Needs work
+++ b/core/modules/file/src/Plugin/Field/FieldType/FileItem.php
@@ -269,6 +269,9 @@ public function getUploadLocation($data = array()) {
+    // There is no token for field name so we have to use a little workaround.
+    $destination = str_replace('[field_name]', $this->parent->getName(), $destination);

Let's go ahead and add the token if we can. This sort of code debt is a time bomb, and the sort of "good enough for now" that has gotten us in trouble before. :-) If we're going to do it, let's do it right.

I also don't really care for the year/month being in the default path. That feels unnecessary to me.

slashrsm’s picture

I agree. That is a nasty hack :).

How do we want to approach this? Add proper token support for fields in a separate issue and postpone this until that gets in? Which field-related tokens we want?

Crell’s picture

I'd have to see what field related tokens exist now; My knee-jerk thought is to add just what we need for this issue, in this issue. We'd also need a beta eval added...

slashrsm’s picture

I believe there are none at the moment. That was the main reason for the nasty hack in #44. We only need field name I believe.

Crell’s picture

Let's add it then. Hacks--

slashrsm’s picture

Status:Needs work» Needs review
StatusFileSize
new6.16 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 88,792 pass(es).
[ View ]
new3.57 KB

Done.

Crell’s picture

Status:Needs review» Needs work
  1. +++ b/core/modules/file/src/Plugin/Field/FieldType/FileItem.php
    @@ -269,6 +269,8 @@ public function getUploadLocation($data = array()) {
    +    $data += ['field-storage' => $this->getFieldDefinition()->getFieldStorageDefinition()];
    +

    Do we still need this part now?

  2. +++ b/core/modules/file/src/Plugin/Field/FieldType/FileItem.php
    @@ -269,6 +269,8 @@ public function getUploadLocation($data = array()) {
    @@ -310,8 +312,16 @@ public static function generateSampleValue(FieldDefinitionInterface $field_defin

    @@ -310,8 +312,16 @@ public static function generateSampleValue(FieldDefinitionInterface $field_defin
         $random = new Random();
         $settings = $field_definition->getSettings();

    +    /**
    +     * Prepare upload destination.
    +     *
    +     * @see FileItem::getUploadLocation()
    +     */
    +    $destination = trim($settings['file_directory'], '/');
    +    $destination = \Drupal::token()->replace($destination, ['field-storage' => $field_definition->getFieldStorageDefinition()]);

    This may be out of scope as well, but... \Drupal:: inside a class is a code smell. Always, without exception. What's this method doing that it needs to run a token replacement? It says "generate sample value", which... shouldn't need a service for that, should it? Is that a pre-existing design flaw in the field system?

  3. +++ b/core/modules/file/src/Tests/FileFieldPathTest.php
    @@ -17,20 +17,28 @@ class FileFieldPathTest extends FileFieldTestBase {
    +      $this->container->get('date.formatter')->format(REQUEST_TIME, 'custom', 'Y') . '/' .
    +      $this->container->get('date.formatter')->format(REQUEST_TIME, 'custom', 'm') . '/' .

    Grab this service once and use it twice. Doesn't make a big difference in a test but it's a good habit to get into. :-)

slashrsm’s picture

Status:Needs work» Needs review
StatusFileSize
new1.28 KB
new6.28 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,153 pass(es).
[ View ]

Thank you. Glad to see we're moving this into the right direction. :)

  1. Do we still need this part now?

    This was added with field tokens. Token service needs field storage definition in order to process them and here we're making sure the service is always there.

  2. This may be out of scope as well, but... \Drupal:: inside a class is a code smell. Always, without exception. What's this method doing that it needs to run a token replacement? It says "generate sample value", which... shouldn't need a service for that, should it? Is that a pre-existing design flaw in the field system?

    Since this is static context we can't inject. We could just skip token replacement (as we do now), but that assumes we'll never use field definition with non-empy "file_directoy" configuration variable in here. Sounds wrong to me.

  3. Grab this service once and use it twice. Doesn't make a big difference in a test but it's a good habit to get into. :-)

    Done. Also noticed unnecessary file_load() few lines above. Fixed that too.

Crell’s picture

Category:Feature request» Task
Issue summary:View changes
Status:Needs review» Reviewed & tested by the community

I'm not going to try and critique the design flaws of the token API at this time, so I guess let's move forward. :-) Thanks!

Adding beta evaluation, too. I don't think we need a change notice for this.

willzyx’s picture

+++ b/core/modules/field/field.module
@@ -13,6 +13,7 @@
+use Drupal\Component\Utility\String;

@@ -339,3 +340,49 @@ function field_form_config_admin_import_form_alter(&$form, FormStateInterface $f
+          $replacements[$original] = $sanitize ? String::checkPlain($field_storage->getName()) : $field_storage->getName();

String::checkPlain was deprecated (will be removed before Drupal 8.0.). Can we start to use \Drupal\Component\Utility\SafeMarkup's methods?

alexpott’s picture

Status:Reviewed & tested by the community» Needs work
Issue tags:+Needs issue summary update, +Needs change record

Re #54 yep we shouldn't be adding new usages.

Also the issue summary could be updated to say what the patch is actually doing. Plus it might be good to have a site builder CR to tell people that this default is changing.

slashrsm’s picture

Status:Needs work» Needs review
StatusFileSize
new949 bytes
new6.29 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,098 pass(es).
[ View ]

Fixed #54.

slashrsm’s picture

Issue summary:View changes
slashrsm’s picture

Status:Needs review» Reviewed & tested by the community
Issue tags:-Needs issue summary update, -Needs change record

Issue summary updated and change record draft created. This was technically already reviewed and was RTBC in the past.

Crell’s picture

I must have missed SafeMarkup becoming public in favor of String... Latest patch still RTBC IMO.

Berdir’s picture

@Crell: #2454447: Split Utility\String class to support PHP 7 (String is a reserved word)

Patch looks fine to me, would be useful to have a "by default" or so in the issue title/commit message I think.

slashrsm’s picture

@Berdir: like this?

alexpott’s picture

Title:Files should automatically be uploaded to directories based on a token» Files should be uploaded to directories based on tokens by default

@Berdir: like this?

alexpott’s picture

Status:Reviewed & tested by the community» Needs work
+++ b/core/modules/file/src/Plugin/Field/FieldType/FileItem.php
@@ -310,8 +312,16 @@ public static function generateSampleValue(FieldDefinitionInterface $field_defin
+    /**
+     * Prepare upload destination.
+     *
+     * @see FileItem::getUploadLocation()
+     */

This is an odd inline comment.

rteijeiro’s picture

Status:Needs work» Needs review
StatusFileSize
new6.26 KB
PASSED: [[SimpleTest]]: [PHP 5.4 MySQL] 90,551 pass(es).
[ View ]
new849 bytes

Fixed. Is this way correct? Could not find anything in [#1354]

toamit’s picture

I did not see this issue until now, so I'd request you folks to take a look at this proposal as well for file handling.
https://www.drupal.org/node/2485807

toamit’s picture

In current patch, which tokens will be available to user besides year/month. Will they include system wide tokens like [nid], [uid] etc?

Year token is mostly useless, as we will get handful of folders. Month gives us 12 folders, a bit of improvement, but does this really solve the original problem?
field-storage/[uid]/[nid] approach will be more scalable for authenticated uploads (if anonymous uploads are permitted that is a different can of worm)

Crell’s picture

#64 looks fine to me; not RTBCing myself because I'd prefer to have someone from token and/or entity API sign off on it. Who should we get for that?