It is a well known fact that if you upload thousands of files into a single directory you will have performance issues on certain systems such as Cloud systems using GlusterFS, NFS or other shared filesystems or simply slow hard drives. The reason being that list operations on that single directory take too much overhead over network.

The solution typically is to ensure there is no more than a couple of thousands of files in a single directory in your files/. This can be easily achieved through dividing directories by date:

30/12/2013/file.jpg
29/12/2013/file2.jpg

https://drupal.org/project/filefield_paths provides this functionality but many people do not install it when they create their website and migration is complicated.

I propose that we automatically add date based tokens to all file system paths in Drupal core and allow modules such as FileField Paths to override these tokens to provide advanced functionality.

Files: 
CommentFileSizeAuthor
#29 added_data_pattern_in_destination_folder-2128055_28.patch1.02 KBstijntilleman
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Unable to apply patch added_data_pattern_in_destination_folder-2128055_28.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#28 Added_data_pattern_in_destination_folder-2128055_2.patch1005 bytesestoyausente
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]
#25 Added_data_pattern_in_destination_folder-2128055-25.patch1021 bytesestoyausente
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]

Comments

attiks’s picture

I don't think it has to be that many levels, we could use 20131230 so we only create one extra sub directory.

In the past we used something like substr(md5($file_name), 0, 4);, but maybe a date is easier for all use cases.

Dries’s picture

Yesterday, a big Drupal site was down for several hours because Drupal uploaded hundreds of thousands of files into a single directory. The amount of files caused severe problems for the underlying filesystem, leading to data corruption. We should have safe defaults, so I propose providing a mechanism in core to prevent this kind of data corruption.

Dave Reid’s picture

I think just [current-date:custom:Y]/[current-date:custom:m] (e.g. 2013/11) would be good.

attiks’s picture

#3 We did a project last month where we imported 30.000 files, so Y/m isn't enough, can we settle for Ymd (without /)?

pdrake’s picture

One problem with using dates in the URL for files is that users may assume that it is a timestamp of sorts for the file, which may lead to confusion about the freshness of content. The date in the URL has no relation to the creation time of the content itself, just the date on which the file was uploaded (or perhaps the date on which it was migrated). Additionally, as attiks said, certain situations (such as content migration) may result in the creation of too many files on a given day.

Crell’s picture

I agree that we should have a dynamic default, but it should not be date-based. I would recommend something like the node type or field type (or both). That makes it very easy and predictable, but still helps to reduce the "crapload of files" problem. It doesn't help if you're going to have 300,000 nodes of the same type, all with files attached, but if you're building that scale site you should set up custom rules anyway. (Perhaps better help text in the latter case could be helpful?)

meba’s picture

You would be surprised about the amount of sites that are not "enterprise" but have this many files.

Let's discuss *how* to make this happen and choose a pattern later. Right now we agree on the need for the pattern, we have the ability to choose it later.

brantwynn’s picture

I agree with #6 that sorting by date makes things less searchable. I am less concerned about whats suggested in #5 since we are going to have a confusing URL in making things granular in most cases. I would suggest a pattern of [entity]/[bundle]/[field]/[id]/my_snarky_meme.gif for a file. This allows developers to easily find files and should prevent too many files from being stored in a single directory. The edge case would be a node with a multifield with millions of attachments to that field. It doesn't seem a likely scenario but its possible someone could do it.

Dave Reid’s picture

We already have a way to make this happen - there is a setting in the file field configuration, but it's default is nothing, which means all files get dumped into the public directory. Discussing how to make this happen means choosing a pattern now.

Dave Reid’s picture

@brantwynn: Always consider that adding files to new content means we don't have context such as ID. Also, we'd need to fix that the token replacement when the file uploaded has no context related to it's parent entity at all (see file_field_widget_uri() and nothing that calls this function uses the $data parameter.

brantwynn’s picture

@Dave Reid Could we get around this by storing newly saved files in a tmp directory and then moving them after the entity is saved?

Dave Reid’s picture

@brantwynn: Sure, it would require some major changes to file fields, but it might be worth a try in order to get that context. I'm sure the maintainers of filefield paths would appricate that work going into core.

slashrsm’s picture

I think we should think about this out of field context. We can create default pattern for fields, but we should not forget that files can come from other sources (there is support for WYSIWYG in D8 now, modules like Media, etc.).

Dave Reid’s picture

@slashrsm Indeed. Also images could via contrib be attached to multiple entities, so I think the same cons for a date approach also apply to an entity/ID approach.

meba’s picture

Any pattern that is not going to be based on either semi-random / date information will fail the same way as no pattern. Remember - most sites have one major content type and one major image field.

Let's summarize:

  • Field based token: Simple and understandable to end users but has potential issues with performance
  • Date based token: Might confuse users who are unsure whether this is date of being uploaded, updated or anything else. Migrate module might complicate things if many images uploaded at the same time
  • Random (think first couple of letters of md5 of filename): Performant, does not confuse users as date does but I wonder if it will still confuse users because searching a filesystem will now be harder if I SSH in. People will ask "What are those strange characters in my URL?"
attiks’s picture

There;s another use case, people using IMCE plugin for wysiwyg, they probably will have a hard time finding their images. No idea if these type of module still exists in Drupal 8, but in general if we start moving images, we will make it hard for all modules trying to access the file system directly.

Crell’s picture

A value unique to the entity isn't going to help much. Instead of 500,000 image files in one directory, you get a directory with 500,000 directories in it, each of which has one file. This is not an improvement.

I think we need to be careful of our goal here; it shouldn't be to "solve" the performance problem. It should be to mitigate it. Whatever we do, we're going to have patterns that result in epic fail in some use cases. We should try to have a default pattern that isn't going to epic fail in the most common use cases, and *suggests to users that they can setup their own rules* based on their use case. ("Teachable moment", as they say in the Ed biz.)

slashrsm’s picture

It looks that we have two options here:

- We use something file-related for sub-folder token. Upload date seems the most appropriate in this group, but it has some obvious problems. Pros: semantically nice urls, easy to understand, seen elsewhere (Wordpress?); Cons: possibility of migration problems (not if timestamp is also migrated over), may cause confusion among users and less experienced developers.
- Pseudo-random sub-folder token. I personally like the approach that hash_wrapper takes, since it ensures good dispersion by design (seen it in action on a site with tens millions of files and it works great). We could also use parts of UUID or filesize's modulo. Those two methods also ensure quite good dispersion of files and seem easier to calculate (compared to hashing) and probably a bit easier to understand (compared to hash_wrapper). Pros: good dispersion in most use-cases, no problems with migrations; Cons: harder to understand, does not provide nice/semantic URLs, harder to directly browse file system.

I completely agree with @Crell about the "teachable moment". People should be aware about implications of huge file libraries and not simply follow defaults we provide. While this might work for smaller sites it is definitely not acceptable for high-end projects.

Fidelix’s picture

How about these:

{field_machinename}/{year}-{weekoftheyear}
{field_machinename}/{year}-{dayoftheyear}
{field_machinename}/{year}-{month}-{day}/

Crell’s picture

field name/year-month seems like a reasonable "common default case" to me. (Most sites won't need per-day, I suspect.)

Fidelix’s picture

I proposed "week of the year" because it won't be recognizable by the average user looking at the folders thinking that's the image timestamp, neither it's too random/dirty like a MD5 or another type of hash.

But yeah, it we take that aside then field_name/year-month is good enough, I think.

askibinski’s picture

I would prefer not using field/entity machine names in the paths because files can be uploaded from outside a field (wysiwyg editor) or media module.

Year + Month + Day (eg. /2014/01/21/) should probably be enough for most cases.

slashrsm’s picture

How do we want to implement that? As as default configuration on image/file fields or on a lower level (more general approach)?

Crell’s picture

Just a default on newly-created file fields, IMO. KISS. This is not quite a Novice issue, but not far from it.

estoyausente’s picture

Issue tags:+D8SVQ, +#SprintWeekend2014
StatusFileSize
new1021 bytes
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]

Something like this?

I put the next pattern: {field_machinename}/{year}/{month}/

If somewhere needs a better implementation for this topic because he has thousand of files in a month... If you want I can try other implementation. I accepts suggestion.

stijntilleman’s picture

Shouldn't it be the other way around?

If the user set an upload location you add /{year}/{month} to that location. The way I understand this issue is that if nothing was entered by the user the token will automatically filled with {field_machinename}/{year}/{month}. That way the files root will not be cluttered with files.

estoyausente’s picture

@stijntilleman mmm... I'm not sure. It's posible, yes. I'm going to change and test again.
Do you think that is a correct pattern?

estoyausente’s picture

StatusFileSize
new1005 bytes
FAILED: [[SimpleTest]]: [MySQL] Setup environment: Test cancelled by admin prior to completion.
[ View ]

I change the patch like #26 told me.

stijntilleman’s picture

StatusFileSize
new1.02 KB
FAILED: [[SimpleTest]]: [PHP 5.4 MySQL] Unable to apply patch added_data_pattern_in_destination_folder-2128055_28.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

@estoyausente I tryed applying your patch but it didn't work. I manually created the patch and this one should apply.

Crell’s picture

Hm, this is doing the work at runtime. I was thinking we just set the pattern by default in filefield, so it shows in the UI. That helps suggest to people that it's changeable.

Also, remember to set the issue to "needs review" when you post a patch so that the testbot can find it.

estoyausente’s picture

Status:Active» Needs review

Thanks @stijntilleman (I'm novice and sometimes have problem with the patchs ^^).

@Cell done. Now it have need review status.

And... I'm not sure about your suggest. In the filetype fileld form, show an input with a pattern, for configure the fieldpattern? It sound great. But I'm not sure if I can do it. I Never work with form in D8. I will try it.

The last submitted patch, 28: Added_data_pattern_in_destination_folder-2128055_2.patch, failed testing.

The last submitted patch, 25: Added_data_pattern_in_destination_folder-2128055-25.patch, failed testing.

no longer here 793948’s picture

Has any decision been reached about how things will be done in D8?

- Pseudo-random sub-folder token. I personally like the approach that hash_wrapper takes, since it ensures good dispersion by design (seen it in action on a site with tens millions of files and it works great). We could also use parts of UUID or filesize's modulo. Those two methods also ensure quite good dispersion of files and seem easier to calculate (compared to hashing) and probably a bit easier to understand (compared to hash_wrapper). Pros: good dispersion in most use-cases, no problems with migrations; Cons: harder to understand, does not provide nice/semantic URLs, harder to directly browse file system.

I love this one! I think the cons are quite minor, as you should be handling files through Drupal and not directly through the filesystem. This way of doing it would also help to identify and not duplicate identical files, a nice benefit.

erwangel’s picture

From user's side experience, I can say that an alphabetical/lexicographical order on directory tree is more meaningful to a user than a day or any other number split, although this second option has some sens to news sites. I am operating both news and e-commerce sites.
- For news sites based on drupal I have this system :
$files/file-type/optional-content-type/year/month/day/filename.ext
- For user generated content sections like picture galleries I have this system :
$files/file-type/user/year/month/day/filename.ext
- For e-commerce sites based on Magento I use this architecture :
$files/optional-category-or-brand/firstletter/secondletter/filename.ext
where firstletter and secondletter derive from filename, i.e. "f" and "i" => f/i/filename.ext

So architecture make sens depending to the use case. We sometimes have to locate a file in a filesystem architecture and then it should be intuitive. There is no ideal situation, but we can give the choice in initial settings of uploading file path to use a numerical or an alphabetical subdivision system along with some tokens about file-type, user-name or content-type. I think this can be easy to configure and understand even to starters and it covers a large number of usage cases.

pbattino’s picture

Pardon the question, but why this is not considered a "D7 also" issue? I know nothing about the main structural differences between D7 and D8 as far as files are concerned, but having a look at this patch seems no D8 specific.

Is is possible to have this issues backported?

batigol’s picture

Heh, yea it was one of the biggest problem for me - multidomains and multilanguage markets site - real pain in the ass.

Media was totally out, I did use:

https://www.drupal.org/project/insert
https://www.drupal.org/project/imce_filefield
https://www.drupal.org/project/filefield_paths
https://www.drupal.org/project/filefield_sources
https://www.drupal.org/project/colorbox

and File (Field) Path settings was something... like this - [current-date:custom:Y-m]/[node:nid]-[node:language]-[node:domain:id]

kerios83’s picture

Please remember that some images need to used twice or more times on the same page.

alimac’s picture

Issue tags:-#SprintWeekend2014+SprintWeekend2014

Minor tag cleanup - please ignore.

Status:Needs review» Needs work

The last submitted patch, 29: added_data_pattern_in_destination_folder-2128055_28.patch, failed testing.