Problem/Motivation
- This issue has second most followers of all feature request.
- https://www.drupal.org/project/usage/transliteration had almost 300k reported installs during the Drupal 7 glory days
On Drupal sites where users are uploading files clean filenames are hard to achieve because Drupal accepts most filenames as valid input. However lots of Drupal installs override this functionality, for example Thunder has been shipping a similar solution for a very long time. As @dww writes in #132
Content creators name their files all kinds of weird and unfortunate things. Even if everything is in English, it's still great to convert spaces to dashes, remove weird punctuation, special characters, etc. Plus, many users find themselves in a mix of case-sensitive and case-insensitive filesystems, so it's great to have Drupal automatically convert everything to lowercase. I constantly build custom functionality into my sites to cleanup filenames on upload so as to prevent the editors from making a mess of things. It'd be much better if core did that for me automatically.
Follow-up to #1842718: Use new Transliteration functionality in core for machine names and #567832: Transliteration in core.
@catch mentioned in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) we may want to transliterate filenames in core, so that we can have a database-level unique constraint on the URI field in the database. (However, this would presumably mean that transliteration of the field could not be optional).
Proposed resolution
Add a fieldset to the File system form (/admin/config/media/file-system) to opt-in for various kinds of filename sanitization:
- Transliterate the filename. Disabled by default, because it's not useful for some languages like Japanese.
- Replace whitespace with dash (
-) or underscore (_). - Replace any characters other than 0-9A-Za-z or
-or_or.with dash (-) or underscore (_). - Remove duplicate separator characters ie
--or..or__with be replaced with-or.or_. - Convert to lowercase (to prevent issues on case-insensitive file systems).
On all files uploaded through the API and UI (i.e. not aggregated JS/CSS files, etc), the filename will be sanitized in whatever ways the site is configured to enforce.
We listen to the \Drupal\file\Event\FileUploadSanitizeNameEvent and change the filename according the the configuration.
Note: These sanitization settings only impact new file uploads, all files already uploaded to a site are not affected.
Remaining tasks
Decide if all whitespace or only spaces are converted. Update code and UI text to match.All whitespace as of #159.Design and dispatch the appropriate event(s) where core's sanitization is done.Complete via #187.Decide if the FileUploadEvent should also let listeners munge the destination. If so, update the Event accordingly.No (via @alexpott in #192.2).Decide if we give the user enough feedback about renaming their file upon upload.No (via @alexpott in #203).If not, implement the appropriate feedback(Complete via #214).Finish cleaning up the UI text, settings names, etc:Done via #219.What's the right label for the 'strip' or 'remove' option for replace_non_alphanumeric? For now: "Option D: remove it entirely, via #219.- Strip -". Previous: "- Remove (do not replace) -". Other?Should the options for the two selects (replace_*) be capitalized?Sure. ;) Via #194Any other concerns/edits/fixes?
Finish writing Kernel test cases for (all?) the combinations of settings.Done via #213 and #219.Upload screenshot(s) of final settings UI and update the summary.Done via #219.Final code cleanup and fixes.No more @todo via #219.Update/fix the change record.Done via #221Add tests for REST integrationDecide if we're allowing transliteration as an option(!). See #272.Decide if we should let people pick a transliteration language. See #270 and #272.Usability review:Decide if we want a UI at all, or if sites should simply configure this via settings.php (like they do for the public + private files dir paths). See #272.- If yes to a UI:
Decide whether we want to be able to have different replacement characters for whitespace, non-alphanumeric characters, and transliteration unknown characters. See #180, #246, #258, #261, and others...If we have it as a separate / shared setting, decide if the "replacement character" setting should be "near" the checkboxes that provide replacement or at the end of the fieldset (see #263 for comparisons)Decide if "replace non-alphanumeric" should be "near" transliterate option. See #241 and #242.Decide if we should let people pick a transliteration language. See #270 and #272.
Avoid the announcement of redundant text by a screenreader (ref #343)- Final string/UI review + signoff
- Final reviews + RTBC.
- Commit.
User interface changes
New "Sanitize filenames" detail element with options on FileSystemForm (/admin/config/media/file-system).
Screenshots of the new settings UI :
Sanitize filenames element closed:

Sanitize filenames element open:

Options under "Replacement character" setting:

API changes
None
Data model changes
New configuration mapping file.settings:filename_sanitization, with options for:
- transliterate: Transliterate
- replace_whitespace: Replace whitespace
- replace_non_alphanumeric: Replace non-alphanumeric characters
- dedupe_separators: Remove duplicate dots (..), underscores (__) or dashes (--)
- lowercase: Convert to lowercase
Release notes snippet
| Comment | File | Size | Author |
|---|---|---|---|
| #404 | 2492171-404.patch | 28.11 KB | sokru |
| #403 | 2492171-nr-bot.txt | 86 bytes | needs-review-queue-bot |
| #394 | 2492171-394.patch | 28.22 KB | sokru |
| #386 | interdiff-385-386.txt | 905 bytes | jungle |
| #386 | interdiff-377-386.txt | 5.17 KB | jungle |
Issue fork drupal-2492171
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #1
neclimdulWhat's the status of this? Should #2504815: d6 to d8 migration throws integrity contraint warning with duplicate file paths expect a unique constraint to exist in the database target?
Comment #2
frobIs this going to make it for 8.0 or should this be moved to 8.1?
Comment #3
xjmYep, this is 8.1.x material at this point, so contrib could provide it for 8.0.x. Thanks!
Comment #4
andypostSuppose it's time, but maybe this blocked on #1925684: Provide a "original filename" field for file entity.
Comment #5
frobI wouldn't consider this blocked on that one. They should be able to be worked on in parallel.
Comment #6
decipheredIt's worth mentioning that this functionality is part of File (Field) Paths module, which has a D8 beta release.
Comment #7
jon@s commentedTrue but it seems like this functionality should be in core. This would also require the transliteration in core to support replacing spaces with a symbol which it, to my knowledge, currently does not. Also why the transliteration functionality in the File (Field) Paths module is limited compared to the old Transliteration module.
Comment #8
decipheredI'm not at all debating that it should be in core, but given @xjm's comment that it's not going to happen in 8.0.x, and that it needs to be dealt with in contrib, this is it being dealt with in contrib.
File (Field) Paths does also provide replacing of spaces with a symbol, as it has since inception (or not long afterwards) via the Pathauto integration.
I'm more than happy to remove the functionality from F(F)P as soon as it's no longer needed.
Comment #9
g089h515r806 commentedI have test with File (Field) Paths with Drupal8.0.1, it does not works at first. it works correctly at last.
I want to upload chinese pdf file using file field module.
Comment #10
decipheredThat seems like a poorly worded statement. But I will be happy to look into it further in the appropriate issue queue.
Comment #11
g089h515r806 commentedSorry for my English.
Comment #13
eigentor commentedI have tried filefield paths to solve the problem of transliterating filenames. For me it dit not work. Some comments for filefield paths:
Comment #14
imadalin commentedI don't think that the issue's topic is about File (Field) Paths and we should concentrate on implementing Transliteration for filenames manipulation in the Core.
Comment #15
pingwin4egLet's start with something.
Here's a patch using code similar to that from a transliteration module.
Comment #16
berdir"Something" looks like a good start to me, at least it doesn't break any existing tests. Next step would be tests with file names that trigger this.
Also wondering about how to introduce this exactly, is there any reason to make this a setting, so it could be disabled? It will be a behavior change, although I suspect a very welcome one for most (all?) sites?
Comment #17
skaughti'll be bold enough to say all sites should use.
The only other request I've ever had is about storing the original file name, that could be used for Title href. As people often name a file because they expect it would be saved. Same for Search API related issues. At this time, I haven't directly looked to see if the File API is yet doing that..
^I think there would be no need for an 'on/off' switch of transliteration if org. name is stored for about points.
$clean_filename = \Drupal::transliteration()->transliterate($file->getFilename(), 'en', '');I should think 'en' shouldn't be hardcoded. sites will not all be english first.
Comment #18
stefan.r commentedThis probably should be opt-in (at least for existing sites), so as to not change behavior.
Comment #19
skaughtComment #20
andypostMakes sense to have that configurable via container param
Comment #21
hgoto commentedI updated the patch #15.
1.
I replaced the hardcoded 'en' to
$langcodewith the following code:2.
I changed the
strtolower()toDrupal\Component\Utility\Unicode\Unicode::strtolower(). I consulted #1842718: Use new Transliteration functionality in core for machine names for this point. This may be unnecessary.3.
I introduced an option config to toggle the transliteration for the file names.
Probably it depends on the language that people want to make this setting opt-in or opt-out (of course, it shouldn't be changed for existing sites with update). In my small personal experience, the current transliteration function for Japanese is not very good and I'd like it to be off by default...
Tests are not yet added.
Comment #22
hgoto commentedComment #24
shadcn commentedPatch in #21 tested and works. Thanks.
Comment #25
dawehnerI'm wondering why this is part of
file_save_upload(), which is part of an API which is related to the form, not some actual underlying API level. IMHO you would like to have the same for potential files uploaded via RESt one day.Maybe it would be enough for now to move all that logic into its own API function and call it from here?
Comment #26
hgoto commentedThanks for your reviews. As dawehner told, surely it's better to make this logic an independent API function.
I tried creating a new method
getTransliteratedFilename()in\Drupal\file\Entity\Fileclass and moved this new transliteration logic into it. I'm not sure where we should put this API... I'd like this patch to be fully reviewed. Thank you.Comment #28
hgoto commentedThe patch #26 failed the auto test because I didn't change the code properly when I extracted the logic to a method. I'll try it again.
Comment #29
tobiberlinTested patch in #28 with a file named with spaces and German umlaut -> works
Comment #30
tobiberlinIf this is added to core it would be very helpful to offer an update function to transliterate the already uploaded files, wouldn't it?
Comment #31
berdirNope, that doesn't work. you don't know how and where those files are used, could be hardcoded in content. We can only offer it for new files.
Comment #32
tobiberlinIn my eyes this answer makes no sense as this patch is about using transliteration on images uploaded by a form. Whenever an image uploaded for a field for example is used somewhere hardcoded the same problem as described in #31 would appear when simply a new image is uploaded. Whenever an image path/ name is used hardcoded although the image can be changed by a form it is a wrong programming, isn't it?
Comment #33
ressaIt would be good to have transliteration of file names in Drupal 8 - and really great to get it into Drupal 8.3, if possible.
Comment #34
sense-designOne of the features why I have not yet completely switched to D8 for all my projects.
Would be great to have this in the upcoming 8.3 release.
Comment #36
hampercm commentedIn response to #30 and #31: perhaps add a drush command to transliterate existing files, if a site admin needs to.
Earlier code in this function does a setFilename() if the filename is changed (see code that handles insecure file extensions at line 818). I'd recommend doing that here, too.
This seems extraneous. Remove it?
Comment #37
kovtunos commentedUpdated the patch #28 to work with Drupal 8.3.0.
Changes: new PHP array syntax.
Comment #38
hgoto commentedComment #40
borisson_I tried this patch and it works, thanks!
I think the failure in #37 is unrelated, but this does need it's own tests. Since we're adding a new public function for the file name transliteration, all we'd need to do is call that function with different file names in the test, I think?
Comment #41
berdirWe can test it directly in a kernel test with diferent variations. I think we also need at least one integration test, upload an image with some spaces or so to make sure the function is actually called.
Comment #42
borisson_This patch adds the kernel tests. Not sure if the test-class is the correct one.
Comment #44
borisson_Ah, those migrate tests are related, I don't see how those should be fixed though.
Comment #45
hgoto commentedI believe the failed tests for #42 can be fixed just by providing the added configuration into
$expectedConfig['system.file']in the test classes.The test passed in my environment with this change and I'd like to see how the testbot reacts.
Comment #46
hgoto commentedOops, I'm sorry, the extension of the interdiff is wrong.
Comment #48
hgoto commentedThe tests passed. (The failure was just caused by the wrong extension of the interdiff.)
Comment #49
j-leeJust found a typo:
"options" should be "option"
The patch works for me. Thank you.
Comment #50
pascodiogo commentedShould not this transliteration be applied to the name displayed to the user?
Like this:
Without transliteration in the name displayed to the user:

With transliteration in the name displayed to the user:

Comment #51
hgoto commented@J-Lee, thank you for the review. I fixed the typo.
@PascoDiogo, in my opinion, this transliteration should not necessarily be applied to displayed names. I believe Transliteration module's project page would help to understand transliteration.
Comment #53
j-leeI'm not sure with #50. The original filename is replaced by the transliterated name. Why should a not existent filename displayed to the user? Isn't it confusing if you got a filename shown but the real filename is different?
Comment #54
tstoecklerShouldn't this use the file language instead?
Comment #55
frobAn argument could be made either way:
It is confusing to upload a file and see a different filename after uploading.
It is confusing to see the name of a file in the UI and then look for that file only to find out that it doesn't exist.
The first option is fixable by letting the user know that the filename will be changed upon upload for reasons. The second option would require adding a bit of UI to let the user know that the file name on the disk will be different than the one shown.
I lean toward the first option. Let the user know that the file name will be changed for reasons and then just display the actual file name.
Comment #56
skaught#53/#55
+1 option1.
the file description should include (like it does for file size restriction) a li child that outlines that filename will be cleanup.
This is similar to the fact that drupal adds '_0' ' _1' (file counts) if use keeps re-attaching the same file name.
i know i've had clients want to have file name as they have make them (ie: '2017 Q1 earning.xls'). but then would have a bit of overhead on tracking related file name changes are 'repairing' them for file downloads etc. same issue when a User re-attached a new version of that same file name (_0 _1 _2....)
Comment #57
frobKnowing the name of the actual file on the server is also necessary for complying with DMCA takedown notices and the like.
Comment #58
skaughti haven't looked over this whole issue in a while.
#4 andypost is correct.
Comment #59
pavelculacov commentedWhen it will be integrated in CORE?
Comment #60
seanb+1 for adding this to core. Not sure on postponing on #1925684: Provide a "original filename" field for file entity.. The feature is optional and the same problem (changing filenames) already occurs for duplicate filenames as mentioned in #56.
We could create a followup for the interface change mentioned in #56 "the file description should include (like it does for file size restriction) a li child that outlines that filename will be cleanup". It seems this particular part could turn into a huge bikeshed and I'm personally not convinced this is a big issue for most users. Hopefully, we don't have to block on that.
New patch to fix #54.
Comment #61
ndobromirov commentedThanks for the patch!
I've started using it on a project, but once I've enabled the configuration and exported it, the site installations started failing due to import of configuration issues. After some checks it turned out the config has stored 1 in the YML file, where the default value was false, so logically it should have stored true in there. Also hinted by the bool schema type.
I've changed the config manually in the YML file to true and installs passed.
I guess the fix should be to just type-cast the value, before setting it to storage from the form-submit handler.
This is also the fix implemented directly on the patch from #60. Diff between the two patch versions is provided.
I do not know should this regression needs any tests, so any help will be appreciated.
As I've done changed on the code still needs review :)
Comment #62
yan commentedPatch from #64 seems to work for me (although spaces are converted to underscores (_) which I think is not ideal).
Comment #63
pavelculacov commentedWork for me too always. :D
Comment #64
frob@yan, what isn't ideal about underscores?
Comment #65
millionleaves commented@frob
There is a difference from a SEO perspective. See #1 in the article linked below:
https://searchengineland.com/9-seo-quirks-you-should-be-aware-of-146465
Comment #66
frobThat article is from 2013 and the point you refer to is linked from 2005. Also, the files it is referring to are .html/.htm or in other words, that article is talking about the link to the content and not the link to the static asset.
SEO is constantly evolving and changing. Do you have a more current reference? I would argue that this should be configurable to make it long term SEO friendly, but I would also argue that should be contrib or another issue. I just don't want this issue to get side-tracked when it is so close.
From #61 Looks like it still needs work:
Comment #67
millionleaves commentedCurrent article from Google states:
"We recommend that you use hyphens (-) instead of underscores (_) in your URLs."
https://support.google.com/webmasters/answer/76329?hl=en
This is consistent with the earlier article I shared, and I don't see any distinction in either between html URLs and image URLs.
Another reason to use hyphens here is that the Transliteration module in D7 used hyphens for file names, so IMHO it is desirable to remain consistent with that unless the core transliteration functionality is also using underscores (although my argument in favour of
underscoreshyphens would remain).It appears that in this case, the use of an underscore is defined in one line of code in patch #61 that could be changed easily without affecting the patch overall.
Providing the option to choose the separator may be useful for some use cases, but agree that this would be better left for a separate issue for the core module rather than this patch.
Comment #68
imclean commented@millionleaves this is an interesting discussion as I've always preferred underscores in filenames while keeping hyphens in the rest of the URL. It turns out this method could be problematic due to underscores being a special character in computing.
Google's examples all use hyphens: https://support.google.com/webmasters/answer/114016?hl=en
This could be used to our advantage in certain cases. For example, "big_red_socks.jpg" you might want people to find only if they're looking for "big_red_socks" and not "big", "red" or "socks". Maybe not very often, but it's good to have that level of control.
Comment #69
frobYeah, it looks like it should be changed to a "-" character. The patch has always had this from #15
So that line needs to be changed and the issue from #61
Comment #70
frobMake the SEO change by rerolling the patch from 61, that leaves the issue mentioned in #61, can anyone review that?
Comment #71
borisson_This will also need fixes in the kernel-tests written for this issue, see
testFileNameTransliterationinmodules/file/tests/src/Kernel/SaveTest.php. Specifically the data-provider (provideFilenames) will need to be updated.Comment #72
panchoFixed provideFilenames().
Comment #73
frobPatch in #72 should resolve all remaining issues. All that is left is manual review.
Comment #74
seanbCould you provide an interdiff for #72?
Comment #75
Raja_Vijayakumar commentedI applied the patch#51 to my application. It only transliterating the filename in database but the file name still appears to be the original on UI for the uploaded file . any idea how do i fix the filename that appears on UI too.
Comment #76
j-leeSee comments #50, #51, #53, #55, #56 and #60.
This is a point of discussion but could/should a follow up issue.
Comment #77
orkutmuratyilmazanyone tested patch #72?
Comment #78
rang501 commentedConfirming that the patch in #72 is working.
Comment #79
eigentor commentedThe Patch #72 applies cleanly with Drupal 8.4.0.
The filenames for the uploaded files itself are correctly transliterated. It adds dashes for spaces.
The displayed filename are an issue for a followup, as I understand, so no surprise they are kept from the original filename.
Comment #80
rootworkComment #83
mcdruid commentedI think the test failures look unrelated to this patch - back to NR to try the tests again.
Noting also that this was mentioned as a step toward being able to remove the workaround for the file uri uniqueness constraint added in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols), but it probably won't help much if it's configurable and opt-in. (The idea being that the uri field in the file_managed table could be set to ascii if that's all it ever contains, and as such there wouldn't be a problem with the key length when imposing a uniqueness constraint in the db schema).
Comment #84
mgiffordSetting it back to @rootwork's RTBC. As @mcdruid noted, "test failures look unrelated to this patch".
Comment #85
mgiffordComment #86
alexpottThis is a bit funny because calling getTransliteratedFilename() might not return a transliterated filename. Maybe it should be the caller's responsibility to check the config.
Adding a configuration key like this means we need an upgrade path for existing sites. Just an update in system.install to add the missing key with the default value.
Comment #87
Anonymous (not verified) commentedHi all.
First of all, well done and thank you very much for this patch. I applied it and it works a treat.
The "Enable filename transliteration" tickbox appeared in /admin/config/media/file-system
The uploaded file names are converted nicely. I tried with russian characters and romanian diacritics.
I have an issue/proposal related to it.
The file name shown in the admin UI is still the original one, which I think it can be confusing for many.
Do you think it would be a good idea to make it the same as the uploaded file name?
Thank you.
Andy
Comment #88
mcdruid commentedComment #89
yan commentedPatch from #72 worked for me for Drupal 8.5.1
Comment #90
ndobromirov commentedI would say that the update should set to FALSE for existing sites through the proposed update path in #86.
For new ones it should be TRUE by default, as file names transliteration is a generally a good thing to do on the web.
Comment #91
sopranos commentedI am trying to upload image with special characters in filename, Drupal does not budge (drupal 8)
any news?
Drupal 6 managed to upload it with no problem
I am talking about special slavic characters like this
srečna družina.jpg
Comment #92
ressaThe patch in #72 together with Drupal 8.6.x works for me. I applied the patch using the new command line tool to quickly install Drupal:
curl -sS https://ftp.drupal.org/files/projects/drupal-8.6.x-dev.zip --output drupal-8.6.x-dev.zipunzip drupal-8.6.x-dev.zip && rm drupal-8.6.x-dev.zipcd drupal-8.6.x-devwget -q -O - https://www.drupal.org/files/issues/use_new_transliteration-2492171-72.patch | git apply -php core/scripts/drupal quick-start demo_umamiAfter enabling filename transliteration under
admin/config/media/file-system, it seems to work perfectly:Original image name
01-a (SMALL!!!)wé↑s e3RQ#er24!"#øæå.jpgResulting filename, after upload
01-a-smallwes-e3rqer24oaea.jpgI also tried uploading a file named
srečna družina.jpg, and it was transliterated tosrecna-druzina.jpg, as expected.I agree with @ndobromirov:
Comment #93
chr.fritschSo, I
Comment #94
amateescu commentedI would expect this replacement to be done at a lower level, for example directly in
file_create_filename()because that would allow unmanaged files to take advantage of transliteration as well.Is there any reason for not doing that?
Comment #96
hgoto commentedSorry for interference. If the discussion for the default value of
filename_transliterationis not enough, I'd like it to be decided carefully.Surely I myself (= a Japanese user) am happy if the transliteration is disabled by default. And I guessed there would be several other languages like Japanese but I started to doubt there're many languages which have a problem with Drupal's transliteration rule like Japanese since I haven't found any opinion similar to mine in this page.
If the out-of-box experience of Drupal gets much better for the majority in case the file transliteration is enabled, it's acceptable for minority users whose language have trouble with the transliteration, even if they need to disable the function every time after installation, I think :)
Thanks for the consideration.
Comment #97
skaughtsee #18
update existing site and how that could effect expectations by those product owners should be included, regardless of language first install base.
Comment #98
alexpottFixing the tests
Comment #99
alexpottI think @amateescu might be correct in #94 that this should be done on a lower level. That way transliteration can apply to unmanaged file saves too.
file_create_filename()is not suitable because it is only called whenFILE_EXISTS_RENAMEis set. In fact there's stuff in that method that definitely should not be there. So I've added it tofile_destination().I also think having File::getFilename() and File::getTransliteratedFilename() is super confusing so I think we should avoid patches that do that.
I agree with the concerns raised about the filename on disk and the filename on the file entity not matching. I know it's an existing issue with the _1 etc added to duplicates but this makes it even more confusing in terms of File::getFilename() not matching the actual filename. I think we should consider either fixing that first or making it part of this issue.
Comment #100
alexpottThis sets the filename correctly in file_save_upload(). Needs tests.
Comment #102
chertzog+1 . This seems to work perfectly for me. Applied to 8.5.4
Comment #103
kirkkalaPatch in #100 works fine but causes issues when css/js aggregation is enabled and server cares about filename case sensitivity.
Modified that patch from @alexpott to come over the issue. Did not though think thoroughly if just .css, .js and .gz are fine to skip from transliteration.
Comment #104
kirkkalaOops, bad patch file in #103, sorry about that. Here's a patch that applies with the change that js/css filenames won't be transliterated.
Comment #106
alexpottFixing the test. I think this shows that the fix is on the right lines because the current expected message is
Raw "Field <em class="placeholder">test_file_field_1</em> can only hold 3 values but there were 6 uploaded. The following files have been omitted as a result: <em class="placeholder">text-0.txt, text-0.txt, text-0.txt</em>." foundI'm not sure about the fix added by #104 - I think we need a test for this situation and also to understand the problem more. This patch includes it and the interdiff is back to #100 so everyone can see what the fix is.
Comment #107
alexpott@kirkkala this is the change you added to fix CSS aggregation can you give some steps-to-reproduce the problem you are seeing?
Comment #108
alexpottI've tried to reproduce the issue detailed by @kirkkala by:
\Drupal::configFactory()->getEditable('system.file')->set('filename_transliteration', TRUE)->save();I've even set debug break points in file_destination() and as far as I can see this method is not called during CSS/JS aggregation.
Comment #109
arosboro commentedI had been on #2492171-100: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) and experienced #2991606: Filename Transliteration enforces lowercase filenames which causes 404 errors with CSS/JS Aggregation turned on mentioned by @kirkkala in #104. I have written a test for that which I believe applies to #106
My solution was to change the AssetDumper service to ensure it creates a lowercase filename when filename_transliteration is enabled, so I have that in the wings as well. I am updating this issue with the test for that, which should check if #104 is a suitable solution as well.
See attached patch and interdiff.
Comment #110
alexpott@arosboro awesome! No idea why my debugging didn't work but your test totally proves the problem.
This should be discussed in a followup issue and not part of this patch.
Patch attached removes this and adds the test from #109.
Comment #111
alexpottSo here's a question can these rules futz with a name produced by
\Drupal\Component\Utility\Crypt::hashBase64()- I think it is possible because I think it is possible to generate a base64 encoded string with several consecutive non-alphabetical characters.I think we need to change how we're deciding to do transliteration and this should only occur on files uploaded by users and not generated files like CSS/JS aggregates.
Comment #113
alexpottSo I've moved the transliterating of filenames to
file_save_upload()as this is the method through which all files uploaded to the site are vaildated and processed. I don't think that filename transliteration should be part of the File entity as it is something that is done only when uploading the file to the server.Comment #114
arsoman commentedI used patch and its latest version worked nice! Thanks to the creator! Cheers!
Comment #115
pavelculacov commentedTested, worked
Comment #116
ressaIt works great, but I still don't understand why it is not enabled by default on a fresh installation ... Just to emphasize, I don't think it should be enabled for existing installations.
Comment #117
laravz commentedI'll be reviewing this at DrupalEurope
Comment #119
laravz commentedThe patch applied cleanly to a new 8.7.x and a 8.6.1. However, the test fails for 8.7.x. The file label cannot be found, because the waitForLink function returns Null.
Unexpected characters and sequences are properly being stripped. Languages like Japanese and Russian are being converted to a Latin alphabet.
Default is set to false, as discussed above. I think this is the logical decision, due to the issue with some languages and not everyone needing to translate filenames.
Files are being defaulted to false as discussed above.
Comment #120
dawehnerAfter discussion with @LaravZ I think personally think that we need to change
\Drupal\file\Plugin\rest\resource\FileUploadResourceas well, given that you expect transliteration as well.Maye we could extract all the transliteration logic into a helper function and call it from the upload resource.
Comment #121
andypost@dawehner maybe rest can have this optional/configurable per request option? makes sense to file separate issue
Comment #122
ivnishThanks! Patch is worked.
Comment #123
andypostI bet it needs usability review
Comment #124
sopranos commentedthis is default in new drupal? But I cant find the settings....
I went to content type story...than edit fields...and edit image field and I cant find the setting
Comment #125
ivnishsopranos, use patch from #113
Comment #126
lawxen commented#113 can't be applied to newest drupal8.7.x and drupal8.6.x any more, reroll it for drupal8.7.x(just file.module, and nothing functional change)
Comment #127
berdirNote that using this patch with the update function is a very bad idea because this adds a 87xx update function, which tells Drupal to ignore all future 86xx update functions.
It's not required anyway since FALSE/NULL is the implicit fallback, so I would suggest you provide an alternative patch without that update function that people can use if they are using the patch.
Comment #128
frobrerolled without update hook 8701
Comment #129
skaughtComment #131
drupalnesia commentedI think this additional feature should be in a module like D7 does in https://www.drupal.org/project/transliteration, because only few sites need special transliteration. While core only provide basic transliteration only.
Comment #132
dwwI disagree with #131. Content creators name their files all kinds of weird and unfortunate things. Even if everything is in English, it's still great to convert spaces to dashes, remove weird punctuation, special characters, etc. Plus, many users find themselves in a mix of case-sensitive and case-insensitive filesystems, so it's great to have Drupal automatically convert everything to lowercase. I constantly build custom functionality into my sites to cleanup filenames on upload so as to prevent the editors from making a mess of things. It'd be much better if core did that for me automatically.
In short: even if you don't think you want this, you probably do. That's a good reason for core to do it, instead of forcing site builders to look for a contrib solution to a problem they don't necessarily realize they have. :)
Cheers,
-Derek
Comment #133
berdirI think the problem is that transliteration only really makes sense for languages with latin characters.. if you try to use it with files that have chinese, arabic etc. characters than you just receive garbage.
That said, it is important to be configurable (maybe it would even need to be configurable by language, but that likely doesn't make much sense, as people might upload chinese filenames when adding an english translation or something) but I don't think it needs to be a separate module. It could maybe be a global setting provided by system.module that different things could check.. if you don't want to transliterate filenames, you probably also don't want to do it for pathauto aliases. But again, who knows, maybe someone does ;)
Comment #134
mfbI think we could say any site wants rules to cleanup file names. This would be removing undesirable stuff like whitespace, control characters, invalid bytes or extreme length from file names, and case and unicode normalization to avoid cross-file system issues. In Drupal this all got agglomerated into "transliteration", which is more clearly optional. There's nothing wrong with 🐈.txt or any other non-ASCII as a file name. The URL would be http://example.com/%f0%9f%90%88.txt which isn't super pretty but works fine.
Comment #135
frobAccording to the IS this has a technical limitation as well.
Look back 130 comments for the discussion as to why this should be in core. I am not trying to say we should have the discussion, rather I am saying that we already had the discussion.
We need a re-roll so tests pass again.
Comment #136
skaught#131 - i can't image a project not needing it. If you've never had a client/editor upload a file with a space in the name and then wonder why they can't embed a link in a WYSIWYG -- then you're luck to have such a tight scope on your projects.
🐈.txtThe basic point of clean names is ensure names are possible to type on a regular keyboard. Certainly MySQL storage of characters needs to match ability.
This base issue does outline having a setting. I'm not directly aware of where the patch is at this point -- hopefully it's not lost.
Also we had previous conversation here about tracking original names #4 has a sub issue.
Comment #137
skaughtEDIT
===
check for get('filename_transliteration') config exists.
Comment #138
dan_metille commented#132 I do not have much experience with chinese, arabic etc. characters, but for cyrillic characters, I very much rely on transliteration.
Comment #139
gpap commentedReroll for 8.6.7
UPDATE: This one messes up CSS/JS asset filenames when aggregation (preprocess) in enabled though. ¯\_(ツ)_/¯
Comment #140
skaughtComment #141
gpap commentedComment #142
alexpottRe #139
Yep this is why we can only do this for user uploaded files and not all files generated by Drupal.
Here's a reroll of #113 for 8.7.x with the update function re-instated and at least one of the tests fixed.
Comment #143
alexpottUpdated the issue summary.
Comment #144
alexpottOne thing I wonder is if this should be:
I wonder:
If we did the above suggestion then
system.file:filename_transliterationwould really only mean transliteration which seems a good thing. And the case-insensitive vs case-sensitive filesystems are a fact of like and so it seems worthwhile always doing. We problem should add a message a tell the user why this has occurred if it has though.Comment #145
panchoFor me lowercasing everything would be a much larger intrusion and annoyance than replacing spaces or consecutive hyphens (which both seems legitimate to do). Plus not many Drupal setups will be on a mixture of Windows and Linux machines.
Even then: be smart, not intrusive. Instead of enforcing lowercase, enforce case insensitivity.
Also, spaces are usually replaced by underscores. Multiple hyphens are rather hard to read, especially if some of them are actual hyphens, while others are whitespace replacements.
„Force lowercase“ (why not force uppercase?) and the whitespace replacement character at the very least need to be configurable. But then it might be something for contrib.
We might possibly want to only do a minimalistic
Drupal::transliteration()->transliterate()and leave the whole rest to contrib.Comment #146
daniel.bosenI changed the description a bit, since Thunder is not actually shipping this patch, but implemented a custom solution similar to this.
The reason for introducing a similar mechanism in the Thunder distribution were faulty http clients. E.G. the OkHttp client which is used in several mobile apps has a bug, where it double encodes urlencoded image names.
So technically only important for us is:
While reducing multiple consecutive characters or lowercasing is not required on production systems. But for developers, it might be interesting to also have the lowercasing, since Windows, as well as OSX, have case insensitive filesystems.
Comment #147
flocondetoileWhat about if we made configurable theses options, once the filename transliteration option is checked ? In this way, many use cases could be satisfied ?
At least :
- Replace whitespace with - or _ => select list, default is _
- Remove multiple consecutive non-alphabetical characters ie --- or ... or ___ with be replaced with - or . or _ => checkbox TRUE/FALSE, default is FALSE
- Force lowercase => checkbox TRUE/FALSE, default is FALSE
Comment #148
dwwAgreed with #147, except there should be a "don't replace" option (or something) in the select list for "Replace whitespace". Otherwise, yes, I'd love it if core provided these options, instead of punting this to contrib. We're so close here to something that will satisfy almost everyone's use-case.
Thanks,
-Derek
Comment #149
dwwp.s. The top-level checkbox could be "Transform filename", and if selected, reveals all the other settings:
[X] Transliterate
[ ] Force lowercase
...
The config setting IDs could be of the form:
filename_transform
filename_transform_transliterate
filename_transform_lowercase
filename_transform_convert_whitespace
filename_transform_only_alphanumeric //..._and_separators (?) ugh.
filename_transform_strip_multiple_punctuation // or ..._strip_multiple_separators (?)
...
Comment #150
pancho++ :)
++ :)
This would be awesome and definitely better than doing just a minimum and leaving the rest to contrib. I was just anxious we would force an overly strict ruleset on users who would in many cases feel disenfranchised. Now I know (and agree) we don't unnecessarily add clutter to the Core UI. However, I believe that this feature warrants some configurability.
Even if configurable, I still think that instead of lowercasing filenames, we should enforce case insensitivity in our files table. However that clearly needs to be investigated further.
So let's go forward as proposed by @flocondetoile and Derek. I hope noone's angry if for now I'm setting the issue to "needs work."
Comment #151
alexpottHere's a start.
Reviews of the UI text and config key names would be great.
We need more test coverage of how the sanitisations play together.
Comment #152
dww@alexpott re: #151:
Looks great, thanks! This is getting close. Agreed it needs more tests. Haven't manually tested, but here's what I see looking at the patch:
1. A few unrelated 80 char hunks at the front make it less readable as a patch.
2. We're inconsistent about "whitespace" vs. "spaces" in the config/variable names, UI text and how it actually works. Is this just for spaces or is it all whitespace (tabs, newlines, etc)? Most of it points towards all whitespace. I think we should standardize on that, expand the
str_replace()to actually work that way, fix the UI text to match, etc.3. Do we want to dispatch an event to let contrib easily continue to tweak the filename before we use it, or is that too much scope creep? How else do we expect contrib to extend this?
4. I don't see the "-- disabled --" pattern anywhere else in core. "- None -" seems much more common. If we keep "disabled" it should be "- Disabled -". Maybe "- Do not replace -"?
5.
Description repeats title and contains a typo. Maybe this:
? "...are allowed."?
6. I know I just suggested 'transliterate' should be an option, but this is still in the summary:
Do we want to force transliteration to help resolve @catch's concern, and make everything else optional? Or do we keep this a config knob and if people run into that problem, they have to turn this on? Seems non-ideal. Should we at least default to on? A bulk rename of already-uploaded files is way beyond scope for this. :/ Should we open a follow-up about that? Does it already exist?
Thanks!
-Derek
Comment #153
panchoSome more comments:
"sanitize_filenames" would be easier to grasp. We have "temporary_maximum_age", "default_scheme" and "path", but also "allow_insecure_uploads." On the other hand, this is a container, not a boolean. So yeah, maybe keep 'filename_sanitization.'
Maybe "strip_non-alphanumeric"?
"dedupe_separators" is enough. If they're not consecutive, they're no duplicates.
Maybe just "Filename sanitization" without "options"?
How about "Transliterate filenames"?
We're replacing not only characters that are exclusively Unicode, but all non-ASCII characters. So how about "Transliteration ensures filenames contain only ASCII characters. It is recommended to keep transliteration enabled."
"Strip non-alphanumeric characters except for dots (.), underscores (_) or dashes (-).'
"Remove duplicate dots (..), underscores (__) or dashes (--)"
"Convert filenames to all lowercase."
Comment #154
flocondetoileI think we need a #state on each option to hide them on the config form if the feature is not enable. This way, all the options will not be visible until the checkbox transliterate is checked.
Comment #155
dww+1 to all the notes in #153.
-1 to using
#statesfrom #154. None of these features depend on using "transliterate". That's simply one kind of sanitation a site might want. But you could still want to strip whitespace even if you're not transliterating. As @mfb pointed out at #134:.
If we wanted a global "turn on any of this stuff" option, it should be called "[X] Sanitize filename", not "Transliterate". But, I think that complication is unneeded. It's a small fieldset of distinct options, so let's leave them all visible by default.
Thanks,
-Derek
Comment #156
andypostNo reason for states, meantime makes sense to give a room for contrib to extend
At lease token and https://www.drupal.org/project/field_tokens
Comment #157
dwwRe-title, move component (since this is about more than transliteration now), and summary update to reflect the current state.
Comment #158
dwwA few more edits. Also fleshed out the Remaining tasks. Currently:
Getting close. ;)
Cheers,
-Derek
Comment #159
dww#152.1: Fixed.
#152.2: Fixed in favor of all whitespace, not just spaces. Crossed that off from the summary remaining TODO.
#152.3: Added @todo for now. ;)
#152.4: Fixed. Using '- None -'. While I was at it, I made the labels for the other choices readable. Very hard to tell - from _ in the previous version.
#152.5: Fixed (mostly via the text proposed in #153). Now:
Fixed all of #153, except I further simplified these:
Still needs work (for tests and other things), but back to NR for the test bot. I'm done for tonight, if anyone else wants to pick up the torch. ;)
Thanks!
-Derek
Comment #160
dwwWhoops, I forgot to update the settings in the test. ;) Sorry about that.
Comment #161
dwwOoof, I realized I partly renamed "remove_non_alphanumeric" to "strip_non_alphanumeric" so that the setting name matched the UI text, but I was inconsistent. This should actually work. ;) Sorry for the noise (and bot churn). My local testing world stopped working suddenly, and I don't have time to get that happy again right now.
Summary updates:
- Fixed typo in remaining tasks generating an empty todo.
- Add link to #159 about whitespace.
- Add settings label to each config key in the UI section.
Comment #165
dwwNot sure exactly what the bot is confused by, but this should work.
Comment #167
mcdruid commentedThanks for all the work on this. @dww you mentioned:
It would be good to remind ourselves what would need to happen to filenames (actually uris) in core in order for us to restore the constraint in the db. (Especially as the code-level constraint is not yet properly enforced - see #2869855: Race condition in file_save_upload causes data loss).
Perhaps that aspect of this change needs to split off into a separate issue? Ideally we wouldn't make that part of the filename processing optional, as otherwise we're stuck without the database-level constraint.
I'll try to have a closer look at this ASAP but wanted to bring this up so that it's not overlooked.
Comment #168
dwwSorry, my regexp for replacing all whitespace had a typo (
/svs\s). It also had a thinko, since I was telling it to replace 1 or more (via+) (since that's what my custom code usually does), but really we want individual replacement (since dedupe is now a separate setting). Got my local test world happy again, and at leastDrupal\Tests\file\Functional\SaveUploadTestnow passes locally, so I think the bot should be happy with this.For fun, I'm also attaching an interdiff relative to #151 to easily review everything I've done since @alexpott's last patch.
Re: #167: Yeah, sounds good. Thanks for the link and the info. Agreed that needs more thought, and perhaps should be handled separately.
Thanks,
-Derek
Comment #169
alexpottThis is the sanitization that scares me the most because it is the most potentially destructive. I'm my mind this only really makes sense if you have transliteration enabled too.
I think dispatching an even makes heaps of sense. And if we do core should use the event rather than putting all the code in here. What's nice about that is that it make it much easier to test - when you are only dealing with a filename.
Comment #170
krzysztof domański1. I added dataProvider. It's easier to add more test cases now.
2. I changed the name of test testTransliteration() -> testFilenameSanitization().
Comment #171
dwwNow with significantly more test coverage of the options. ;) Tentatively removing 'Needs tests' and crossing this off from Remaining tasks, although there are a few possible combinations we're not explicitly testing. SaveUploadTest all passes locally. Let's see what the bot says...
Comment #172
dwwUgh, whoops, cross-post. :/ Dang. I'll take stab at merging #170's dataProvider with all my new test cases. Stay tuned.
Comment #173
dwwDone. All passes locally.
Comment #174
dwwFixing more typos in the summary.
Re: #169.1: Yeah, it's a bit scary. In my custom implementations, I usually convert to _, not entirely strip. I wonder if we want a similar setting to the whitespace one that lets you decide what you want to happen for these. Or are you proposing we do away with this entirely? I don't love the idea of forcing transliterate to be on for this to happen. And there's now a test for what happens if you use this without transliteration (works as expected, even if it's scary). ;)
Re: #169.2: Great. :) I guess that means system module should listen to the event and respond based on the system.file settings (since system.module is providing the settings in the first place). Is that what you have in mind?
I'd still feel more comfortable with the Functional test coverage we now have. It's good to know the uploads actually work and the files are renamed. I'm less interested in unit tests that make sure
preg_replace()does what we tell it to. ;) But sure, I guess we could add those, too.Cheers,
-Derek
Comment #175
panchoVery nice. However,
1) why is Scandinavian "Å" transliterated to "oe", not to "Aa" or possibly "A"?
Not the right place to change any of these mappings, but at least we should have some tests proving we're correctly applying the langcode to trigger language-dependent overrides.
2) I'm tending to agree with @alexpott in #169 considering the "Strip non-alphanumeric" option as needlessly destructive if not coupled with a proper transliteration and actually applied after the transliteration. While the other options may be independent, I'd say these two should be coupled. We might want to move the "Remove whitespace" option to the top and have:
☒ Replace whitespace
☒ Transliterate
⮡ ☒ Strip non-alphanumeric
☒ Remove duplicate dots etc.
☐ Convert to lowercase
These could actually be sensible defaults.
Comment #176
dwwRe: #175.1: Totally out of scope for this issue, IMHO.
Re: #175.2: The order is fixed in the implementation. We're not letting people configure the order at all.
Transliteration always comes first, so I don't see any problem there.
I don't think the order of when we replace the whitespace matters, so long as that comes *before* the dedupe_separators transformation.
Personally, I'm *always* going to have [X] Transliterate enabled, so if y'all want to force the 'strip_non_alphanumeric' to require that, I won't complain. ;)
I don't think we can enable any of this by default, but I'll let a release manager decide that.
Cheers,
-Derek
Comment #177
dwwAlso, re: #175.1 -- you're seeing another manifestation of #2922638: No charset on response for patch & text files which is a d.o bug when viewing patch files. If you download the patch and look locally, you'll see it doesn't actually use Å at all...
Comment #178
pancho1) You're right in that "Transliterate" has to come before "Replacing whitespace", as it replaces quite some characters with whitespace, that might have to be stripped lateron.
2) Furthermore, "Replacing whitespace" definitely needs to happen before "Stripping non-alphanumeric," otherwise words are illegibly concatenated.
So "Strip non-alphanumeric" should basically only be available if both "Transliterate" and "Replace whitespace" are checked.
Both checked:
☒ Transliterate
☒ Replace whitespace
☐ Strip non-alphanumeric
☒ Remove duplicate dots etc.
☐ Convert to lowercase
Otherwise:
☐ Transliterate
☒ Replace whitespace
☒ Remove duplicate dots etc.
☐ Convert to lowercase
☒ Transliterate
☐ Replace whitespace
☒ Remove duplicate dots etc.
☐ Convert to lowercase
☐ Transliterate
☐ Replace whitespace
☒ Remove duplicate dots etc.
☐ Convert to lowercase
This also reduces the number of possible combinations from 120 to 20 which is way easier to test one by one.
3) When replacing whitespace, we might want to avoid adding the replacement character to the beginning or the end, so we would trim the filename first and only replace inner whitespace characters.
Comment #179
dwwI'll give this another push. I agree on
trim(). I think replace_non_alphanumeric (with a '' strip option) is better than forcing strip. I'm not totally sold this needs to require transliterate, but if that's the will of the masses, I'll make it so. ;) I disagree this *requires* that you already replaced whitespace. In fact, if you use replace vs. strip here, you don't even need the whitespace conversion at all. So I'm definitely not going to make this require whitespace. But yes, I'll keep it running in the same order.Stay tuned...
Comment #180
dwwRenamed
strip_non_alphanumerictoreplace_non_alphanumeric. The setting is now a select, similar to the one for whitespace, but it adds a '- Remove (do not replace) -' option. Probably needs UI/text review/help. ;)Fixed #169.1 + #178. The
replace_non_alphanumericonly appears (via#states) if the 'Transliterate' checkbox is checked. The code to enforce this only fires if 'transliterate' is enabled in the active config.Renamed
force_lowercasesetting tolowercaseand cleaned up some docs + text to match.Fixed the
replace_whitespacesetting to calltrim()(#178.3). Unfortunately,trim()is a bit confused if you have spaces "at the end" of your filename, because there's also the file extension itself. So, while we'll correctly trim 'something.txt ' right now, we don't do exactly what you expect with ' something .txt'. I added a test and a comment about this.Added some more test cases to touch
trim(), ignoringreplace_non_alphanumericiftransliterateisFALSE, etc.Uploading screenshots of current settings UI. Probably needs help all around, but I figure I'd get something up for now to make it easier to review.
Unassigning again.
Before I (or anyone) works on the event, it'd be nice to have direction on where all that should live. Who defines the event itself? Seems a bit weird for the Event class to live in system if it's being dispatched from file. I don't fully grok the relationship between file.module and system.module (e.g. why is system providing this settings form, not file)? Should system.module declare the listener that enforces all these settings?
Thanks!
-Derek
Comment #181
alexpott@dww er.. the config should so be in file.settings - my bad. I forgot about file module :)
Comment #182
dwwRe: #181: Well, system already provides FileSystemForm. We're just extending it. Ditto the tests. I think a complete refactoring of all that is out of scope here, no? I just wanted some direction on how it should work if we're going to go through the trouble to add and dispatch a new Event here. I'm not proposing we completely fix everything about system vs. file module in this issue. But I'd like to ensure that whatever new API we're adding isn't going to immediately need to be deprecated. ;)
Thanks,
-Derek
Comment #183
alexpottYeah but this functionality is only for uploaded files and that's enabled by the file module. So we should move it to the file module and use a form alter to put this on the system form - because we don't need another form.
Comment #184
dwwRe: #183: Hrm, no offense, but that seems ugly and weird. ;) There are already some things that live in file.settings (although I can't find anything in core that lets you set them). I don't want to add more to that schema, and form_alter a UI for part of it, etc. See also #2894194-14: Move 'temporary_maximum_age' setting from 'system.file' config to 'file.settings' config. I'm not sure untangling that entire mess makes sense as part of this issue.
Meanwhile, I'm still looking for input on the API and layout of the event + listener. ;) Since we're starting from scratch on that, I'm happy to layout the code and name things in an agreeable way from the start.
Thanks,
-Derek
p.s. Updated the summary's config mapping to match what the patch is now doing.
Comment #185
dwwOkay, maybe this is better. The patch only touches core/modules/file now (except core/modules/layout_builder/tests/src/FunctionalJavascript/InlineBlockPrivateFilesTest.php). It still seems a bit hacky, but so is having system.module responsible for any of this. ;)
It also means we no longer need those migrate-related test hunks either (since this is no longer in system settings). Those tests pass locally without the hunks, so I'm leaving them out. I don't see any tests trying to migrate file.settings, maybe I'm missing them.
The interdiff is nearly as large as the patch. Oh well. ;)
#181 and #183 are now addressed.
Comment #186
dwwMore IS tidy-up. Added these to the todo on finalizing the UI text:
thanks,
-Derek
Comment #187
dwwNow with a FileUploadEvent, subscriber (where file enforces the sanitization settings), dispatch at the right spot, etc. Updated summary accordingly.
Would love feedback on the event, its API (
get/setFilename()vs. just making$filenamepublic), if/how we should let subscribers munge the destination, too, if/how/where we should better document that this event is being dispatched, etc.Thanks!
-Derek
Comment #188
pingwin4eg@dww, nice work!
Do we actually need to "enforce" default config presence via
hook_update_N? I think it'd be better to check if config is set in our event class.This should use dependency injection.
If a transliteration is optional now, shouldn't we use functions with multibyte support instead of
preg_replace?Also there are few places where coding standards are not followed.
Comment #189
pancho1) Another thought why at least part of this should live in system: these sanitization options basically don't exclusively apply to filenames but may, possibly should be used for machine_names and possibly other string items.
2) re #180:
We might want to turn to a regex instead and make it strip all whitespace not only at beginning and end but also before/after any period "."
3) The latter should also apply to "replace non-alphanumeric characters." A file named 'Drupal (v8).jpg' shouldn't be sanitized to 'drupal-v8-.jpg'. If enabled, we should be stripping all non-alphanumeric at the beginning and end of each part of the filename, and only replace occurrences in the middle by the selected character.
4) For filenames (as well as machine_names) underscores and dashes aren't different to alphanumeric, while dots (.) bear special significance. I don't think we should replace anything by a dot. A file with the name "myßdoc" (without file extension) shouldn't be sanitized into "my.doc". We may preserve dots when replacing non-alphanumeric and we may remove dupes, but we shouldn't allow to replace anything (whitespace, non-alphanumeric) with a dot.
5) While unduping underscores (__) and dashes (--) may be a matter of preference (and usecase), duplicate dots (..) IMHO should be always deduped. They simply shouldn't exist in a filename in the first place. I would even do that silently, no matter what options are chosen.
Comment #190
alexpott@Pancho machine names have their own regex - I think by bringing up machines here we are making this more complex than it needs to be. Machine names and file names are different things.
Comment #191
pancho@alexpott I know machine_names have their own regex, but should they? How much are they different?
I'm not saying we necessarily need a UI for configuring machine_name sanitization (besides possibly transliteration), but under the hood, machine_names could actually use the whole stack of filename sanitization as follows:
Besides avoiding code duplication, this would be an improvement as the current machine_name regex regularly produces sub-par results which I end up editing manually. This clearly is beyond scope here, but all I'm saying is: some of this is not just file stuff, but system stuff.
Comment #192
alexpottI'm really torn here - between an event that has the whole file and one that is just about filename sanitization and just receives a filename rather than a file object. The problem with an event that has the entire File object is that then there are lots of things the subscribers can do that are out-of-place. For me given the security implications I think just firing an event with the name is a good idea. Way less complexity. Also given the security implications it is vital that the event subscribers do not change the extension of the file so we should code in some form of check for that. Or prevent it from occurring at all by only passing the part of the filename.
This can now move into a unit test because the event is unit testable. Which is great because then we don't have to reinstall the system 10 times as we do for this.
Comment #193
wim leers+1 for this functionality.
But if we do it, we should also do it in
FileUploadResource, so that API file uploads are treated the same. We already have explicit test coverage for exactly this sort of thing, which this issue could expand:\Drupal\Tests\jsonapi\Functional\FileUploadTest::testFileUploadUnicodeFilename().Comment #194
dww#186.1: I changed it to just "
- Strip -" for now. Further review and feedback on this welcome.#186.2: I went for yes. Capitalized the options.
#186.3: Still pending...
#188.1: I don't know. I'm just following the lead from @alexpott and moving his DB update function around.
#188.2: Fixed. Good catch! I DI'ed the config, but forgot about the transliteration service while moving this code around.
#188.3: Fixed. I hope. ;) Tests all still pass locally, so I think this is cool.
#188.other:
Where? I try to be pretty pedantic as I write code.
#189.1: No thanks, out of scope (per @alexpott in #190 and #192).
#189.2: Sort of fixed by having the event only touch the non-extension part of the filename, and gluing the extension back on after dispatch. See below. I think your proposal for trimming whitespace around dots makes this more complicated and error-prone.
#189.3: No thanks. That's overly complicating this. a file called
drupal-v8-.jpgis just fine with me. If you really care, you can now write your own event subscriber in custom or contrib! ;)#189.4: Sure, makes sense to me. I removed the
dot (.)option from both selects.#189.5: Hrm, not sure. I'd like someone else to weigh in on that before I change how it works. I don't think we should always and silently dedupe '..'. Seems like it should be a setting, and we already have a setting, so why mess with it?
#192.1: Agreed.
#192.2: Both fixed. The FileUploadEvent no longer takes a File object. Instead, it gets the filename (without extension, the "filename" part of what
pathinfo()returns), and a (read-only copy of the) extension. Plus the language ID (which we only have in the$file, but need fortransliterate(). This nicely solves the problem of trim() and filenames like 'something .jpg'. Makes the event more bounded in scope. There's only a setter for the filename. The extension and language ID are read-only via getters.I wonder if this event should be renamed to something like FileUploadSanitizeFilenameEvent to be even more self-documenting.
#192.3: Todo. I kinda like leaving at least a couple of these cases in here so that we know the whole thing works (integration), the event is dispatched, it actually renames the uploaded file, etc. If we leave a couple of the nasty cases in here, I'm okay trying every possible combination of settings as unit tests on the event. Cool?
#193: Thanks @wimleers! That's why I pinged you about this. ;) After a brief look, it seems like dispatching the same
FileUploadEvent(or whatever we rename it) at the start ofFileUploadResource::prepareFilename()is the right thing to do, right? If so, how do I get the correct language ID inside this resource? And if I need to inject the event dispatcher service, can I safely change the constructor of this thing without breaking BC? ;)Cheers,
-Derek
Comment #195
dwwThis seems cleaner. The event takes the full filename in the constructor, does the
pathinfo()itself, and provides agetFilenameWithExtension(). Makes it easier to dispatch the event in multiple places, as this patch starts doing to address #193 (although I left some @todo for my questions for @wim leer in #194). I wonder if all the rest/api tests will pass with this. ;)Still needs Unit tests for all the possible settings. Anyone else want to take a stab at that? ;)
Comment #197
alexpottNeed to remove the $extension.
Tricky to name this. But the now this only gets the name and is about names I think we need to reflect that.
SANITIZE_NAME?
Extra line here.
Missing doc block for the class. I think we could document that this event only allows changing of the filename and not the extension of anything else about the file. And then we could lose the "This value is passed for context".
Comment #198
dwwBah, sorry about that. I had confused my git checkout (forgot to re-add the file before the
git diff) and left out a final (very important!) edit. ;) This should pass. Unless REST hates me now. Let's see.Comment #199
dwwUgh, x-post w/ @alexpott..
#197.1: Already fixed in #198. ;)
#197.2: Right, in #194 I proposed "I wonder if this event should be renamed to something like FileUploadSanitizeFilenameEvent to be even more self-documenting." I'm going with FileUploadSanitizeNameEvent here. Is that too verbose or are you okay with it?
#197.3: If the name of the event class has "name" in it, maybe this is unneeded?
#197.4: Fixed.
#197.5: Fixed. How's this?
#197.6: Fair enough. We could revert the functional to a slimmed-down version of what I had in #171 (no provider) and flesh out full coverage of all the options via Unit tests (that probably want a provider). I'd love it for someone to give the unit tests a first stab, as I confess to not being sure the right way to go about that. Presumably we want a loop where we can twiddle the active config, dispatch the event with different initial filenames, and assert that the filename after the dispatch is what we expect.
#197.7: Yup, that's in the Remaining tasks in the IS as:
What do you have in mind? Does the dispatcher in file.module simply do a
drupal_set_message()if the pre vs. post sanitization filename is different?What's the right way (if any) to provide such feedback for the rest response? ;)
Comment #200
dwwAlso re #197.7: Or does that imply we want
FileUploadSanitizeNameEventto have knowledge of a message for the end user, and subscribers can set additional messages on the event as they munge things? That sounds complicated and scope-creepy. Can we do a single(or something) for now and punt anything more fancy to a follow-up?
Thanks,
-Derek
Comment #203
alexpottI think #200 is the minimum we should do - so let's start there. There's prior art...
\Drupal::messenger()->addStatus(t('For security reasons, your upload has been renamed to %filename.', ['%filename' => $file->getFilename()]));Comment #205
dwwFixed tests (I didn't finish the Event rename in
core/modules/file/src/Plugin/rest/resource/FileUploadResource.php).Fixed the CS bugs the bot is complaining about.
Fixed #197.7/#200/#203.
Crossing off a lot more of the remaining tasks. ;) Getting closer!
Comment #206
dwwA few minor cleanups and updates to the summary.
Comment #208
dwwWow, nice coverage... go API-first team. ;) Even though the Drupal UI doesn't seem to let you upload files without extensions, REST does, so we can't assume pathinfo() will return an extension. I think this should pass now (even though it's a bit more ugly).
Comment #209
alexpottI don't think this ternary is necessary because you've taken care of this case in the constructor. So
$this->filename . '.' . $this->extension;will work just fine.There's a little bit of complexity here. I wonder if we should only have a single message about renaming the file on upload.
The new name does not leave space for confusion - so whilst verbose and long I'm in favour.
Comment #210
dww#209.0: Except we do, b/c if you have a file without an extension ("foo"), your version of
getFilenameWithExtension()would return a trailing dot ("foo."). The alternative would be to test and store the extension with the dot, then this function could be simple, but then callers would get ".txt" when usinggetExtension(). Maybe that's ok, but it seemed like the raw extension without the dot would be a kinder return value forgetExtension().#209.1: Well, yeah, probably. ;) I wanted to avoid too much complexity, but you're right. @todo.
#209.2: Great.
Also updated IS to remove mention of replacing anything with a dot and other cleanups. Screenshots are still stale, but I want to wait for more feedback on that before I update them.
Comment #211
krzysztof domańskiMinor bugs if we add a file with an existing name.
1. Add file (e.g. "Zürich.png") twice. If file name exist then _N is added at the end (e.g. "Zürich_0.png", "Zürich_1.png"). If transliteration is enabled we get the correct result "Zurich_0.png" but message is:
"For security reasons, your upload has been renamed to Zurich.png."
Should be:
"For security reasons, your upload has been renamed to Zurich_0.png."
2. Enable "Remove duplicate dots (..), underscores (__) or dashes (--)". Upload twice any file with underscore at the end (e.g. "example_.png").
Result is "example__0.png" with duplicate underlines (__). Should be "example_0.png".
Comment #212
dwwTest cleanup:
- Restore basically the tests I added in #171, removing the functional test provider from #170, slim down a bit, and update everything to match the current settings names, etc. So, we still have functional integration tests for a few of the complex cases.
- Add a new Kernel test to do what I described near the end of #199. There's a provider. For each case, we set the file.settings config accordingly, create and dispatch the event, and assert that the event's final
getFilenameWithExtension()matches the expected filename. I couldn't quite grok how to do this as a Unit test, but as Kernel it seems easy-peasy. Runs very vast locally. ;) Seemed a bit wonky to insert this into one of the existing Kernel tests (e.g. @borisson_ in #42) so I went with a whole new class. Hope that's agreeable.If someone wants to get really nit-picky, we could now easily add enough cases to
SanitizeNameTest::provideFilenames()to cover every possible combination of settings.Re: #211.1: Thanks for testing. Yup, that's the @todo about the message from #209.1.
Re: #211.2: Interesting. That's because the deduplicate filename stuff happens after the filename sanitization. Which it has to, since we obviously want to check for filename collision on the sanitized name. I don't think we want to dispatch the event again and double-sanitize. I think we might have to live with that behavior. @alexpott: thoughts on this?
Comment #213
dwwEven more test cases for Kernel now. I think we now have plenty of tests. ;) Untagging and crossing that off from remaining tasks.
So, this really NW for the message cleanup (#209.1 et al), but then I think we're in the home stretch for final UI word-smithing...
Comment #214
dwwConsolidates and fixes the user-facing warning message. Should solve #209.1 and #211.1.
Added some assertions to the Functional tests to make sure we see the message as expected when filename sanitization happens.
Crossing this off from the remaining tasks.
Home stretch now! UI / text reviews please?
Thanks,
-Derek
Comment #215
alexpottI think we should set a flag here to use the security text for the rename and if it is not a security rename use different text. Maybe something like "Your upload has been renamed to %filename." to keep it simple.
I think there's a problem with the title being replace and then an option being strip. This is tricky.
Really nice idea having both the names and examples of the characters - makes it clearer.
I don't see how the language is being set here. Either we need to extract it somehow from the data being posted. As far as I can work it out the langcode used when creating files here will be
\Drupal::languageManager()->getDefaultLanguage()->getId()as this is what is set in \Drupal\Core\Field\Plugin\Field\FieldType\LanguageItem::applyDefaultValue()I dunno - it's not necessary for this code to test the transliteration service. But it might be neat to prove we pass the correct langcode to it.
This unit test is really nice work. Great to see the test coverage. We can go a step further and make this a unit test and mock the transliteration and config services and only call FileEventSubscriber::filenameSanitize(). But of course that'll tie the test a bit more to implementation. But this might be worth it given with already good functional test coverage.
Comment #216
dww#215.1: Done. Since there are already new strings here, translations are going to have to adjust, anyway.
#215.2: Agreed. That's why I flagged this multiple times for feedback. ;) It seems like the possible solutions are:
A) Fix the label of the thing to be more generic than "Replace". Perhaps:
"Transform non-alphanumeric characters:"
Options:
B) Change the "Strip" label to something else like "Remove (do not replace)" (like it used to be).
C) Live with it as-is. Stripping is a form of "replace", since we're replacing with an empty string. ;)
D) Other?
#215.3: Great, thanks. :)
#215.4: Thanks for looking deeper. Yeah, that seems likely to be the only way to do it. I was hoping @Wim Leers would answer about The Right Way(tm), but I'd be okay using the default language like you suggest.
On a related note, do you have opinions on changing the FileUploadResource constructor to properly DI the event dispatcher (and probably this language manager service)? I've seen various debates about the best way to handle this. Should we leave the constructor alone, add setters for these 2 new services, and call those setters in
create()? Or do we just change the constructor and let anyone who happens to be extending it change their code?#215.5: Yeah, I don't really care. I agree this test is not responsible for testing language-specific transliteration. I was only trying to be responsible and flagging it if someone does care.
#215.6: Glad you like the Kernel test. :) I don't see how all the additional hoop-jumping is "a step further". This is already testing what we want to test to make sure all these settings work as expected. Mocking other services seems like a lot of additional effort for no benefit. To speak frankly, this is exactly the kind of thing that burns me out and makes me not want to try contributing to core anymore. I've already spent way more time on this issue that I can ethically bill any clients for. I'm just trying to push it over the finish line so we can actually commit this, finally realize all the hours I and others have poured into this issue over the last 4 years, and move on with our lives. We've got solid / complete test coverage. Now you're suggesting I spend another few hours for no apparent benefit. No thanks. If you want to do so, knock yourself out. ;) But honestly, you have a lot more important and useful things you could be doing (RTBC queue, I'm looking at you) than "polishing screws" and further rearranging these tests for no actual benefit. Sorry if that sounds harsh, but I wanted to express my frustration in an honest and open way. I know you're looking out for the health of the code, and I respect that. I also want us to also look out for the health and well-being of our (shrinking?) contributor community and not make trying to improve core so hard and painful that fewer and fewer people are willing to try.
Comment #217
dwwFixes #215.4, #194/193, etc.
Slack thread w/ @gabesullice:
---
@gabesullice: AFAIK, no. But this reminds me of this issue: #3021155: Accept client-generated IDs (UUIDs) for file uploads too
@gabesullice: I think your request highlights that we need a way to specify field values on the upload `application/octet-stream` request
@dww: @gabesullice Thanks, makes sense. So I'll just inject the language manager and get the default. Meanwhile, preferences on the BC implications of changing the constructor to get the 2 new services?
@gabesullice: I think you mean in core. @wimleers would be the person for that.
@dww: Well, its easy to change core in this issue. :wink: The question is if anyone who might be extending FileUploadResource would complain. Thought you might know/care about that, too. If not, no worries, and I'll wait for wim. Thanks again!
@gabesullice: Ah, I misunderstood. I still don't have a preference. But you should probably be aware of this issue too then: #2940383: [META] Unify file upload logic of REST and JSON:API
@gabesullice: That will already change the constructor
@dww: Cool, sounds good. For now, I'll change the constructor in 2492171 and we can work out the collision if/when the first one lands.
---
Still really NW for #215.2 and other possible string review/fixes. But otherwise, now almost entirely free of @todo. ;)
Adding #2940383 to the top of related issues since apparently this now conflicts with that on constructor changes to
FileUploadResource. Also adding #3021155.Cheers,
-Derek
Comment #218
dwwBased on slack convo in #d9readiness, here's a new version of the FileUploadResource constructor change that gracefully triggers a deprecation error if you don't pass the new args, instead of outright crashing. Should ease any BC-breaking concerns.
Comment #219
dww@alexpott and I chatted in Slack. We decided option D for #215.2 is best: remove the 'Strip' option entirely. ;) It simplifies life, makes things more consistent, and is safer. If people want to dedupe, they have a setting for that. I noticed this now lets us construct a single array of options and reuse it for both settings inside the form_alter, which is nice. Updated all the tests to match, and removed all mention of "strip" from the patch (and the IS). If a site *really* wants to strip these, not replace, they can subscribe to the sanitize event and do whatever they want. ;)
Also quickly fixed #215.5. The Kernel test provider can now optionally define a language code. Added 3 cases to ensure we're using the right language-specific transliteration, and how to handle Ä: en (A) vs. de (Ae). Removed the @todo from the test.
Any other string concerns? Updated the IS screenshots. Crossed off a lot of stuff.
Comment #220
dwwMinor summary updates, crossing this off, too:
Final code cleanup and fixes.No more @todo via #219Only open tasks at this point are:
Can we reuse the same CR to talk about the new settings, the new Event, and the changes to the
FileUploadResourceconstructor?Thanks,
-Derek
Comment #221
dww@alexpott via slack said: "Yes I think so" to reusing the same CR for all of this.
So I converted the CR into this:
New filename sanitization settings during upload (via UI or REST), new sanitization Event, changes to FileUploadResource constructor
Untagging that and crossing it off the TODO.
Optimistically tagging this for an 8.7.0 string change. ;) Also crossed that off the list.
Comment #223
dwwWow, that's incredibly evil and weird. The complete disappearance of Ä does not happen when I run those Kernel tests locally. :/ WTF? Thoughts?
Meanwhile, this fixes the CS bug.
Also updating IS to reflect that
replace_non_alphanumericrequirestransliterate. Screenshot showing that.Comment #224
dwwThis also all passes locally:
Does the d.o bot like this better? Ugh. I don't like this, but I don't think it's my fault. ;) If the bot is now green, I'm willing to call this Good Enough(tm).
Comment #227
alexpottMaybe this is a
mb_ereg_replace()bug - we don't use that anywhere else in core. I'm guessing and this is just a guess - that the DrupalCI servers have mb_regex_encoding() set to something non standard causing the first byte to be treated as a BOM or something. Dunno.Comment #228
dwwThanks for looking! Good guess/thought. We'll see what the bot says. However, I'm skeptical this will solve it. Here are the first two test cases that failed in #224:
All those
' 'andFALSEoptions turn off the code paths touched by your interdiff.I'd be surprised if #227 makes a difference, although I'm happy to go with the
umodifier onpreg_replace()instead ofmb_ereg_replace()if that's what core does everywhere else.Comment #230
wim leersWow, incredible progress here! 🤯
Two more services injected. This is making the code even more complex :(
I'm surprised that nobody has mentioned #3021652: Deprecate upload-related functions file.inc and move to a file upload service yet. I've been told in #2940383: [META] Unify file upload logic of REST and JSON:API that that should wait for #3021652: Deprecate upload-related functions file.inc and move to a file upload service. That's definitely overlapping with this in scope.
Shouldn't
file_munge_filename()also be implemented as aFileUploadSanitizeNameEventevent subscriber?Comment #231
pingwin4egThe latest failing tests problem is with the
pathinfo().Docs says:
Comment #232
alexpott@pingwin4eg yes but we call set_locale() on every run - see core/tests/bootstrap.php and \Drupal\Core\DrupalKernel::bootEnvironment()
What we the different locales you used in #231?
Comment #233
alexpott@pingwin4eg what I see locally is:
Comment #234
pingwin4eg@alexpott Hmm, well that's weird, I have these:
Comment #235
alexpottHere's a fix - we've had this problem before :) See #278425: Using basename() is not locale safe
Comment #236
pingwin4egBTW IMHO the "Filename sanitization" fieldset in the config form should have some description that sanitization would apply when files are uploaded only, and config changes won't touch existing files.
Comment #237
alexpottEdge case fixing and testing.
Plus there's a PHP bug here - see https://bugs.php.net/bug.php?id=77239
Comment #238
laravz commentedThe second argument of basename() has default value "null". However, the patch sets an empty string when there's no extension in the function below:
$this->filename = $file_system->basename($filename, $this->extension !== '' ? '.' . $this->extension : '');Do we want to change this? The output should be the same in either case.
Comment #239
dww@alexpott + @pingwin4eg: Thanks for diagnosing the problem and fixing this!! Very very happy to wake up to a green test bot patch here. ;)
@wimleers:
#230.1: Okay, thanks for the sign-off.
#230.2: I hear you, but that feedback isn't really actionable. Yes, we have to inject two new services to make this work. The alternatives are:
A) Ignore REST and only sanitize filenames uploaded through the UI. I know you and many others won't like that.
B) Don't change
FileUploadResourcefor this to properly DI the services, and just call\Drupal::service(). ;) I know the DeprecationFrenzy crowd won't like that, either. ;)C) Embrace that Doing Things Right(tm) around here generally means making them more complex. :/
I'm grudgingly picking C for this. I don't see how A or B are viable. I don't know if there's option D that makes everyone happy. I doubt it.
Thanks for the link to #3021652: Deprecate upload-related functions file.inc and move to a file upload service. Agreed there's overlapping scope. I'm not a core file system maintainer, so I don't have all the open issues / efforts mapped in my mind. I'm just scratching an itch I have, and uncovering a lot of related pain. ;) I think if this issue gets RTBC and lands, we'll just have to re-shuffle this event dispatch in those other issue(s).
See #192.2 @alexpott wanted this whole event to not touch the extension, and that
file_munge_filename()was too important for security to allow this event to impact it at all. Hence, all the hoop-jumping withpathinfo()and only letting subscribers mess with the filename/basename (argh, I hate that all the terminology on this is inconsistent), and not the extension.Re: #236: Sure, makes sense. Added this:
How's that?
Re: #238: It doesn't really matter, since
FileSystemInterface::basename()is checkingso either
NULLor''would do. But sure, why not? Now setting asNULLif there's no extension.Other changes since #237:
1. I had a bit of a head-scratch WTF reading the new code in the event constructor to resolve #238, since
basename()is a weirdly-named method in ourFileSystemservice, and it doesn't actually do anything like what PHP'sbasename()orpathinfo()does. :( So, I added this code comment to help others understand what we're doing and why:2. Renamed the EventSubscriber method we call from
filenameSanitize()tosanitizeFilename()to put the verb first.3. Added an Á to the front of a few of the earlier test cases to make sure that works, too.
4. Restored the Ä at the front of the related test cases. Stripping that out at #224 was a desperate attempt to understand the problem, but we might as well have it back now that we know that was a red herring.
5. Moved the umlaut-tastic "No transformations" case to the top of that little block of test cases.
Okay, I think we're back to "review and sign-off on all the strings" as the only renaming TODO in here. Jah be praised!! ;)
Cheers,
-Derek
Comment #240
dwwFresh screenshots for the IS to aid in string / UI review, showing the fieldset description from #236.
Comment #241
krzysztof domańskiNow "Replace non-alphanumeric characters with the selected character" is visible if "Transliterate" is enabled but is behind the "Replace non-alphanumeric characters with the selected character".
Minor: Maybe it should be next to "Transliterate"? This change does not affect the result so change only to improve UX.
This configuration is not often changed so this order is not very important.
"Transliterate"
- "Replace non-alphanumeric characters with the selected character"
"Replace whitespace with the selected character"
"Remove duplicate dots (..), underscores (__) or dashes (--)"
"Convert to lowercase"
Comment #242
dwwRe: #241: Thanks for looking, but I disagree with your suggestion. Although we don't explicitly say it, the order of the settings on this form maps to the order that the transformations are done. If we move "replace non-alpha" before "replace whitespace", someone might justifiably wonder if the replace whitespace setting would do anything, since the "previous" setting would have already replaced whitespace (which is also non-alphanumeric).
Frankly, now that it's entirely replace with no 'strip' option at all, I'm less concerned that the non-alpha setting *has* to depend on transliterate. But I don't want to undo that work now, and @alexpott was concerned about it, so I'm leaving it as-is.
Maybe the newly added fieldset description should mention that transformations are done in the same order as the settings. But I think that's probably too much complication and implementation detail to subject end-users to.
Comment #243
dwwSide note: It'd probably be nice to fix #2396145: Option #description_display for form element fieldset is not changing anything and set that on this fieldset description, instead of leaving it at the end. Seems better for the description to come before all the options for this fieldset.
Comment #244
krzysztof domańskiAn important test case is missing. The file name containing only unknown characters.
This is security problem if transliteration is enabled. The new file begin with a dot is created.
Comment #245
dwwI don't know what, if anything, we should do about #244. It's not actually a "security problem", IMHO. But yeah, maybe we should have a fallback in
FileUploadSanitizeNameEvent::getFilenameWithExtension()that tries to do something else in this case. Something like:More or less.
I'd like feedback from @alexpott before we code that up and add your test case.
Comment #246
alexpott@Krzysztof Domański - yep this is why we need to change
$filename = $this->transliteration->transliterate($filename, $language_id, '');to replace unknown characters with something.Here's a solution that introduces a new setting - replacement character and turns the other replacement options into booleans. I feel that this is the most consistent approach:
I discussed this in slack with @dww who voiced some reservations and wanted UX feedback. I decided to not hide or disable the replacement character if none of the options that use it are ticked because the description takes care of this, hiding it from the user makes the behaviour of these options less knowable, and it does no harm.
Also I moved the fieldset description to the top of the fieldset because this information is important and it clashes with descriptions on the elements contained if placed there.
Comment #247
dwwMostly looks great. A few thoughts/nits:
I'd add a @todo here pointing to #2396145: Option #description_display for form element fieldset is not changing anything that all this could be accomplished with
'#description_display' => 'above',if that wasn't buggy.Maybe:
"Character used when replacing whitespace, replacing non-alphanumeric characters or transliterating an unknown character.'"
? I know we say "character" a lot in that description, but it seems to read better this way.
Should we also @see the places the event is being dispatched, or is that overkill?
I was always curious why we needed to change this test for all this to pass. Is this really still a thing?
Attached screenshots of the new UI to the summary. I think it's probably simpler than what we had before. Simple is generally better, especially for a UI. ;) But I left the pre-245 screenshots up for comparison, and added a remaining task on officially deciding which UI to move forward with.
Comment #248
dwwFixes #247 1 and 2.
3 and 4 are still pending review/reply, but they're probably not needed for commit.
Comment #249
pingwin4egFollowing notes are not really critical, so I'm not changing an issue status. They are more like questions)
- Multiple
getFilenameusages. Can we add a new var for new filename here instead?- Why to use
strcmp()instead of===?Can we add an empty string to the replacement options?
Can we change this line to simply
$this->filename = $filenameif it has no extension?Comment #250
alexpottRe #247
2. We can remove the first "Character" from this sentence too... to give it less character :)
3. I think this is unnecessary.
4. - this is because of
And the file in the test being renamed due to the name being used before.
Re #249
1. I think using getFileName() is fine - it's a micro-optimisation that leads to more mental work for the programmer. Using !== is fine though
2. I don't think so. Firstly this leads %20 in URLs and secondly we have to work out what that means for replacing whitespace.
3. Good point. In fact doing this makes me realise that using the file system service is unnecessary we can remove the suffix ourselves.
Comment #251
pingwin4eg@alexpott in the 2nd item I meant not the whitespace, but an empty string, which would simply remove unknown chars.
Comment #252
alexpottMade
SanitizeNameTesta unit test. @dww raised concerns that this would not have caught the pathinfo() PHP bug but I've written it in such a way that both\Drupal\file\EventSubscriber\FileEventSubscriber::sanitizeFilename()and\Drupal\file\Event\FileUploadSanitizeNameEvent::__construct()are covered and the test would catch that.\Drupal\Tests\file\Functional\SaveUploadTest::testSanitizationcovers the functional implementation and ensure thats the even is subscribed to properly.We need to add test coverage for the REST changes - added this to the task list.
Also I've changed the task
to decide whether we want to be able to have different replacement characters for whitespace and non-alphanumeric replacements - as that is the use-case we need to decide if we design for. Personally I'm not convinced about the use-case for different replacement characters.
Also fixed the deduping when there is a dot on the end and we have an extension - so
foo.....txtbecomesfoo.txtRe #251 this has been discussed and removed before - see #244. It is too destructive.
Comment #253
alexpottNow with REST test goodness.
Comment #254
alexpottSo adding the REST test shows me we have to defend against transliteration return an empty string.
Note that
iconv()is no better for transliterating this - it returns FALSE. These type of edge cases are definitely a reason to keep this the string manipulation as non-destructive as possible.Comment #257
alexpottNow we're fixing obscure REST bugs too! I think this fix is totally in-scope because with this change we are massively extending test coverage of unicode characters in filenames.
Comment #258
panchoWow, this is so fast, I had to merge manually a few times.
Implementation is awesome now, but let's not jump over functionality and UI considerations.
#246-1:
I see that we need a solution for unknown Unicode characters, but my first thought was: no. But then, I figured out that it doesn't mean forcing all-dashes or all-underscores on the user - if someone (or a certain language) uses a lot of hyphens that are important to distinguish from spaces, just use underscore as replacement character.
Finally, after a proper transliteration, non-alphanumeric characters will in most cases be stuff that should really get stripped, as replacing and thereby preserving it in a way, doesn't convey any information. Without a proper transliteration, this however may not be the case and some German or Hungarian or Czech people will get bad results. And if even whitespace replacing is off, it will only get worse. But in that case dashes won't help more than underscores vice versa. So I'd even go farther and say: if at least whitespace replacing is on (which would be very recommended), let's just strip the remaining non-alphanumeric characters and not replace them by anything else.
Well, the experienced Drupalista ticks from a list and is done. First impression is worse, and IMO it is worse than it was before, so tagging for usability review.
One aspect I'm concerned with, usability-wise, is the placement of the default replacement character select right at the bottom, underneath "convert to lowercase", so completely out of context. We should move it up, right below the relevant checkboxes, and have the last two checkboxes at the bottom. A long list of similar checkboxes are way harder to grasp than a more visually structured fieldset, even though visually neater.
We also need to make sure is the descriptions are as concise and helpful as possible. Once we figured out everything else, we should iterate over these one more time.
I'm adding another patch and a short screencast with a design proposal that follows my suggestions.
Comment #259
panchoBut even then, we're unfortunately still not quite there IMHO. Of course we can't support every use case. But I still see two major deficiencies:
1.) There will be quite some companies/organizations/people with filename schemes that use parentheses () and/or brackets [] which is relatively acceptable in a filename. With these being stripped, many will turn off pretty much all of this, even though we really want sitebuilders to do a proper sanitization.
2.) Same holds for umlauts and quite some other diacritics. Wikipedia allows almost everything, for example https://en.wikipedia.org/wiki/Die_Ärzte_(1986_album), Wikimedia has https://commons.wikimedia.org/wiki/File:Die_Ärzte2012.svg, or https://commons.wikimedia.org/wiki/File:¿Cómo_te_quedas%3F.jpg (note: the question mark is an exception to what is possible).
Now of course people can turn off transliteration, and many will. But then they almost have to turn off stripping/replacing non-alphanumeric, too. Which leaves them with no protection at all. So I think for all those who consider a full transliteration too strict, we would either need a reasonable subset of "Strip remaining non-alphanumeric characters" (case 1) resp. of transliteration (case 2) that at least filters out totally unacceptable characters. Possibly ISO/IEC 8859-1 (= CP-1252 = latin1) as softer transliteration, and US-ASCII minus some disallowed characters as a softer replace? Just a thought.
3.) re pix in #258: now there clearly are too many individual recommendations. Now that we're properly informing the sitebuilder about every option, I guess, we should leave them all out.
Comment #260
alexpottRe-uploading #257. @Pancho this is your second patch on this issue. Raising your concerns in comment rather than filing a patch and losing important test coverage would have been great. @dww and I have been working together on this issue and talking about in slack - please join us there and discuss before heading off in another direction.
I'm sorry @Pancho but removing test coverage like this is upsetting.
Also we're not
- @dww has raised this issue in the UX slack channel and has added screenshots all the time. The consideration is the replacing with nothing is not viable see all the previous comments - for example #244.
Of course we want to consider the UI. But please come up with your concerns about the current approach - but also please read the entire issue to make sure that you don't rehash old concerns that have been answered before.
Comment #261
pancho[x-post]
Fixed #259-3 and the non-alphanumeric description.
[Patches #257 and #260 are going to fail like the ones before, but should not fail worse.]
Comment #262
panchoRe #260:
If you prefer keeping everybody else out here and work everything out in a small circle, then please, just go ahead and piss off other contributors. I'm not gonna bother you on this issue anymore.
Stop. I said
It's not okay to suggest I would have skipped these 3 lines on purpose. I had to merge my patch manually several times.
Comment #263
alexpott@Pancho why do think I want to stop any contribution? I've tried my best to address all the feedback and concerns raised in this issue. But it would be great to talk before changing direction again. Especially when some of the points you raise have been raised and thought about before on the issue.
Here's my thoughts on #258.
Please supply example strings so that we can test to see what happens - real world examples would be great.
Sure we can move the select. But personally I find putting the select in the middle of the list of checkboxes hard to parse. See the two screenshots. This is because of how the select list gets a header.
Re #259
We're introducing an event that contrib and custom code can you to enforce whatever company policy you like - there's no way we can produce a 1 size fits all here.
If you want that do not turn on transliteration as you suggest.
Protection from what exactly? This feature is to allow you to standardise filenames on your site - it's not a security feature per se.
Comment #265
pancho1.)
Have they? I've been following this issue for quite some time.
2.)
Not as long as the technical implementation is moving so fast and I get the feeling you're already settled feature- and UI-wise.
3.)
Nope, but a reasonable standard that covers many if not most common usecases. German umlauts, French or Czech diacritics are common. Parentheses and brackets in filenames, too.
4.)
We're still posting a message:
5.)
Readding 2 missing tests and @alexpott's REST bugfix from #257 which he reposted in #260.
Unsubscribing now to no longer bother.
Comment #266
alexpott@Pancho the message
For security reasons, your upload has been renamed to %filename.it only emitted when the fiel is renamed for security reasons. I.e. when it contains an unsafe extension. if that's not the case it saysYour upload has been renamed to %filename.Thinking about this some more - I'm beginning to ponder if we're going about this all wrong. As @Pancho points out Wikipedia allows umlauts in its URLs. And if you want a file called '🙈🙈🙈.txt' should Drupal get in the way http://example.com/sites/all/files/🙈🙈🙈.txt is going to work just fine. So perhaps we need a re-think. Is using transliteration the correct approach? Maybe we could focus on things that cause known problems which seem to be:
If we add the event then contrib and custom can experiment with transliteration and work out the pitfalls of much there seem to be many. People have already raised concerns about Japanese and Chinese filenames with this turned on.
Comment #267
alexpottDecided to test some of the assumptions about turning on transliteration and non-European texts. It seems our transliteration library can cope with this.
Comment #270
alexpott@Pancho I am sorry that my actions have caused you to quit the issue and I'm sorry I got upset with tests going missing. It is hard when there are so many comments and patches to an issue. I like to post incremental progress so people can follow along and see how I've arrived at the patches I'm posting. Big bang patches make that more difficult. Your contributions are valuable and continue to inform my thinking on this issue.
@Pancho it is your comment about umlauts and other diacritics in #259 that has convinced me we've pursuing the wrong path here.
After discussing this with others including @Berdir (who maintains large multi-lingual sites) and @Wim Leers (who maintains the REST API) I'm more convinced that transliterating filenames is going to cause more problems than it solves. For example, the current implementation will do different transliterations depending on the language the site is currently in!?!? For example if the UI is in english
ÄbecomesAbut if it is German it becomesAe- this type of magic is very hard to explain to users. Also it prevents us from solving one of the problems transliteration was supposed to solve - ie. basically the same filenames but not.Therefore I think we should go with something along the lines of #266 and having the following options:
Open questions for me are:
Comment #271
frobI think #266 is on the right track. While @Pancho was a bit heavy handed in how suggestions where made, I think they are the right suggestions to make. Further enhancements belong in a contrib module, such as pathauto. It might be a good idea to add a test module to ensure that the event is sufficient for making those changes as designed.
One thing however, @alexpott I don't think it is helpful to constantly refer people to Slack. I know this stuff has been discussed, in this issue's comments even, but I am not going to get onto Slack to discuss this and if that is a requirement for my continued (marginal) contribution then my contribution will decrease even further. I understand the need to sometimes get on chat and discuss an issue quickly, that is fine and I have done that on IRC, but in the end the discussion needs to be resolved here. Otherwise we will be marginalizing a sub-set of our community.
Comment #272
dwwWow, 23 comments in short order. Not just the implementation moving fast here. ;) Alas it's gotten so triggered/charged.
#250 247.2: Great, love it. :)
#250.3: Love it! Very happy not to pass the file system service around. Way better. Yay.
#252 Unit test: that was easier than I thought it'd be. :) I'm down. Thanks.
UI: okay, sounds right.
Dedupe: Nice fix.
#257: Hah, nice! Yes to in scope.
#258:
Yes, indeed!
I'm still open to some variant of the earlier UI that allowed different replacement characters. I just worry it gets too complex and confusing.
I'm still not convinced the "replace other non-alpha" setting should require transliterate being on. I don't like forcing it to be "strip". I think replace is better.
I'm open to "strip" as a "replace" option, but that gets confusing. See above.
#259 Yes, contrib can now listen to this event and do what it needs. Yes, sympathetic to the UI concerns.
#260: For the record: I'm happy to discuss here, Slack, IRC, or potentially audio. I don't think it's a requirement to get things done, that's what we have the issue queues for.
It was an honest mistake. I easily could have done the same. Any of us can when patches are flying around like this. Let's give each other the benefit of the doubt and all try to stay calm and relaxed. Thanks.
#262: Alas. Wow, that escalated fast. Sorry you got triggered. I don't think anyone is trying to exclude anyone here. We're desperately trying to get more input on this issue so we can actually commit it and move on.
#265.1:
Yes, but you are rehashing various points we've raised and addressed already, so it gives the impression you haven't. Easy to do on huge complex issues like this. :/
#265.2
I don't think we're settled feature and UI wise, that's why all that is still in the remaining tasks section of the summary.
#265.4: Yeah, you misread the patch. The "for security" part is preserving an existing message. See above.
Alas. :( Ugh, sorry it came to that.
#270:
Indeed!
Argh! :) This issue started as "start using transliterate for filenames", and now y'all have convinced yourselves we don't need it at all. Fun.
Re: language - ugh, I was just trying to "be responsible" and "do the right thing" to keep all the i18n folks happy. I'd be fine with a setting for this to either pick a site-wide language for transliteration or to use the current language of the upload. Maybe that's too confusing.
I'm also ok for *none* of this to have a UI. If we tell people to configure their site manually via settings.php or their site config, that's fine with me. I'm happy to write reasonable defaults and comments about it, instead of crafting and arguing about the UI. We already tell people they have to define their paths this way. Happy to have them enable and tweak filename sanitization at the same time.
"Need" is a strong word ;) but "want" I think is "yes". Separate is good.
Again: need: no, want: yes. We already have it, why throw it out now? It's like our time and life energy is totally free and expendable.
#271: Agreed, thanks for voicing that.
I still think there needs to be an option to enable transliteration for sites that know they want it. I'm going to be an incredibly sad panda if I end up having to maintain a D8 contrib to add that, after coming this far to simply putting something in file.settings.yml for this from now on. Please don't completely change course from how this issue started.
So, maybe a no-UI file.settings.filename_sanitization array that lets people do what they want?
Back to UI review/debates?
Follow-up issue to do what this issue originally set out to do, 270+ comments ago?
I don't exactly want to re-re-upload pat h #257, but I don't want the d.o bot to be confused, and I don't want to add my own patch to the mix until we decide on a direction. Tentatively setting back to NR.
Agreed with 'Needs usability review' tag.
Thanks,
-Derek
Comment #273
dwwAttempt at collecting everything needed for the 'Needs usability review' into the summary. Pasting here as a ul in case that's easier to see:
Comment #274
dwwWhoops.
Comment #275
alexpottI've opened two new issues to try to split this up a bit and make it so this issue can focus on transliteration and maybe the the UI for sanitization options for the user.
Comment #276
panchoPostponing on #3032390: Add an event to sanitize filenames during upload. Let's get the event in first, then see how we want to use it.
Comment #278
nlisgo commentedThe most recent patch did not apply after upgrading to 8.6.13 so re-rolling.
To-do: reintroduce test adjustment to
/core/modules/file/tests/src/Functional/FileFieldWidgetTest.phpComment #279
nlisgo commentedThe above patch was for the current 8.6.x branch but it didn't apply to 8.6.13 so uploading a patch for that.
Comment #280
eric.guerin@ucsf.edu commented@nlisgo I tried patch 278 but the patch failed still. That patch also seems to have way more in it then it should for just the file sanitation. Such as the composer.json etc.
Comment #281
pavelculacov commented@nlisgo patch-279 working on Russian, Romanian language. Tested
Comment #282
nlisgo commented@gr8tkicks apologies. Obviously screwed something up on the re-roll for the dev branch. The #279 patch should apply to 8.6.13. As this is marked postponed I will not bother to submit a re-roll on the dev branch at this moment in time.
Comment #283
jcisio commentedWhile #279 applies on 8.6.13, it contains a few schema changes and hook_update_N (as in the latest patch). Here is an updated patch of #128 which doesn't have any hook_update_N functions (which could be easily in conflict with any future changes).
Comment #284
mitrpaka commentedUpdated patch from #283 to apply with 8.7.0 (released 1 day ago).
Comment #285
mitrpaka commentedFixed failing test case for transliteration.
Comment #286
andreasderijckeThe actual sanitization code is usefull for other (custom) modules.
I recommend making this available as a service.
Comment #287
Quicksaver commentedFOR MY CASE I had custom code that highly depended on the functionality of patch #279, so I made a direct re-roll of that to apply for 8.7.x. Once again, please note that this was for my specific case only. I'm making this patch available in case someone else might also benefit from it.
(I have also removed a couple of .orig files from it, seemed like leftovers from previous dev work.)
I have been trying to follow this discussion, but I admit I have now become completely lost. For instance, I don't understand why patch #283 (and all subsequent patches) was was based off of patch #128; a ton of work had been lost there. Before merging this in (eventually), it may be worth trying to retrace all the steps for the last few months and see what's missing and what's the current consensus for the currently applied functionality.
Comment #288
dwwThis issue is blocked on #3032390: Add an event to sanitize filenames during upload. None of the patches here are valid/relevant anymore. In the other issue, we've significantly moved past what's here. Please do not re-roll this or continue development. We need to put the energy into #3032390 instead.
Therefore, the summary (and scope) of this issue are now stale. Initial attempt to update the summary to point to #3032390. Adding that as the true parent issue for this (and moving the previous parent, #1842718: Use new Transliteration functionality in core for machine names as a related issue).
Re: #287 -- absolutely right. #283 and beyond confuse the issue and aren't helpful. Please stop. ;) Please help get #3032390 done, instead.
Thanks,
-Derek
Comment #289
jcisio commentedRe @dww @Quicksaver: as I explained in #283, that patch is supposed to be a safe patch to use with Composer, when the future of this issue is not clear. Sorry that it confused you.
Comment #291
avpadernoComment #292
frob@dww should this issue be closed as a duplicate if #3032390: Add an event to sanitize filenames during upload has superseded this issue? I thought that this issue was about building the sanitation service and admin UI and #3032390: Add an event to sanitize filenames during upload was about adding the events that would be needed for this issue to do the file name sanitation.
Comment #293
dww@frob re: #292
Sadly, no.
Correct. But we can't build the sanitation service and admin UI without the event that would be dispatched which will allow this to work. Hence postponed and "[PP-1]" in the title.
Re: #283 and other attempts to re-roll this to be "composer-ready". I feel your pain. I understand why you do that (I do it, too). But in this case, it's really confusing things. To give y'all a public place to put your re-rolls without causing any trouble in here, I just opened a child issue:
#3094052: [IGNORE] Composer-friendly patches for filename sanitation while #2492171 awaits #3032390
Please post composer-friendly re-rolls there, not here.
Thanks!
-Derek
p.s. Removing some now bogus tags.
Comment #294
nlisgo commentedThis is a re-roll of the patch in #285.
Comment #295
dww@nlisgo: Again, please post such patches at #3094052: [IGNORE] Composer-friendly patches for filename sanitation while #2492171 awaits #3032390 not here.
@all: I'm closing comments on this issue until the blocker is fixed. Apparently folks neither read the first sentence of the summary nor the comments, so the best solution is to disable commenting entirely. Meanwhile, #3094052 can be the playground for rerolls / backports / etc.
Thanks,
-Derek
Comment #296
dwwActually closing comments.
Comment #299
alexpottThe blocker is in.
Comment #300
alexpottHere's a reroll of #263 on top of #3032390: Add an event to sanitize filenames during upload.
We need to go through the discussions from around that time to ensure all opinions and questions are accounted for. #272 and #273 look like good places to start as they attempt to summarise the preceding discussion.
I rerolled #263 because it was the last green patch prior to 3032390 and has the most features / test coverage. The patch itself is not necessary what I think the final patch should be. We need a usability review and research into what we really should do. I've rerolled it to get the ball rolling again. Also I've used a merge request because we're near 300 comments and I don't want the pager :)
I asked @daniel.bosen from the Thunder team which thunder ships with functionality to transliterated filenames. #146 explains it pretty well. An http client that quite a few mobile apps use breaks when image file name have umlauts in them.
Comment #302
alexpottI've updated the issue summary to reflect more the current state.
Comment #303
frobWhat does the update function need to do? Set some configuration defaults?
Comment #304
berdirYes, exactly. I would say the update function should ensure that the current behavior of existing sites remains unchanged for BC, whereas we need to decide what the recommended/default behavior should be.
Comment #305
xmacinfo> whereas we need to decide what the recommended/default behavior should be.
What are the options for the “recommended/default behavior”?
Comment #307
oakulm commentedTried to apply patch patch at #294 to 8.9.x but patching fails. There seems to be similar work already in core https://www.drupal.org/node/2972665 but the settings are not visible in admin/config/media/file-system so this is really confusing.. Anything I can do to help nudge this forward? Tried to reroll but got stuck while trying to understand what's going on.
Comment #308
imclean commented@oakulm the link you provided is to the draft change record for this issue. It's what this issue is trying to achieve, it isn't in core yet.
Comment #309
AshM commentedComment #310
AshM commentedComment #311
gauravvvv commentedPatch #310, re-rolled. Attached interdiff for same.
Comment #312
philyAbove patch #311 seems to be working for me using Drupal 9.2.2 (English & French).
Thanks
Comment #313
sutharsan commentedNote that patches in #309 to #311 do NOT replace the Merge request 370 nor an addition to it. At best they are a summary. I recommend to ignore them.
Comment #314
sokru commentedChanging status to Needs work, because there's unresolved threads on merge request 370.
Comment #315
thetailwind commentedI've been looking into this fix. How would replacing a file in an existing media entity work? For example, with this module (which really should be core functionality) https://www.drupal.org/project/media_entity_file_replace?
Assume a dirty file has been uploaded to an entity. All URL references to said entity would be based off of the dirty filename. How would this behave after this functionality has been enabled but when a user uploads a replacement file?
Thanks!
Comment #317
kovtunos commentedPatch doesn't work for Drupal 9.3+
Can you please update it?
Comment #319
collinhaines commentedI pushed a couple commits to MR 370 to resolve the missing constructor parameter documentation and prevent PHP notices for the config array.
Did we ever decide on what the update hook is going to do? My assumption is to add the new default config values found in file.settings.yml to the database.
Comment #320
dajka commentedThank you all who are working on this!
If I were to use a patch in production, which (if any) do you recommend?
Comment #321
sutharsan commented@Dajka, using patches from an open issue is always at your own risk. No upgrade path between patches is or will be available an support is minimal or absent. Only use patches if you are comfortable with code and if you are prepared to solve any problem yourself if that may arise.
If you decide to use a patch, use the most mature that is available, usually the latest. If you apply patches to your code I highly recommend to let Composer (composer-patches plug-in) do it for you.
Comment #322
frobI would never use a patch from an issue that has been lingering for 7 years. You could try contrib. There is an old module "File field paths" that had this functionality in it.
Comment #323
dercheffeI prefer https://www.drupal.org/project/filename_transliteration instead of file field paths module.
Comment #326
ressaThanks for sharing that @dercheffe. Until this issue gets filename transliteration in core, here are two issues discussing the differences between Filename Transliteration and Transliterate filenames:
Comment #327
sokru commentedThe threads on MR (370) are now resolved, attached a reroll patch to 10.1.x branch. Tested the patch manually.
Comment #329
sokru commentedThis should fix the CSpell nags, coding standards and one mistake I made during the reroll.
Comment #330
sokru commentedMissing interdiff.
Comment #332
sokru commentedThis will fix the tests, BUT I had to alter the tests, since the tests assumed lowercase alteration could not be done without transliteration.
Snippet from core/modules/file/src/EventSubscriber/FileEventSubscriber.php
Also if the current way is desired workflow, I think we should use States API also for other options like this:

Transliteration unchecked:
Transliteration checked:

Comment #333
sokru commentedComment #334
sokru commented@catch commented on Slack that using the States API as described on #332 would be good approach, so updating the patch accordingly.
Comment #335
sokru commentedComment #336
sokru commentedCleaned up the Issue summary for the Usability review.
Comment #337
junglePer the tests, it's not perfect. Can we introduce
Symfony\Component\String\Slugger\AsciiSlugger?See https://symfony.com/doc/current/components/string.html#slug-emojis
Note: symfony/intl is required.
Comment #338
simohell commentedUsability review
We discussed this issue at #3335028: Drupal Usability Meeting 2023-01-27. That issue will have a link to a recording of the meeting.
For the record, the attendees at this usability meeting were @BlackBamboo, @benjifisher, @ckrina, @iszabo, @shaal, and @simohell.
In the review we agreed, that
Comment #339
sokru commented- Updated the interface based on usability review. Made the details open by default. Screenshots updated to summary.
- Based on usability review I also updated the code and tests so its possible to only do specific sanitizations (eg. only lowercase).
- Removed the mentioned PHP bug workaround on
FileEventSubscriber::sanitizeFilename, since its only on PHP 7.4, not 8.xFor a suggestion on #337 about using AsciiSlugger, I think it would be relevant if core would use it also on
\Drupal\Component\Transliteration. So suggesting leaving it to follow-up issue.Comment #340
smokris(Fixing screenshot filename in issue summary.)
Comment #341
sokru commentedFixing the CSpell issue.
Comment #342
rinku jacob 13 commentedHi @sokru , Reviewed the patch #341 for drupal version 10.1.x. It was successfully applied .Adding screenshots for the reference. Need RTBC+1

Comment #343
rkollerI've tested the patch in #341 on a drupal 10.1.x-dev install as well. overall the feature looks great! the only nitpick i would have is when you are using a screenreader (voiceover on macos 12.6.1 in my case). in my current verbosity text setting (see the linked video) i have activated the default option
some. if you switch to the optionallthe information becomes even more granular. but as you can see in the current video, with the optionsomeset, underscore gets announced twice for the first sentence and in the second second sentence three times. for sighted users i consider it a good thing to exemplify and illustrate what dot, underscore and dashes stand for, but in the aural interface it is sort of redundant for some or all characters depending on the set verbosity level. so i wonder if it would make sense to wrap the parenthesis with dots, underscores, and dashes in spans and hide them from the screenreader with something like aria-hidden="true"?Comment #344
jwilson3Text should be:
i.e., swap "Remove" with "Replace"
IMO, there is another feature that could be supported. "Replace sequences of dots, underscores and dashes with a single one". So that stuff like:
_-_would be simplified to-.Comment #345
pooja saraah commentedAddressed the comment on #344
Attached patch against Drupal 10.1.x
Comment #346
poker10 commentedPatch #345 does not apply.
I would also prefer not to change the quotes to double quotes (as there are single quotes used everywhere).
Comment #347
pooja saraah commentedAddressed the comment on #346
Attached patch against Drupal 10.1.x
Comment #348
pooja saraah commentedComment #349
Ratan Priya commentedAddressed #344.
Needs review.
Comment #350
sokru commentedRemoving the dot at end of title and addressing #343, will ask feedback from #ux about the change.
#344.2 "IMO, there is another feature that could be supported. "Replace sequences of dots, underscores and dashes with a single one". So that stuff like: _-_ would be simplified to -."
Not sure how common this replacement pattern is. IMHO I'd leave it to follow-up or to contrib (eg. "regexp filename transliteration") module.
Comment #351
andypost@sokru this kind of filters better leave for contrib a-la https://www.drupal.org/project/smiley
not sure potx can properly parse it as it contains attributes
Comment #352
sokru commented@andypost, but there's plenty of strings like
in core already. I tried exporting string with aria-hidden attribute with drupal/potx module and it exported the string nicely.
Comment #353
smustgrave commentedPatch no longer applies cleanly - Will move to NW for that.
error: patch failed: core/modules/file/file.post_update.php:13
error: core/modules/file/file.post_update.php: patch does not apply
but fixed manually for testing (didn't upload as I didn't want to take myself out of the review role)
This was tagged for issue rescope but that was 4 years ago, is that still needed? - Should be removed or update before moving to RTBC
Taking a look at the remaining tasks.
Has this been completed by UX team?
Manually testing
post_update hook ran without issue
Enabled all settings in File system.
On Drupal 10.1 with a Standard install
Edited Article content type image field to be unlimited
Uploaded these files and were transformed to
Test File!.png => test-file.png
Test File.png => test-file_0.png
Test ..--__.png => test-.-_.png
Test .. File.png => test-.-file.png
Switched to underscores vs dashes
Test File!.png => test_file.png
Test File.png => test_file_0.png
Test ..--__.png =>test_.-_.png May be a non issue but there was a dash left over
Test .. File.png => test_._file.png
Comment #354
ravi.shankar commentedAdded reroll of patch #350.
Comment #355
rkollerre #353 the results of the ux meeting were in #338 and the recommended changes were applied in #339
the only detail left is #343. i was unable to attend the ux meeting on that day but have taken a look at the issue afterwards and just noticed the redundancy i've provided a video and a suggestion for a potential fix for.
@sokru already asked over in the accessibility channel for maintainers feedback in that regard to make sure there aren't any downsides: https://drupal.slack.com/archives/C2ANFUGGG/p1677133509757059
Comment #356
smustgrave commentedCan the issue summary be updated then please
Comment #357
rkolleryou mean adding a task to the list or remaining tasks? something like
13. Avoid the announcement of redundant text by a screenreader (ref #343)? or would you recommend (also) another step?Comment #358
smustgrave commentedYup should be added to remaining tasks exactly how you said it, that's perfect.
Think the quickest fix would be to remove the examples of dots, underscores, and dashes completely but not sure if that's the desired fix.
Moving to NW as this fix needs to be done before review.
Comment #359
rkollerthanks i am always a bit hesitant updating issue summaries.
and removing those examples would have a downside for sighted users. by using those examples you are able to skim those punctuation character in a glimps of an eye instead of reading and processing their names. and it also provides a certain level of convenience for none native speakers who are using the admin interface in english. having examples for those punctuation characters is way more clear. if possible the micro copy for sighted users should be left as is.
and i've added the point to the issue summary.
Comment #360
rkollerwoops somehow switched to needs review by accident. setting back to needs work. sorry
Comment #361
jwilson3I wonder if this whole problem of accessibility, strings, and duplicate character handling could be simplified by following what filename_transliteration contrib module does:
No need for confusing explanations and examples that must be hidden in our Ui strings, like (—), (..), etc. And handles more cases ultimately resulting in easier-to-read file names.
Comment #362
smustgrave commentedThink the solution is either that or what rkroller suggested and wrap the parenthesis in aria hidden tag.
Comment #363
sokru commentedAssuming all new commits to 10.1. require construction property promotion per #3278431: Use PHP 8 constructor property promotion for existing code, I adjusted the patch accordingly.
I'd be fine with #361 approach, but based on core committers opinion on #215.4 (and points from #359), the current approach is desired, so marking part 14 done from remaining tasks.
Comment #364
avpadernoComment #365
sokru commentedFix CS issue.
Comment #366
rkollerthank you! I've only manually tested the changes for the remaining task #14 (am not a developer). the patch applies properly to 10.1.x-dev. I've attached a short video illustrating the applied change with aria-hidden in macos 12.6.1 and voiceover. the redundant announcement of punctuations go successfully drop/hidden.
and did another search in the documentation on d.o and according to https://www.drupal.org/docs/accessibility/hide-content-properly it is the only listed and recommended option for not visually hiding content but only hiding it for screenreaders. and since the attribute isn't used on a focusable element nor the spans have any child elements (https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA/Attribut...) it looks good to me. i leave the status at needs review in case more code level review is necessary.
Comment #367
smustgrave commentedLeaving the needs issue rescope tag but since that was 4 years ago not sure if it's needed anymore.
Also not sure what
is? Maybe that's tied into announcement issue.
Functionality wise working as expected so moving forward.
Comment #368
jwilson3I'm not sure how #215.4 has any bearing on #361.
The relevant piece of #359 would still stand for the first of the two checkboxes. The suggestion is to remove it for the second ones (
(--)and friends). So to clarify #361, my proposal is to cover more than just 2 or more of the same character, and include two or more of any of the characters. In concrete terms, it means:Change the functionality of the second checkbox from "Replace duplicate dots (..), underscores (__) or dashes (--) with a single one" to "Simplify sequences of dots, underscores and dashes (eg
_-_,..,--) with the replacement character.".This would both:
1.- widen coverage of our naming cleanup for more of the cases like
_-_,_._, etc and (assuming underscore was set as the separator) would vastly improve messy filenames like:"Test - file.png" to "test_file.png" instead of "test_-_file.png"
"Test_-_file.png" to "test_file.png" instead of "test_-_file.png"
"Test .. File.png" to "test_file.png" instead of "test_._file.png" (mentioned in #353)
2.- Simplify the cognitive overload for sighted users scanning the page for relevant examples like
_-_.Comment #369
gauravvvv commentedUpdated attrition
Comment #370
sokru commentedUsing approach from #368, Please check the interdiff; a bit of code repetition, but makes the results less messy.
Comment #372
rkollerre #368 ahhhh now i understand. thanks for the clarification! you definitely have a point. having an example for each punctuation example for the first checkbox label is the correct approach as you stated. but having the same pattern for the second checkbox which i basically about sequences of punctuation characters that is definitely suboptimal. i agree.
1. with the latest patch the label got changed to
Simplify sequences of dots, underscores and dashes (eg _-_, .., --) with the replacement character.. i have to admit the label text posed a certain cognitive challenge to me as well. first dots, underscores and dashes got named in the label then within the parenthesis i get an eg and afterwards punctuations separated by commas. mentally my brain automatically tried to spot a pattern and draw the connection to the aforementioned dots, underscores and dashes.i wonder wouldn't it be way more clear to shorten the label to
Simplify sequences of dots, underscores and dashes with the replacement character.and add a description with one of the examples you've provided in #368. imho that might be way more straight forward and clear? something likeFor example "Test_-_file.png" changes to "test_file.png" instead of "test_-_file.png"2. and i think the aria-hidden should still be minded and kept in place for the second checkbox, label at least with the changes the latest patch introduced (in that case hide the parenthesis and it's content from screenreaders). the announcement for
(eg _-_, .., --)is confusing (see and hear in the attached video).Comment #373
johnpitcairn commentedStill seems slightly clumsy. How about:
"Reduce sequences of dots, underscores and dashes to a single replacement character."
Then provide an example in the description as mentioned above.
It might also be helpful to clarify that this occurs after an initial replacement of spaces.
Comment #374
sokru commentedThis just fixes the tests from #370.
Comment #375
kim.pepperLooks good, with extensive test coverage.
One nitpick:
Should this be named $configFactory?
Comment #376
dwwWe still need to decide if we're going to change the behavior and UI of the "Remove duplicates..." setting. I think I like the simplification, but it's going to take a bit of work to get it happy:
deduplicate_separatorsis no longer a very self-documenting config key for this setting. And its description needs help. 😉Moving back to NR to get more eyes on this, but probably NW for all the above if we agree on changing this setting and how it works.
p.s. Starting to cleanup credit in here, removing from a few pointless/unproductive rerolls that the author then abandoned.
Comment #377
sokru commented@rkoller found a small bug in Slack on the "Remove duplicates.." logic, this should address it with test coverage.
Comment #378
rkollerabout #367.2 and in reply to #373
I agree, my initial suggestion was sort of clumsy. But comparing your suggestion with the other labels the patch introduces i wonder if it would make sense to stick as close as possible to the pattern the previous checkbox label uses (Replace non-alphanumeric characters with the replacement character)?
Replace sequences of dots, underscores and/or dashes with the replacement character.For example "One _ test ._--_. file.png" changes to "One _ test - file.png"Replace none-alphanumeric characters with the replacement character) as close as possible. Additionally I've just added anand/orto point out that it also applies to combinations of those characters. I've reflected that also in the updated example within the description._remain untouched by the filter (as fixed in #377)Comment #379
smustgrave commentedFrom the slack channel seems there is still work to be done around the "eg" part.
Comment #380
jungleUpdated the label/title/text per #378
Fix the namespaces
Add an example per #378, but changed the naming.
Comment #381
jungleOops, "replace" should be uppercased.
1. Emojis are everywhere, and I'd like to have the option to transliterate them, as I commented in #337, but with the FileUploadSanitizeNameEvent added, it's easy to do in custom code.
2. Meanwhile, I saw extension is not available for altering. Is there a good reason for that? I have not found it with a quick look.
Comment #382
jungleSorry, wrong changes in #380.2
Comment #383
jungleThe FileUploadEvent class does not exist. Updating the comment.
Comment #384
jungleAfter reading the discussion in slack, which @dww pointed me to (thanks!), I am removing the #description. And let's do it in the follow-up with the javascript approach to make the example dynamic depending on the selected replacement character.
One suggestion from me is to move the
Replacement characterform element to the top beforededuplicate_separators, because in the label ofdeduplicate_separators, which mentioned replacement character, if the end users see the term Replacement character first which would make more sense.Tagging "Needs followup"
Comment #385
jungleAdjust the position of the Replacement character form element.
Comment #386
jungleUpdate the description of Transliterate to mention the use of the replacement character.
Comment #387
smustgrave commentedSo when I try to test the post_update hook I get
Anyone else getting that?
Comment #388
smustgrave commentedDisregard #387 got around that
post_update hook ran without issue
On Drupal 10.1 with a Standard install
Enabled all settings in File system.
Edited Article content type image field to be unlimited
Uploaded these files and were transformed to
new Test ..--__.png => new-test.png
new Test File!.png => new-test-file.png
new Test File.png => new-test-file_0.png
new_Test .. File.png => new_test-file.png (underscore not replaced)
Also when I upload the same file if it appends _0 should it not replace that underscore also?
Comment #389
sokru commentedThis is according the spec, the label says "Replace sequences of dots, underscores and/or dashes with the replacement character".
If this is desired, I'd leave this as a follow-up, since its handled on multiple places on core eg. in
_file_save_upload_from_form().Comment #390
smustgrave commentedAh makes sense. Then my questions there are answered.
Though I did want to ask. I was only able to replicate once but when I uploaded a file that had been removed from the sites folder. I kept getting an error that the file that was being renamed already existed. Technically it did in the database but not file system. Didn’t flag that because I think it’s user error but wanted to ask before I mark
Comment #392
nathan tsai commentedDrupal does not update the database when a file is deleted in the file system.
Thus, AFAIK, this is a user error.
Comment #393
smustgrave commentedPatch no longer applies.
Comment #394
sokru commentedRerolled the patch, although the patch from #386 still applies to 11.x, as this is categorized as a "feature request" I don't think this will be backported to 9.x or 10.x branch.
Comment #396
sokru commentedRandom test failure on #394, now it will pass.
Comment #397
smustgrave commentedApplied #394
Verified post update hook ran without issue
Verified UI appears as mentioned
Did the same file test as #388
Going to mark. Though shouldn't be merged until follow up from #384 is opened.
Comment #398
dwwI can't believe this is finally RTBC! 🎉 Thanks to everyone who picked up the torch after I was too burned out trying to keep running with it. 😅
Here's that follow-up: #3377577: Add JS to system_file_system_settings_form to dynamically update examples based on current settings
Comment #400
sokru commentedRandom test failure.
Comment #402
jwilson3The
Drupal\Tests\user\FunctionalJavascript\UserPermissionsTest::testPermissionCheckboxestest failure seems unrelated here.Comment #403
needs-review-queue-bot commentedThe Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".
This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.
Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.
Comment #404
sokru commentedJust a reroll from 394, no interdiff.
Comment #405
avpadernoComment #406
kim.pepper@apaderno why was this put back to needs review?
Comment #408
catchI wonder if we should ship configuration to enable this in the standard profile, or even on new sites in general with the update setting things to off. However the reasons given for defaulting the config to off by default for existing sites (and possibly new sites) are good ones.
Some of the config options like remove duplicates I'm not sure about - does that really need to be an option? However I'm conscious we're at over 400 comments and eight years already, these would be small changes to make in a minor release if we wanted to change our minds later.
Tagging for 10.2.0 release highlights.
Committed 56e1b48 and pushed to 11.x. Thanks!
Comment #409
andypostThank you!
Change record also needs to be published!
Comment #410
xmacinfoAmazing! Thank you to everyone who worked on this issue.
Comment #411
agoradesign commentedThe change is currently only commited to 11.x branch, but tagged for 10.2 release highlights - shouldn't it be cherry-picked to 10.x too?
Comment #412
catch10.2.x will be branched off 11.x, it is confusing but it saves retargeting MRs every six months, see https://www.drupal.org/about/core/blog/new-drupal-core-branching-scheme-...
Comment #413
agoradesign commentedoh thanks for this important info :) and hell yeah, it's great that this long awaited feature finally found its way into core - thanks!
Comment #414
agoradesign commentedoh thanks for this important info :) and hell yeah, it's great that this long awaited feature finally found its way into core - thanks!
Comment #415
wim leersAmazing work, @dww! 🤯👏
Comment #416
socialnicheguru commentedIf you find your way here and you are still on Drupal 8, 9, you can use https://www.drupal.org/project/transliterate_filenames
It also works on Drupal 10
Comment #417
sokru commentedCreated a follow-up: #3385777: Simplify & enable transliteration by default?, feel free to update it.
Comment #419
ressaGreat work everybody, thanks! It's so nice to see improvements such as this one, and all the other ones getting included in Drupal core, making it incrementally better and better. #229193: Incremental filter for permissions page and #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) (this) are my two highlights in Drupal 10.2.
Shout-out to @smustgrave for help keeping Drupal core issues moving with his Needs Review Queue Initiative. It's having a big impact now, keeping Drupal core issues moving and getting committed. We are all reaping the fruits now.
I have even had @smustgrave appear in one of my own Drupal core issues, reviewing and helping get it committed. Maybe you have also had the pleasure of having him help out in one of your issues?
I think @smustgrave is having such a huge impact because he bridges the often separated roles of a Scrum Master and a coder, skillfully wearing both hats simultaneously, acting sort of like a "Scrum Coder" :)
Imagine the velocity we could reach, if even more people helped out in the Needs Review Queue Initiative ... Just spending ten minutes every other day could make a huge difference, and help keep the momentum in otherwise stalled issues.
See also Talking Drupal #426 - Needs Review Queue Initiative.
Comment #420
defcon0 commentedAt first a big thanks to this new feature. Made parts of our custom implementation obsolete :)
Only one thing bothers me: why is the replacement character only trimmed at the right side of the filename and not on the left one, as well?
https://github.com/drupal/core/blob/10.2.x/modules/file/src/EventSubscri...
I think filenames like -some-filename.png are not so nice. What has been the reason to implement it that way? Couldn't find info in this issue log.
Comment #421
damienmckennaPlease open a new issue for any additional refinements needed.