Problem/Motivation

On Drupal sites where users are uploading files clean filenames are hard to achieve because Drupal accepts most filenames as valid input. However lots of Drupal installs override this functionality, for example Thunder has been shipping a similar solution for a very long time. As @dww writes in #132

Content creators name their files all kinds of weird and unfortunate things. Even if everything is in English, it's still great to convert spaces to dashes, remove weird punctuation, special characters, etc. Plus, many users find themselves in a mix of case-sensitive and case-insensitive filesystems, so it's great to have Drupal automatically convert everything to lowercase. I constantly build custom functionality into my sites to cleanup filenames on upload so as to prevent the editors from making a mess of things. It'd be much better if core did that for me automatically.

Follow-up to #1842718: Use new Transliteration functionality in core for machine names and #567832: Transliteration in core.

@catch mentioned in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) we may want to transliterate filenames in core, so that we can have a database-level unique constraint on the URI field in the database. (However, this would presumably mean that transliteration of the field could not be optional).

Proposed resolution

Add a fieldset to the File system form (/admin/config/media/file-system) to opt-in for various kinds of filename sanitization:

  • Transliterate the filename. Disabled by default, because it's not useful for some languages like Japanese.
  • Replace whitespace with dash (-) or underscore (_).
  • Replace any characters other than 0-9A-Za-z or - or _ or . with dash (-) or underscore (_).
  • Remove duplicate separator characters ie -- or .. or __ with be replaced with - or . or _.
  • Convert to lowercase (to prevent issues on case-insensitive file systems).

On all files uploaded through the API and UI (i.e. not aggregated JS/CSS files, etc), the filename will be sanitized in whatever ways the site is configured to enforce.

We listen to the \Drupal\file\Event\FileUploadSanitizeNameEvent and change the filename according the the configuration.

Note: These sanitization settings only impact new file uploads, all files already uploaded to a site are not affected.

Remaining tasks

  1. Decide if all whitespace or only spaces are converted. Update code and UI text to match. All whitespace as of #159.
  2. Design and dispatch the appropriate event(s) where core's sanitization is done. Complete via #187.
  3. Decide if the FileUploadEvent should also let listeners munge the destination. If so, update the Event accordingly. No (via @alexpott in #192.2).
  4. Decide if we give the user enough feedback about renaming their file upon upload. No (via @alexpott in #203). If not, implement the appropriate feedback (Complete via #214).
  5. Finish cleaning up the UI text, settings names, etc: Done via #219.
    • What's the right label for the 'strip' or 'remove' option for replace_non_alphanumeric? For now: "- Strip -". Previous: "- Remove (do not replace) -". Other? Option D: remove it entirely, via #219.
    • Should the options for the two selects (replace_*) be capitalized? Sure. ;) Via #194
    • Any other concerns/edits/fixes?
  6. Finish writing Kernel test cases for (all?) the combinations of settings. Done via #213 and #219.
  7. Upload screenshot(s) of final settings UI and update the summary. Done via #219.
  8. Final code cleanup and fixes. No more @todo via #219.
  9. Update/fix the change record. Done via #221
  10. Add tests for REST integration
  11. Decide if we're allowing transliteration as an option(!). See #272.
  12. Decide if we should let people pick a transliteration language. See #270 and #272.
  13. Usability review:
    • Decide if we want a UI at all, or if sites should simply configure this via settings.php (like they do for the public + private files dir paths). See #272.
    • If yes to a UI:
      • Decide whether we want to be able to have different replacement characters for whitespace, non-alphanumeric characters, and transliteration unknown characters. See #180, #246, #258, #261, and others...
      • If we have it as a separate / shared setting, decide if the "replacement character" setting should be "near" the checkboxes that provide replacement or at the end of the fieldset (see #263 for comparisons)
      • Decide if "replace non-alphanumeric" should be "near" transliterate option. See #241 and #242.
      • Decide if we should let people pick a transliteration language. See #270 and #272.
  14. Avoid the announcement of redundant text by a screenreader (ref #343)
  15. Final string/UI review + signoff
  16. Final reviews + RTBC.
  17. Commit.

User interface changes

New "Sanitize filenames" detail element with options on FileSystemForm (/admin/config/media/file-system).

Screenshots of the new settings UI :

Sanitize filenames element closed:

Transliteration is not enabled

Sanitize filenames element open:
Transliteration is enabled

Options under "Replacement character" setting:
Transliteration options

API changes

None

Data model changes

New configuration mapping file.settings:filename_sanitization, with options for:

  • transliterate: Transliterate
  • replace_whitespace: Replace whitespace
  • replace_non_alphanumeric: Replace non-alphanumeric characters
  • dedupe_separators: Remove duplicate dots (..), underscores (__) or dashes (--)
  • lowercase: Convert to lowercase

Release notes snippet

#2972665: New filename sanitization settings during upload (via UI or REST), new sanitization Event, changes to FileUploadResource constructor

CommentFileSizeAuthor
#404 2492171-404.patch28.11 KBsokru
#403 2492171-nr-bot.txt86 bytesneeds-review-queue-bot
#394 2492171-394.patch28.22 KBsokru
#386 interdiff-385-386.txt905 bytesjungle
#386 interdiff-377-386.txt5.17 KBjungle
#386 2492171-386.patch28.11 KBjungle
#385 2492171-replacement.png72.72 KBjungle
#385 interdiff-384-385.txt1.56 KBjungle
#385 interdiff-377-385.txt4.63 KBjungle
#385 2492171-385.patch27.99 KBjungle
#384 interdiff-383-384.txt670 bytesjungle
#384 interdiff-377-384.txt3.25 KBjungle
#384 2492171-384.patch27.99 KBjungle
#383 interdiff-382-383.txt660 bytesjungle
#383 interdiff-377-383.txt3.4 KBjungle
#383 2492171-383.patch28.09 KBjungle
#382 interdiff-381-382.txt1.4 KBjungle
#382 2492171-382.patch28.09 KBjungle
#382 interdiff-377-382.txt3.18 KBjungle
#381 interdiff-380-381.txt728 bytesjungle
#381 2492171-381.patch28.08 KBjungle
#380 interdiff-377-380.txt3.17 KBjungle
#380 2492171-380.patch28.08 KBjungle
#377 interdiff-376-377.txt1.63 KBsokru
#377 2492171-377.patch27.9 KBsokru
#376 2492171.374_376.interdiff.txt1.32 KBdww
#376 2492171-376.patch27.64 KBdww
#374 interdiff-370-374.txt752 bytessokru
#374 2492171-374.patch27.68 KBsokru
#372 for_example.mp4101.1 KBrkoller
#370 interdiff-365-370.txt4.45 KBsokru
#370 2492171-370.patch27.68 KBsokru
#366 aria-hidden.mp4571.76 KBrkoller
#365 interdiff-363-365.txt499 bytessokru
#365 2492171-365.patch26.71 KBsokru
#363 interdiff-354-363.txt4 KBsokru
#363 2492171-363.patch26.71 KBsokru
#354 reroll_diff.txt706 bytesravi.shankar
#354 2492171-354.patch27.18 KBravi.shankar
#350 2492171-350.patch27.19 KBsokru
#350 interdiff-349-350.txt1.16 KBsokru
#349 interdiff_341-349.txt679 bytesRatan Priya
#349 2492171-349.patch27.01 KBRatan Priya
#347 interdiff_345-347.txt559 bytespooja saraah
#347 2492171-347.patch30.21 KBpooja saraah
#345 interdiff_341-345.txt559 bytespooja saraah
#345 2492171-345.patch27.44 KBpooja saraah
#343 underscores.mp4644.12 KBrkoller
#342 Screenshot 2023-02-01 at 1.02.48 PM.png214.28 KBrinku jacob 13
#341 2492171-341.patch27.01 KBsokru
#341 interdiff-339-341.txt867 bytessokru
#339 2492171-details-closed.png19.87 KBsokru
#339 2492171-details-open.png193.25 KBsokru
#339 2492171-339.patch27.01 KBsokru
#339 interdiff-334-339.txt6.11 KBsokru
#335 2492171-transliteration-options.png139.5 KBsokru
#334 interdiff-332-334.txt2.01 KBsokru
#334 2492171-334.patch28.39 KBsokru
#332 interdiff-329-330.txt2.68 KBsokru
#332 2492171-330.patch27.81 KBsokru
#332 2492171-transliteration-is-not-enabled.png382.68 KBsokru
#332 2492171-transliteration-is-enabled.png868.27 KBsokru
#330 interdiff-327-329.txt9.96 KBsokru
#329 2492171-329.patch27.87 KBsokru
#327 2492171-327.patch27.67 KBsokru
#311 2492171-310_311.txt517 bytesgauravvvv
#311 2492171-311.patch3.83 KBgauravvvv
#310 2492171-295.patch3.83 KBAshM
#309 2492171-295.patch3.83 KBAshM
#294 2492171-294.patch9.95 KBnlisgo
#287 2492171-287.patch39.04 KBQuicksaver
#285 2492171-285.patch9.96 KBmitrpaka
#284 2492171-284.patch9.95 KBmitrpaka
#283 2492171-283.patch9.74 KBjcisio
#279 2492171-279.patch114.98 KBnlisgo
#278 2492171-278.patch5.85 MBnlisgo
#265 260-264_interdiff.txt1.11 KBpancho
#265 2492171-264.patch37.79 KBpancho
#261 filenames-select-260.png33.83 KBpancho
#261 filenames-open2-260.png31.94 KBpancho
#261 filenames-open1-260.png31.45 KBpancho
#261 filenames-closed-260.png23.29 KBpancho
#261 filenames-closed-260.png23.29 KBpancho
#261 257-260_interdiff.txt2.41 KBpancho
#261 2492171-260.patch37.21 KBpancho
#258 filenames-select.png35.01 KBpancho
#258 filenames-open2.png33.52 KBpancho
#258 filenames-open1.png32.62 KBpancho
#258 filenames-closed.png25.48 KBpancho
#258 254-257_interdiff.txt21.13 KBpancho
#258 2492171-257.patch37.43 KBpancho
#257 2492171-4-257.patch41.22 KBalexpott
#257 254-257-interdiff.txt680 bytesalexpott
#254 2492171-4-254.patch40.88 KBalexpott
#254 253-254-interdiff.txt1.79 KBalexpott
#253 2492171-4-253.patch40.32 KBalexpott
#253 252-253-interdiff.txt1.19 KBalexpott
#252 2492171-4-251.patch39.13 KBalexpott
#252 250-251-interdiff.txt3.85 KBalexpott
#250 2492171-4-250.patch38.3 KBalexpott
#250 248-250-interdiff.txt4.72 KBalexpott
#248 2492171.245_248.interdiff.txt1.06 KBdww
#248 2492171-248.patch38.58 KBdww
#247 2492171-245.settings-replacement-char-options.png49.66 KBdww
#247 2492171-245.settings.png148.1 KBdww
#246 2492171-4-245.patch38.52 KBalexpott
#246 239-245-interdiff.txt19.18 KBalexpott
#240 2492171-240.settings-replace-options.png112.53 KBdww
#240 2492171-240.settings-replace-nonalpha-depends-on-transliterate.png97.43 KBdww
#240 2492171-240.settings.png135.9 KBdww
#239 2492171.237_239.interdiff.txt4.76 KBdww
#239 2492171-239.patch36.85 KBdww
#237 2492171-4-237.patch36.5 KBalexpott
#237 235-237-interdiff.txt1.86 KBalexpott
#235 2492171-4-235.patch36.28 KBalexpott
#235 227-235-interdiff.txt3.77 KBalexpott
#227 2492171-4-227.patch35.78 KBalexpott
#227 224-227-interdiff.txt1.52 KBalexpott
#224 2492171.223_224.interdiff.txt1.19 KBdww
#224 2492171-224.patch35.67 KBdww
#223 2492171-223.settings-no-alphanumeric-replace-without-transliterate.png80.84 KBdww
#223 2492171.219_223.interdiff.txt442 bytesdww
#223 2492171-223.patch35.69 KBdww
#219 2492171-219.settings-replace-options.png136.7 KBdww
#219 2492171-219.settings.png119.36 KBdww
#219 2492171.218_219.interdiff.txt10.25 KBdww
#219 2492171-219.patch35.69 KBdww
#218 2492171.217_218.interdiff.txt2.43 KBdww
#218 2492171-218.patch35.23 KBdww
#217 2492171.216_217.interdiff.txt4.37 KBdww
#217 2492171-217.patch34.62 KBdww
#216 2492171.214_216.interdiff.txt3 KBdww
#216 2492171-216.patch31.21 KBdww
#214 2492171.213_214.interdiff.txt3.64 KBdww
#214 2492171-214.patch31.06 KBdww
#213 2492171.212_213.interdiff.txt3.78 KBdww
#213 2492171-213.patch29.74 KBdww
#212 2492171.208_212.interdiff.txt13 KBdww
#212 2492171-212.patch28.33 KBdww
#208 2492171.205_208.interdiff.txt1.24 KBdww
#208 2492171-208.patch23.11 KBdww
#205 2492171.199_205.interdiff.txt3.09 KBdww
#205 2492171-205.patch23.02 KBdww
#199 2492171.197_199.interdiff.txt8.08 KBdww
#199 2492171-199.patch22.25 KBdww
#198 2492171.195_197.interdiff.txt589 bytesdww
#198 2492171-197.patch21.74 KBdww
#195 2492171.194_195.interdiff.txt3.68 KBdww
#195 2492171-195.patch21.75 KBdww
#194 2492171.187_194.interdiff.txt9.52 KBdww
#194 2492171-194.patch20.72 KBdww
#187 2492171.185_187.interdiff.txt7.78 KBdww
#187 2492171-187.patch19.4 KBdww
#185 2492171.180_185.interdiff.txt13.92 KBdww
#185 2492171-185.patch14.95 KBdww
#180 2492171-180.settings-nonalpha-options.png70.2 KBdww
#180 2492171-180.settings.png121.91 KBdww
#180 2492171.173_180.interdiff.txt11.54 KBdww
#180 2492171-180.patch18.1 KBdww
#173 2492171.170_173.interdiff.txt2.8 KBdww
#173 2492171-173.patch16.68 KBdww
#171 2492171.168_169.interdiff.txt7.35 KBdww
#171 2492171-169.patch19.63 KBdww
#168 2492171.151_168.interdiff.txt9.69 KBdww
#168 2492171.165_168.interdiff.txt1.28 KBdww
#168 2492171-168.patch15.32 KBdww
#165 2492171.161_165.interdiff.txt454 bytesdww
#165 2492171-165.patch15.32 KBdww
#161 2492171.160_161.interdiff.txt3.32 KBdww
#161 2492171-161.patch15.32 KBdww
#160 2492171.159_160.interdiff.txt831 bytesdww
#160 2492171-160.patch15.33 KBdww
#159 2492171.151-159.interdiff.txt8.8 KBdww
#159 2492171-159.patch15.34 KBdww
#151 142-151-interdiff.txt11.82 KBalexpott
#151 2492171-3-151.patch15.91 KBalexpott
#142 2492171-3-142.patch12.17 KBalexpott
#142 113-142-interdiff.txt1.22 KBalexpott
#139 2492171-139.patch9.19 KBgpap
#128 2492171-128.patch10.96 KBfrob
#126 2492171-126.patch11.45 KBlawxen
#113 2492171-3-113.patch11.52 KBalexpott
#113 110-113-interdiff.txt10.53 KBalexpott
#110 2492171-2-110.patch12.52 KBalexpott
#110 108-110-interdiff.txt2.4 KBalexpott
#15 drupal-use_new_transliteration-2492171-15.patch1.34 KBpingwin4eg
#21 drupal-use_new_transliteration-2492171-20.patch4.17 KBhgoto
#21 interdiff-2492171-15-20.txt4.8 KBhgoto
#26 drupal-use_new_transliteration-2492171-26.patch5.13 KBhgoto
#26 interdiff-2492171-20-26.txt3.07 KBhgoto
#28 drupal-use_new_transliteration-2492171-28.patch5.13 KBhgoto
#28 interdiff-2492171-20-28.txt3.07 KBhgoto
#28 interdiff-2492171-26-28.txt380 byteshgoto
#37 core-file-system_file-upload-transliteration-2492171-37.patch5.08 KBkovtunos
#42 interdiff-2492171.txt1.54 KBborisson_
#42 use_new_transliteration-2492171-42.patch6.63 KBborisson_
#45 use_new_transliteration-2492171-45.patch8.14 KBhgoto
#45 interdiff-2492171-42-45.patch1.14 KBhgoto
#50 Transliteration.jpg18.74 KBpascodiogo
#50 Tr@nsliteratiön $.jpg18.85 KBpascodiogo
#51 use_new_transliteration-2492171-51.patch8.14 KBhgoto
#51 interdiff-2492171-45-51.txt526 byteshgoto
#60 use_new_transliteration-2492171-60.patch8.14 KBseanb
#60 interdiff-51-60.txt691 bytesseanb
#61 use_new_transliteration-2492171-61.patch8.15 KBndobromirov
#61 diff-between-the-patch-files.txt360 bytesndobromirov
#70 use_new_transliteration-2492171-70.patch8.15 KBfrob
#70 interdiff_61_70.txt120 bytesfrob
#72 use_new_transliteration-2492171-72.patch7.6 KBpancho
#93 interdiff-2492171-70-72.txt1.67 KBchr.fritsch
#93 interdiff-2492171-72-93.txt5.03 KBchr.fritsch
#93 2492171-93.patch8.23 KBchr.fritsch
#98 93-98-interdiff.txt529 bytesalexpott
#98 2492171-98.patch8.23 KBalexpott
#99 98-99-interdiff.txt7.52 KBalexpott
#99 2492171-99.patch9.21 KBalexpott
#100 99-100-interdiff.txt611 bytesalexpott
#100 2492171-100.patch9.63 KBalexpott
#103 transliterate-file-uploads-2492171-102.patch9.88 KBkirkkala
#104 transliterate-file-uploads-2492171-104.patch9.88 KBkirkkala
#106 100-106-interdiff.txt2.08 KBalexpott
#106 2492171-2-106.patch11.15 KBalexpott
#108 106-108-interdiff.txt1.55 KBalexpott
#108 2492171-2-108.patch10.89 KBalexpott
#109 interdiff-2492171-106-109.txt2.14 KBarosboro
#109 transliterate-file-uploads-2492171-109.patch13.29 KBarosboro

Issue fork drupal-2492171

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

neclimdul’s picture

What's the status of this? Should #2504815: d6 to d8 migration throws integrity contraint warning with duplicate file paths expect a unique constraint to exist in the database target?

frob’s picture

Is this going to make it for 8.0 or should this be moved to 8.1?

xjm’s picture

Version: 8.0.x-dev » 8.1.x-dev
Status: Active » Postponed

Yep, this is 8.1.x material at this point, so contrib could provide it for 8.0.x. Thanks!

andypost’s picture

Status: Postponed » Active
Related issues: +#1925684: Provide a "original filename" field for file entity.

Suppose it's time, but maybe this blocked on #1925684: Provide a "original filename" field for file entity.

frob’s picture

I wouldn't consider this blocked on that one. They should be able to be worked on in parallel.

deciphered’s picture

It's worth mentioning that this functionality is part of File (Field) Paths module, which has a D8 beta release.

jon@s’s picture

It's worth mentioning that this functionality is part of File (Field) Paths module, which has a D8 beta release.

True but it seems like this functionality should be in core. This would also require the transliteration in core to support replacing spaces with a symbol which it, to my knowledge, currently does not. Also why the transliteration functionality in the File (Field) Paths module is limited compared to the old Transliteration module.

deciphered’s picture

I'm not at all debating that it should be in core, but given @xjm's comment that it's not going to happen in 8.0.x, and that it needs to be dealt with in contrib, this is it being dealt with in contrib.

File (Field) Paths does also provide replacing of spaces with a symbol, as it has since inception (or not long afterwards) via the Pathauto integration.

I'm more than happy to remove the functionality from F(F)P as soon as it's no longer needed.

g089h515r806’s picture

I have test with File (Field) Paths with Drupal8.0.1, it does not works at first. it works correctly at last.
I want to upload chinese pdf file using file field module.

deciphered’s picture

That seems like a poorly worded statement. But I will be happy to look into it further in the appropriate issue queue.

g089h515r806’s picture

Sorry for my English.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

eigentor’s picture

I have tried filefield paths to solve the problem of transliterating filenames. For me it dit not work. Some comments for filefield paths:

  1. It seems to keep the original filename and create the new filename, which is some kind of alias. I think this is by design and in the sense of the module, but unneccessary for the simple goal of getting web-safe filenames.
  2. One has do the setting "transliterate filename" in every single filefield, maybe even instance of every filefield. That is quite some work and can cause oversights.
  3. There is a major problem with filefield paths: it cannot be uninstalled https://www.drupal.org/node/2685731 The module is not alone with this problem. The new uninstall system of Drupal 8 requires all database content of a module to be removed before you can uninstall it. If this content is not deletable by hand, there is a problem.
imadalin’s picture

I don't think that the issue's topic is about File (Field) Paths and we should concentrate on implementing Transliteration for filenames manipulation in the Core.

pingwin4eg’s picture

Status: Active » Needs review
StatusFileSize
new1.34 KB

Let's start with something.

Here's a patch using code similar to that from a transliteration module.

berdir’s picture

"Something" looks like a good start to me, at least it doesn't break any existing tests. Next step would be tests with file names that trigger this.

Also wondering about how to introduce this exactly, is there any reason to make this a setting, so it could be disabled? It will be a behavior change, although I suspect a very welcome one for most (all?) sites?

skaught’s picture

i'll be bold enough to say all sites should use.

The only other request I've ever had is about storing the original file name, that could be used for Title href. As people often name a file because they expect it would be saved. Same for Search API related issues. At this time, I haven't directly looked to see if the File API is yet doing that..

^I think there would be no need for an 'on/off' switch of transliteration if org. name is stored for about points.

$clean_filename = \Drupal::transliteration()->transliterate($file->getFilename(), 'en', '');
I should think 'en' shouldn't be hardcoded. sites will not all be english first.

stefan.r’s picture

This probably should be opt-in (at least for existing sites), so as to not change behavior.

skaught’s picture

Status: Needs review » Needs work
andypost’s picture

opt-in

Makes sense to have that configurable via container param

hgoto’s picture

StatusFileSize
new4.17 KB
new4.8 KB

I updated the patch #15.

1.

$clean_filename = \Drupal::transliteration()->transliterate($file->getFilename(), 'en', '');
I should think 'en' shouldn't be hardcoded. sites will not all be english first.

I replaced the hardcoded 'en' to $langcode with the following code:

      $langcode = \Drupal::languageManager()->getCurrentLanguage()->getId();

      // Transliterate and sanitize the destination filename.
      $filename = \Drupal::transliteration()->transliterate($file->getFilename(), $langcode, '');

2.

I changed the strtolower() to Drupal\Component\Utility\Unicode\Unicode::strtolower(). I consulted #1842718: Use new Transliteration functionality in core for machine names for this point. This may be unnecessary.

      // Force lowercase to prevent issues on case-insensitive file systems.
      $filename = Unicode::strtolower($filename);

3.

I introduced an option config to toggle the transliteration for the file names.

This probably should be opt-in (at least for existing sites), so as to not change behavior.

Probably it depends on the language that people want to make this setting opt-in or opt-out (of course, it shouldn't be changed for existing sites with update). In my small personal experience, the current transliteration function for Japanese is not very good and I'd like it to be off by default...

Tests are not yet added.

hgoto’s picture

Status: Needs work » Needs review

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

shadcn’s picture

Patch in #21 tested and works. Thanks.

dawehner’s picture

+++ b/core/modules/file/file.module
@@ -842,7 +842,28 @@ function file_save_upload($form_field_name, $validators = array(), $destination
+    if (\Drupal::config('system.file')->get('filename_transliteration')) {
+      $langcode = \Drupal::languageManager()->getCurrentLanguage()->getId();
+
+      // Transliterate and sanitize the destination filename.
+      $filename = \Drupal::transliteration()->transliterate($file->getFilename(), $langcode, '');
+
+      // Replace whitespace.
+      $filename = str_replace(' ', '_', $filename);
+      // Remove remaining unsafe characters.
+      $filename = preg_replace('![^0-9A-Za-z_.-]!', '', $filename);
+      // Remove multiple consecutive non-alphabetical characters.
+      $filename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $filename);
+      // Force lowercase to prevent issues on case-insensitive file systems.
+      $filename = Unicode::strtolower($filename);
+    }

I'm wondering why this is part of file_save_upload(), which is part of an API which is related to the form, not some actual underlying API level. IMHO you would like to have the same for potential files uploaded via RESt one day.

Maybe it would be enough for now to move all that logic into its own API function and call it from here?

hgoto’s picture

StatusFileSize
new5.13 KB
new3.07 KB

Thanks for your reviews. As dawehner told, surely it's better to make this logic an independent API function.

Maybe it would be enough for now to move all that logic into its own API function and call it from here?

I tried creating a new method getTransliteratedFilename() in \Drupal\file\Entity\File class and moved this new transliteration logic into it. I'm not sure where we should put this API... I'd like this patch to be fully reviewed. Thank you.

hgoto’s picture

Status: Needs work » Needs review
StatusFileSize
new5.13 KB
new3.07 KB
new380 bytes

The patch #26 failed the auto test because I didn't change the code properly when I extracted the logic to a method. I'll try it again.

tobiberlin’s picture

Tested patch in #28 with a file named with spaces and German umlaut -> works

tobiberlin’s picture

If this is added to core it would be very helpful to offer an update function to transliterate the already uploaded files, wouldn't it?

berdir’s picture

Nope, that doesn't work. you don't know how and where those files are used, could be hardcoded in content. We can only offer it for new files.

tobiberlin’s picture

In my eyes this answer makes no sense as this patch is about using transliteration on images uploaded by a form. Whenever an image uploaded for a field for example is used somewhere hardcoded the same problem as described in #31 would appear when simply a new image is uploaded. Whenever an image path/ name is used hardcoded although the image can be changed by a form it is a wrong programming, isn't it?

ressa’s picture

It would be good to have transliteration of file names in Drupal 8 - and really great to get it into Drupal 8.3, if possible.

sense-design’s picture

One of the features why I have not yet completely switched to D8 for all my projects.
Would be great to have this in the upcoming 8.3 release.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

hampercm’s picture

Status: Needs review » Needs work

In response to #30 and #31: perhaps add a drush command to transliterate existing files, if a site admin needs to.

+++ b/core/modules/file/file.module
@@ -842,7 +842,10 @@ function file_save_upload($form_field_name, $validators = array(), $destination
+    $filename = $file->getTransliteratedFilename();
+
+    $file->destination = file_destination($destination . $filename, $replace);

Earlier code in this function does a setFilename() if the filename is changed (see code that handles insecure file extensions at line 818). I'd recommend doing that here, too.

+++ b/core/modules/file/src/Entity/File.php
@@ -196,6 +197,38 @@ public function preSave(EntityStorageInterface $storage) {
+      // Remove multiple consecutive non-alphabetical characters.
+      $filename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $filename);

This seems extraneous. Remove it?

kovtunos’s picture

Updated the patch #28 to work with Drupal 8.3.0.
Changes: new PHP array syntax.

hgoto’s picture

Status: Needs work » Needs review
borisson_’s picture

I tried this patch and it works, thanks!

I think the failure in #37 is unrelated, but this does need it's own tests. Since we're adding a new public function for the file name transliteration, all we'd need to do is call that function with different file names in the test, I think?

berdir’s picture

We can test it directly in a kernel test with diferent variations. I think we also need at least one integration test, upload an image with some spaces or so to make sure the function is actually called.

borisson_’s picture

Status: Needs work » Needs review
StatusFileSize
new1.54 KB
new6.63 KB

This patch adds the kernel tests. Not sure if the test-class is the correct one.

borisson_’s picture

Ah, those migrate tests are related, I don't see how those should be fixed though.

hgoto’s picture

Status: Needs work » Needs review
StatusFileSize
new8.14 KB
new1.14 KB

I believe the failed tests for #42 can be fixed just by providing the added configuration into $expectedConfig['system.file'] in the test classes.

       // temporary_maximum_age is not handled by the migration.
       'temporary_maximum_age' => 21600,
+      // filename_transliteration is not handled by migration.
+      'filename_transliteration' => FALSE,
     ],
     'system.image.gd' => [
       'jpeg_quality' => 75,

The test passed in my environment with this change and I'd like to see how the testbot reacts.

hgoto’s picture

Oops, I'm sorry, the extension of the interdiff is wrong.

hgoto’s picture

Status: Needs work » Needs review

The tests passed. (The failure was just caused by the wrong extension of the interdiff.)

j-lee’s picture

@@ -274,4 +275,37 @@ public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
     return $fields;
   }
 
+  /**
+   * Gets the transliterated filename.
+   *
+   * @return string
+   *   The transliterated filename. If the filename transliteration options is
+   *   disabled, the raw filename of the file.
+   */

Just found a typo:

If the filename transliteration options is disabled, the raw filename of the file.

"options" should be "option"

The patch works for me. Thank you.

pascodiogo’s picture

StatusFileSize
new18.74 KB
new18.85 KB

Should not this transliteration be applied to the name displayed to the user?

Like this:

     $file->destination = file_destination($destination . $filename, $replace);

+    $file->filename = $filename;

Without transliteration in the name displayed to the user:
Only local images are allowed.

With transliteration in the name displayed to the user:
Only local images are allowed.

hgoto’s picture

StatusFileSize
new8.14 KB
new526 bytes

@J-Lee, thank you for the review. I fixed the typo.

@PascoDiogo, in my opinion, this transliteration should not necessarily be applied to displayed names. I believe Transliteration module's project page would help to understand transliteration.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

j-lee’s picture

I'm not sure with #50. The original filename is replaced by the transliterated name. Why should a not existent filename displayed to the user? Isn't it confusing if you got a filename shown but the real filename is different?

tstoeckler’s picture

+++ b/core/modules/file/src/Entity/File.php
@@ -274,4 +275,37 @@ public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
+      $langcode = $this->languageManager()->getCurrentLanguage()->getId();

Shouldn't this use the file language instead?

frob’s picture

Isn't it confusing if you got a filename shown but the real filename is different?

An argument could be made either way:

It is confusing to upload a file and see a different filename after uploading.
It is confusing to see the name of a file in the UI and then look for that file only to find out that it doesn't exist.

The first option is fixable by letting the user know that the filename will be changed upon upload for reasons. The second option would require adding a bit of UI to let the user know that the file name on the disk will be different than the one shown.

I lean toward the first option. Let the user know that the file name will be changed for reasons and then just display the actual file name.

skaught’s picture

#53/#55

+1 option1.
the file description should include (like it does for file size restriction) a li child that outlines that filename will be cleanup.

This is similar to the fact that drupal adds '_0' ' _1' (file counts) if use keeps re-attaching the same file name.

i know i've had clients want to have file name as they have make them (ie: '2017 Q1 earning.xls'). but then would have a bit of overhead on tracking related file name changes are 'repairing' them for file downloads etc. same issue when a User re-attached a new version of that same file name (_0 _1 _2....)

frob’s picture

Knowing the name of the actual file on the server is also necessary for complying with DMCA takedown notices and the like.

skaught’s picture

i haven't looked over this whole issue in a while.

#4 andypost is correct.

pavelculacov’s picture

When it will be integrated in CORE?

seanb’s picture

StatusFileSize
new8.14 KB
new691 bytes

+1 for adding this to core. Not sure on postponing on #1925684: Provide a "original filename" field for file entity.. The feature is optional and the same problem (changing filenames) already occurs for duplicate filenames as mentioned in #56.

We could create a followup for the interface change mentioned in #56 "the file description should include (like it does for file size restriction) a li child that outlines that filename will be cleanup". It seems this particular part could turn into a huge bikeshed and I'm personally not convinced this is a big issue for most users. Hopefully, we don't have to block on that.

New patch to fix #54.

ndobromirov’s picture

Thanks for the patch!

I've started using it on a project, but once I've enabled the configuration and exported it, the site installations started failing due to import of configuration issues. After some checks it turned out the config has stored 1 in the YML file, where the default value was false, so logically it should have stored true in there. Also hinted by the bool schema type.

I've changed the config manually in the YML file to true and installs passed.

I guess the fix should be to just type-cast the value, before setting it to storage from the form-submit handler.
This is also the fix implemented directly on the patch from #60. Diff between the two patch versions is provided.

I do not know should this regression needs any tests, so any help will be appreciated.

As I've done changed on the code still needs review :)

yan’s picture

Patch from #64 seems to work for me (although spaces are converted to underscores (_) which I think is not ideal).

pavelculacov’s picture

Work for me too always. :D

frob’s picture

@yan, what isn't ideal about underscores?

millionleaves’s picture

@frob

There is a difference from a SEO perspective. See #1 in the article linked below:

https://searchengineland.com/9-seo-quirks-you-should-be-aware-of-146465

frob’s picture

Status: Needs review » Needs work

That article is from 2013 and the point you refer to is linked from 2005. Also, the files it is referring to are .html/.htm or in other words, that article is talking about the link to the content and not the link to the static asset.

SEO is constantly evolving and changing. Do you have a more current reference? I would argue that this should be configurable to make it long term SEO friendly, but I would also argue that should be contrib or another issue. I just don't want this issue to get side-tracked when it is so close.

From #61 Looks like it still needs work:

site installations started failing due to import of configuration issues. After some checks it turned out the config has stored 1 in the YML file, where the default value was false, so logically it should have stored true in there. Also hinted by the bool schema type.

I've changed the config manually in the YML file to true and installs passed.

I guess the fix should be to just type-cast the value, before setting it to storage from the form-submit handler.
This is also the fix implemented directly on the patch from #60. Diff between the two patch versions is provided.

millionleaves’s picture

Current article from Google states:

"We recommend that you use hyphens (-) instead of underscores (_) in your URLs."

https://support.google.com/webmasters/answer/76329?hl=en

This is consistent with the earlier article I shared, and I don't see any distinction in either between html URLs and image URLs.

Another reason to use hyphens here is that the Transliteration module in D7 used hyphens for file names, so IMHO it is desirable to remain consistent with that unless the core transliteration functionality is also using underscores (although my argument in favour of underscores hyphens would remain).

It appears that in this case, the use of an underscore is defined in one line of code in patch #61 that could be changed easily without affecting the patch overall.

Providing the option to choose the separator may be useful for some use cases, but agree that this would be better left for a separate issue for the core module rather than this patch.

imclean’s picture

@millionleaves this is an interesting discussion as I've always preferred underscores in filenames while keeping hyphens in the rest of the URL. It turns out this method could be problematic due to underscores being a special character in computing.

Google's examples all use hyphens: https://support.google.com/webmasters/answer/114016?hl=en

This could be used to our advantage in certain cases. For example, "big_red_socks.jpg" you might want people to find only if they're looking for "big_red_socks" and not "big", "red" or "socks". Maybe not very often, but it's good to have that level of control.

frob’s picture

Yeah, it looks like it should be changed to a "-" character. The patch has always had this from #15

$clean_filename = str_replace(' ', '_', $clean_filename);

So that line needs to be changed and the issue from #61

frob’s picture

StatusFileSize
new8.15 KB
new120 bytes

Make the SEO change by rerolling the patch from 61, that leaves the issue mentioned in #61, can anyone review that?

borisson_’s picture

This will also need fixes in the kernel-tests written for this issue, see testFileNameTransliteration in modules/file/tests/src/Kernel/SaveTest.php. Specifically the data-provider (provideFilenames) will need to be updated.

pancho’s picture

Status: Needs work » Needs review
StatusFileSize
new7.6 KB

Fixed provideFilenames().

frob’s picture

Patch in #72 should resolve all remaining issues. All that is left is manual review.

seanb’s picture

Could you provide an interdiff for #72?

Raja_Vijayakumar’s picture

I applied the patch#51 to my application. It only transliterating the filename in database but the file name still appears to be the original on UI for the uploaded file . any idea how do i fix the filename that appears on UI too.

j-lee’s picture

See comments #50, #51, #53, #55, #56 and #60.

This is a point of discussion but could/should a follow up issue.

orkutmuratyilmaz’s picture

anyone tested patch #72?

rang501’s picture

Confirming that the patch in #72 is working.

eigentor’s picture

The Patch #72 applies cleanly with Drupal 8.4.0.
The filenames for the uploaded files itself are correctly transliterated. It adds dashes for spaces.
The displayed filename are an issue for a followup, as I understand, so no surprise they are kept from the original filename.

rootwork’s picture

Status: Needs review » Reviewed & tested by the community

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

mcdruid’s picture

Status: Needs work » Needs review

I think the test failures look unrelated to this patch - back to NR to try the tests again.

Noting also that this was mentioned as a step toward being able to remove the workaround for the file uri uniqueness constraint added in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols), but it probably won't help much if it's configurable and opt-in. (The idea being that the uri field in the file_managed table could be set to ascii if that's all it ever contains, and as such there wouldn't be a problem with the key length when imposing a uniqueness constraint in the db schema).

mgifford’s picture

Setting it back to @rootwork's RTBC. As @mcdruid noted, "test failures look unrelated to this patch".

mgifford’s picture

Status: Needs review » Reviewed & tested by the community
alexpott’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: -Needs tests +Needs issue summary update
  1. +++ b/core/modules/file/src/Entity/File.php
    @@ -277,4 +278,36 @@
    +  public function getTransliteratedFilename() {
    +    // If the transliteration option is enabled, transliterate the filename.
    +    if (\Drupal::config('system.file')->get('filename_transliteration')) {
    

    This is a bit funny because calling getTransliteratedFilename() might not return a transliterated filename. Maybe it should be the caller's responsibility to check the config.

  2. +++ b/core/modules/system/config/install/system.file.yml
    @@ -3,3 +3,4 @@
    +filename_transliteration: false
    

    Adding a configuration key like this means we need an upgrade path for existing sites. Just an update in system.install to add the missing key with the default value.

  3. We still need a change record
  4. The issue summary could do with an update to explain the solution. I.e. why is it configurable. Why the default for new sites is false.
Anonymous’s picture

Hi all.
First of all, well done and thank you very much for this patch. I applied it and it works a treat.
The "Enable filename transliteration" tickbox appeared in /admin/config/media/file-system
The uploaded file names are converted nicely. I tried with russian characters and romanian diacritics.

I have an issue/proposal related to it.
The file name shown in the admin UI is still the original one, which I think it can be confusing for many.
Do you think it would be a good idea to make it the same as the uploaded file name?

Thank you.

Andy

mcdruid’s picture

Issue summary: View changes
yan’s picture

Patch from #72 worked for me for Drupal 8.5.1

ndobromirov’s picture

I would say that the update should set to FALSE for existing sites through the proposed update path in #86.

For new ones it should be TRUE by default, as file names transliteration is a generally a good thing to do on the web.

sopranos’s picture

I am trying to upload image with special characters in filename, Drupal does not budge (drupal 8)
any news?
Drupal 6 managed to upload it with no problem
I am talking about special slavic characters like this

srečna družina.jpg

ressa’s picture

The patch in #72 together with Drupal 8.6.x works for me. I applied the patch using the new command line tool to quickly install Drupal:

  1. curl -sS https://ftp.drupal.org/files/projects/drupal-8.6.x-dev.zip --output drupal-8.6.x-dev.zip
  2. unzip drupal-8.6.x-dev.zip && rm drupal-8.6.x-dev.zip
  3. cd drupal-8.6.x-dev
  4. wget -q -O - https://www.drupal.org/files/issues/use_new_transliteration-2492171-72.patch | git apply -
  5. php core/scripts/drupal quick-start demo_umami

After enabling filename transliteration under admin/config/media/file-system, it seems to work perfectly:

Original image name
01-a (SMALL!!!)wé↑s e3RQ#er24!"#øæå.jpg

Resulting filename, after upload
01-a-smallwes-e3rqer24oaea.jpg

I also tried uploading a file named srečna družina.jpg, and it was transliterated to srecna-druzina.jpg, as expected.

I agree with @ndobromirov:

For new ones it should be TRUE by default, as file names transliteration is a generally a good thing to do on the web.

chr.fritsch’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new1.67 KB
new5.03 KB
new8.23 KB

So, I

amateescu’s picture

+++ b/core/modules/file/file.module
@@ -989,7 +989,16 @@ function file_save_upload($form_field_name, $validators = [], $destination = FAL
-    $file->destination = file_destination($destination . $file->getFilename(), $replace);
+
+    // If the transliteration option is enabled, transliterate the filename.
+    if (\Drupal::config('system.file')->get('filename_transliteration')) {
+      $filename = $file->getTransliteratedFilename();
+    }
+    else {
+      $filename = $this->getFilename();
+    }
+    $file->destination = file_destination($destination . $filename, $replace);

I would expect this replacement to be done at a lower level, for example directly in file_create_filename() because that would allow unmanaged files to take advantage of transliteration as well.

Is there any reason for not doing that?

hgoto’s picture

Sorry for interference. If the discussion for the default value of filename_transliteration is not enough, I'd like it to be decided carefully.

Transliteration for filenames will be disabled by default, because it's not useful for some languages like Japanese.

Surely I myself (= a Japanese user) am happy if the transliteration is disabled by default. And I guessed there would be several other languages like Japanese but I started to doubt there're many languages which have a problem with Drupal's transliteration rule like Japanese since I haven't found any opinion similar to mine in this page.

If the out-of-box experience of Drupal gets much better for the majority in case the file transliteration is enabled, it's acceptable for minority users whose language have trouble with the transliteration, even if they need to disable the function every time after installation, I think :)

Thanks for the consideration.

skaught’s picture

For new ones it should be TRUE by default, as file names transliteration is a generally a good thing to do on the web.

Transliteration for filenames will be disabled by default, because it's not useful for some languages like Japanese.

see #18

This probably should be opt-in (at least for existing sites), so as to not change behavior

update existing site and how that could effect expectations by those product owners should be included, regardless of language first install base.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new529 bytes
new8.23 KB

Fixing the tests

alexpott’s picture

StatusFileSize
new7.52 KB
new9.21 KB

I think @amateescu might be correct in #94 that this should be done on a lower level. That way transliteration can apply to unmanaged file saves too. file_create_filename() is not suitable because it is only called when FILE_EXISTS_RENAME is set. In fact there's stuff in that method that definitely should not be there. So I've added it to file_destination().

I also think having File::getFilename() and File::getTransliteratedFilename() is super confusing so I think we should avoid patches that do that.

I agree with the concerns raised about the filename on disk and the filename on the file entity not matching. I know it's an existing issue with the _1 etc added to duplicates but this makes it even more confusing in terms of File::getFilename() not matching the actual filename. I think we should consider either fixing that first or making it part of this issue.

alexpott’s picture

StatusFileSize
new611 bytes
new9.63 KB

This sets the filename correctly in file_save_upload(). Needs tests.

chertzog’s picture

+1 . This seems to work perfectly for me. Applied to 8.5.4

kirkkala’s picture

Patch in #100 works fine but causes issues when css/js aggregation is enabled and server cares about filename case sensitivity.

Modified that patch from @alexpott to come over the issue. Did not though think thoroughly if just .css, .js and .gz are fine to skip from transliteration.

kirkkala’s picture

Oops, bad patch file in #103, sorry about that. Here's a patch that applies with the change that js/css filenames won't be transliterated.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new2.08 KB
new11.15 KB

Fixing the test. I think this shows that the fix is on the right lines because the current expected message is Raw "Field <em class="placeholder">test_file_field_1</em> can only hold 3 values but there were 6 uploaded. The following files have been omitted as a result: <em class="placeholder">text-0.txt, text-0.txt, text-0.txt</em>." found

I'm not sure about the fix added by #104 - I think we need a test for this situation and also to understand the problem more. This patch includes it and the interdiff is back to #100 so everyone can see what the fix is.

alexpott’s picture

+++ b/core/includes/file.inc
@@ -595,11 +595,20 @@
+  // Do not transliterate if the file to process is css, js or gz.
+  $pieces = explode('.', $destination);
+  if (in_array(array_pop($pieces), ['js', 'css', 'gz'])) {
+    $transliterate = FALSE;
+  }
+  else {
+    $transliterate = TRUE;
+  }
+

@kirkkala this is the change you added to fix CSS aggregation can you give some steps-to-reproduce the problem you are seeing?

alexpott’s picture

StatusFileSize
new1.55 KB
new10.89 KB

I've tried to reproduce the issue detailed by @kirkkala by:

  1. Using a case sensitive filesystem
  2. Installing standard
  3. Ensuring the CSS and JS aggregration is enabled
  4. Having the patch attached applied - it doesn't have @kirkkala's changes
  5. Running the following snippet against the site: \Drupal::configFactory()->getEditable('system.file')->set('filename_transliteration', TRUE)->save();
  6. Clearing the cache
  7. Refreshing a page view which will rebuild the aggregates
  8. Nothing is broken

I've even set debug break points in file_destination() and as far as I can see this method is not called during CSS/JS aggregation.

arosboro’s picture

I had been on #2492171-100: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) and experienced #2991606: Filename Transliteration enforces lowercase filenames which causes 404 errors with CSS/JS Aggregation turned on mentioned by @kirkkala in #104. I have written a test for that which I believe applies to #106

My solution was to change the AssetDumper service to ensure it creates a lowercase filename when filename_transliteration is enabled, so I have that in the wings as well. I am updating this issue with the test for that, which should check if #104 is a suitable solution as well.

See attached patch and interdiff.

alexpott’s picture

StatusFileSize
new2.4 KB
new12.52 KB

@arosboro awesome! No idea why my debugging didn't work but your test totally proves the problem.

+++ b/core/includes/file.inc
@@ -583,12 +583,47 @@ function file_build_uri($path) {
+    // Force lowercase to prevent issues on case-insensitive file systems.
+    $basename = mb_strtolower($basename);

This should be discussed in a followup issue and not part of this patch.

Patch attached removes this and adds the test from #109.

alexpott’s picture

+++ b/core/includes/file.inc
@@ -583,12 +583,36 @@ function file_build_uri($path) {
+    $basename = \Drupal::transliteration()->transliterate($basename, $langcode, '');
+    // Replace whitespace.
+    $basename = str_replace(' ', '-', $basename);
+    // Remove remaining unsafe characters.
+    $basename = preg_replace('![^0-9A-Za-z_.-]!', '', $basename);
...
+    $basename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $basename);

So here's a question can these rules futz with a name produced by \Drupal\Component\Utility\Crypt::hashBase64() - I think it is possible because I think it is possible to generate a base64 encoded string with several consecutive non-alphabetical characters.

I think we need to change how we're deciding to do transliteration and this should only occur on files uploaded by users and not generated files like CSS/JS aggregates.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new10.53 KB
new11.52 KB

So I've moved the transliterating of filenames to file_save_upload() as this is the method through which all files uploaded to the site are vaildated and processed. I don't think that filename transliteration should be part of the File entity as it is something that is done only when uploading the file to the server.

arsoman’s picture

I used patch and its latest version worked nice! Thanks to the creator! Cheers!

pavelculacov’s picture

Tested, worked

ressa’s picture

It works great, but I still don't understand why it is not enabled by default on a fresh installation ... Just to emphasize, I don't think it should be enabled for existing installations.

laravz’s picture

Assigned: Unassigned » laravz
Issue tags: +DrupalEurope2018

I'll be reviewing this at DrupalEurope

laravz’s picture

Assigned: laravz » Unassigned

The patch applied cleanly to a new 8.7.x and a 8.6.1. However, the test fails for 8.7.x. The file label cannot be found, because the waitForLink function returns Null.

  1. +++ b/core/modules/file/file.module
    @@ -989,7 +990,24 @@ function file_save_upload($form_field_name, $validators = [], $destination = FAL
    +      // Remove remaining unsafe characters.
    +      $filename = preg_replace('![^0-9A-Za-z_.-]!', '', $filename);
    +      // Remove multiple consecutive non-alphabetical characters.
    +      $filename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $filename);
    

    Unexpected characters and sequences are properly being stripped. Languages like Japanese and Russian are being converted to a Latin alphabet.

  2. +++ b/core/modules/system/config/install/system.file.yml
    @@ -3,3 +3,4 @@ default_scheme: 'public'
    +filename_transliteration: false
    

    Default is set to false, as discussed above. I think this is the logical decision, due to the issue with some languages and not everyone needing to translate filenames.

  3. +++ b/core/modules/system/system.install
    @@ -2173,3 +2173,12 @@ function system_update_8501() {
    +
    +/**
    + * Set filename_transliteration config to the default value.
    + */
    +function system_update_8701() {
    +  \Drupal::configFactory()->getEditable('system.file')
    +    ->set('filename_transliteration', FALSE)
    +    ->save();
    +}
    

    Files are being defaulted to false as discussed above.

dawehner’s picture

After discussion with @LaravZ I think personally think that we need to change \Drupal\file\Plugin\rest\resource\FileUploadResource as well, given that you expect transliteration as well.
Maye we could extract all the transliteration logic into a helper function and call it from the upload resource.

andypost’s picture

@dawehner maybe rest can have this optional/configurable per request option? makes sense to file separate issue

ivnish’s picture

Thanks! Patch is worked.

andypost’s picture

Issue tags: +Needs screenshots

I bet it needs usability review

sopranos’s picture

this is default in new drupal? But I cant find the settings....
I went to content type story...than edit fields...and edit image field and I cant find the setting

ivnish’s picture

sopranos, use patch from #113

lawxen’s picture

StatusFileSize
new11.45 KB

#113 can't be applied to newest drupal8.7.x and drupal8.6.x any more, reroll it for drupal8.7.x(just file.module, and nothing functional change)

berdir’s picture

Note that using this patch with the update function is a very bad idea because this adds a 87xx update function, which tells Drupal to ignore all future 86xx update functions.

It's not required anyway since FALSE/NULL is the implicit fallback, so I would suggest you provide an alternative patch without that update function that people can use if they are using the patch.

frob’s picture

StatusFileSize
new10.96 KB

rerolled without update hook 8701

skaught’s picture

Status: Needs work » Needs review
drupalnesia’s picture

I think this additional feature should be in a module like D7 does in https://www.drupal.org/project/transliteration, because only few sites need special transliteration. While core only provide basic transliteration only.

dww’s picture

I disagree with #131. Content creators name their files all kinds of weird and unfortunate things. Even if everything is in English, it's still great to convert spaces to dashes, remove weird punctuation, special characters, etc. Plus, many users find themselves in a mix of case-sensitive and case-insensitive filesystems, so it's great to have Drupal automatically convert everything to lowercase. I constantly build custom functionality into my sites to cleanup filenames on upload so as to prevent the editors from making a mess of things. It'd be much better if core did that for me automatically.

In short: even if you don't think you want this, you probably do. That's a good reason for core to do it, instead of forcing site builders to look for a contrib solution to a problem they don't necessarily realize they have. :)

Cheers,
-Derek

berdir’s picture

I think the problem is that transliteration only really makes sense for languages with latin characters.. if you try to use it with files that have chinese, arabic etc. characters than you just receive garbage.

That said, it is important to be configurable (maybe it would even need to be configurable by language, but that likely doesn't make much sense, as people might upload chinese filenames when adding an english translation or something) but I don't think it needs to be a separate module. It could maybe be a global setting provided by system.module that different things could check.. if you don't want to transliterate filenames, you probably also don't want to do it for pathauto aliases. But again, who knows, maybe someone does ;)

mfb’s picture

I think we could say any site wants rules to cleanup file names. This would be removing undesirable stuff like whitespace, control characters, invalid bytes or extreme length from file names, and case and unicode normalization to avoid cross-file system issues. In Drupal this all got agglomerated into "transliteration", which is more clearly optional. There's nothing wrong with 🐈.txt or any other non-ASCII as a file name. The URL would be http://example.com/%f0%9f%90%88.txt which isn't super pretty but works fine.

frob’s picture

According to the IS this has a technical limitation as well.

@catch mentioned in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) we may want to transliterate filenames in core, so that we can have a database-level unique constraint on the URI field in the database.

Look back 130 comments for the discussion as to why this should be in core. I am not trying to say we should have the discussion, rather I am saying that we already had the discussion.

We need a re-roll so tests pass again.

skaught’s picture

#131 - i can't image a project not needing it. If you've never had a client/editor upload a file with a space in the name and then wonder why they can't embed a link in a WYSIWYG -- then you're luck to have such a tight scope on your projects.

🐈.txt

The basic point of clean names is ensure names are possible to type on a regular keyboard. Certainly MySQL storage of characters needs to match ability.

This base issue does outline having a setting. I'm not directly aware of where the patch is at this point -- hopefully it's not lost.

Also we had previous conversation here about tracking original names #4 has a sub issue.

skaught’s picture

EDIT
===
check for get('filename_transliteration') config exists.

dan_metille’s picture

#132 I do not have much experience with chinese, arabic etc. characters, but for cyrillic characters, I very much rely on transliteration.

gpap’s picture

StatusFileSize
new9.19 KB

Reroll for 8.6.7

UPDATE: This one messes up CSS/JS asset filenames when aggregation (preprocess) in enabled though. ¯\_(ツ)_/¯

skaught’s picture

Status: Needs work » Needs review
gpap’s picture

Status: Needs review » Needs work
alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new1.22 KB
new12.17 KB

Re #139

This one messes up CSS/JS asset filenames when aggregation (preprocess) in enabled though. ¯\_(ツ)_/¯

Yep this is why we can only do this for user uploaded files and not all files generated by Drupal.

Here's a reroll of #113 for 8.7.x with the update function re-instated and at least one of the tests fixed.

alexpott’s picture

Issue summary: View changes

Updated the issue summary.

alexpott’s picture

+++ b/core/modules/file/file.module
@@ -1022,7 +1023,21 @@ function _file_save_upload_single(\SplFileInfo $file_info, $form_field_name, $va
+  $filename = $file->getFilename();
+  if ($system_file_config->get('filename_transliteration')) {
+    // Transliterate and sanitize the destination filename.
+    $filename = \Drupal::transliteration()
+      ->transliterate($filename, $file->language()->getId(), '');
+    // Replace whitespace.
+    $filename = str_replace(' ', '-', $filename);
+    // Remove remaining unsafe characters.
+    $filename = preg_replace('![^0-9A-Za-z_.-]!', '', $filename);
+    // Remove multiple consecutive non-alphabetical characters.
+    $filename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $filename);
+    // Force lowercase to prevent issues on case-insensitive file systems.
+    $filename = mb_strtolower($filename);
+  }

One thing I wonder is if this should be:

  $filename = $file->getFilename();
  if ($system_file_config->get('filename_transliteration')) {
    // Transliterate and sanitize the destination filename.
    $filename = \Drupal::transliteration()->transliterate($filename, $file->language()->getId(), '');
  }
  // Force lowercase to prevent issues on case-insensitive file systems.
  $filename = mb_strtolower($filename);

I wonder:

  • Is whitespace replacing necessary and solving problems for anyone?
  • Is the "Remove remaining unsafe characters." really correct after transliteration?
  • The replace multiple consecutive non-alphabetical would annoy me as a user. If named a file I---like---a---hyphen.txt and the system renamed it I-like-a-hyphen.txt I'd be annoyed

If we did the above suggestion then system.file:filename_transliteration would really only mean transliteration which seems a good thing. And the case-insensitive vs case-sensitive filesystems are a fact of like and so it seems worthwhile always doing. We problem should add a message a tell the user why this has occurred if it has though.

pancho’s picture

For me lowercasing everything would be a much larger intrusion and annoyance than replacing spaces or consecutive hyphens (which both seems legitimate to do). Plus not many Drupal setups will be on a mixture of Windows and Linux machines.
Even then: be smart, not intrusive. Instead of enforcing lowercase, enforce case insensitivity.

Also, spaces are usually replaced by underscores. Multiple hyphens are rather hard to read, especially if some of them are actual hyphens, while others are whitespace replacements.

„Force lowercase“ (why not force uppercase?) and the whitespace replacement character at the very least need to be configurable. But then it might be something for contrib.

We might possibly want to only do a minimalistic Drupal::transliteration()->transliterate() and leave the whole rest to contrib.

daniel.bosen’s picture

Issue summary: View changes

I changed the description a bit, since Thunder is not actually shipping this patch, but implemented a custom solution similar to this.
The reason for introducing a similar mechanism in the Thunder distribution were faulty http clients. E.G. the OkHttp client which is used in several mobile apps has a bug, where it double encodes urlencoded image names.
So technically only important for us is:

  • Transliterate the filename.
  • Replace whitespace with -
  • Remove remaining characters other than 0-9A-Za-z or - or . or _

While reducing multiple consecutive characters or lowercasing is not required on production systems. But for developers, it might be interesting to also have the lowercasing, since Windows, as well as OSX, have case insensitive filesystems.

flocondetoile’s picture

Remaining tasks
Agree exactly what should occur when system.file:filename_transliteration is TRUE.

Apart from transliterating what else should we do? Currently we:

Replace whitespace with -
Remove remaining characters other than 0-9A-Za-z or - or . or _
Remove multiple consecutive non-alphabetical characters ie --- or ... or ___ with be replaced with - or . or _
Force lowercase to prevent issues on case-insensitive file systems.

What about if we made configurable theses options, once the filename transliteration option is checked ? In this way, many use cases could be satisfied ?
At least :

- Replace whitespace with - or _ => select list, default is _
- Remove multiple consecutive non-alphabetical characters ie --- or ... or ___ with be replaced with - or . or _ => checkbox TRUE/FALSE, default is FALSE
- Force lowercase => checkbox TRUE/FALSE, default is FALSE

dww’s picture

Agreed with #147, except there should be a "don't replace" option (or something) in the select list for "Replace whitespace". Otherwise, yes, I'd love it if core provided these options, instead of punting this to contrib. We're so close here to something that will satisfy almost everyone's use-case.

Thanks,
-Derek

dww’s picture

p.s. The top-level checkbox could be "Transform filename", and if selected, reveals all the other settings:

[X] Transliterate
[ ] Force lowercase
...

The config setting IDs could be of the form:

filename_transform
filename_transform_transliterate
filename_transform_lowercase
filename_transform_convert_whitespace
filename_transform_only_alphanumeric //..._and_separators (?) ugh.
filename_transform_strip_multiple_punctuation // or ..._strip_multiple_separators (?)
...

pancho’s picture

Status: Needs review » Needs work

What about if we made configurable theses options, once the filename transliteration option is checked ?

++ :)

Agreed with #147

++ :)

This would be awesome and definitely better than doing just a minimum and leaving the rest to contrib. I was just anxious we would force an overly strict ruleset on users who would in many cases feel disenfranchised. Now I know (and agree) we don't unnecessarily add clutter to the Core UI. However, I believe that this feature warrants some configurability.

Even if configurable, I still think that instead of lowercasing filenames, we should enforce case insensitivity in our files table. However that clearly needs to be investigated further.

So let's go forward as proposed by @flocondetoile and Derek. I hope noone's angry if for now I'm setting the issue to "needs work."

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new15.91 KB
new11.82 KB

Here's a start.

Reviews of the UI text and config key names would be great.

We need more test coverage of how the sanitisations play together.

dww’s picture

Status: Needs review » Needs work

@alexpott re: #151:

Looks great, thanks! This is getting close. Agreed it needs more tests. Haven't manually tested, but here's what I see looking at the patch:

1. A few unrelated 80 char hunks at the front make it less readable as a patch.

2. We're inconsistent about "whitespace" vs. "spaces" in the config/variable names, UI text and how it actually works. Is this just for spaces or is it all whitespace (tabs, newlines, etc)? Most of it points towards all whitespace. I think we should standardize on that, expand the str_replace() to actually work that way, fix the UI text to match, etc.

3. Do we want to dispatch an event to let contrib easily continue to tweak the filename before we use it, or is that too much scope creep? How else do we expect contrib to extend this?

4. I don't see the "-- disabled --" pattern anywhere else in core. "- None -" seems much more common. If we keep "disabled" it should be "- Disabled -". Maybe "- Do not replace -"?

5.

+      '#title' => $this->t('Remove non-alphanumeric characters'),
+      '#default_value' => $config->get('filename_sanitization.remove_non_alpha_characters'),
+      '#description' => $this->t('Remove non-alphanumeric characters exception dots, underscores or dashes.'),

Description repeats title and contains a typo. Maybe this:

      '#title' => $this->t('Remove non-alphanumeric characters'),
      '#default_value' => $config->get('filename_sanitization.remove_non_alpha_characters'),
      '#description' => $this->t('Alphanumeric characters, dots, underscores and dashes are preserved.'),

? "...are allowed."?

6. I know I just suggested 'transliterate' should be an option, but this is still in the summary:

@catch mentioned in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) we may want to transliterate filenames in core, so that we can have a database-level unique constraint on the URI field in the database. (However, this would presumably mean that transliteration of the field could not be optional).

Do we want to force transliteration to help resolve @catch's concern, and make everything else optional? Or do we keep this a config knob and if people run into that problem, they have to turn this on? Seems non-ideal. Should we at least default to on? A bulk rename of already-uploaded files is way beyond scope for this. :/ Should we open a follow-up about that? Does it already exist?

Thanks!
-Derek

pancho’s picture

Some more comments:

+    filename_sanitization:
+      type: mapping
+      label: 'Uploaded filename sanitization options'
+      mapping:

"sanitize_filenames" would be easier to grasp. We have "temporary_maximum_age", "default_scheme" and "path", but also "allow_insecure_uploads." On the other hand, this is a container, not a boolean. So yeah, maybe keep 'filename_sanitization.'

+        transliterate:
+          type: boolean
+          label: 'Transliterate'
+        replace_whitespace:
+          type: string
+          label: 'Replace whitespace'
+        remove_non_alpha_characters:
+          type: boolean
+          label: 'Remove non alphanumeric characters except dot, underscore and dash'

Maybe "strip_non-alphanumeric"?

+        dedupe_consecutive_separators:
+          type: boolean
+          label: 'Deduplicate consecutive dots, underscores and dashes'

"dedupe_separators" is enough. If they're not consecutive, they're no duplicates.

+        force_lowercase:
+          type: boolean
+          label: 'Force lowercase'
+    $form['filename_sanitization'] = [
+      '#type' => 'fieldset',
+      '#title' => $this->t('Filename sanitization options'),
+      '#tree' => TRUE,
+    ];

Maybe just "Filename sanitization" without "options"?

+    $form['filename_sanitization']['transliterate'] = [
+      '#type' => 'checkbox',
+      '#title' => $this->t('Enable filename transliteration'),
+      '#default_value' => $config->get('filename_sanitization.transliterate'),
+      '#description' => $this->t('Transliteration ensures that filenames do not contain unicode characters.'),
+    ];

How about "Transliterate filenames"?
We're replacing not only characters that are exclusively Unicode, but all non-ASCII characters. So how about "Transliteration ensures filenames contain only ASCII characters. It is recommended to keep transliteration enabled."

+    $form['filename_sanitization']['replace_whitespace'] = [
+      '#type' => 'select',
+      '#title' => $this->t('Replace spaces with the selected character'),
+      '#default_value' => $config->get('filename_sanitization.replace_whitespace'),
+      '#options' => [
+        ' ' => $this->t('-- disabled --'),
+        '-' => '-',
+        '_' => '_',
+        '.' => '.',
+      ],
+    ];
+
+    $form['filename_sanitization']['remove_non_alpha_characters'] = [
+      '#type' => 'checkbox',
+      '#title' => $this->t('Remove non-alphanumeric characters'),
+      '#default_value' => $config->get('filename_sanitization.remove_non_alpha_characters'),
+      '#description' => $this->t('Remove non-alphanumeric characters exception dots, underscores or dashes.'),
+    ];

"Strip non-alphanumeric characters except for dots (.), underscores (_) or dashes (-).'

+    $form['filename_sanitization']['dedupe_consecutive_separators'] = [
+      '#type' => 'checkbox',
+      '#title' => $this->t('Remove duplicate consecutive dots, underscores or dashes'),
+      '#default_value' => $config->get('filename_sanitization.remove_non_alpha_characters'),
+    ];

"Remove duplicate dots (..), underscores (__) or dashes (--)"

+    $form['filename_sanitization']['force_lowercase'] = [
+      '#type' => 'checkbox',
+      '#title' => $this->t('Convert all filenames to lowercase'),
+      '#default_value' => $config->get('filename_sanitization.force_lowercase'),
+    ];

"Convert filenames to all lowercase."

flocondetoile’s picture

I think we need a #state on each option to hide them on the config form if the feature is not enable. This way, all the options will not be visible until the checkbox transliterate is checked.

dww’s picture

+1 to all the notes in #153.

-1 to using #states from #154. None of these features depend on using "transliterate". That's simply one kind of sanitation a site might want. But you could still want to strip whitespace even if you're not transliterating. As @mfb pointed out at #134:

In Drupal this all got agglomerated into "transliteration", which is more clearly optional.

.

If we wanted a global "turn on any of this stuff" option, it should be called "[X] Sanitize filename", not "Transliterate". But, I think that complication is unneeded. It's a small fieldset of distinct options, so let's leave them all visible by default.

Thanks,
-Derek

andypost’s picture

No reason for states, meantime makes sense to give a room for contrib to extend

At lease token and https://www.drupal.org/project/field_tokens

dww’s picture

Title: Use new Transliteration functionality in core for file names » Provide options to sanitize filenames (transliterate, force lowercase, replace whitespace, etc)
Component: transliteration system » file system
Category: Task » Feature request
Issue summary: View changes
Issue tags: -Needs issue summary update +Needs tests

Re-title, move component (since this is about more than transliteration now), and summary update to reflect the current state.

dww’s picture

Issue summary: View changes

A few more edits. Also fleshed out the Remaining tasks. Currently:

  1. Decide if all whitespace or only spaces are converted. Update code and UI text to match.
  2. Design and dispatch the appropriate event(s) after (? and/or before?) core's sanitation is done.
  3. Decide if we give the user enough feedback about renaming their file upon upload. If not, implement the appropriate feedback.
  4. Finish cleaning up the UI text, settings names, etc.
  5. Final code cleanup and fixes.
  6. Finish writing test cases for (all?) the combinations of settings.
  7. Upload screenshot of final settings UI and add to the summary (@see @todo). ;)
  8. Final string/UI review + signoff. If we get this in by 8.7.0, tag for "String change in 8.7.0".
  9. Write change record.
  10. Final reviews + RTBC.
  11. Commit.

Getting close. ;)

Cheers,
-Derek

dww’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new15.34 KB
new8.8 KB

#152.1: Fixed.
#152.2: Fixed in favor of all whitespace, not just spaces. Crossed that off from the summary remaining TODO.
#152.3: Added @todo for now. ;)
#152.4: Fixed. Using '- None -'. While I was at it, I made the labels for the other choices readable. Very hard to tell - from _ in the previous version.
#152.5: Fixed (mostly via the text proposed in #153). Now:

      '#title' => $this->t('Strip non-alphanumeric characters'),
      '#description' => $this->t('Alphanumeric characters, dots (.), underscores (_) and dashes (-) are preserved.'),

Fixed all of #153, except I further simplified these:

      '#title' => $this->t('Transliterate'),
...
      '#title' => $this->t('Convert to lowercase'),

Still needs work (for tests and other things), but back to NR for the test bot. I'm done for tonight, if anyone else wants to pick up the torch. ;)

Thanks!
-Derek

dww’s picture

StatusFileSize
new15.33 KB
new831 bytes

Whoops, I forgot to update the settings in the test. ;) Sorry about that.

dww’s picture

Issue summary: View changes
StatusFileSize
new15.32 KB
new3.32 KB

Ooof, I realized I partly renamed "remove_non_alphanumeric" to "strip_non_alphanumeric" so that the setting name matched the UI text, but I was inconsistent. This should actually work. ;) Sorry for the noise (and bot churn). My local testing world stopped working suddenly, and I don't have time to get that happy again right now.

Summary updates:
- Fixed typo in remaining tasks generating an empty todo.
- Add link to #159 about whitespace.
- Add settings label to each config key in the UI section.

dww’s picture

Status: Needs work » Needs review
StatusFileSize
new15.32 KB
new454 bytes

Not sure exactly what the bot is confused by, but this should work.

mcdruid’s picture

Thanks for all the work on this. @dww you mentioned:

@catch mentioned in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) we may want to transliterate filenames in core, so that we can have a database-level unique constraint on the URI field in the database. (However, this would presumably mean that transliteration of the field could not be optional).

It would be good to remind ourselves what would need to happen to filenames (actually uris) in core in order for us to restore the constraint in the db. (Especially as the code-level constraint is not yet properly enforced - see #2869855: Race condition in file_save_upload causes data loss).

Perhaps that aspect of this change needs to split off into a separate issue? Ideally we wouldn't make that part of the filename processing optional, as otherwise we're stuck without the database-level constraint.

I'll try to have a closer look at this ASAP but wanted to bring this up so that it's not overlooked.

dww’s picture

Status: Needs work » Needs review
StatusFileSize
new15.32 KB
new1.28 KB
new9.69 KB

Sorry, my regexp for replacing all whitespace had a typo (/s vs \s). It also had a thinko, since I was telling it to replace 1 or more (via +) (since that's what my custom code usually does), but really we want individual replacement (since dedupe is now a separate setting). Got my local test world happy again, and at least Drupal\Tests\file\Functional\SaveUploadTest now passes locally, so I think the bot should be happy with this.

For fun, I'm also attaching an interdiff relative to #151 to easily review everything I've done since @alexpott's last patch.

Re: #167: Yeah, sounds good. Thanks for the link and the info. Agreed that needs more thought, and perhaps should be handled separately.

Thanks,
-Derek

alexpott’s picture

  1. +++ b/core/modules/file/file.module
    @@ -1033,18 +1033,19 @@
    +  if ($sanitization_options['strip_non_alphanumeric']) {
         $filename = preg_replace('![^0-9A-Za-z_.-]!', '', $filename);
       }
    

    This is the sanitization that scares me the most because it is the most potentially destructive. I'm my mind this only really makes sense if you have transliteration enabled too.

  2. +++ b/core/modules/file/file.module
    @@ -1033,18 +1033,19 @@
    +  // @todo Dispatch event here?
    

    I think dispatching an even makes heaps of sense. And if we do core should use the event rather than putting all the code in here. What's nice about that is that it make it much easier to test - when you are only dealing with a filename.

krzysztof domański’s picture

StatusFileSize
new15.5 KB
new4.07 KB

1. I added dataProvider. It's easier to add more test cases now.

/**
   * Provides data for self::testFilenameSanitization().
   *
   * @return array
   *   An array of arrays, each containing the parameters for
   *   self::testFilenameSanitization().
   */
  public function providerTestFilenameSanitization() {
    return [
      'Test default options' => [
        'TEXT-œ.txt', 'TEXT-œ.txt', [
          FALSE, ' ', FALSE, FALSE, FALSE,
        ],
      ],
      'Test all options selected' => [
        'TEXT-œ.txt', 'text-oe.txt', [
          TRUE, '-', TRUE, TRUE, TRUE,
        ],
      ],
      'Test file name with unknown character' => [
        'S  Pace--🙈.txt', 's-pace-.txt', [
          TRUE, '-', TRUE, TRUE, TRUE,
        ],
      ],
    ];
  }

2. I changed the name of test testTransliteration() -> testFilenameSanitization().

dww’s picture

Issue summary: View changes
Issue tags: -Needs tests
StatusFileSize
new19.63 KB
new7.35 KB

Now with significantly more test coverage of the options. ;) Tentatively removing 'Needs tests' and crossing this off from Remaining tasks, although there are a few possible combinations we're not explicitly testing. SaveUploadTest all passes locally. Let's see what the bot says...

dww’s picture

Assigned: Unassigned » dww

Ugh, whoops, cross-post. :/ Dang. I'll take stab at merging #170's dataProvider with all my new test cases. Stay tuned.

dww’s picture

Assigned: dww » Unassigned
StatusFileSize
new16.68 KB
new2.8 KB

Done. All passes locally.

dww’s picture

Issue summary: View changes

Fixing more typos in the summary.

Re: #169.1: Yeah, it's a bit scary. In my custom implementations, I usually convert to _, not entirely strip. I wonder if we want a similar setting to the whitespace one that lets you decide what you want to happen for these. Or are you proposing we do away with this entirely? I don't love the idea of forcing transliterate to be on for this to happen. And there's now a test for what happens if you use this without transliteration (works as expected, even if it's scary). ;)

Re: #169.2: Great. :) I guess that means system module should listen to the event and respond based on the system.file settings (since system.module is providing the settings in the first place). Is that what you have in mind?

I'd still feel more comfortable with the Functional test coverage we now have. It's good to know the uploads actually work and the files are renamed. I'm less interested in unit tests that make sure preg_replace() does what we tell it to. ;) But sure, I guess we could add those, too.

Cheers,
-Derek

pancho’s picture

Very nice. However,

1) why is Scandinavian "Å" transliterated to "oe", not to "Aa" or possibly "A"?
Not the right place to change any of these mappings, but at least we should have some tests proving we're correctly applying the langcode to trigger language-dependent overrides.

2) I'm tending to agree with @alexpott in #169 considering the "Strip non-alphanumeric" option as needlessly destructive if not coupled with a proper transliteration and actually applied after the transliteration. While the other options may be independent, I'd say these two should be coupled. We might want to move the "Remove whitespace" option to the top and have:

☒ Replace whitespace
☒ Transliterate
⮡ ☒ Strip non-alphanumeric
☒ Remove duplicate dots etc.
☐ Convert to lowercase

These could actually be sensible defaults.

dww’s picture

Re: #175.1: Totally out of scope for this issue, IMHO.

Re: #175.2: The order is fixed in the implementation. We're not letting people configure the order at all.
Transliteration always comes first, so I don't see any problem there.

I don't think the order of when we replace the whitespace matters, so long as that comes *before* the dedupe_separators transformation.

Personally, I'm *always* going to have [X] Transliterate enabled, so if y'all want to force the 'strip_non_alphanumeric' to require that, I won't complain. ;)

I don't think we can enable any of this by default, but I'll let a release manager decide that.

Cheers,
-Derek

dww’s picture

Also, re: #175.1 -- you're seeing another manifestation of #2922638: No charset on response for patch & text files which is a d.o bug when viewing patch files. If you download the patch and look locally, you'll see it doesn't actually use Å at all...

pancho’s picture

1) You're right in that "Transliterate" has to come before "Replacing whitespace", as it replaces quite some characters with whitespace, that might have to be stripped lateron.

2) Furthermore, "Replacing whitespace" definitely needs to happen before "Stripping non-alphanumeric," otherwise words are illegibly concatenated.

So "Strip non-alphanumeric" should basically only be available if both "Transliterate" and "Replace whitespace" are checked.

Both checked:
☒ Transliterate
☒ Replace whitespace
☐ Strip non-alphanumeric
☒ Remove duplicate dots etc.
☐ Convert to lowercase

Otherwise:
☐ Transliterate
☒ Replace whitespace
☒ Remove duplicate dots etc.
☐ Convert to lowercase

☒ Transliterate
☐ Replace whitespace
☒ Remove duplicate dots etc.
☐ Convert to lowercase

☐ Transliterate
☐ Replace whitespace
☒ Remove duplicate dots etc.
☐ Convert to lowercase

This also reduces the number of possible combinations from 120 to 20 which is way easier to test one by one.

3) When replacing whitespace, we might want to avoid adding the replacement character to the beginning or the end, so we would trim the filename first and only replace inner whitespace characters.

dww’s picture

Assigned: Unassigned » dww
Status: Needs review » Needs work

I'll give this another push. I agree on trim(). I think replace_non_alphanumeric (with a '' strip option) is better than forcing strip. I'm not totally sold this needs to require transliterate, but if that's the will of the masses, I'll make it so. ;) I disagree this *requires* that you already replaced whitespace. In fact, if you use replace vs. strip here, you don't even need the whitespace conversion at all. So I'm definitely not going to make this require whitespace. But yes, I'll keep it running in the same order.

Stay tuned...

dww’s picture

Assigned: dww » Unassigned
Issue summary: View changes
Status: Needs work » Needs review
Issue tags: -Needs screenshots
StatusFileSize
new18.1 KB
new11.54 KB
new121.91 KB
new70.2 KB

Renamed strip_non_alphanumeric to replace_non_alphanumeric. The setting is now a select, similar to the one for whitespace, but it adds a '- Remove (do not replace) -' option. Probably needs UI/text review/help. ;)

Fixed #169.1 + #178. The replace_non_alphanumeric only appears (via #states) if the 'Transliterate' checkbox is checked. The code to enforce this only fires if 'transliterate' is enabled in the active config.

Renamed force_lowercase setting to lowercase and cleaned up some docs + text to match.

Fixed the replace_whitespace setting to call trim() (#178.3). Unfortunately, trim() is a bit confused if you have spaces "at the end" of your filename, because there's also the file extension itself. So, while we'll correctly trim 'something.txt ' right now, we don't do exactly what you expect with ' something .txt'. I added a test and a comment about this.

Added some more test cases to touch trim(), ignoring replace_non_alphanumeric if transliterate is FALSE, etc.

Uploading screenshots of current settings UI. Probably needs help all around, but I figure I'd get something up for now to make it easier to review.

Unassigning again.

Before I (or anyone) works on the event, it'd be nice to have direction on where all that should live. Who defines the event itself? Seems a bit weird for the Event class to live in system if it's being dispatched from file. I don't fully grok the relationship between file.module and system.module (e.g. why is system providing this settings form, not file)? Should system.module declare the listener that enforces all these settings?

Thanks!
-Derek

alexpott’s picture

@dww er.. the config should so be in file.settings - my bad. I forgot about file module :)

dww’s picture

Re: #181: Well, system already provides FileSystemForm. We're just extending it. Ditto the tests. I think a complete refactoring of all that is out of scope here, no? I just wanted some direction on how it should work if we're going to go through the trouble to add and dispatch a new Event here. I'm not proposing we completely fix everything about system vs. file module in this issue. But I'd like to ensure that whatever new API we're adding isn't going to immediately need to be deprecated. ;)

Thanks,
-Derek

alexpott’s picture

Yeah but this functionality is only for uploaded files and that's enabled by the file module. So we should move it to the file module and use a form alter to put this on the system form - because we don't need another form.

dww’s picture

Re: #183: Hrm, no offense, but that seems ugly and weird. ;) There are already some things that live in file.settings (although I can't find anything in core that lets you set them). I don't want to add more to that schema, and form_alter a UI for part of it, etc. See also #2894194-14: Move 'temporary_maximum_age' setting from 'system.file' config to 'file.settings' config. I'm not sure untangling that entire mess makes sense as part of this issue.

Meanwhile, I'm still looking for input on the API and layout of the event + listener. ;) Since we're starting from scratch on that, I'm happy to layout the code and name things in an agreeable way from the start.

Thanks,
-Derek

p.s. Updated the summary's config mapping to match what the patch is now doing.

dww’s picture

Title: Provide options to sanitize filenames (transliterate, force lowercase, replace whitespace, etc) » Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc)
Issue summary: View changes
StatusFileSize
new14.95 KB
new13.92 KB

Okay, maybe this is better. The patch only touches core/modules/file now (except core/modules/layout_builder/tests/src/FunctionalJavascript/InlineBlockPrivateFilesTest.php). It still seems a bit hacky, but so is having system.module responsible for any of this. ;)

It also means we no longer need those migrate-related test hunks either (since this is no longer in system settings). Those tests pass locally without the hunks, so I'm leaving them out. I don't see any tests trying to migrate file.settings, maybe I'm missing them.

The interdiff is nearly as large as the patch. Oh well. ;)

#181 and #183 are now addressed.

dww’s picture

Issue summary: View changes

More IS tidy-up. Added these to the todo on finalizing the UI text:

  • What's the right label for the 'strip' or 'remove' option for replace_non_alphanumeric?
  • Should the options for the two selects (replace_*) be capitalized?
  • Any other concerns/edits/fixes?

thanks,
-Derek

dww’s picture

Issue summary: View changes
StatusFileSize
new19.4 KB
new7.78 KB

Now with a FileUploadEvent, subscriber (where file enforces the sanitization settings), dispatch at the right spot, etc. Updated summary accordingly.

Would love feedback on the event, its API (get/setFilename() vs. just making $filename public), if/how we should let subscribers munge the destination, too, if/how/where we should better document that this event is being dispatched, etc.

Thanks!
-Derek

pingwin4eg’s picture

Status: Needs review » Needs work

@dww, nice work!

  1. +++ b/core/modules/file/file.install
    @@ -177,3 +177,18 @@ function file_update_8700() {
    +function file_update_8701() {
    +  \Drupal::configFactory()->getEditable('file.settings')
    +    ->set('filename_sanitization', [
    +      'transliterate' => FALSE,
    +      'replace_whitespace' => ' ',
    +      'replace_non_alphanumeric' => ' ',
    +      'dedupe_separators' => FALSE,
    +      'lowercase' => FALSE,
    +    ])
    +    ->save(TRUE);
    +}
    

    Do we actually need to "enforce" default config presence via hook_update_N? I think it'd be better to check if config is set in our event class.

  2. +++ b/core/modules/file/src/EventSubscriber/FileEventSubscriber.php
    @@ -0,0 +1,77 @@
    +      $filename = \Drupal::transliteration()->transliterate($filename, $file->language()->getId(), '');
    

    This should use dependency injection.

  3. +++ b/core/modules/file/src/EventSubscriber/FileEventSubscriber.php
    @@ -0,0 +1,77 @@
    +      $filename = preg_replace('!\s!', $sanitization_options['replace_whitespace'], trim($filename));
    

    If a transliteration is optional now, shouldn't we use functions with multibyte support instead of preg_replace?

Also there are few places where coding standards are not followed.

pancho’s picture

1) Another thought why at least part of this should live in system: these sanitization options basically don't exclusively apply to filenames but may, possibly should be used for machine_names and possibly other string items.

2) re #180:

Unfortunately, trim() is a bit confused if you have spaces "at the end" of your filename, because there's also the file extension itself. So, while we'll correctly trim 'something.txt ' right now, we don't do exactly what you expect with ' something .txt'.

We might want to turn to a regex instead and make it strip all whitespace not only at beginning and end but also before/after any period "."

3) The latter should also apply to "replace non-alphanumeric characters." A file named 'Drupal (v8).jpg' shouldn't be sanitized to 'drupal-v8-.jpg'. If enabled, we should be stripping all non-alphanumeric at the beginning and end of each part of the filename, and only replace occurrences in the middle by the selected character.

4) For filenames (as well as machine_names) underscores and dashes aren't different to alphanumeric, while dots (.) bear special significance. I don't think we should replace anything by a dot. A file with the name "myßdoc" (without file extension) shouldn't be sanitized into "my.doc". We may preserve dots when replacing non-alphanumeric and we may remove dupes, but we shouldn't allow to replace anything (whitespace, non-alphanumeric) with a dot.

5) While unduping underscores (__) and dashes (--) may be a matter of preference (and usecase), duplicate dots (..) IMHO should be always deduped. They simply shouldn't exist in a filename in the first place. I would even do that silently, no matter what options are chosen.

alexpott’s picture

@Pancho machine names have their own regex - I think by bringing up machines here we are making this more complex than it needs to be. Machine names and file names are different things.

pancho’s picture

@alexpott I know machine_names have their own regex, but should they? How much are they different?

I'm not saying we necessarily need a UI for configuring machine_name sanitization (besides possibly transliteration), but under the hood, machine_names could actually use the whole stack of filename sanitization as follows:

['transliterate' => TRUE,
'replace_whitespace' => '_',
'replace_non_alphanumeric' => '_',
'dedupe_separators' => TRUE,
'lowercase' => TRUE]

Besides avoiding code duplication, this would be an improvement as the current machine_name regex regularly produces sub-par results which I end up editing manually. This clearly is beyond scope here, but all I'm saying is: some of this is not just file stuff, but system stuff.

alexpott’s picture

  1. I don't find your argument compelling. There's no need for generic system code to do these things call transliterate in your code, do the str_replace().
  2. +++ b/core/modules/file/src/Event/FileUploadEvent.php
    @@ -0,0 +1,75 @@
    +class FileUploadEvent extends SymfonyEvent {
    

    I'm really torn here - between an event that has the whole file and one that is just about filename sanitization and just receives a filename rather than a file object. The problem with an event that has the entire File object is that then there are lots of things the subscribers can do that are out-of-place. For me given the security implications I think just firing an event with the name is a good idea. Way less complexity. Also given the security implications it is vital that the event subscribers do not change the extension of the file so we should code in some form of check for that. Or prevent it from occurring at all by only passing the part of the filename.

  3. +++ b/core/modules/file/tests/src/Functional/SaveUploadTest.php
    @@ -128,6 +135,116 @@ public function testNormal() {
    +  /**
    +   * Provides data for self::testFilenameSanitization().
    +   *
    +   * @return array
    +   *   An array of arrays, each containing the parameters for
    +   *   self::testFilenameSanitization().
    +   */
    +  public function providerTestFilenameSanitization() {
    

    This can now move into a unit test because the event is unit testable. Which is great because then we don't have to reinstall the system 10 times as we do for this.

wim leers’s picture

+1 for this functionality.

But if we do it, we should also do it in FileUploadResource, so that API file uploads are treated the same. We already have explicit test coverage for exactly this sort of thing, which this issue could expand: \Drupal\Tests\jsonapi\Functional\FileUploadTest::testFileUploadUnicodeFilename().

dww’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new20.72 KB
new9.52 KB

#186.1: I changed it to just "- Strip -" for now. Further review and feedback on this welcome.

#186.2: I went for yes. Capitalized the options.

#186.3: Still pending...

#188.1: I don't know. I'm just following the lead from @alexpott and moving his DB update function around.

#188.2: Fixed. Good catch! I DI'ed the config, but forgot about the transliteration service while moving this code around.

#188.3: Fixed. I hope. ;) Tests all still pass locally, so I think this is cool.

#188.other:

Also there are few places where coding standards are not followed.

Where? I try to be pretty pedantic as I write code.

#189.1: No thanks, out of scope (per @alexpott in #190 and #192).

#189.2: Sort of fixed by having the event only touch the non-extension part of the filename, and gluing the extension back on after dispatch. See below. I think your proposal for trimming whitespace around dots makes this more complicated and error-prone.

#189.3: No thanks. That's overly complicating this. a file called drupal-v8-.jpg is just fine with me. If you really care, you can now write your own event subscriber in custom or contrib! ;)

#189.4: Sure, makes sense to me. I removed the dot (.) option from both selects.

#189.5: Hrm, not sure. I'd like someone else to weigh in on that before I change how it works. I don't think we should always and silently dedupe '..'. Seems like it should be a setting, and we already have a setting, so why mess with it?

#192.1: Agreed.

#192.2: Both fixed. The FileUploadEvent no longer takes a File object. Instead, it gets the filename (without extension, the "filename" part of what pathinfo() returns), and a (read-only copy of the) extension. Plus the language ID (which we only have in the $file, but need for transliterate(). This nicely solves the problem of trim() and filenames like ' something .jpg'. Makes the event more bounded in scope. There's only a setter for the filename. The extension and language ID are read-only via getters.

I wonder if this event should be renamed to something like FileUploadSanitizeFilenameEvent to be even more self-documenting.

#192.3: Todo. I kinda like leaving at least a couple of these cases in here so that we know the whole thing works (integration), the event is dispatched, it actually renames the uploaded file, etc. If we leave a couple of the nasty cases in here, I'm okay trying every possible combination of settings as unit tests on the event. Cool?

#193: Thanks @wimleers! That's why I pinged you about this. ;) After a brief look, it seems like dispatching the same FileUploadEvent (or whatever we rename it) at the start of FileUploadResource::prepareFilename() is the right thing to do, right? If so, how do I get the correct language ID inside this resource? And if I need to inject the event dispatcher service, can I safely change the constructor of this thing without breaking BC? ;)

Cheers,
-Derek

dww’s picture

Issue summary: View changes
StatusFileSize
new21.75 KB
new3.68 KB

This seems cleaner. The event takes the full filename in the constructor, does the pathinfo() itself, and provides a getFilenameWithExtension(). Makes it easier to dispatch the event in multiple places, as this patch starts doing to address #193 (although I left some @todo for my questions for @wim leer in #194). I wonder if all the rest/api tests will pass with this. ;)

Still needs Unit tests for all the possible settings. Anyone else want to take a stab at that? ;)

alexpott’s picture

  1. +++ b/core/modules/file/src/Event/FileUploadEvent.php
    @@ -46,20 +46,19 @@
       public function __construct($filename, $extension, $language_id) {
    

    Need to remove the $extension.

  2. +++ b/core/modules/file/file.module
    @@ -1022,7 +1023,12 @@ function _file_save_upload_single(\SplFileInfo $file_info, $form_field_name, $va
    +  $event = new FileUploadEvent($file->getFilename(), $file->language()->getId());
    

    Tricky to name this. But the now this only gets the name and is about names I think we need to reflect that.

  3. +++ b/core/modules/file/file.module
    @@ -1022,7 +1023,12 @@ function _file_save_upload_single(\SplFileInfo $file_info, $form_field_name, $va
    +  $event_dispatcher->dispatch(FileUploadEvent::SANITIZE, $event);
    

    SANITIZE_NAME?

  4. +++ b/core/modules/file/file.module
    @@ -1739,3 +1749,85 @@ function _views_file_status($choice = NULL) {
     }
    +
    +
    +/**
    + * Implements hook_form_FORM_ID_alter().
    

    Extra line here.

  5. +++ b/core/modules/file/src/Event/FileUploadEvent.php
    @@ -0,0 +1,110 @@
    +
    +class FileUploadEvent extends SymfonyEvent {
    ...
    +   * This value is passed for context, but should not be modified by listeners.
    +   * The event dispatcher does not use this value, and there's no setter.
    ...
    +   * This value is passed for context, but should not be modified by listeners.
    +   * The event dispatcher does not use this value, and there's no setter.
    

    Missing doc block for the class. I think we could document that this event only allows changing of the filename and not the extension of anything else about the file. And then we could lose the "This value is passed for context".

  6. It's definitely worth having a couple of functional tests - but not with a data provider as well as a unit test which extensively tests our assumptions about how these options work together on decent examples.
  7. One thing I've been pondering is shouldn't we tell the user what has been done so they understand why the filename has been changed?
dww’s picture

Status: Needs work » Needs review
StatusFileSize
new21.74 KB
new589 bytes

Bah, sorry about that. I had confused my git checkout (forgot to re-add the file before the git diff) and left out a final (very important!) edit. ;) This should pass. Unless REST hates me now. Let's see.

dww’s picture

StatusFileSize
new22.25 KB
new8.08 KB

Ugh, x-post w/ @alexpott..

#197.1: Already fixed in #198. ;)

#197.2: Right, in #194 I proposed "I wonder if this event should be renamed to something like FileUploadSanitizeFilenameEvent to be even more self-documenting." I'm going with FileUploadSanitizeNameEvent here. Is that too verbose or are you okay with it?

#197.3: If the name of the event class has "name" in it, maybe this is unneeded?

#197.4: Fixed.

#197.5: Fixed. How's this?

/**
 * An event during file upload that lets subscribers sanitize the filename.
 *
 * The original file extension is preserved, and when dispatching this event,
 * the caller should use self::getFilenameWithExtension() to get the full
 * filename. self::getFilename() and self::setFilename() only operate on the
 * part of the filename without the extension (what pathinfo() returns via
 * PATHINFO_FILENAME). The original extension and language ID of the uploaded
 * file are passed as context, but cannot be modified by subscribers.
 *
 * @see pathinfo()
 * @see _file_save_upload_single()
 * @see \Drupal\file\Plugin\rest\resource\FileUploadResource::prepareFilename()
 */

#197.6: Fair enough. We could revert the functional to a slimmed-down version of what I had in #171 (no provider) and flesh out full coverage of all the options via Unit tests (that probably want a provider). I'd love it for someone to give the unit tests a first stab, as I confess to not being sure the right way to go about that. Presumably we want a loop where we can twiddle the active config, dispatch the event with different initial filenames, and assert that the filename after the dispatch is what we expect.

#197.7: Yup, that's in the Remaining tasks in the IS as:

Decide if we give the user enough feedback about renaming their file upon upload. If not, implement the appropriate feedback.

What do you have in mind? Does the dispatcher in file.module simply do a drupal_set_message() if the pre vs. post sanitization filename is different?

What's the right way (if any) to provide such feedback for the rest response? ;)

dww’s picture

Also re #197.7: Or does that imply we want FileUploadSanitizeNameEvent to have knowledge of a message for the end user, and subscribers can set additional messages on the event as they munge things? That sounds complicated and scope-creepy. Can we do a single

drupal_set_message(t('Your file has been renamed from %old to %new due to site sanitization rules.', [...]));

(or something) for now and punt anything more fancy to a follow-up?

Thanks,
-Derek

alexpott’s picture

I think #200 is the minimum we should do - so let's start there. There's prior art... \Drupal::messenger()->addStatus(t('For security reasons, your upload has been renamed to %filename.', ['%filename' => $file->getFilename()]));

dww’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new23.02 KB
new3.09 KB

Fixed tests (I didn't finish the Event rename in core/modules/file/src/Plugin/rest/resource/FileUploadResource.php).

Fixed the CS bugs the bot is complaining about.

Fixed #197.7/#200/#203.

Crossing off a lot more of the remaining tasks. ;) Getting closer!

dww’s picture

Issue summary: View changes

A few minor cleanups and updates to the summary.

dww’s picture

Status: Needs work » Needs review
StatusFileSize
new23.11 KB
new1.24 KB

Wow, nice coverage... go API-first team. ;) Even though the Drupal UI doesn't seem to let you upload files without extensions, REST does, so we can't assume pathinfo() will return an extension. I think this should pass now (even though it's a bit more ugly).

alexpott’s picture

+++ b/core/modules/file/src/Event/FileUploadSanitizeNameEvent.php
@@ -86,13 +86,13 @@
   public function getFilenameWithExtension() {
-    return $this->filename . '.' . $this->extension;
+    return $this->filename . (!empty($this->extension) ? ('.' . $this->extension) : '');
   }

I don't think this ternary is necessary because you've taken care of this case in the constructor. So $this->filename . '.' . $this->extension; will work just fine.

  1. +++ b/core/modules/file/file.module
    @@ -1028,6 +1028,13 @@
    +    // @todo: It's not necessarily "security reasons", but reusing this message
    +    // means there's no new string to translate for this.
    +    \Drupal::messenger()->addStatus(t('For security reasons, your upload has been renamed to %filename.', ['%filename' => $event->getFilenameWithExtension()]));
    

    There's a little bit of complexity here. I wonder if we should only have a single message about renaming the file on upload.

  2. +++ b/core/modules/file/src/Plugin/rest/resource/FileUploadResource.php
    @@ -12,6 +12,7 @@
    +use Drupal\file\Event\FileUploadSanitizeNameEvent;
    

    The new name does not leave space for confusion - so whilst verbose and long I'm in favour.

dww’s picture

Issue summary: View changes

#209.0: Except we do, b/c if you have a file without an extension ("foo"), your version of getFilenameWithExtension() would return a trailing dot ("foo."). The alternative would be to test and store the extension with the dot, then this function could be simple, but then callers would get ".txt" when using getExtension(). Maybe that's ok, but it seemed like the raw extension without the dot would be a kinder return value for getExtension().

#209.1: Well, yeah, probably. ;) I wanted to avoid too much complexity, but you're right. @todo.

#209.2: Great.

Also updated IS to remove mention of replacing anything with a dot and other cleanups. Screenshots are still stale, but I want to wait for more feedback on that before I update them.

krzysztof domański’s picture

Minor bugs if we add a file with an existing name.

1. Add file (e.g. "Zürich.png") twice. If file name exist then _N is added at the end (e.g. "Zürich_0.png", "Zürich_1.png"). If transliteration is enabled we get the correct result "Zurich_0.png" but message is:
"For security reasons, your upload has been renamed to Zurich.png."

Should be:
"For security reasons, your upload has been renamed to Zurich_0.png."

2. Enable "Remove duplicate dots (..), underscores (__) or dashes (--)". Upload twice any file with underscore at the end (e.g. "example_.png").

Result is "example__0.png" with duplicate underlines (__). Should be "example_0.png".

dww’s picture

Issue summary: View changes
StatusFileSize
new28.33 KB
new13 KB

Test cleanup:

- Restore basically the tests I added in #171, removing the functional test provider from #170, slim down a bit, and update everything to match the current settings names, etc. So, we still have functional integration tests for a few of the complex cases.

- Add a new Kernel test to do what I described near the end of #199. There's a provider. For each case, we set the file.settings config accordingly, create and dispatch the event, and assert that the event's final getFilenameWithExtension() matches the expected filename. I couldn't quite grok how to do this as a Unit test, but as Kernel it seems easy-peasy. Runs very vast locally. ;) Seemed a bit wonky to insert this into one of the existing Kernel tests (e.g. @borisson_ in #42) so I went with a whole new class. Hope that's agreeable.

If someone wants to get really nit-picky, we could now easily add enough cases to SanitizeNameTest::provideFilenames() to cover every possible combination of settings.

Re: #211.1: Thanks for testing. Yup, that's the @todo about the message from #209.1.

Re: #211.2: Interesting. That's because the deduplicate filename stuff happens after the filename sanitization. Which it has to, since we obviously want to check for filename collision on the sanitized name. I don't think we want to dispatch the event again and double-sanitize. I think we might have to live with that behavior. @alexpott: thoughts on this?

dww’s picture

Issue summary: View changes
Issue tags: -Needs tests
StatusFileSize
new29.74 KB
new3.78 KB

Even more test cases for Kernel now. I think we now have plenty of tests. ;) Untagging and crossing that off from remaining tasks.

So, this really NW for the message cleanup (#209.1 et al), but then I think we're in the home stretch for final UI word-smithing...

dww’s picture

Issue summary: View changes
StatusFileSize
new31.06 KB
new3.64 KB

Consolidates and fixes the user-facing warning message. Should solve #209.1 and #211.1.
Added some assertions to the Functional tests to make sure we see the message as expected when filename sanitization happens.
Crossing this off from the remaining tasks.

Home stretch now! UI / text reviews please?

Thanks,
-Derek

alexpott’s picture

  1. +++ b/core/modules/file/file.module
    @@ -1001,7 +1004,6 @@ function _file_save_upload_single(\SplFileInfo $file_info, $form_field_name, $va
    -      \Drupal::messenger()->addStatus(t('For security reasons, your upload has been renamed to %filename.', ['%filename' => $file->getFilename()]));
    
    @@ -1061,6 +1068,18 @@ function _file_save_upload_single(\SplFileInfo $file_info, $form_field_name, $va
    +  // If the filename has been modified, let the user know.
    +  if (strcmp($file->getFilename(), $original_filename)) {
    +    // @todo: It might not necessarily be "security" reasons, it could just be
    +    // the filename sanitization settings. But not changing this string means we
    +    // don't have to break existing translations.
    +    \Drupal::messenger()->addStatus(t('For security reasons, your upload has been renamed to %filename.', ['%filename' => $file->getFilename()]));
    +  }
    +
    

    I think we should set a flag here to use the security text for the rename and if it is not a security rename use different text. Maybe something like "Your upload has been renamed to %filename." to keep it simple.

  2. +++ b/core/modules/file/file.module
    @@ -1739,3 +1758,84 @@ function _views_file_status($choice = NULL) {
    +    '#title' => t('Replace non-alphanumeric characters with the selected character'),
    ...
    +      '' => t('- Strip -'),
    

    I think there's a problem with the title being replace and then an option being strip. This is tricky.

  3. +++ b/core/modules/file/file.module
    @@ -1739,3 +1758,84 @@ function _views_file_status($choice = NULL) {
    +    '#description' => t('Alphanumeric characters, dots (.), underscores (_) and dashes (-) are preserved.'),
    ...
    +    '#title' => t('Remove duplicate dots (..), underscores (__) or dashes (--)'),
    

    Really nice idea having both the names and examples of the characters - makes it clearer.

  4. +++ b/core/modules/file/src/Plugin/rest/resource/FileUploadResource.php
    @@ -458,6 +459,14 @@ protected function validate(FileInterface $file, array $validators) {
    +    // @todo: Need to set the right language ID here, not sure where to get it.
    +    $event = new FileUploadSanitizeNameEvent($filename, 'en');
    

    I don't see how the language is being set here. Either we need to extract it somehow from the data being posted. As far as I can work it out the langcode used when creating files here will be \Drupal::languageManager()->getDefaultLanguage()->getId() as this is what is set in \Drupal\Core\Field\Plugin\Field\FieldType\LanguageItem::applyDefaultValue()

  5. +++ b/core/modules/file/tests/src/Kernel/SanitizeNameTest.php
    @@ -0,0 +1,154 @@
    +    // @todo: Maybe the language ID should be supplied by the provider?
    

    I dunno - it's not necessary for this code to test the transliteration service. But it might be neat to prove we pass the correct langcode to it.

  6. +++ b/core/modules/file/tests/src/Kernel/SanitizeNameTest.php
    @@ -0,0 +1,154 @@
    +  public function provideFilenames() {
    

    This unit test is really nice work. Great to see the test coverage. We can go a step further and make this a unit test and mock the transliteration and config services and only call FileEventSubscriber::filenameSanitize(). But of course that'll tie the test a bit more to implementation. But this might be worth it given with already good functional test coverage.

dww’s picture

StatusFileSize
new31.21 KB
new3 KB

#215.1: Done. Since there are already new strings here, translations are going to have to adjust, anyway.

#215.2: Agreed. That's why I flagged this multiple times for feedback. ;) It seems like the possible solutions are:

A) Fix the label of the thing to be more generic than "Replace". Perhaps:
"Transform non-alphanumeric characters:"
Options:

  • - None -
  • Remove
  • Replace with dash (-)
  • Replace with underscore (_)

B) Change the "Strip" label to something else like "Remove (do not replace)" (like it used to be).

C) Live with it as-is. Stripping is a form of "replace", since we're replacing with an empty string. ;)

D) Other?

#215.3: Great, thanks. :)

#215.4: Thanks for looking deeper. Yeah, that seems likely to be the only way to do it. I was hoping @Wim Leers would answer about The Right Way(tm), but I'd be okay using the default language like you suggest.

On a related note, do you have opinions on changing the FileUploadResource constructor to properly DI the event dispatcher (and probably this language manager service)? I've seen various debates about the best way to handle this. Should we leave the constructor alone, add setters for these 2 new services, and call those setters in create()? Or do we just change the constructor and let anyone who happens to be extending it change their code?

#215.5: Yeah, I don't really care. I agree this test is not responsible for testing language-specific transliteration. I was only trying to be responsible and flagging it if someone does care.

#215.6: Glad you like the Kernel test. :) I don't see how all the additional hoop-jumping is "a step further". This is already testing what we want to test to make sure all these settings work as expected. Mocking other services seems like a lot of additional effort for no benefit. To speak frankly, this is exactly the kind of thing that burns me out and makes me not want to try contributing to core anymore. I've already spent way more time on this issue that I can ethically bill any clients for. I'm just trying to push it over the finish line so we can actually commit this, finally realize all the hours I and others have poured into this issue over the last 4 years, and move on with our lives. We've got solid / complete test coverage. Now you're suggesting I spend another few hours for no apparent benefit. No thanks. If you want to do so, knock yourself out. ;) But honestly, you have a lot more important and useful things you could be doing (RTBC queue, I'm looking at you) than "polishing screws" and further rearranging these tests for no actual benefit. Sorry if that sounds harsh, but I wanted to express my frustration in an honest and open way. I know you're looking out for the health of the code, and I respect that. I also want us to also look out for the health and well-being of our (shrinking?) contributor community and not make trying to improve core so hard and painful that fewer and fewer people are willing to try.

dww’s picture

Fixes #215.4, #194/193, etc.

Slack thread w/ @gabesullice:
---
@gabesullice: AFAIK, no. But this reminds me of this issue: #3021155: Accept client-generated IDs (UUIDs) for file uploads too
@gabesullice: I think your request highlights that we need a way to specify field values on the upload `application/octet-stream` request
@dww: @gabesullice Thanks, makes sense. So I'll just inject the language manager and get the default. Meanwhile, preferences on the BC implications of changing the constructor to get the 2 new services?
@gabesullice: I think you mean in core. @wimleers would be the person for that.
@dww: Well, its easy to change core in this issue. :wink: The question is if anyone who might be extending FileUploadResource would complain. Thought you might know/care about that, too. If not, no worries, and I'll wait for wim. Thanks again!
@gabesullice: Ah, I misunderstood. I still don't have a preference. But you should probably be aware of this issue too then: #2940383: [META] Unify file upload logic of REST and JSON:API
@gabesullice: That will already change the constructor
@dww: Cool, sounds good. For now, I'll change the constructor in 2492171 and we can work out the collision if/when the first one lands.
---

Still really NW for #215.2 and other possible string review/fixes. But otherwise, now almost entirely free of @todo. ;)

Adding #2940383 to the top of related issues since apparently this now conflicts with that on constructor changes to FileUploadResource. Also adding #3021155.

Cheers,
-Derek

dww’s picture

StatusFileSize
new35.23 KB
new2.43 KB

Based on slack convo in #d9readiness, here's a new version of the FileUploadResource constructor change that gracefully triggers a deprecation error if you don't pass the new args, instead of outright crashing. Should ease any BC-breaking concerns.

dww’s picture

Issue summary: View changes
StatusFileSize
new35.69 KB
new10.25 KB
new119.36 KB
new136.7 KB

@alexpott and I chatted in Slack. We decided option D for #215.2 is best: remove the 'Strip' option entirely. ;) It simplifies life, makes things more consistent, and is safer. If people want to dedupe, they have a setting for that. I noticed this now lets us construct a single array of options and reuse it for both settings inside the form_alter, which is nice. Updated all the tests to match, and removed all mention of "strip" from the patch (and the IS). If a site *really* wants to strip these, not replace, they can subscribe to the sanitize event and do whatever they want. ;)

Also quickly fixed #215.5. The Kernel test provider can now optionally define a language code. Added 3 cases to ensure we're using the right language-specific transliteration, and how to handle Ä: en (A) vs. de (Ae). Removed the @todo from the test.

Any other string concerns? Updated the IS screenshots. Crossed off a lot of stuff.

dww’s picture

Issue summary: View changes

Minor summary updates, crossing this off, too:

Final code cleanup and fixes. No more @todo via #219

Only open tasks at this point are:

  • Final string/UI review + signoff. If we get this in by 8.7.0, tag for "String change in 8.7.0".
  • Update/fix the change record.
  • Final reviews + RTBC.
  • Commit.

Can we reuse the same CR to talk about the new settings, the new Event, and the changes to the FileUploadResource constructor?

Thanks,
-Derek

dww’s picture

@alexpott via slack said: "Yes I think so" to reusing the same CR for all of this.

So I converted the CR into this:

New filename sanitization settings during upload (via UI or REST), new sanitization Event, changes to FileUploadResource constructor

Untagging that and crossing it off the TODO.

Optimistically tagging this for an 8.7.0 string change. ;) Also crossed that off the list.

dww’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new35.69 KB
new442 bytes
new80.84 KB

Wow, that's incredibly evil and weird. The complete disappearance of Ä does not happen when I run those Kernel tests locally. :/ WTF? Thoughts?

Meanwhile, this fixes the CS bug.

Also updating IS to reflect that replace_non_alphanumeric requires transliterate. Screenshot showing that.

dww’s picture

StatusFileSize
new35.67 KB
new1.19 KB

This also all passes locally:

Drupal\Tests\file\Kernel\SanitizeNameTest                     21 passes

Does the d.o bot like this better? Ugh. I don't like this, but I don't think it's my fault. ;) If the bot is now green, I'm willing to call this Good Enough(tm).

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new1.52 KB
new35.78 KB

Maybe this is a mb_ereg_replace() bug - we don't use that anywhere else in core. I'm guessing and this is just a guess - that the DrupalCI servers have mb_regex_encoding() set to something non standard causing the first byte to be treated as a BOM or something. Dunno.

dww’s picture

Thanks for looking! Good guess/thought. We'll see what the bot says. However, I'm skeptical this will solve it. Here are the first two test cases that failed in #224:

      'Transliterate via en (not de), no other transformations' => [
        'Ö Ü Å Ø äöüåøhello.txt',
        'O U A O aouaohello.txt',
        [TRUE, ' ', ' ', FALSE, FALSE],
      ],
      'Transliterate via de (not en), no other transformations' => [
        'Ö Ü Å Ø äöüåøhello.txt',
        'Oe Ue A O aeoeueaohello.txt',
        [TRUE, ' ', ' ', FALSE, FALSE], 'de',
      ],

All those ' ' and FALSE options turn off the code paths touched by your interdiff.

I'd be surprised if #227 makes a difference, although I'm happy to go with the u modifier on preg_replace() instead of mb_ereg_replace() if that's what core does everywhere else.

wim leers’s picture

Wow, incredible progress here! 🤯

  1. I think using the default language probably indeed makes sense. REST was sadly never architected with translations in mind, which is why #2135829: EntityResource: translations support is still open. WRT translations, Drupal is definitely not yet "API-First".
  2. +++ b/core/modules/file/src/Plugin/rest/resource/FileUploadResource.php
    @@ -124,6 +127,20 @@ class FileUploadResource extends ResourceBase {
    +  /**
    +   * The event dispatcher to dispatch the filename sanitize event.
    +   *
    +   * @var \Symfony\Component\EventDispatcher\EventDispatcherInterface
    +   */
    +  protected $eventDispatcher;
    +
    +  /**
    +   * Language manager for retrieving the default langcode during upload.
    +   *
    +   * @var \Drupal\Core\Language\LanguageManagerInterface
    +   */
    +  protected $languageManager;
    

    Two more services injected. This is making the code even more complex :(

    I'm surprised that nobody has mentioned #3021652: Deprecate upload-related functions file.inc and move to a file upload service yet. I've been told in #2940383: [META] Unify file upload logic of REST and JSON:API that that should wait for #3021652: Deprecate upload-related functions file.inc and move to a file upload service. That's definitely overlapping with this in scope.

    Shouldn't file_munge_filename() also be implemented as a FileUploadSanitizeNameEvent event subscriber?

pingwin4eg’s picture

The latest failing tests problem is with the pathinfo().

mike@mikeshiyan ~/code/drutest (8.7.x *+=) $ drush ev 'var_dump(pathinfo("Ü Å Ø äöüåøhello.txt"));'
array(4) {
  ["dirname"]=>
  string(1) "."
  ["basename"]=>
  string(26) " Å Ø äöüåøhello.txt"
  ["extension"]=>
  string(3) "txt"
  ["filename"]=>
  string(22) " Å Ø äöüåøhello"
}
mike@mikeshiyan ~/code/drutest (8.7.x *+=) $ php -r 'var_dump(pathinfo("Ü Å Ø äöüåøhello.txt"));'
array(4) {
  ["dirname"]=>
  string(1) "."
  ["basename"]=>
  string(28) "Ü Å Ø äöüåøhello.txt"
  ["extension"]=>
  string(3) "txt"
  ["filename"]=>
  string(24) "Ü Å Ø äöüåøhello"
}

Docs says:

Caution pathinfo() is locale aware, so for it to parse a path containing multibyte characters correctly, the matching locale must be set using the setlocale() function.

alexpott’s picture

@pingwin4eg yes but we call set_locale() on every run - see core/tests/bootstrap.php and \Drupal\Core\DrupalKernel::bootEnvironment()

What we the different locales you used in #231?

alexpott’s picture

@pingwin4eg what I see locally is:

> drush ev 'var_dump(setlocale(LC_ALL, \'0\')); var_dump(pathinfo("Ü Å Ø äöüåøhello.txt"));'
string(1) "C"
array(4) {
  ["dirname"]=>
  string(1) "."
  ["basename"]=>
  string(28) "Ü Å Ø äöüåøhello.txt"
  ["extension"]=>
  string(3) "txt"
  ["filename"]=>
  string(24) "Ü Å Ø äöüåøhello"
}
pingwin4eg’s picture

@alexpott Hmm, well that's weird, I have these:

mike@mikeshiyan ~/code/drutest (8.7.x *+=) $ drush ev 'var_dump(setlocale(LC_ALL, "0"));'
string(1) "C"
mike@mikeshiyan ~/code/drutest (8.7.x *+=) $ php -r 'var_dump(setlocale(LC_ALL, "0"));'
string(170) "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new3.77 KB
new36.28 KB

Here's a fix - we've had this problem before :) See #278425: Using basename() is not locale safe

pingwin4eg’s picture

BTW IMHO the "Filename sanitization" fieldset in the config form should have some description that sanitization would apply when files are uploaded only, and config changes won't touch existing files.

alexpott’s picture

StatusFileSize
new1.86 KB
new36.5 KB

Edge case fixing and testing.

Plus there's a PHP bug here - see https://bugs.php.net/bug.php?id=77239

laravz’s picture

The second argument of basename() has default value "null". However, the patch sets an empty string when there's no extension in the function below:
$this->filename = $file_system->basename($filename, $this->extension !== '' ? '.' . $this->extension : '');

Do we want to change this? The output should be the same in either case.

dww’s picture

StatusFileSize
new36.85 KB
new4.76 KB

@alexpott + @pingwin4eg: Thanks for diagnosing the problem and fixing this!! Very very happy to wake up to a green test bot patch here. ;)

@wimleers:
#230.1: Okay, thanks for the sign-off.

#230.2: I hear you, but that feedback isn't really actionable. Yes, we have to inject two new services to make this work. The alternatives are:
A) Ignore REST and only sanitize filenames uploaded through the UI. I know you and many others won't like that.
B) Don't change FileUploadResource for this to properly DI the services, and just call \Drupal::service(). ;) I know the DeprecationFrenzy crowd won't like that, either. ;)
C) Embrace that Doing Things Right(tm) around here generally means making them more complex. :/

I'm grudgingly picking C for this. I don't see how A or B are viable. I don't know if there's option D that makes everyone happy. I doubt it.

Thanks for the link to #3021652: Deprecate upload-related functions file.inc and move to a file upload service. Agreed there's overlapping scope. I'm not a core file system maintainer, so I don't have all the open issues / efforts mapped in my mind. I'm just scratching an itch I have, and uncovering a lot of related pain. ;) I think if this issue gets RTBC and lands, we'll just have to re-shuffle this event dispatch in those other issue(s).

Shouldn't file_munge_filename() also be implemented as a FileUploadSanitizeNameEvent event subscriber?

See #192.2 @alexpott wanted this whole event to not touch the extension, and that file_munge_filename() was too important for security to allow this event to impact it at all. Hence, all the hoop-jumping with pathinfo() and only letting subscribers mess with the filename/basename (argh, I hate that all the terminology on this is inconsistent), and not the extension.

Re: #236: Sure, makes sense. Added this:

    '#description' => t('These settings only apply to new files as they are uploaded. Changes here do not affect existing file names.'),

How's that?

Re: #238: It doesn't really matter, since FileSystemInterface::basename() is checking

    if ($suffix) {

so either NULL or '' would do. But sure, why not? Now setting as NULL if there's no extension.

Other changes since #237:

1. I had a bit of a head-scratch WTF reading the new code in the event constructor to resolve #238, since basename() is a weirdly-named method in our FileSystem service, and it doesn't actually do anything like what PHP's basename() or pathinfo() does. :( So, I added this code comment to help others understand what we're doing and why:

+    // The second argument is a suffix to be stripped, in this case, the
+    // filename extension (including the dot) so that we more closely mimic what
+    // pathinfo() would return via PATHINFO_FILENAME if it wasn't buggy.

2. Renamed the EventSubscriber method we call from filenameSanitize() to sanitizeFilename() to put the verb first.

3. Added an Á to the front of a few of the earlier test cases to make sure that works, too.

4. Restored the Ä at the front of the related test cases. Stripping that out at #224 was a desperate attempt to understand the problem, but we might as well have it back now that we know that was a red herring.

5. Moved the umlaut-tastic "No transformations" case to the top of that little block of test cases.

Okay, I think we're back to "review and sign-off on all the strings" as the only renaming TODO in here. Jah be praised!! ;)

Cheers,
-Derek

dww’s picture

Fresh screenshots for the IS to aid in string / UI review, showing the fieldset description from #236.

krzysztof domański’s picture

Now "Replace non-alphanumeric characters with the selected character" is visible if "Transliterate" is enabled but is behind the "Replace non-alphanumeric characters with the selected character".

Minor: Maybe it should be next to "Transliterate"? This change does not affect the result so change only to improve UX.

This configuration is not often changed so this order is not very important.

"Transliterate"
- "Replace non-alphanumeric characters with the selected character"
"Replace whitespace with the selected character"
"Remove duplicate dots (..), underscores (__) or dashes (--)"
"Convert to lowercase"

dww’s picture

Re: #241: Thanks for looking, but I disagree with your suggestion. Although we don't explicitly say it, the order of the settings on this form maps to the order that the transformations are done. If we move "replace non-alpha" before "replace whitespace", someone might justifiably wonder if the replace whitespace setting would do anything, since the "previous" setting would have already replaced whitespace (which is also non-alphanumeric).

Frankly, now that it's entirely replace with no 'strip' option at all, I'm less concerned that the non-alpha setting *has* to depend on transliterate. But I don't want to undo that work now, and @alexpott was concerned about it, so I'm leaving it as-is.

Maybe the newly added fieldset description should mention that transformations are done in the same order as the settings. But I think that's probably too much complication and implementation detail to subject end-users to.

dww’s picture

Side note: It'd probably be nice to fix #2396145: Option #description_display for form element fieldset is not changing anything and set that on this fieldset description, instead of leaving it at the end. Seems better for the description to come before all the options for this fieldset.

krzysztof domański’s picture

An important test case is missing. The file name containing only unknown characters.

+++ b/core/modules/file/tests/src/Kernel/SanitizeNameTest.php
@@ -174,6 +174,11 @@ public function provideFilenames() {
         'foo..0',
         [FALSE, ' ', ' ', TRUE, FALSE],
       ],
+      'Only unknown characters' => [
+        '🙈🙈🙈.txt',
+        '.txt',
+        [TRUE, ' ', ' ', FALSE, FALSE],
+      ],
     ];
   }

This is security problem if transliteration is enabled. The new file begin with a dot is created.

Forbidden
You don't have permission to access /files/issues/2019-02-05/.txt on this server.
dww’s picture

I don't know what, if anything, we should do about #244. It's not actually a "security problem", IMHO. But yeah, maybe we should have a fallback in FileUploadSanitizeNameEvent::getFilenameWithExtension() that tries to do something else in this case. Something like:

if ($this->filename === '') {
  return '-' . ($this->extension !== '' ? '.' . $this->extension : '');
}

More or less.

I'd like feedback from @alexpott before we code that up and add your test case.

alexpott’s picture

StatusFileSize
new19.18 KB
new38.52 KB

@Krzysztof Domański - yep this is why we need to change $filename = $this->transliteration->transliterate($filename, $language_id, ''); to replace unknown characters with something.

Here's a solution that introduces a new setting - replacement character and turns the other replacement options into booleans. I feel that this is the most consistent approach:

  • The use-case for replacing spaces with _ and non alphanumeric with - is very small
  • If we went for the default unknown character (?) this could work with replace non-alphanumeric but is that was not selected '🙈🙈🙈.txt' would become '???.txt' and I don't think that that is desired because ? has special meaning in URLs so this would increase escape characters.
  • For me it makes the UI and effects very consistent - you tick from a list of actions which you want and then you select the character to use for replacements or when transliteration meets an unknown character.

I discussed this in slack with @dww who voiced some reservations and wanted UX feedback. I decided to not hide or disable the replacement character if none of the options that use it are ticked because the description takes care of this, hiding it from the user makes the behaviour of these options less knowable, and it does no harm.

Also I moved the fieldset description to the top of the fieldset because this information is important and it clashes with descriptions on the elements contained if placed there.

dww’s picture

Issue summary: View changes
StatusFileSize
new148.1 KB
new49.66 KB

Mostly looks great. A few thoughts/nits:

  1. +++ b/core/modules/file/file.module
    @@ -1737,3 +1763,92 @@ function _views_file_status($choice = NULL) {
    +  // Place a helpful description at the top of the fieldset. Prevents it
    +  // clashing with the description on the final form element.
    +  $form['filename_sanitization']['help'] = [
    +    '#type' => 'markup',
    +    '#prefix' => '<div class="description">',
    +    '#suffix' => '</div>',
    +    '#markup' => t('These settings only apply to new files as they are uploaded. Changes here do not affect existing file names.'),
    +  ];
    

    I'd add a @todo here pointing to #2396145: Option #description_display for form element fieldset is not changing anything that all this could be accomplished with '#description_display' => 'above', if that wasn't buggy.

  2. +++ b/core/modules/file/file.module
    @@ -1737,3 +1763,92 @@ function _views_file_status($choice = NULL) {
    +    '#description' => t('Character used when replacing whitespace, replacing non-alphanumeric or transliterating an unknown character.'),
    

    Maybe:

    "Character used when replacing whitespace, replacing non-alphanumeric characters or transliterating an unknown character.'"

    ? I know we say "character" a lot in that description, but it seems to read better this way.

  3. +++ b/core/modules/file/src/EventSubscriber/FileEventSubscriber.php
    @@ -0,0 +1,86 @@
    +  /**
    +   * Sanitize the filename of a file being uploaded.
    +   *
    +   * @param \Drupal\file\Event\FileUploadSanitizeNameEvent $event
    +   *   File upload sanitize name event.
    +   *
    +   * @see file_form_system_file_system_settings_alter()
    +   */
    +  public function sanitizeFilename(FileUploadSanitizeNameEvent $event) {
    

    Should we also @see the places the event is being dispatched, or is that overkill?

  4. +++ b/core/modules/layout_builder/tests/src/FunctionalJavascript/InlineBlockPrivateFilesTest.php
    @@ -139,7 +139,7 @@ public function testPrivateFiles() {
         $assert_session->pageTextContains('You are not authorized to access this page');
     
         $this->drupalGet('node/2/layout');
    -    $file4 = $this->createPrivateFile('drupal.txt');
    +    $file4 = $this->createPrivateFile('drupal_4.txt');
         $this->addInlineFileBlockToLayout('The file', $file4);
         $this->assertSaveLayout();
     
    

    I was always curious why we needed to change this test for all this to pass. Is this really still a thing?

Attached screenshots of the new UI to the summary. I think it's probably simpler than what we had before. Simple is generally better, especially for a UI. ;) But I left the pre-245 screenshots up for comparison, and added a remaining task on officially deciding which UI to move forward with.

dww’s picture

StatusFileSize
new38.58 KB
new1.06 KB

Fixes #247 1 and 2.

3 and 4 are still pending review/reply, but they're probably not needed for commit.

pingwin4eg’s picture

Following notes are not really critical, so I'm not changing an issue status. They are more like questions)

  1. +++ b/core/modules/file/file.module
    @@ -1061,6 +1072,21 @@ function _file_save_upload_single(\SplFileInfo $file_info, $form_field_name, $va
    +  // Update the filename with any changes as a result of the FileUploadEvent
    +  // (sanitization) or renaming due to an existing file.
    +  $file->setFilename(\Drupal::service('file_system')->basename($file->destination));
    +
    +  // If the filename has been modified, let the user know.
    +  if (strcmp($file->getFilename(), $original_filename)) {
    

    - Multiple getFilename usages. Can we add a new var for new filename here instead?
    - Why to use strcmp() instead of ===?

  2. +++ b/core/modules/file/file.module
    @@ -1739,3 +1765,94 @@ function _views_file_status($choice = NULL) {
    +  $form['filename_sanitization']['replacement_character'] = [
    +    '#type' => 'select',
    +    '#title' => t('Replacement character'),
    +    '#default_value' => $config->get('filename_sanitization.replacement_character'),
    +    '#options' => [
    +      '-' => t('Dash (-)'),
    +      '_' => t('Underscore (_)'),
    +    ],
    +    '#description' => t('Character used when replacing whitespace, replacing non-alphanumeric characters or transliterating unknown characters.'),
    +  ];
    

    Can we add an empty string to the replacement options?

  3. +++ b/core/modules/file/src/Event/FileUploadSanitizeNameEvent.php
    @@ -0,0 +1,126 @@
    +    $this->filename = $file_system->basename($filename, $this->extension !== '' ? '.' . $this->extension : NULL);
    

    Can we change this line to simply $this->filename = $filename if it has no extension?

alexpott’s picture

StatusFileSize
new4.72 KB
new38.3 KB

Re #247
2. We can remove the first "Character" from this sentence too... to give it less character :)
3. I think this is unnecessary.
4. - this is because of

  // Update the filename with any changes as a result of the FileUploadEvent
  // (sanitization) or renaming due to an existing file.
  $file->setFilename(\Drupal::service('file_system')->basename($file->destination));

And the file in the test being renamed due to the name being used before.

Re #249
1. I think using getFileName() is fine - it's a micro-optimisation that leads to more mental work for the programmer. Using !== is fine though
2. I don't think so. Firstly this leads %20 in URLs and secondly we have to work out what that means for replacing whitespace.
3. Good point. In fact doing this makes me realise that using the file system service is unnecessary we can remove the suffix ourselves.

pingwin4eg’s picture

@alexpott in the 2nd item I meant not the whitespace, but an empty string, which would simply remove unknown chars.

alexpott’s picture

Issue summary: View changes
StatusFileSize
new3.85 KB
new39.13 KB

Made SanitizeNameTest a unit test. @dww raised concerns that this would not have caught the pathinfo() PHP bug but I've written it in such a way that both \Drupal\file\EventSubscriber\FileEventSubscriber::sanitizeFilename() and \Drupal\file\Event\FileUploadSanitizeNameEvent::__construct() are covered and the test would catch that. \Drupal\Tests\file\Functional\SaveUploadTest::testSanitization covers the functional implementation and ensure thats the even is subscribed to properly.

We need to add test coverage for the REST changes - added this to the task list.

Also I've changed the task

Decide which settings UI we like better: #245 or what we had before.

to decide whether we want to be able to have different replacement characters for whitespace and non-alphanumeric replacements - as that is the use-case we need to decide if we design for. Personally I'm not convinced about the use-case for different replacement characters.

Also fixed the deduping when there is a dot on the end and we have an extension - so foo.....txt becomes foo.txt

Re #251 this has been discussed and removed before - see #244. It is too destructive.

alexpott’s picture

Issue summary: View changes
StatusFileSize
new1.19 KB
new40.32 KB

Now with REST test goodness.

alexpott’s picture

StatusFileSize
new1.79 KB
new40.88 KB

So adding the REST test shows me we have to defend against transliteration return an empty string.

drush ev "var_dump(\Drupal::service('transliteration')->transliterate('✓'));"
string(0) ""

Note that iconv() is no better for transliterating this - it returns FALSE. These type of edge cases are definitely a reason to keep this the string manipulation as non-destructive as possible.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new680 bytes
new41.22 KB

Now we're fixing obscure REST bugs too! I think this fix is totally in-scope because with this change we are massively extending test coverage of unicode characters in filenames.

pancho’s picture

Issue tags: +Needs usability review
StatusFileSize
new37.43 KB
new21.13 KB
new25.48 KB
new32.62 KB
new33.52 KB
new35.01 KB

Wow, this is so fast, I had to merge manually a few times.
Implementation is awesome now, but let's not jump over functionality and UI considerations.

#246-1:

The use-case for replacing spaces with _ and non alphanumeric with - is very small

I see that we need a solution for unknown Unicode characters, but my first thought was: no. But then, I figured out that it doesn't mean forcing all-dashes or all-underscores on the user - if someone (or a certain language) uses a lot of hyphens that are important to distinguish from spaces, just use underscore as replacement character.

Finally, after a proper transliteration, non-alphanumeric characters will in most cases be stuff that should really get stripped, as replacing and thereby preserving it in a way, doesn't convey any information. Without a proper transliteration, this however may not be the case and some German or Hungarian or Czech people will get bad results. And if even whitespace replacing is off, it will only get worse. But in that case dashes won't help more than underscores vice versa. So I'd even go farther and say: if at least whitespace replacing is on (which would be very recommended), let's just strip the remaining non-alphanumeric characters and not replace them by anything else.

For me it makes the UI and effects very consistent - you tick from a list of actions which you want and then you select the character to use for replacements or when transliteration meets an unknown character.

Well, the experienced Drupalista ticks from a list and is done. First impression is worse, and IMO it is worse than it was before, so tagging for usability review.

One aspect I'm concerned with, usability-wise, is the placement of the default replacement character select right at the bottom, underneath "convert to lowercase", so completely out of context. We should move it up, right below the relevant checkboxes, and have the last two checkboxes at the bottom. A long list of similar checkboxes are way harder to grasp than a more visually structured fieldset, even though visually neater.

We also need to make sure is the descriptions are as concise and helpful as possible. Once we figured out everything else, we should iterate over these one more time.

I'm adding another patch and a short screencast with a design proposal that follows my suggestions.

closed
open1
open2
select

pancho’s picture

But even then, we're unfortunately still not quite there IMHO. Of course we can't support every use case. But I still see two major deficiencies:

1.) There will be quite some companies/organizations/people with filename schemes that use parentheses () and/or brackets [] which is relatively acceptable in a filename. With these being stripped, many will turn off pretty much all of this, even though we really want sitebuilders to do a proper sanitization.

2.) Same holds for umlauts and quite some other diacritics. Wikipedia allows almost everything, for example https://en.wikipedia.org/wiki/Die_Ärzte_(1986_album), Wikimedia has https://commons.wikimedia.org/wiki/File:Die_Ärzte2012.svg, or https://commons.wikimedia.org/wiki/File:¿Cómo_te_quedas%3F.jpg (note: the question mark is an exception to what is possible).

Now of course people can turn off transliteration, and many will. But then they almost have to turn off stripping/replacing non-alphanumeric, too. Which leaves them with no protection at all. So I think for all those who consider a full transliteration too strict, we would either need a reasonable subset of "Strip remaining non-alphanumeric characters" (case 1) resp. of transliteration (case 2) that at least filters out totally unacceptable characters. Possibly ISO/IEC 8859-1 (= CP-1252 = latin1) as softer transliteration, and US-ASCII minus some disallowed characters as a softer replace? Just a thought.

3.) re pix in #258: now there clearly are too many individual recommendations. Now that we're properly informing the sitebuilder about every option, I guess, we should leave them all out.

alexpott’s picture

StatusFileSize
new41.22 KB
new41.22 KB

Re-uploading #257. @Pancho this is your second patch on this issue. Raising your concerns in comment rather than filing a patch and losing important test coverage would have been great. @dww and I have been working together on this issue and talking about in slack - please join us there and discuss before heading off in another direction.

+++ b/core/modules/file/tests/src/Unit/SanitizeNameTest.php
@@ -70,149 +67,118 @@
-      '✓ unicode' => [
-        '✓.txt',
-        '-.txt',
-        [TRUE, '-', FALSE, FALSE, FALSE, FALSE],
+        [TRUE, '-', FALSE, FALSE],
       ],
-      'Multiple ✓ unicode' => [
-        '✓✓✓.txt',
-        '---.txt',
-        [TRUE, '-', FALSE, FALSE, FALSE, FALSE],
-      ],

I'm sorry @Pancho but removing test coverage like this is upsetting.

Also we're not

let's not jump over functionality and UI considerations.

- @dww has raised this issue in the UX slack channel and has added screenshots all the time. The consideration is the replacing with nothing is not viable see all the previous comments - for example #244.

Of course we want to consider the UI. But please come up with your concerns about the current approach - but also please read the entire issue to make sure that you don't rehash old concerns that have been answered before.

pancho’s picture

Status: Needs review » Needs work
StatusFileSize
new37.21 KB
new2.41 KB
new23.29 KB
new23.29 KB
new31.45 KB
new31.94 KB
new33.83 KB

[x-post]
Fixed #259-3 and the non-alphanumeric description.

closed
open1
open2
select

[Patches #257 and #260 are going to fail like the ones before, but should not fail worse.]

pancho’s picture

Re #260:

@Pancho this is your second patch on this issue. [...] @dww and I have been working together on this issue and talking about in slack - please join us there and discuss before heading off in another direction.

If you prefer keeping everybody else out here and work everything out in a small circle, then please, just go ahead and piss off other contributors. I'm not gonna bother you on this issue anymore.

I'm sorry @Pancho but removing test coverage like this is upsetting.

Stop. I said

Wow, this is so fast, I had to merge manually a few times.

It's not okay to suggest I would have skipped these 3 lines on purpose. I had to merge my patch manually several times.

alexpott’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new41.22 KB
new41.22 KB
new128.17 KB
new174.38 KB

@Pancho why do think I want to stop any contribution? I've tried my best to address all the feedback and concerns raised in this issue. But it would be great to talk before changing direction again. Especially when some of the points you raise have been raised and thought about before on the issue.

Here's my thoughts on #258.

Finally, after a proper transliteration, non-alphanumeric characters will in most cases be stuff that should really get stripped, as replacing and thereby preserving it in a way, doesn't convey any information. Without a proper transliteration, this however may not be the case and some German or Hungarian or Czech people will get bad results. And if even whitespace replacing is off, it will only get worse. But in that case dashes won't help more than underscores vice versa. So I'd even go farther and say: if at least whitespace replacing is on (which would be very recommended), let's just strip the remaining non-alphanumeric characters and not replace them by anything else.

Please supply example strings so that we can test to see what happens - real world examples would be great.

One aspect I'm concerned with, usability-wise, is the placement of the default replacement character select right at the bottom, underneath "convert to lowercase", so completely out of context. We should move it up, right below the relevant checkboxes, and have the last two checkboxes at the bottom. A long list of similar checkboxes are way harder to grasp than a more visually structured fieldset, even though visually neater.

Sure we can move the select. But personally I find putting the select in the middle of the list of checkboxes hard to parse. See the two screenshots. This is because of how the select list gets a header.

Re #259

There will be quite some companies/organizations/people with filename schemes that use parentheses () and/or brackets [] which is relatively acceptable in a filename. With these being stripped, many will turn off pretty much all of this, even though we really want sitebuilders to do a proper sanitization.

We're introducing an event that contrib and custom code can you to enforce whatever company policy you like - there's no way we can produce a 1 size fits all here.

Same holds for umlauts and quite some other diacritics. Wikipedia allows almost everything, for example

If you want that do not turn on transliteration as you suggest.

Which leaves them with no protection at all.

Protection from what exactly? This feature is to allow you to standardise filenames on your site - it's not a security feature per se.

pancho’s picture

StatusFileSize
new37.21 KB
new2.41 KB
new23.29 KB
new23.29 KB
new31.45 KB
new31.94 KB
new33.83 KB
new37.79 KB
new1.11 KB

1.)

Especially when some of the points you raise have been raised and thought about before on the issue.

Have they? I've been following this issue for quite some time.

2.)

Please supply example strings so that we can test to see what happens - real world examples would be great.

Not as long as the technical implementation is moving so fast and I get the feeling you're already settled feature- and UI-wise.

3.)

We're introducing an event that contrib and custom code can you to enforce whatever company policy you like - there's no way we can produce a 1 size fits all here.

Nope, but a reasonable standard that covers many if not most common usecases. German umlauts, French or Czech diacritics are common. Parentheses and brackets in filenames, too.

4.)

it's not a security feature per se

We're still posting a message:

'For security reasons, your upload has been renamed to %filename.'

5.)
Readding 2 missing tests and @alexpott's REST bugfix from #257 which he reposted in #260.
Unsubscribing now to no longer bother.

alexpott’s picture

@Pancho the message For security reasons, your upload has been renamed to %filename. it only emitted when the fiel is renamed for security reasons. I.e. when it contains an unsafe extension. if that's not the case it says Your upload has been renamed to %filename.

Thinking about this some more - I'm beginning to ponder if we're going about this all wrong. As @Pancho points out Wikipedia allows umlauts in its URLs. And if you want a file called '🙈🙈🙈.txt' should Drupal get in the way http://example.com/sites/all/files/🙈🙈🙈.txt is going to work just fine. So perhaps we need a re-think. Is using transliteration the correct approach? Maybe we could focus on things that cause known problems which seem to be:

If we add the event then contrib and custom can experiment with transliteration and work out the pitfalls of much there seem to be many. People have already raised concerns about Japanese and Chinese filenames with this turned on.

alexpott’s picture

      'Chinese' => [
        '顶顶顶顶.txt',
        'dingdingdingding.txt',
        [TRUE, '-', FALSE, FALSE, FALSE, FALSE],
      ],
      'Chinese 2' => [
        '例.txt',
        'li.txt',
        [TRUE, '-', FALSE, FALSE, FALSE, FALSE],
      ],
      'Arabic' => [
        'مثال.txt',
        'mthal.txt',
        [TRUE, '-', FALSE, FALSE, FALSE, FALSE],
      ],

Decided to test some of the assumptions about turning on transliteration and non-European texts. It seems our transliteration library can cope with this.

alexpott’s picture

@Pancho I am sorry that my actions have caused you to quit the issue and I'm sorry I got upset with tests going missing. It is hard when there are so many comments and patches to an issue. I like to post incremental progress so people can follow along and see how I've arrived at the patches I'm posting. Big bang patches make that more difficult. Your contributions are valuable and continue to inform my thinking on this issue.

@Pancho it is your comment about umlauts and other diacritics in #259 that has convinced me we've pursuing the wrong path here.

After discussing this with others including @Berdir (who maintains large multi-lingual sites) and @Wim Leers (who maintains the REST API) I'm more convinced that transliterating filenames is going to cause more problems than it solves. For example, the current implementation will do different transliterations depending on the language the site is currently in!?!? For example if the UI is in english Ä becomes A but if it is German it becomes Ae - this type of magic is very hard to explain to users. Also it prevents us from solving one of the problems transliteration was supposed to solve - ie. basically the same filenames but not.

Therefore I think we should go with something along the lines of #266 and having the following options:

  • Replacement character - either - an underscore or a dash
  • Replace whitespace - boolean
  • Replace reserved characters - boolean
  • De-duplicate separators - boolean
  • Convert to lowercase - boolean

Open questions for me are:

  • Do we need both "Replace whitespace" and "Replace reserved characters" - or can we combine. For me they should be separate and we should consider enabling the "Replace whitespace" on new sites and this requirement is very normal.
  • Do we really need the deduplicating option? I'm not sure.
frob’s picture

I think #266 is on the right track. While @Pancho was a bit heavy handed in how suggestions where made, I think they are the right suggestions to make. Further enhancements belong in a contrib module, such as pathauto. It might be a good idea to add a test module to ensure that the event is sufficient for making those changes as designed.

One thing however, @alexpott I don't think it is helpful to constantly refer people to Slack. I know this stuff has been discussed, in this issue's comments even, but I am not going to get onto Slack to discuss this and if that is a requirement for my continued (marginal) contribution then my contribution will decrease even further. I understand the need to sometimes get on chat and discuss an issue quickly, that is fine and I have done that on IRC, but in the end the discussion needs to be resolved here. Otherwise we will be marginalizing a sub-set of our community.

dww’s picture

Status: Needs work » Needs review

Wow, 23 comments in short order. Not just the implementation moving fast here. ;) Alas it's gotten so triggered/charged.

#250 247.2: Great, love it. :)
#250.3: Love it! Very happy not to pass the file system service around. Way better. Yay.

#252 Unit test: that was easier than I thought it'd be. :) I'm down. Thanks.
UI: okay, sounds right.
Dedupe: Nice fix.

#257: Hah, nice! Yes to in scope.

#258:

Wow, this is so fast, I had to merge manually a few times.
Implementation is awesome now, but let's not jump over functionality and UI considerations.

Yes, indeed!
I'm still open to some variant of the earlier UI that allowed different replacement characters. I just worry it gets too complex and confusing.
I'm still not convinced the "replace other non-alpha" setting should require transliterate being on. I don't like forcing it to be "strip". I think replace is better.
I'm open to "strip" as a "replace" option, but that gets confusing. See above.

#259 Yes, contrib can now listen to this event and do what it needs. Yes, sympathetic to the UI concerns.

#260: For the record: I'm happy to discuss here, Slack, IRC, or potentially audio. I don't think it's a requirement to get things done, that's what we have the issue queues for.

I'm sorry @Pancho but removing test coverage like this is upsetting.

It was an honest mistake. I easily could have done the same. Any of us can when patches are flying around like this. Let's give each other the benefit of the doubt and all try to stay calm and relaxed. Thanks.

#262: Alas. Wow, that escalated fast. Sorry you got triggered. I don't think anyone is trying to exclude anyone here. We're desperately trying to get more input on this issue so we can actually commit it and move on.

#265.1:

I've been following this issue for quite some time.

Yes, but you are rehashing various points we've raised and addressed already, so it gives the impression you haven't. Easy to do on huge complex issues like this. :/

#265.2

Not as long as the technical implementation is moving so fast and I get the feeling you're already settled feature- and UI-wise.

I don't think we're settled feature and UI wise, that's why all that is still in the remaining tasks section of the summary.

#265.4: Yeah, you misread the patch. The "for security" part is preserving an existing message. See above.

Unsubscribing now to no longer bother.

Alas. :( Ugh, sorry it came to that.

#270:

@Pancho I am sorry that my actions have caused you to quit the issue and I'm sorry I got upset with tests going missing. It is hard when there are so many comments and patches to an issue. I like to post incremental progress so people can follow along and see how I've arrived at the patches I'm posting. Big bang patches make that more difficult. Your contributions are valuable and continue to inform my thinking on this issue.

Indeed!

After discussing this with others including @Berdir (who maintains large multi-lingual sites) and @Wim Leers (who maintains the REST API) I'm more convinced that transliterating filenames is going to cause more problems than it solves. For example, the current implementation will do different transliterations depending on the language the site is currently in!?!? For example if the UI is in english Ä becomes A but if it is German it becomes Ae - this type of magic is very hard to explain to users. Also it prevents us from solving one of the problems transliteration was supposed to solve - ie. basically the same filenames but not.

Argh! :) This issue started as "start using transliterate for filenames", and now y'all have convinced yourselves we don't need it at all. Fun.

Re: language - ugh, I was just trying to "be responsible" and "do the right thing" to keep all the i18n folks happy. I'd be fine with a setting for this to either pick a site-wide language for transliteration or to use the current language of the upload. Maybe that's too confusing.

I'm also ok for *none* of this to have a UI. If we tell people to configure their site manually via settings.php or their site config, that's fine with me. I'm happy to write reasonable defaults and comments about it, instead of crafting and arguing about the UI. We already tell people they have to define their paths this way. Happy to have them enable and tweak filename sanitization at the same time.

Do we need both "Replace whitespace" and "Replace reserved characters" - or can we combine. For me they should be separate and we should consider enabling the "Replace whitespace" on new sites and this requirement is very normal.

"Need" is a strong word ;) but "want" I think is "yes". Separate is good.

Do we really need the deduplicating option? I'm not sure.

Again: need: no, want: yes. We already have it, why throw it out now? It's like our time and life energy is totally free and expendable.

#271: Agreed, thanks for voicing that.

I still think there needs to be an option to enable transliteration for sites that know they want it. I'm going to be an incredibly sad panda if I end up having to maintain a D8 contrib to add that, after coming this far to simply putting something in file.settings.yml for this from now on. Please don't completely change course from how this issue started.

So, maybe a no-UI file.settings.filename_sanitization array that lets people do what they want?

Back to UI review/debates?

Follow-up issue to do what this issue originally set out to do, 270+ comments ago?

I don't exactly want to re-re-upload pat h #257, but I don't want the d.o bot to be confused, and I don't want to add my own patch to the mix until we decide on a direction. Tentatively setting back to NR.

Agreed with 'Needs usability review' tag.

Thanks,
-Derek

dww’s picture

Issue summary: View changes

Attempt at collecting everything needed for the 'Needs usability review' into the summary. Pasting here as a ul in case that's easier to see:

  • Decide if we're allowing transliteration as an option(!). See #272.
  • Decide if we should let people pick a transliteration language. See #270 and #272.
  • Usability review:
    • Decide if we want a UI at all, or if sites should simply configure this via settings.php (like they do for the public + private files dir paths). See #272.
    • If yes to a UI:
      • Decide whether we want to be able to have different replacement characters for whitespace, non-alphanumeric characters, and transliteration unknown characters. See #180, #246, #258, #261, and others...
      • If we have it as a separate / shared setting, decide if the "replacement character" setting should be "near" the checkboxes that provide replacement or at the end of the fieldset (see #263 for comparisons)
      • Decide if "replace non-alphanumeric" should be "near" transliterate option. See #241 and #242.
      • ...
  • Final string/UI review + signoff...
  • ...
dww’s picture

Issue summary: View changes

Whoops.

alexpott’s picture

I've opened two new issues to try to split this up a bit and make it so this issue can focus on transliteration and maybe the the UI for sanitization options for the user.

pancho’s picture

Title: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) » [PP-1] Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc)
Status: Needs review » Postponed

Postponing on #3032390: Add an event to sanitize filenames during upload. Let's get the event in first, then see how we want to use it.

nlisgo’s picture

StatusFileSize
new5.85 MB

The most recent patch did not apply after upgrading to 8.6.13 so re-rolling.

To-do: reintroduce test adjustment to /core/modules/file/tests/src/Functional/FileFieldWidgetTest.php

nlisgo’s picture

StatusFileSize
new114.98 KB

The above patch was for the current 8.6.x branch but it didn't apply to 8.6.13 so uploading a patch for that.

eric.guerin@ucsf.edu’s picture

@nlisgo I tried patch 278 but the patch failed still. That patch also seems to have way more in it then it should for just the file sanitation. Such as the composer.json etc.

pavelculacov’s picture

@nlisgo patch-279 working on Russian, Romanian language. Tested

nlisgo’s picture

@gr8tkicks apologies. Obviously screwed something up on the re-roll for the dev branch. The #279 patch should apply to 8.6.13. As this is marked postponed I will not bother to submit a re-roll on the dev branch at this moment in time.

jcisio’s picture

StatusFileSize
new9.74 KB

While #279 applies on 8.6.13, it contains a few schema changes and hook_update_N (as in the latest patch). Here is an updated patch of #128 which doesn't have any hook_update_N functions (which could be easily in conflict with any future changes).

mitrpaka’s picture

StatusFileSize
new9.95 KB

Updated patch from #283 to apply with 8.7.0 (released 1 day ago).

mitrpaka’s picture

StatusFileSize
new9.96 KB

Fixed failing test case for transliteration.

andreasderijcke’s picture

The actual sanitization code is usefull for other (custom) modules.
I recommend making this available as a service.

Quicksaver’s picture

StatusFileSize
new39.04 KB

FOR MY CASE I had custom code that highly depended on the functionality of patch #279, so I made a direct re-roll of that to apply for 8.7.x. Once again, please note that this was for my specific case only. I'm making this patch available in case someone else might also benefit from it.

(I have also removed a couple of .orig files from it, seemed like leftovers from previous dev work.)

I have been trying to follow this discussion, but I admit I have now become completely lost. For instance, I don't understand why patch #283 (and all subsequent patches) was was based off of patch #128; a ton of work had been lost there. Before merging this in (eventually), it may be worth trying to retrace all the steps for the last few months and see what's missing and what's the current consensus for the currently applied functionality.

dww’s picture

This issue is blocked on #3032390: Add an event to sanitize filenames during upload. None of the patches here are valid/relevant anymore. In the other issue, we've significantly moved past what's here. Please do not re-roll this or continue development. We need to put the energy into #3032390 instead.

Therefore, the summary (and scope) of this issue are now stale. Initial attempt to update the summary to point to #3032390. Adding that as the true parent issue for this (and moving the previous parent, #1842718: Use new Transliteration functionality in core for machine names as a related issue).

Re: #287 -- absolutely right. #283 and beyond confuse the issue and aren't helpful. Please stop. ;) Please help get #3032390 done, instead.

Thanks,
-Derek

jcisio’s picture

Re @dww @Quicksaver: as I explained in #283, that patch is supposed to be a safe patch to use with Composer, when the future of this issue is not clear. Sorry that it confused you.

avpaderno’s picture

Issue summary: View changes
frob’s picture

@dww should this issue be closed as a duplicate if #3032390: Add an event to sanitize filenames during upload has superseded this issue? I thought that this issue was about building the sanitation service and admin UI and #3032390: Add an event to sanitize filenames during upload was about adding the events that would be needed for this issue to do the file name sanitation.

dww’s picture

@frob re: #292

should this issue be closed as a duplicate if #3032390: Add an event to sanitize filenames during upload has superseded this issue?

Sadly, no.

I thought that this issue was about building the sanitation service and admin UI and #3032390 was about adding the events that would be needed for this issue to do the file name sanitation.

Correct. But we can't build the sanitation service and admin UI without the event that would be dispatched which will allow this to work. Hence postponed and "[PP-1]" in the title.

Re: #283 and other attempts to re-roll this to be "composer-ready". I feel your pain. I understand why you do that (I do it, too). But in this case, it's really confusing things. To give y'all a public place to put your re-rolls without causing any trouble in here, I just opened a child issue:

#3094052: [IGNORE] Composer-friendly patches for filename sanitation while #2492171 awaits #3032390

Please post composer-friendly re-rolls there, not here.

Thanks!
-Derek

p.s. Removing some now bogus tags.

nlisgo’s picture

StatusFileSize
new9.95 KB

This is a re-roll of the patch in #285.

dww’s picture

@nlisgo: Again, please post such patches at #3094052: [IGNORE] Composer-friendly patches for filename sanitation while #2492171 awaits #3032390 not here.

@all: I'm closing comments on this issue until the blocker is fixed. Apparently folks neither read the first sentence of the summary nor the comments, so the best solution is to disable commenting entirely. Meanwhile, #3094052 can be the playground for rerolls / backports / etc.

Thanks,
-Derek

dww’s picture

Actually closing comments.

alexpott’s picture

The blocker is in.

alexpott’s picture

Issue summary: View changes
Status: Needs work » Needs review

Here's a reroll of #263 on top of #3032390: Add an event to sanitize filenames during upload.

We need to go through the discussions from around that time to ensure all opinions and questions are accounted for. #272 and #273 look like good places to start as they attempt to summarise the preceding discussion.

I rerolled #263 because it was the last green patch prior to 3032390 and has the most features / test coverage. The patch itself is not necessary what I think the final patch should be. We need a usability review and research into what we really should do. I've rerolled it to get the ball rolling again. Also I've used a merge request because we're near 300 comments and I don't want the pager :)

I asked @daniel.bosen from the Thunder team which thunder ships with functionality to transliterated filenames. #146 explains it pretty well. An http client that quite a few mobile apps use breaks when image file name have umlauts in them.

alexpott’s picture

Issue summary: View changes

I've updated the issue summary to reflect more the current state.

frob’s picture

What does the update function need to do? Set some configuration defaults?

berdir’s picture

Yes, exactly. I would say the update function should ensure that the current behavior of existing sites remains unchanged for BC, whereas we need to decide what the recommended/default behavior should be.

xmacinfo’s picture

> whereas we need to decide what the recommended/default behavior should be.

What are the options for the “recommended/default behavior”?

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

oakulm’s picture

Tried to apply patch patch at #294 to 8.9.x but patching fails. There seems to be similar work already in core https://www.drupal.org/node/2972665 but the settings are not visible in admin/config/media/file-system so this is really confusing.. Anything I can do to help nudge this forward? Tried to reroll but got stuck while trying to understand what's going on.

imclean’s picture

@oakulm the link you provided is to the draft change record for this issue. It's what this issue is trying to achieve, it isn't in core yet.

AshM’s picture

StatusFileSize
new3.83 KB
AshM’s picture

StatusFileSize
new3.83 KB
gauravvvv’s picture

StatusFileSize
new3.83 KB
new517 bytes

Patch #310, re-rolled. Attached interdiff for same.

phily’s picture

Above patch #311 seems to be working for me using Drupal 9.2.2 (English & French).
Thanks

sutharsan’s picture

Note that patches in #309 to #311 do NOT replace the Merge request 370 nor an addition to it. At best they are a summary. I recommend to ignore them.

sokru’s picture

Status: Needs review » Needs work
Issue tags: -Needs issue summary update

Changing status to Needs work, because there's unresolved threads on merge request 370.

thetailwind’s picture

I've been looking into this fix. How would replacing a file in an existing media entity work? For example, with this module (which really should be core functionality) https://www.drupal.org/project/media_entity_file_replace?

Assume a dirty file has been uploaded to an entity. All URL references to said entity would be based off of the dirty filename. How would this behave after this functionality has been enabled but when a user uploads a replacement file?

Thanks!

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

kovtunos’s picture

Patch doesn't work for Drupal 9.3+
Can you please update it?

collinhaines made their first commit to this issue’s fork.

collinhaines’s picture

I pushed a couple commits to MR 370 to resolve the missing constructor parameter documentation and prevent PHP notices for the config array.

Did we ever decide on what the update hook is going to do? My assumption is to add the new default config values found in file.settings.yml to the database.

dajka’s picture

Thank you all who are working on this!
If I were to use a patch in production, which (if any) do you recommend?

sutharsan’s picture

@Dajka, using patches from an open issue is always at your own risk. No upgrade path between patches is or will be available an support is minimal or absent. Only use patches if you are comfortable with code and if you are prepared to solve any problem yourself if that may arise.
If you decide to use a patch, use the most mature that is available, usually the latest. If you apply patches to your code I highly recommend to let Composer (composer-patches plug-in) do it for you.

frob’s picture

I would never use a patch from an issue that has been lingering for 7 years. You could try contrib. There is an old module "File field paths" that had this functionality in it.

dercheffe’s picture

I prefer https://www.drupal.org/project/filename_transliteration instead of file field paths module.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

ressa’s picture

Thanks for sharing that @dercheffe. Until this issue gets filename transliteration in core, here are two issues discussing the differences between Filename Transliteration and Transliterate filenames:

sokru’s picture

Status: Needs work » Needs review
StatusFileSize
new27.67 KB

The threads on MR (370) are now resolved, attached a reroll patch to 10.1.x branch. Tested the patch manually.

sokru’s picture

StatusFileSize
new27.87 KB

This should fix the CSpell nags, coding standards and one mistake I made during the reroll.

sokru’s picture

StatusFileSize
new9.96 KB

Missing interdiff.

Status: Needs review » Needs work

The last submitted patch, 329: 2492171-329.patch, failed testing. View results

sokru’s picture

Status: Needs work » Needs review
StatusFileSize
new868.27 KB
new382.68 KB
new27.81 KB
new2.68 KB

This will fix the tests, BUT I had to alter the tests, since the tests assumed lowercase alteration could not be done without transliteration.

Snippet from core/modules/file/src/EventSubscriber/FileEventSubscriber.php

  public function sanitizeFilename(FileUploadSanitizeNameEvent $event) {
    // Don't reset the filename if transliteration is off, let others handle it.
    $transliterate = $this->config->get('filename_sanitization.transliterate');
    if (!$transliterate) {
      return;
    }
    ...

Also if the current way is desired workflow, I think we should use States API also for other options like this:
Transliteration unchecked:

Transliteration checked:
Transliteration is enabled

sokru’s picture

Issue tags: +Needs usability review
sokru’s picture

StatusFileSize
new28.39 KB
new2.01 KB

@catch commented on Slack that using the States API as described on #332 would be good approach, so updating the patch accordingly.

sokru’s picture

Issue summary: View changes
StatusFileSize
new139.5 KB
sokru’s picture

Issue summary: View changes

Cleaned up the Issue summary for the Usability review.

jungle’s picture

+++ b/core/modules/file/tests/src/Unit/SanitizeNameTest.php
@@ -0,0 +1,228 @@
+        'S  Pácê--táb#	#--🙈.jpg',
+        's-pace-tab-.jpg',
...
+        '🙈🙈🙈.txt',
+        '---.txt',

Per the tests, it's not perfect. Can we introduce Symfony\Component\String\Slugger\AsciiSlugger?

  $slugger = new AsciiSlugger();
  $slugger = $slugger->withEmoji();
  // S-Pace-tab-see-no-evil-monkey
  $slugger->slug('S  Pácê--táb#	#--🙈', '-', 'en');
  // S-Pace-tab-singe-ne-rien-voir
  $slugger->slug('S  Pácê--táb#	#--🙈', '-', 'fr');
  // S-Pace-tab-fei-li-wu-shi
  $slugger->slug('S  Pácê--táb#	#--🙈', '-', 'zh');
  // see-no-evil-monkeysee-no-evil-monkeysee-no-evil-monkey
  $slugger->slug('🙈🙈🙈', '-', 'en');
  // singe-ne-rien-voirsinge-ne-rien-voirsinge-ne-rien-voir
  $slugger->slug('🙈🙈🙈', '-', 'fr');
  // fei-li-wu-shi-fei-li-wu-shi-fei-li-wu-shi
  $slugger->slug('🙈🙈🙈', '-', 'zh');

See https://symfony.com/doc/current/components/string.html#slug-emojis

You cannot use the "Symfony\Component\String\Slugger\AsciiSlugger::withEmoji()" method as the "symfony/intl" package is not installed. Try running "composer require symfony/intl".

Note: symfony/intl is required.

simohell’s picture

Issue tags: -Needs usability review

Usability review

We discussed this issue at #3335028: Drupal Usability Meeting 2023-01-27. That issue will have a link to a recording of the meeting.

For the record, the attendees at this usability meeting were @BlackBamboo, @benjifisher, @ckrina, @iszabo, @shaal, and @simohell.

In the review we agreed, that

  • The setting for Transliterate should not set other options' visibility as there is no obvious hierarchy. Instead it is recommended to use an HTML details-element to contain all the settings.
  • The title would be better as a slightly shorter version "Sanitize filenames".
  • Removing duplicate dots etc. should state the precise action. We assume the actual action is "Replace duplicate dots (..), underscores (__) or dashes (--) with a single one."
  • Sticking to a single replacement character keeps the settings simple and therefore usable. The settings are easier to read if the replacement character is in the end.
sokru’s picture

Issue summary: View changes
StatusFileSize
new6.11 KB
new27.01 KB
new193.25 KB
new19.87 KB

- Updated the interface based on usability review. Made the details open by default. Screenshots updated to summary.
- Based on usability review I also updated the code and tests so its possible to only do specific sanitizations (eg. only lowercase).
- Removed the mentioned PHP bug workaround on FileEventSubscriber::sanitizeFilename, since its only on PHP 7.4, not 8.x

For a suggestion on #337 about using AsciiSlugger, I think it would be relevant if core would use it also on \Drupal\Component\Transliteration. So suggesting leaving it to follow-up issue.

smokris’s picture

Issue summary: View changes

(Fixing screenshot filename in issue summary.)

sokru’s picture

StatusFileSize
new867 bytes
new27.01 KB

Fixing the CSpell issue.

rinku jacob 13’s picture

StatusFileSize
new214.28 KB

Hi @sokru , Reviewed the patch #341 for drupal version 10.1.x. It was successfully applied .Adding screenshots for the reference. Need RTBC+1
sanitize

rkoller’s picture

StatusFileSize
new644.12 KB

I've tested the patch in #341 on a drupal 10.1.x-dev install as well. overall the feature looks great! the only nitpick i would have is when you are using a screenreader (voiceover on macos 12.6.1 in my case). in my current verbosity text setting (see the linked video) i have activated the default option some. if you switch to the option all the information becomes even more granular. but as you can see in the current video, with the option some set, underscore gets announced twice for the first sentence and in the second second sentence three times. for sighted users i consider it a good thing to exemplify and illustrate what dot, underscore and dashes stand for, but in the aural interface it is sort of redundant for some or all characters depending on the set verbosity level. so i wonder if it would make sense to wrap the parenthesis with dots, underscores, and dashes in spans and hide them from the screenreader with something like aria-hidden="true"?

jwilson3’s picture

Status: Needs review » Needs work
  1. +++ b/core/modules/file/file.module
    @@ -1331,3 +1331,80 @@ function file_get_file_references(FileInterface $file, FieldDefinitionInterface
    +    '#title' => t('Remove duplicate dots (..), underscores (__) or dashes (--) with a single one'),
    

    Text should be:

    "Replace duplicate dots (..), underscores (__) or dashes (--) with a single one."
    

    i.e., swap "Remove" with "Replace"

  2. IMO, there is another feature that could be supported. "Replace sequences of dots, underscores and dashes with a single one". So that stuff like: _-_ would be simplified to -.

pooja saraah’s picture

StatusFileSize
new27.44 KB
new559 bytes

Addressed the comment on #344
Attached patch against Drupal 10.1.x

poker10’s picture

Patch #345 does not apply.

I would also prefer not to change the quotes to double quotes (as there are single quotes used everywhere).

-    '#title' => t('Remove duplicate dots (..), underscores (__) or dashes (--) with a single one'),
+    '#title' => t("Replace duplicate dots (..), underscores (__) or dashes (--) with a single one."),
pooja saraah’s picture

StatusFileSize
new30.21 KB
new559 bytes

Addressed the comment on #346
Attached patch against Drupal 10.1.x

pooja saraah’s picture

Status: Needs work » Needs review
Ratan Priya’s picture

StatusFileSize
new27.01 KB
new679 bytes

Addressed #344.
Needs review.

sokru’s picture

Issue summary: View changes
StatusFileSize
new1.16 KB
new27.19 KB

Removing the dot at end of title and addressing #343, will ask feedback from #ux about the change.

#344.2 "IMO, there is another feature that could be supported. "Replace sequences of dots, underscores and dashes with a single one". So that stuff like: _-_ would be simplified to -."

Not sure how common this replacement pattern is. IMHO I'd leave it to follow-up or to contrib (eg. "regexp filename transliteration") module.

andypost’s picture

@sokru this kind of filters better leave for contrib a-la https://www.drupal.org/project/smiley

+++ b/core/modules/file/file.module
@@ -1372,12 +1372,12 @@
-    '#description' => t('Alphanumeric characters, dots (.), underscores (_) and dashes (-) are preserved.'),
+    '#description' => t('Alphanumeric characters, dots <span aria-hidden="true">(.)</span>, underscores <span aria-hidden="true">(_)</span> and dashes <span aria-hidden="true">(-)</span> are preserved.'),

not sure potx can properly parse it as it contains attributes

sokru’s picture

@andypost, but there's plenty of strings like

Permissions <span class="visually-hidden">for @module</span>

in core already. I tried exporting string with aria-hidden attribute with drupal/potx module and it exported the string nicely.

smustgrave’s picture

Status: Needs review » Needs work
Issue tags: +Needs Review Queue Initiative

Patch no longer applies cleanly - Will move to NW for that.

error: patch failed: core/modules/file/file.post_update.php:13
error: core/modules/file/file.post_update.php: patch does not apply

but fixed manually for testing (didn't upload as I didn't want to take myself out of the review role)

This was tagged for issue rescope but that was 4 years ago, is that still needed? - Should be removed or update before moving to RTBC

Taking a look at the remaining tasks.

Final string/UI review

Has this been completed by UX team?

Manually testing
post_update hook ran without issue
Enabled all settings in File system.
On Drupal 10.1 with a Standard install
Edited Article content type image field to be unlimited
Uploaded these files and were transformed to
Test File!.png => test-file.png
Test File.png => test-file_0.png
Test ..--__.png => test-.-_.png
Test .. File.png => test-.-file.png

Switched to underscores vs dashes
Test File!.png => test_file.png
Test File.png => test_file_0.png
Test ..--__.png =>test_.-_.png May be a non issue but there was a dash left over
Test .. File.png => test_._file.png

ravi.shankar’s picture

Status: Needs work » Needs review
StatusFileSize
new27.18 KB
new706 bytes

Added reroll of patch #350.

rkoller’s picture

re #353 the results of the ux meeting were in #338 and the recommended changes were applied in #339

the only detail left is #343. i was unable to attend the ux meeting on that day but have taken a look at the issue afterwards and just noticed the redundancy i've provided a video and a suggestion for a potential fix for.

@sokru already asked over in the accessibility channel for maintainers feedback in that regard to make sure there aren't any downsides: https://drupal.slack.com/archives/C2ANFUGGG/p1677133509757059

smustgrave’s picture

Can the issue summary be updated then please

rkoller’s picture

you mean adding a task to the list or remaining tasks? something like 13. Avoid the announcement of redundant text by a screenreader (ref #343)? or would you recommend (also) another step?

smustgrave’s picture

Status: Needs review » Needs work

Yup should be added to remaining tasks exactly how you said it, that's perfect.

Think the quickest fix would be to remove the examples of dots, underscores, and dashes completely but not sure if that's the desired fix.

Moving to NW as this fix needs to be done before review.

rkoller’s picture

Issue summary: View changes
Status: Needs work » Needs review

thanks i am always a bit hesitant updating issue summaries.

and removing those examples would have a downside for sighted users. by using those examples you are able to skim those punctuation character in a glimps of an eye instead of reading and processing their names. and it also provides a certain level of convenience for none native speakers who are using the admin interface in english. having examples for those punctuation characters is way more clear. if possible the micro copy for sighted users should be left as is.

and i've added the point to the issue summary.

rkoller’s picture

Status: Needs review » Needs work

woops somehow switched to needs review by accident. setting back to needs work. sorry

jwilson3’s picture

I wonder if this whole problem of accessibility, strings, and duplicate character handling could be simplified by following what filename_transliteration contrib module does:

Reduce duplicate separators - Sequences of underscores, dashes, and periods are simplified to a single character.

No need for confusing explanations and examples that must be hidden in our Ui strings, like (—), (..), etc. And handles more cases ultimately resulting in easier-to-read file names.

smustgrave’s picture

Think the solution is either that or what rkroller suggested and wrap the parenthesis in aria hidden tag.

sokru’s picture

Issue summary: View changes
Status: Needs work » Needs review
StatusFileSize
new26.71 KB
new4 KB

Assuming all new commits to 10.1. require construction property promotion per #3278431: Use PHP 8 constructor property promotion for existing code, I adjusted the patch accordingly.

I'd be fine with #361 approach, but based on core committers opinion on #215.4 (and points from #359), the current approach is desired, so marking part 14 done from remaining tasks.

avpaderno’s picture

Status: Needs review » Needs work
sokru’s picture

Status: Needs work » Needs review
StatusFileSize
new26.71 KB
new499 bytes

Fix CS issue.

rkoller’s picture

StatusFileSize
new571.76 KB

thank you! I've only manually tested the changes for the remaining task #14 (am not a developer). the patch applies properly to 10.1.x-dev. I've attached a short video illustrating the applied change with aria-hidden in macos 12.6.1 and voiceover. the redundant announcement of punctuations go successfully drop/hidden.
and did another search in the documentation on d.o and according to https://www.drupal.org/docs/accessibility/hide-content-properly it is the only listed and recommended option for not visually hiding content but only hiding it for screenreaders. and since the attribute isn't used on a focusable element nor the spans have any child elements (https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA/Attribut...) it looks good to me. i leave the status at needs review in case more code level review is necessary.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Leaving the needs issue rescope tag but since that was 4 years ago not sure if it's needed anymore.

Also not sure what

Final string/UI review

is? Maybe that's tied into announcement issue.

Functionality wise working as expected so moving forward.

jwilson3’s picture

I'd be fine with #361 approach, but based on core committers opinion on #215.4 (and points from #359), the current approach is desired

I'm not sure how #215.4 has any bearing on #361.

The relevant piece of #359 would still stand for the first of the two checkboxes. The suggestion is to remove it for the second ones ((--) and friends). So to clarify #361, my proposal is to cover more than just 2 or more of the same character, and include two or more of any of the characters. In concrete terms, it means:

Change the functionality of the second checkbox from "Replace duplicate dots (..), underscores (__) or dashes (--) with a single one" to "Simplify sequences of dots, underscores and dashes (eg _-_, .., --) with the replacement character.".

This would both:

1.- widen coverage of our naming cleanup for more of the cases like _-_, _._, etc and (assuming underscore was set as the separator) would vastly improve messy filenames like:

"Test - file.png" to "test_file.png" instead of "test_-_file.png"
"Test_-_file.png" to "test_file.png" instead of "test_-_file.png"
"Test .. File.png" to "test_file.png" instead of "test_._file.png" (mentioned in #353)

2.- Simplify the cognitive overload for sighted users scanning the page for relevant examples like _-_.

gauravvvv’s picture

Updated attrition

sokru’s picture

StatusFileSize
new27.68 KB
new4.45 KB

Using approach from #368, Please check the interdiff; a bit of code repetition, but makes the results less messy.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 370: 2492171-370.patch, failed testing. View results

rkoller’s picture

StatusFileSize
new101.1 KB

re #368 ahhhh now i understand. thanks for the clarification! you definitely have a point. having an example for each punctuation example for the first checkbox label is the correct approach as you stated. but having the same pattern for the second checkbox which i basically about sequences of punctuation characters that is definitely suboptimal. i agree.

1. with the latest patch the label got changed to Simplify sequences of dots, underscores and dashes (eg _-_, .., --) with the replacement character.. i have to admit the label text posed a certain cognitive challenge to me as well. first dots, underscores and dashes got named in the label then within the parenthesis i get an eg and afterwards punctuations separated by commas. mentally my brain automatically tried to spot a pattern and draw the connection to the aforementioned dots, underscores and dashes.
i wonder wouldn't it be way more clear to shorten the label to Simplify sequences of dots, underscores and dashes with the replacement character. and add a description with one of the examples you've provided in #368. imho that might be way more straight forward and clear? something like For example "Test_-_file.png" changes to "test_file.png" instead of "test_-_file.png"

2. and i think the aria-hidden should still be minded and kept in place for the second checkbox, label at least with the changes the latest patch introduced (in that case hide the parenthesis and it's content from screenreaders). the announcement for (eg _-_, .., --) is confusing (see and hear in the attached video).

johnpitcairn’s picture

Still seems slightly clumsy. How about:

"Reduce sequences of dots, underscores and dashes to a single replacement character."

Then provide an example in the description as mentioned above.

It might also be helpful to clarify that this occurs after an initial replacement of spaces.

sokru’s picture

StatusFileSize
new27.68 KB
new752 bytes

This just fixes the tests from #370.

kim.pepper’s picture

Looks good, with extensive test coverage.

One nitpick:

+++ b/core/modules/file/src/EventSubscriber/FileEventSubscriber.php
@@ -0,0 +1,110 @@
+    protected ConfigFactoryInterface $config,

Should this be named $configFactory?

dww’s picture

Status: Needs work » Needs review
Issue tags: -Needs issue rescope
StatusFileSize
new27.64 KB
new1.32 KB
  • Removing 'Needs issue rescope' tag I previously added when this was blocked on other things. 😅
  • Fixed #375.

We still need to decide if we're going to change the behavior and UI of the "Remove duplicates..." setting. I think I like the simplification, but it's going to take a bit of work to get it happy:

  1. deduplicate_separators is no longer a very self-documenting config key for this setting. And its description needs help. 😉
  2. The UI text for the settings form.
  3. The actual implementation of this setting in FileEventSubscriber::sanitizeFilename()
  4. The Unit test coverage.
  5. New screenshots in the summary for the new UI
  6. Update the text of the proposed resolution + UI sections in the summary.
  7. Update the change record

Moving back to NR to get more eyes on this, but probably NW for all the above if we agree on changing this setting and how it works.

p.s. Starting to cleanup credit in here, removing from a few pointless/unproductive rerolls that the author then abandoned.

sokru’s picture

StatusFileSize
new27.9 KB
new1.63 KB

@rkoller found a small bug in Slack on the "Remove duplicates.." logic, this should address it with test coverage.

rkoller’s picture

about #367.2 and in reply to #373

I agree, my initial suggestion was sort of clumsy. But comparing your suggestion with the other labels the patch introduces i wonder if it would make sense to stick as close as possible to the pattern the previous checkbox label uses (Replace non-alphanumeric characters with the replacement character)?

Replace sequences of dots, underscores and/or dashes with the replacement character.
For example "One _ test ._--_. file.png" changes to "One _ test - file.png"

  • As already mentioned I've tried to match the pattern of the previous sentence (Replace none-alphanumeric characters with the replacement character) as close as possible. Additionally I've just added an and/or to point out that it also applies to combinations of those characters. I've reflected that also in the updated example within the description.
  • I've slightly changed the description. I've removed the "instead of"-part at the end and added "One_" at the beginning to demonstrate that single characters like _ remain untouched by the filter (as fixed in #377)
smustgrave’s picture

Status: Needs review » Needs work

From the slack channel seems there is still work to be done around the "eg" part.

jungle’s picture

Status: Needs work » Needs review
StatusFileSize
new28.08 KB
new3.17 KB
  1. +++ b/core/modules/file/config/schema/file.schema.yml
    @@ -39,7 +39,7 @@ file.settings:
    -          label: 'Remove consecutive dots, underscores and dashes'
    +          label: 'replace sequences of dots, underscores and/or dashes with the replacement character'
    
    +++ b/core/modules/file/file.module
    @@ -1377,8 +1377,9 @@ function file_form_system_file_system_settings_alter(array &$form, FormStateInte
    +    '#title' => t('Replace sequences of dots, underscores and/or dashes with the replacement character'),
    
    +++ b/core/modules/file/tests/src/Unit/SanitizeNameTest.php
    @@ -236,7 +236,7 @@ public function provideFilenames() {
    +      'Test transliteration, replace (-), replace sequences of dots, underscores and/or dashes with the replacement character' => [
    

    Updated the label/title/text per #378

  2. +++ b/core/modules/file/file.module
    @@ -1342,7 +1342,7 @@ function file_get_file_references(FileInterface $file, FieldDefinitionInterface
    - * @see \Drupal\file\Event\FileUploadEvent
    + * @see \Drupal\core\file\Event\FileUploadEvent
    
    +++ b/core/modules/file/tests/src/Unit/SanitizeNameTest.php
    @@ -39,7 +39,7 @@ class SanitizeNameTest extends UnitTestCase {
    -   * @covers \Drupal\file\Event\FileUploadSanitizeNameEvent::__construct
    +   * @covers \Drupal\core\file\Event\FileUploadSanitizeNameEvent::__construct
    

    Fix the namespaces

  3. +++ b/core/modules/file/file.module
    @@ -1377,8 +1377,9 @@ function file_form_system_file_system_settings_alter(array &$form, FormStateInte
    +    '#description' => t('For example, "a _ file ._--_. name.txt" changes to "a _ file - name.txt".'),
    

    Add an example per #378, but changed the naming.

jungle’s picture

StatusFileSize
new28.08 KB
new728 bytes
+++ b/core/modules/file/config/schema/file.schema.yml
@@ -24,6 +24,28 @@ file.settings:
+          label: 'replace sequences of dots, underscores and/or dashes with the replacement character'

Oops, "replace" should be uppercased.

1. Emojis are everywhere, and I'd like to have the option to transliterate them, as I commented in #337, but with the FileUploadSanitizeNameEvent added, it's easy to do in custom code.

Event subscribers can modify and sanitize the filename (but not extension) before the file is saved.

2. Meanwhile, I saw extension is not available for altering. Is there a good reason for that? I have not found it with a quick look.

jungle’s picture

StatusFileSize
new3.18 KB
new28.09 KB
new1.4 KB

Sorry, wrong changes in #380.2

jungle’s picture

StatusFileSize
new28.09 KB
new3.4 KB
new660 bytes
+++ b/core/modules/file/file.module
@@ -1332,3 +1332,81 @@ function file_get_file_references(FileInterface $file, FieldDefinitionInterface
+ * These settings are enforced during upload by the FileEventSubscriber that
+ * listens to the FileUploadEvent::UPLOAD event.

The FileUploadEvent class does not exist. Updating the comment.

jungle’s picture

Issue tags: +Needs followup
StatusFileSize
new27.99 KB
new3.25 KB
new670 bytes
+++ b/core/modules/file/file.module
@@ -1377,8 +1377,9 @@ function file_form_system_file_system_settings_alter(array &$form, FormStateInte
+    '#description' => t('For example, "a _ file ._--_. name.txt" changes to "a _ file - name.txt".'),

After reading the discussion in slack, which @dww pointed me to (thanks!), I am removing the #description. And let's do it in the follow-up with the javascript approach to make the example dynamic depending on the selected replacement character.

One suggestion from me is to move the Replacement character form element to the top before deduplicate_separators , because in the label of deduplicate_separators, which mentioned replacement character, if the end users see the term Replacement character first which would make more sense.

Tagging "Needs followup"

jungle’s picture

StatusFileSize
new27.99 KB
new4.63 KB
new1.56 KB
new72.72 KB

@sokru: Personally I’d prefer option-1.png (first one), since “Transliterate” also uses replacement character.

Adjust the position of the Replacement character form element.

jungle’s picture

StatusFileSize
new28.11 KB
new5.17 KB
new905 bytes

Update the description of Transliterate to mention the use of the replacement character.

smustgrave’s picture

So when I try to test the post_update hook I get

 Unable to decode output into JSON: Syntax error                                                                                                                                                                                      
                                                                                                                                                                                                                                       
  TypeError: ArrayObject::__construct(): Argument #1 ($array) must be of type array, bool given in ArrayObject->__construct() (line 15 of /var/www/html/vendor/consolidation/output-formatters/src/StructuredData/AbstractListData.ph  
  p).                                                                                                                                                                                                                                  
               

Anyone else getting that?

smustgrave’s picture

Status: Needs review » Needs work

Disregard #387 got around that

post_update hook ran without issue

On Drupal 10.1 with a Standard install
Enabled all settings in File system.
Edited Article content type image field to be unlimited

Uploaded these files and were transformed to
new Test ..--__.png => new-test.png
new Test File!.png => new-test-file.png
new Test File.png => new-test-file_0.png
new_Test .. File.png => new_test-file.png (underscore not replaced)

Also when I upload the same file if it appends _0 should it not replace that underscore also?

sokru’s picture

new_Test .. File.png => new_test-file.png (underscore not replaced)

This is according the spec, the label says "Replace sequences of dots, underscores and/or dashes with the replacement character".

Also when I upload the same file if it appends _0 should it not replace that underscore also?

If this is desired, I'd leave this as a follow-up, since its handled on multiple places on core eg. in _file_save_upload_from_form().

smustgrave’s picture

Ah makes sense. Then my questions there are answered.

Though I did want to ask. I was only able to replicate once but when I uploaded a file that had been removed from the sites folder. I kept getting an error that the file that was being renamed already existed. Technically it did in the database but not file system. Didn’t flag that because I think it’s user error but wanted to ask before I mark

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

nathan tsai’s picture

Status: Needs work » Needs review

Technically it did in the database but not file system.

Drupal does not update the database when a file is deleted in the file system.

Thus, AFAIK, this is a user error.

smustgrave’s picture

Status: Needs review » Needs work

Patch no longer applies.

sokru’s picture

Status: Needs work » Needs review
StatusFileSize
new28.22 KB

Rerolled the patch, although the patch from #386 still applies to 11.x, as this is categorized as a "feature request" I don't think this will be backported to 9.x or 10.x branch.

Status: Needs review » Needs work

The last submitted patch, 394: 2492171-394.patch, failed testing. View results

sokru’s picture

Status: Needs work » Needs review

Random test failure on #394, now it will pass.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Applied #394

Verified post update hook ran without issue
Verified UI appears as mentioned
Did the same file test as #388

Uploaded these files and were transformed to
new Test ..--__.png => new-test.png
new Test File!.png => new-test-file.png
new Test File.png => new-test-file_0.png
new_Test .. File.png => new_test-file.png (underscore not replaced)

Going to mark. Though shouldn't be merged until follow up from #384 is opened.

dww’s picture

Issue tags: -Needs followup

I can't believe this is finally RTBC! 🎉 Thanks to everyone who picked up the torch after I was too burned out trying to keep running with it. 😅

Here's that follow-up: #3377577: Add JS to system_file_system_settings_form to dynamically update examples based on current settings

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 394: 2492171-394.patch, failed testing. View results

sokru’s picture

Status: Needs work » Reviewed & tested by the community

Random test failure.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 394: 2492171-394.patch, failed testing. View results

jwilson3’s picture

Status: Needs work » Reviewed & tested by the community

The Drupal\Tests\user\FunctionalJavascript\UserPermissionsTest::testPermissionCheckboxes test failure seems unrelated here.

needs-review-queue-bot’s picture

Status: Reviewed & tested by the community » Needs work
StatusFileSize
new86 bytes

The Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".

This does not mean that the patch needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

sokru’s picture

Status: Needs work » Reviewed & tested by the community
StatusFileSize
new28.11 KB

Just a reroll from 394, no interdiff.

avpaderno’s picture

Status: Reviewed & tested by the community » Needs review
kim.pepper’s picture

Status: Needs review » Reviewed & tested by the community

@apaderno why was this put back to needs review?

  • catch committed 56e1b484 on 11.x
    Issue #2492171 by dww, alexpott, sokru, jungle, hgoto, Pancho,...
catch’s picture

Status: Reviewed & tested by the community » Fixed
Issue tags: +10.2.0 release highlights, +Needs followup

I wonder if we should ship configuration to enable this in the standard profile, or even on new sites in general with the update setting things to off. However the reasons given for defaulting the config to off by default for existing sites (and possibly new sites) are good ones.

Some of the config options like remove duplicates I'm not sure about - does that really need to be an option? However I'm conscious we're at over 400 comments and eight years already, these would be small changes to make in a minor release if we wanted to change our minds later.

Tagging for 10.2.0 release highlights.

Committed 56e1b48 and pushed to 11.x. Thanks!

andypost’s picture

Thank you!

Change record also needs to be published!

xmacinfo’s picture

Amazing! Thank you to everyone who worked on this issue.

agoradesign’s picture

The change is currently only commited to 11.x branch, but tagged for 10.2 release highlights - shouldn't it be cherry-picked to 10.x too?

catch’s picture

10.2.x will be branched off 11.x, it is confusing but it saves retargeting MRs every six months, see https://www.drupal.org/about/core/blog/new-drupal-core-branching-scheme-...

agoradesign’s picture

oh thanks for this important info :) and hell yeah, it's great that this long awaited feature finally found its way into core - thanks!

agoradesign’s picture

oh thanks for this important info :) and hell yeah, it's great that this long awaited feature finally found its way into core - thanks!

wim leers’s picture

Amazing work, @dww! 🤯👏

socialnicheguru’s picture

If you find your way here and you are still on Drupal 8, 9, you can use https://www.drupal.org/project/transliterate_filenames

It also works on Drupal 10

sokru’s picture

Issue tags: -Needs followup

Created a follow-up: #3385777: Simplify & enable transliteration by default?, feel free to update it.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

ressa’s picture

Great work everybody, thanks! It's so nice to see improvements such as this one, and all the other ones getting included in Drupal core, making it incrementally better and better. #229193: Incremental filter for permissions page and #2492171: Provide options to sanitize filenames (transliterate, lowercase, replace whitespace, etc) (this) are my two highlights in Drupal 10.2.

Shout-out to @smustgrave for help keeping Drupal core issues moving with his Needs Review Queue Initiative. It's having a big impact now, keeping Drupal core issues moving and getting committed. We are all reaping the fruits now.

I have even had @smustgrave appear in one of my own Drupal core issues, reviewing and helping get it committed. Maybe you have also had the pleasure of having him help out in one of your issues?

I think @smustgrave is having such a huge impact because he bridges the often separated roles of a Scrum Master and a coder, skillfully wearing both hats simultaneously, acting sort of like a "Scrum Coder" :)

Imagine the velocity we could reach, if even more people helped out in the Needs Review Queue Initiative ... Just spending ten minutes every other day could make a huge difference, and help keep the momentum in otherwise stalled issues.

See also Talking Drupal #426 - Needs Review Queue Initiative.

defcon0’s picture

At first a big thanks to this new feature. Made parts of our custom implementation obsolete :)

Only one thing bothers me: why is the replacement character only trimmed at the right side of the filename and not on the left one, as well?

https://github.com/drupal/core/blob/10.2.x/modules/file/src/EventSubscri...

I think filenames like -some-filename.png are not so nice. What has been the reason to implement it that way? Couldn't find info in this issue log.

damienmckenna’s picture

Please open a new issue for any additional refinements needed.