Follow-up to #1842718: Use new Transliteration functionality in core for machine names

As a follow-up to #567832: Transliteration in core, we now have a Transliteration class in Core.

@catch mentioned in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) we may want to transliterate filenames in core, so that we can have a database-level unique constraint on the URI field in the database.

CommentFileSizeAuthor
#72 use_new_transliteration-2492171-72.patch7.6 KBPancho
#70 interdiff_61_70.txt120 bytesfrob
#70 use_new_transliteration-2492171-70.patch8.15 KBfrob
#61 diff-between-the-patch-files.txt360 bytesndobromirov
#61 use_new_transliteration-2492171-61.patch8.15 KBndobromirov
#60 interdiff-51-60.txt691 bytesseanB
#60 use_new_transliteration-2492171-60.patch8.14 KBseanB
#51 interdiff-2492171-45-51.txt526 byteshgoto
#51 use_new_transliteration-2492171-51.patch8.14 KBhgoto
#50 Tr@nsliteratiön $.jpg18.85 KBPascoDiogo
#50 Transliteration.jpg18.74 KBPascoDiogo
#45 interdiff-2492171-42-45.patch1.14 KBhgoto
#45 use_new_transliteration-2492171-45.patch8.14 KBhgoto
#42 use_new_transliteration-2492171-42.patch6.63 KBborisson_
#42 interdiff-2492171.txt1.54 KBborisson_
#37 core-file-system_file-upload-transliteration-2492171-37.patch5.08 KBkovtunos
#28 interdiff-2492171-26-28.txt380 byteshgoto
#28 interdiff-2492171-20-28.txt3.07 KBhgoto
#28 drupal-use_new_transliteration-2492171-28.patch5.13 KBhgoto
#26 interdiff-2492171-20-26.txt3.07 KBhgoto
#26 drupal-use_new_transliteration-2492171-26.patch5.13 KBhgoto
#21 interdiff-2492171-15-20.txt4.8 KBhgoto
#21 drupal-use_new_transliteration-2492171-20.patch4.17 KBhgoto
#15 drupal-use_new_transliteration-2492171-15.patch1.34 KBpingwin4eg
Members fund testing for the Drupal project. Drupal Association Learn more

Comments

neclimdul’s picture

What's the status of this? Should #2504815: d6 to d8 migration throws integrity contraint warning with duplicate file paths expect a unique constraint to exist in the database target?

frob’s picture

Is this going to make it for 8.0 or should this be moved to 8.1?

xjm’s picture

Version: 8.0.x-dev » 8.1.x-dev
Status: Active » Postponed

Yep, this is 8.1.x material at this point, so contrib could provide it for 8.0.x. Thanks!

andypost’s picture

Status: Postponed » Active
Related issues: +#1925684: Provide a "original filename" field for file entity.

Suppose it's time, but maybe this blocked on #1925684: Provide a "original filename" field for file entity.

frob’s picture

I wouldn't consider this blocked on that one. They should be able to be worked on in parallel.

Deciphered’s picture

It's worth mentioning that this functionality is part of File (Field) Paths module, which has a D8 beta release.

Jon@s’s picture

It's worth mentioning that this functionality is part of File (Field) Paths module, which has a D8 beta release.

True but it seems like this functionality should be in core. This would also require the transliteration in core to support replacing spaces with a symbol which it, to my knowledge, currently does not. Also why the transliteration functionality in the File (Field) Paths module is limited compared to the old Transliteration module.

Deciphered’s picture

I'm not at all debating that it should be in core, but given @xjm's comment that it's not going to happen in 8.0.x, and that it needs to be dealt with in contrib, this is it being dealt with in contrib.

File (Field) Paths does also provide replacing of spaces with a symbol, as it has since inception (or not long afterwards) via the Pathauto integration.

I'm more than happy to remove the functionality from F(F)P as soon as it's no longer needed.

g089h515r806’s picture

I have test with File (Field) Paths with Drupal8.0.1, it does not works at first. it works correctly at last.
I want to upload chinese pdf file using file field module.

Deciphered’s picture

That seems like a poorly worded statement. But I will be happy to look into it further in the appropriate issue queue.

g089h515r806’s picture

Sorry for my English.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

eigentor’s picture

I have tried filefield paths to solve the problem of transliterating filenames. For me it dit not work. Some comments for filefield paths:

  1. It seems to keep the original filename and create the new filename, which is some kind of alias. I think this is by design and in the sense of the module, but unneccessary for the simple goal of getting web-safe filenames.
  2. One has do the setting "transliterate filename" in every single filefield, maybe even instance of every filefield. That is quite some work and can cause oversights.
  3. There is a major problem with filefield paths: it cannot be uninstalled https://www.drupal.org/node/2685731 The module is not alone with this problem. The new uninstall system of Drupal 8 requires all database content of a module to be removed before you can uninstall it. If this content is not deletable by hand, there is a problem.
imadalin’s picture

I don't think that the issue's topic is about File (Field) Paths and we should concentrate on implementing Transliteration for filenames manipulation in the Core.

pingwin4eg’s picture

Status: Active » Needs review
FileSize
1.34 KB

Let's start with something.

Here's a patch using code similar to that from a transliteration module.

Berdir’s picture

"Something" looks like a good start to me, at least it doesn't break any existing tests. Next step would be tests with file names that trigger this.

Also wondering about how to introduce this exactly, is there any reason to make this a setting, so it could be disabled? It will be a behavior change, although I suspect a very welcome one for most (all?) sites?

SKAUGHT’s picture

i'll be bold enough to say all sites should use.

The only other request I've ever had is about storing the original file name, that could be used for Title href. As people often name a file because they expect it would be saved. Same for Search API related issues. At this time, I haven't directly looked to see if the File API is yet doing that..

^I think there would be no need for an 'on/off' switch of transliteration if org. name is stored for about points.

$clean_filename = \Drupal::transliteration()->transliterate($file->getFilename(), 'en', '');
I should think 'en' shouldn't be hardcoded. sites will not all be english first.

stefan.r’s picture

This probably should be opt-in (at least for existing sites), so as to not change behavior.

SKAUGHT’s picture

Status: Needs review » Needs work
andypost’s picture

opt-in

Makes sense to have that configurable via container param

hgoto’s picture

I updated the patch #15.

1.

$clean_filename = \Drupal::transliteration()->transliterate($file->getFilename(), 'en', '');
I should think 'en' shouldn't be hardcoded. sites will not all be english first.

I replaced the hardcoded 'en' to $langcode with the following code:

      $langcode = \Drupal::languageManager()->getCurrentLanguage()->getId();

      // Transliterate and sanitize the destination filename.
      $filename = \Drupal::transliteration()->transliterate($file->getFilename(), $langcode, '');

2.

I changed the strtolower() to Drupal\Component\Utility\Unicode\Unicode::strtolower(). I consulted #1842718: Use new Transliteration functionality in core for machine names for this point. This may be unnecessary.

      // Force lowercase to prevent issues on case-insensitive file systems.
      $filename = Unicode::strtolower($filename);

3.

I introduced an option config to toggle the transliteration for the file names.

This probably should be opt-in (at least for existing sites), so as to not change behavior.

Probably it depends on the language that people want to make this setting opt-in or opt-out (of course, it shouldn't be changed for existing sites with update). In my small personal experience, the current transliteration function for Japanese is not very good and I'd like it to be off by default...

Tests are not yet added.

hgoto’s picture

Status: Needs work » Needs review

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

arshadcn’s picture

Patch in #21 tested and works. Thanks.

dawehner’s picture

+++ b/core/modules/file/file.module
@@ -842,7 +842,28 @@ function file_save_upload($form_field_name, $validators = array(), $destination
+    if (\Drupal::config('system.file')->get('filename_transliteration')) {
+      $langcode = \Drupal::languageManager()->getCurrentLanguage()->getId();
+
+      // Transliterate and sanitize the destination filename.
+      $filename = \Drupal::transliteration()->transliterate($file->getFilename(), $langcode, '');
+
+      // Replace whitespace.
+      $filename = str_replace(' ', '_', $filename);
+      // Remove remaining unsafe characters.
+      $filename = preg_replace('![^0-9A-Za-z_.-]!', '', $filename);
+      // Remove multiple consecutive non-alphabetical characters.
+      $filename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $filename);
+      // Force lowercase to prevent issues on case-insensitive file systems.
+      $filename = Unicode::strtolower($filename);
+    }

I'm wondering why this is part of file_save_upload(), which is part of an API which is related to the form, not some actual underlying API level. IMHO you would like to have the same for potential files uploaded via RESt one day.

Maybe it would be enough for now to move all that logic into its own API function and call it from here?

hgoto’s picture

Thanks for your reviews. As dawehner told, surely it's better to make this logic an independent API function.

Maybe it would be enough for now to move all that logic into its own API function and call it from here?

I tried creating a new method getTransliteratedFilename() in \Drupal\file\Entity\File class and moved this new transliteration logic into it. I'm not sure where we should put this API... I'd like this patch to be fully reviewed. Thank you.

Status: Needs review » Needs work

The last submitted patch, 26: drupal-use_new_transliteration-2492171-26.patch, failed testing.

hgoto’s picture

The patch #26 failed the auto test because I didn't change the code properly when I extracted the logic to a method. I'll try it again.

tobiberlin’s picture

Tested patch in #28 with a file named with spaces and German umlaut -> works

tobiberlin’s picture

If this is added to core it would be very helpful to offer an update function to transliterate the already uploaded files, wouldn't it?

Berdir’s picture

Nope, that doesn't work. you don't know how and where those files are used, could be hardcoded in content. We can only offer it for new files.

tobiberlin’s picture

In my eyes this answer makes no sense as this patch is about using transliteration on images uploaded by a form. Whenever an image uploaded for a field for example is used somewhere hardcoded the same problem as described in #31 would appear when simply a new image is uploaded. Whenever an image path/ name is used hardcoded although the image can be changed by a form it is a wrong programming, isn't it?

ressa’s picture

It would be good to have transliteration of file names in Drupal 8 - and really great to get it into Drupal 8.3, if possible.

sense-design’s picture

One of the features why I have not yet completely switched to D8 for all my projects.
Would be great to have this in the upcoming 8.3 release.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

hampercm’s picture

Status: Needs review » Needs work

In response to #30 and #31: perhaps add a drush command to transliterate existing files, if a site admin needs to.

+++ b/core/modules/file/file.module
@@ -842,7 +842,10 @@ function file_save_upload($form_field_name, $validators = array(), $destination
+    $filename = $file->getTransliteratedFilename();
+
+    $file->destination = file_destination($destination . $filename, $replace);

Earlier code in this function does a setFilename() if the filename is changed (see code that handles insecure file extensions at line 818). I'd recommend doing that here, too.

+++ b/core/modules/file/src/Entity/File.php
@@ -196,6 +197,38 @@ public function preSave(EntityStorageInterface $storage) {
+      // Remove multiple consecutive non-alphabetical characters.
+      $filename = preg_replace('/(_)_+|(\.)\.+|(-)-+/', '\\1\\2\\3', $filename);

This seems extraneous. Remove it?

kovtunos’s picture

Updated the patch #28 to work with Drupal 8.3.0.
Changes: new PHP array syntax.

hgoto’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work
borisson_’s picture

I tried this patch and it works, thanks!

I think the failure in #37 is unrelated, but this does need it's own tests. Since we're adding a new public function for the file name transliteration, all we'd need to do is call that function with different file names in the test, I think?

Berdir’s picture

We can test it directly in a kernel test with diferent variations. I think we also need at least one integration test, upload an image with some spaces or so to make sure the function is actually called.

borisson_’s picture

This patch adds the kernel tests. Not sure if the test-class is the correct one.

Status: Needs review » Needs work

The last submitted patch, 42: use_new_transliteration-2492171-42.patch, failed testing.

borisson_’s picture

Ah, those migrate tests are related, I don't see how those should be fixed though.

hgoto’s picture

I believe the failed tests for #42 can be fixed just by providing the added configuration into $expectedConfig['system.file'] in the test classes.

       // temporary_maximum_age is not handled by the migration.
       'temporary_maximum_age' => 21600,
+      // filename_transliteration is not handled by migration.
+      'filename_transliteration' => FALSE,
     ],
     'system.image.gd' => [
       'jpeg_quality' => 75,

The test passed in my environment with this change and I'd like to see how the testbot reacts.

hgoto’s picture

Oops, I'm sorry, the extension of the interdiff is wrong.

Status: Needs review » Needs work

The last submitted patch, 45: interdiff-2492171-42-45.patch, failed testing.

hgoto’s picture

Status: Needs work » Needs review

The tests passed. (The failure was just caused by the wrong extension of the interdiff.)

J-Lee’s picture

@@ -274,4 +275,37 @@ public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
     return $fields;
   }
 
+  /**
+   * Gets the transliterated filename.
+   *
+   * @return string
+   *   The transliterated filename. If the filename transliteration options is
+   *   disabled, the raw filename of the file.
+   */

Just found a typo:

If the filename transliteration options is disabled, the raw filename of the file.

"options" should be "option"

The patch works for me. Thank you.

PascoDiogo’s picture

FileSize
18.74 KB
18.85 KB

Should not this transliteration be applied to the name displayed to the user?

Like this:

     $file->destination = file_destination($destination . $filename, $replace);

+    $file->filename = $filename;

Without transliteration in the name displayed to the user:
Only local images are allowed.

With transliteration in the name displayed to the user:
Only local images are allowed.

hgoto’s picture

@J-Lee, thank you for the review. I fixed the typo.

@PascoDiogo, in my opinion, this transliteration should not necessarily be applied to displayed names. I believe Transliteration module's project page would help to understand transliteration.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

J-Lee’s picture

I'm not sure with #50. The original filename is replaced by the transliterated name. Why should a not existent filename displayed to the user? Isn't it confusing if you got a filename shown but the real filename is different?

tstoeckler’s picture

+++ b/core/modules/file/src/Entity/File.php
@@ -274,4 +275,37 @@ public static function baseFieldDefinitions(EntityTypeInterface $entity_type) {
+      $langcode = $this->languageManager()->getCurrentLanguage()->getId();

Shouldn't this use the file language instead?

frob’s picture

Isn't it confusing if you got a filename shown but the real filename is different?

An argument could be made either way:

It is confusing to upload a file and see a different filename after uploading.
It is confusing to see the name of a file in the UI and then look for that file only to find out that it doesn't exist.

The first option is fixable by letting the user know that the filename will be changed upon upload for reasons. The second option would require adding a bit of UI to let the user know that the file name on the disk will be different than the one shown.

I lean toward the first option. Let the user know that the file name will be changed for reasons and then just display the actual file name.

SKAUGHT’s picture

#53/#55

+1 option1.
the file description should include (like it does for file size restriction) a li child that outlines that filename will be cleanup.

This is similar to the fact that drupal adds '_0' ' _1' (file counts) if use keeps re-attaching the same file name.

i know i've had clients want to have file name as they have make them (ie: '2017 Q1 earning.xls'). but then would have a bit of overhead on tracking related file name changes are 'repairing' them for file downloads etc. same issue when a User re-attached a new version of that same file name (_0 _1 _2....)

frob’s picture

Knowing the name of the actual file on the server is also necessary for complying with DMCA takedown notices and the like.

SKAUGHT’s picture

i haven't looked over this whole issue in a while.

#4 andypost is correct.

Regnoy’s picture

When it will be integrated in CORE?

seanB’s picture

+1 for adding this to core. Not sure on postponing on #1925684: Provide a "original filename" field for file entity.. The feature is optional and the same problem (changing filenames) already occurs for duplicate filenames as mentioned in #56.

We could create a followup for the interface change mentioned in #56 "the file description should include (like it does for file size restriction) a li child that outlines that filename will be cleanup". It seems this particular part could turn into a huge bikeshed and I'm personally not convinced this is a big issue for most users. Hopefully, we don't have to block on that.

New patch to fix #54.

ndobromirov’s picture

Thanks for the patch!

I've started using it on a project, but once I've enabled the configuration and exported it, the site installations started failing due to import of configuration issues. After some checks it turned out the config has stored 1 in the YML file, where the default value was false, so logically it should have stored true in there. Also hinted by the bool schema type.

I've changed the config manually in the YML file to true and installs passed.

I guess the fix should be to just type-cast the value, before setting it to storage from the form-submit handler.
This is also the fix implemented directly on the patch from #60. Diff between the two patch versions is provided.

I do not know should this regression needs any tests, so any help will be appreciated.

As I've done changed on the code still needs review :)

yan’s picture

Patch from #64 seems to work for me (although spaces are converted to underscores (_) which I think is not ideal).

Regnoy’s picture

Work for me too always. :D

frob’s picture

@yan, what isn't ideal about underscores?

millionleaves’s picture

@frob

There is a difference from a SEO perspective. See #1 in the article linked below:

https://searchengineland.com/9-seo-quirks-you-should-be-aware-of-146465

frob’s picture

Status: Needs review » Needs work

That article is from 2013 and the point you refer to is linked from 2005. Also, the files it is referring to are .html/.htm or in other words, that article is talking about the link to the content and not the link to the static asset.

SEO is constantly evolving and changing. Do you have a more current reference? I would argue that this should be configurable to make it long term SEO friendly, but I would also argue that should be contrib or another issue. I just don't want this issue to get side-tracked when it is so close.

From #61 Looks like it still needs work:

site installations started failing due to import of configuration issues. After some checks it turned out the config has stored 1 in the YML file, where the default value was false, so logically it should have stored true in there. Also hinted by the bool schema type.

I've changed the config manually in the YML file to true and installs passed.

I guess the fix should be to just type-cast the value, before setting it to storage from the form-submit handler.
This is also the fix implemented directly on the patch from #60. Diff between the two patch versions is provided.

millionleaves’s picture

Current article from Google states:

"We recommend that you use hyphens (-) instead of underscores (_) in your URLs."

https://support.google.com/webmasters/answer/76329?hl=en

This is consistent with the earlier article I shared, and I don't see any distinction in either between html URLs and image URLs.

Another reason to use hyphens here is that the Transliteration module in D7 used hyphens for file names, so IMHO it is desirable to remain consistent with that unless the core transliteration functionality is also using underscores (although my argument in favour of underscores hyphens would remain).

It appears that in this case, the use of an underscore is defined in one line of code in patch #61 that could be changed easily without affecting the patch overall.

Providing the option to choose the separator may be useful for some use cases, but agree that this would be better left for a separate issue for the core module rather than this patch.

imclean’s picture

@millionleaves this is an interesting discussion as I've always preferred underscores in filenames while keeping hyphens in the rest of the URL. It turns out this method could be problematic due to underscores being a special character in computing.

Google's examples all use hyphens: https://support.google.com/webmasters/answer/114016?hl=en

This could be used to our advantage in certain cases. For example, "big_red_socks.jpg" you might want people to find only if they're looking for "big_red_socks" and not "big", "red" or "socks". Maybe not very often, but it's good to have that level of control.

frob’s picture

Yeah, it looks like it should be changed to a "-" character. The patch has always had this from #15

$clean_filename = str_replace(' ', '_', $clean_filename);

So that line needs to be changed and the issue from #61

frob’s picture

Make the SEO change by rerolling the patch from 61, that leaves the issue mentioned in #61, can anyone review that?

borisson_’s picture

This will also need fixes in the kernel-tests written for this issue, see testFileNameTransliteration in modules/file/tests/src/Kernel/SaveTest.php. Specifically the data-provider (provideFilenames) will need to be updated.

Pancho’s picture

Fixed provideFilenames().

frob’s picture

Patch in #72 should resolve all remaining issues. All that is left is manual review.

seanB’s picture

Could you provide an interdiff for #72?

Raja_Vijayakumar’s picture

I applied the patch#51 to my application. It only transliterating the filename in database but the file name still appears to be the original on UI for the uploaded file . any idea how do i fix the filename that appears on UI too.

J-Lee’s picture

See comments #50, #51, #53, #55, #56 and #60.

This is a point of discussion but could/should a follow up issue.

Orkut Murat Yılmaz’s picture

anyone tested patch #72?

rang501’s picture

Confirming that the patch in #72 is working.

eigentor’s picture

The Patch #72 applies cleanly with Drupal 8.4.0.
The filenames for the uploaded files itself are correctly transliterated. It adds dashes for spaces.
The displayed filename are an issue for a followup, as I understand, so no surprise they are kept from the original filename.

rootwork’s picture

Status: Needs review » Reviewed & tested by the community

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 72: use_new_transliteration-2492171-72.patch, failed testing. View results

mcdruid’s picture

Status: Needs work » Needs review

I think the test failures look unrelated to this patch - back to NR to try the tests again.

Noting also that this was mentioned as a step toward being able to remove the workaround for the file uri uniqueness constraint added in #1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols), but it probably won't help much if it's configurable and opt-in. (The idea being that the uri field in the file_managed table could be set to ascii if that's all it ever contains, and as such there wouldn't be a problem with the key length when imposing a uniqueness constraint in the db schema).

mgifford’s picture

Setting it back to @rootwork's RTBC. As @mcdruid noted, "test failures look unrelated to this patch".

mgifford’s picture

Status: Needs review » Reviewed & tested by the community
alexpott’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: -Needs tests +Needs issue summary update
  1. +++ b/core/modules/file/src/Entity/File.php
    @@ -277,4 +278,36 @@
    +  public function getTransliteratedFilename() {
    +    // If the transliteration option is enabled, transliterate the filename.
    +    if (\Drupal::config('system.file')->get('filename_transliteration')) {
    

    This is a bit funny because calling getTransliteratedFilename() might not return a transliterated filename. Maybe it should be the caller's responsibility to check the config.

  2. +++ b/core/modules/system/config/install/system.file.yml
    @@ -3,3 +3,4 @@
    +filename_transliteration: false
    

    Adding a configuration key like this means we need an upgrade path for existing sites. Just an update in system.install to add the missing key with the default value.

  3. We still need a change record
  4. The issue summary could do with an update to explain the solution. I.e. why is it configurable. Why the default for new sites is false.