Problem/Motivation

While working on #3516706: [PP-1] Disallow dangerous filenames e.g. command injection characters we discovered that invalid UTF-8 characters can cause preg_replace() to return NULL. You need to call preg_last_error() or preg_last_error_msg() to see that it caused a PREG_BAD_UTF8_ERROR.

This error is not currently checked and a $filename of NULL is then passed to subsequent validation checks. These will fail, but the reason isn't obvious. For example, you might get an 'invalid extension error' which is confusing to users.

Steps to reproduce

Proposed resolution

In Drupal\file\EventSubscriber\FileEventSubscriber::sanitizeFilename() check for invalid UTF-8 characters before doing any other validation checks and throw a Drupal\Core\File\Exception\FileException if invalid. This will get caught and handled in file_save_upload().

Remaining tasks

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Issue fork drupal-3562543

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

kim.pepper created an issue. See original summary.

kim.pepper’s picture

Status: Active » Needs review

Ready for review.

kim.pepper’s picture

Ran a test-only job to show the failure.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: +Needs Review Queue Initiative, +Bug Smash Initiative

Pretty simple review :)

Test-only was already ran here https://git.drupalcode.org/issue/drupal-3562543/-/jobs/7621295 which shows the test coverage.

Changes in core/modules/file/src/EventSubscriber/FileEventSubscriber.php look good and even better have a simple comment added to further explain.

LGTM!

quietone made their first commit to this issue’s fork.

quietone’s picture

Status: Reviewed & tested by the community » Needs work

I applied the suggestions and now the Kernel tests are failing.

kim.pepper’s picture

They look unrelated:
core/modules/package_manager/tests/src/Kernel/PhpTufValidatorTest.php

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.

kim.pepper’s picture

Status: Needs work » Needs review

Seemed like a random fail. Rebased on main and back to green.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Rebased and test failure appears to be random /builds/core/modules/layout_builder/tests/src/FunctionalJavascript/InlineBlockTest.php:528.

Feedback appears to be addressed on this one though.

dww’s picture

Left a few concerns/musings on the MR. Not worth downgrading to NR over them, but wanted to at least ask before this is committed.

Thanks,
-Derek

p.s. I thought I was going to have to point out #3545276: expectDeprecation() is deprecated and friends, but I realized this is an exception test, not a deprecation.

sivaji_ganesh_jojodae made their first commit to this issue’s fork.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

I was about to commit this and think I wondered whether throwing an exception here is the correct place. I think we should be adding this check to \Drupal\file\Validation\Constraint\UploadedFileConstraintValidator so we can output a message to the user in the form rather than triggering a 500.

I think we need to leave the exception in as this will cover places not using the \Drupal\file\Upload\FileUploadHandler but adding this as a violation via the constraint will be a make nicer UI.

alexpott’s picture

Another thought that adds a layer of complexity was does transliteration do with these characters if it is enabled? It does...

    // Ensure the string is valid UTF8 for preg_split(). Unknown characters will
    // be replaced by a question mark.
    $string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');

I wonder if that offers us a way forward rather than rejecting the file - if transliteration is not enabled we add similar logic to \Drupal\file\EventSubscriber\FileEventSubscriber::sanitizeFilename() instead of throwing an exception.

alexpott’s picture

We'd have to do something like...

     // Replace question marks with a unique hash if necessary. This because
      // mb_convert_encoding() replaces all invalid characters with a question
      // mark.
      if ($replacement != '?' && str_contains($filename, '?')) {
        $hash = hash('sha256', $filename);
        $filename = str_replace('?', $hash, $filename);
      }

      // Ensure the string is valid UTF8 for preg_split(). Unknown characters will
      // be replaced by a question mark.
      $filename = mb_convert_encoding($filename, 'UTF-8', 'UTF-8');

      // Use the provided unknown character instead of a question mark.
      if ($replacement != '?') {
        $filename = str_replace('?', $replacement, $filename);
        // Restore original question marks if necessary.
        if ($hash !== FALSE) {
          $filename = str_replace($hash, '?', $filename);
        }
      }

Inspired by \Drupal\Component\Transliteration\PhpTransliteration::transliterate() ... maybe we could refactor this into public static method on \Drupal\Component\Transliteration\PhpTransliteration so we don;t have to duplicate code.

shivansh_0909 made their first commit to this issue’s fork.

smustgrave’s picture

@shivansh_0909 fixes should continue in the existing MR please.

shivansh_0909’s picture

I have opened a merge request to improve the UTF-8 exception message assertion.

Merge request: https://git.drupalcode.org/project/drupal/-/merge_requests/15138

Happy to make any changes if needed. Thanks!

dww’s picture

And LLM use needs to be disclosed.

And fixes should actually address the given feedback.

shivansh_0909’s picture

I have addressed the UTF-8 exception message assertion and fixed all PHPCS and PHPStan issues. The pipeline is now passing. Ready for review.

dww’s picture

@shivansh_0909 - at this point, it’s not clear if you’re a human or an LLM agent. These “contributions” are totally unhelpful for this issue. You need to read what we’ve said and act accordingly. I’ve closed your MR. If you actually want to help, you’ll have to read and understand comments 13, 14 and 15, then make the suggested changes and push them to the existing MR. If you continue to push commits to new branches that aren’t directly addressing those comments, we’re going to assume you’re using automated tools to generate slop and waste our time, and will likely request that your account be banned.

alexpott’s picture

Also before we do #13 or #15 we need to have a discussion about what is the correct approach. Currently we have:

  1. Validate in UploadedFileConstraintValidator and continue to throw an exception in FileEventSubscriber::sanitizeFilename()
  2. Replace non-unicode characters with the replacement character when transliteration is disabled in FileEventSubscriber::sanitizeFilename()

We need to decide what to do before we start coding.

I think because of the existing behaviour of transliterate and our approach of manipulating the filename in FileEventSubscriber::sanitizeFilename() we should do that rather than error. The question is how and when. Should we only do this if we've not transliterated and 'filename_sanitization.replace_whitespace' or 'filename_sanitization.deduplicate_separators' is enabled... as that is the only way an error is going to be caused. Or should we add a new event listener that does this sanitisation at the very beginning of the process so all other event listeners can be sure the filename is valid unicode. Not sure.

shivansh_0909’s picture

@dww Hey, thanks for the feedback and sorry for the confusion earlier I realize I jumped in without fully understanding the ongoing discussion and should’ve contributed to the existing MR instead of opening a new one. I’ll go through comments 13–15 properly to understand the approach being discussed before making any further contributions. Appreciate the guidance.