Files can be uploaded via Drupal with funky filenames, like this one: user.module.patch @@ ^^ asdf . asdf . However, when you do that and try and download the file, Drupal doesn't know what to do. Not to mention, across platforms, weird characters and spaces can cause other problems, most notably when trying to send a link to a particular file to download.

Therefore, filenames in Drupal should be cleaned taking out these unwanted characters. I have a pretty solid function right now to do this and I'll adapt it to Drupal very soon. Just want to get the basic issue created so I can put it in my to-do bin.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

m3avrck’s picture

Version: x.y.z » 7.x-dev

Still an issue :-)

Here is the function I have been using...

function nameToSafe($name, $maxlen=64) {
  $noalpha = 'ÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÀÈÌÒÙàèìòùÄËÏÖÜäëïöüÿÃãÕõÅåÑñÇç@°ºª';
  $alpha   = 'AEIOUYaeiouyAEIOUaeiouAEIOUaeiouAEIOUaeiouyAaOoAaNnCcaooa';

   $name = substr($name, 0, $maxlen);
   $name = strtr($name, $noalpha, $alpha);
   // not permitted chars are replaced with "_"
   return preg_replace('/[^a-zA-Z0-9,._\+\()\-]/', '_', $name);
} 

Now, yes, it does have it's implications and isn't a patch for Drupal.

But something along these lines, and/or taking a file name and making an MD5/related-hash would be ideal. Serving those crappy file names can lead to all sorts of issues, from file system errors (strange characters, etc...), quirks when serving files in the browser, and related.

Adding something like this wouldn't be all that hard and with the new file system objects this makes even more sense to further sanitize things.

drewish’s picture

+1 from me, i wish it was in core.

killes@www.drop.org’s picture

FileSize
0 bytes

hmmm.

I've created a file ÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÀÈÌÒÙàèìòùÄËÏÖÜäëïöüÿÃãÕõÅåÑñÇç@°ºª.txt locally and trying to upload it.

killes@www.drop.org’s picture

The filename already gets butchered.

(yes, the file was empty)

m3avrck’s picture

Is it butchered or the filesystem couldn't understand those UTF8 characters and chopped them off regardless?

m3avrck’s picture

Title: File names should be cleansed of weird characters and spaces » file names should be cleansed of non-alphanumeric characters
JirkaRybka’s picture

If going to translate alphanumeric-like characters to plain ASCII equivalents (which I think is good, allowing non-english names to stay more or less intact), then we should go through UTF-8 specification, and add *all* the characters.

For my native Czech language, I'm missing these: čďěňřšťůžČĎĚŇŘŠŤŮŽ Polish have more like 期ƌŹ (which is incomplete list limited by my Czech keyboard - there should be aeAE with bottom hangings, lL with slash over, and zZ with single dot above), Slovak another ones ŕľŔĽ, and probably even more exists around the world (I seem able to type at least āēīōū, and remember that Turkish have some really interesting ones).

dww’s picture

Just a note to the people in this thread: don't forget that attaching files to issues doesn't (yet) use the same code as core's upload.module. So, "testing" these problems by means of replies to this issue aren't particularly helpful, at least not until we're done converting project_issue to use comment.module and comment_upload.module (http://drupal.org/node/18920) and that is installed on d.o.

m3avrck’s picture

JirkaRybka, you do have a good point. But then what about Greek or Japanese characters? This could be a nightmare :-)

Therefore, I propose simply MD5() the filename in question and writing it to the server. That would take care of all these odd use cases at once and be much faster and cleaner.

dww’s picture

md5 hashes for cache ids is one thing. but, for filenames, it's quite another... :( unlike cache ids, filenames are user-visible. imagine how much fun it'd be trying to download patches from the d.o issue queue when doing drupal development, and trying to keep track of which patch went with which issue and which bug you were in the middle of testing, etc. Ugh.

Or, what about people who attach .pdf or .png files to events on their sites, e.g. the poster for an upcoming gig. I don't want that to automatically be translated from "2007-09-22-gig-poster.pdf" to "04b569e506a3b07161ee84c49b442732.pdf" so that everyone who downloads the attachment is left with "garbage files" on their computer, not something they'd actually want to print and use.

Great big -1 to that proposal, sorry to say...

m3avrck’s picture

Standing on the shoulders of giants...

http://farm1.static.flickr.com/53/146365348_eaf376c891.jpg?v=0

I think they know a thing or two about uploaded files...

Perhaps an option to enable this? Or a variable that can be set? The use case for enabling this is quite high, while in some cases, it might be better to have it off.

webchick’s picture

I agree with Derek. We shouldn't ever needlessly screw with users' data. A file name is part of that data. The only time we intervene here is when a file would cause a collision, and that's to avoid data loss.

dww’s picture

@m3avrck: The difference is that with flickr, from the UI and user experience perspective, the file you attach *is* the node. You upload it and *expect* it to get its own unique ID, which you then can add your own title, comments, and tags (metadata) on to. In Drupal, attachments are parts of the metadata that you add to nodes (or comments), and the filenames can (and usually are) in some way part of that metadata...

I guess I could stand it if there was a setting (which defaulted to off) to md5 hash all uploaded filenames (though you'd still have to retain the extension, plus whatever extension munging happens for security). But, I definitely don't think that scrambling the filenames should be the default behavior. I'd rather see a different solution to this problem that didn't involve data loss (or at least significantly reduced it by some form of reasonable transliteration), instead of punting to the admins with a setting. Another variable is both code bloat, and admission of defeat that we can't come up with a good solution so we force the admins to choose a lesser of 2 evils. As with elections, once you agree to go along with a choice of 2 evils, you've already lost. ;) Let's actually come up with something worth voting for...

dww’s picture

For example, Gabor suggested this: http://pecl.php.net/package/translit -- definitely looks promising.

JirkaRybka’s picture

Yes, that definitely looks promising. But I didn't propose do go as far as Greek and Japanese, which are completely idependent alphabets - I just wanted to have all characters, which already ARE latin (someone say English) characters, only with additional marks on them.

Another approach would be to make the search/replace strings translatable through t(), so that translators may provide character lists for their language, so covering the characters most likely to occur on the particular (localized) site.

JirkaRybka’s picture

FileSize
4.64 KB

I'm attaching improved version of the name-cleaning function.

- Now supports all UTF-8 characters, which are VISUALLY fitting to be replaced with english-chars. I know this is not perfect, but since filenames will be always a bit distorted by this cleaning, I'm not going into the unsafe world of multi-character transcriptions. My version converts only just characters, which look significantly like the replacement, and so would look silly if replaced by just underscores. This replacement works pretty well for Czech, and I believe it's OK everywhere excepting non-latin alphabets, where there's nothing to lose anyway.

- This feature, if integrated into Drupal, should be configurable (to be used or not). Somewhere it will help, removing problematic characters (files with weird chars may be indeed difficult to download, and from my experience there are even chars making a file completely impossible to open in Windows!), but somewhere (Greek, Arabic, Hebrew, Russian...) it'll turn the language completely into a mess, so admin might want to disable it.

- Unfortunately, I had to work around limitations: strtr() doesn't work on multi-byte characters on my install (PHP 4.4.4 with mbstring), so I treated the replacement as string-to-string which works always fine (even without mbstring, I believe). Without this, even european characters were horribly messed up. I'm also compensating for one function not being supported in PHP4, still supported by Drupal. The resulting code is maybe a bit less performant, but as this function will be only used once in already-very-slow upload process, I believe it's OK.

- I also modified the whole a bit for better code readability.

- If going to view/edit the code, consider that this is UTF-8 textfile with *VERY* wide character set used, so some systems might even fail to view it properly, let alone edit. I'm using a Linux desktop, where UTF-8 is native encoding (Gnome, Gedit).

All work left is to integrate the function into Drupal ;-)

sun’s picture

Title: file names should be cleansed of non-alphanumeric characters » Filenames should be cleaned of non-alphanumeric characters

Image and Imagefield also suffer from malformed filenames. Images, or files in general, are neither viewable nor downloadable (see http://drupal.org/node/168059).

At least, reserved URI characters need to be cleaned out (see RFC2396: 2.2. Reserved Characters). I'd vote to perform this cleanup in file.inc for all files, not in upload.module.

Regarding non-latin characters (including spaces) in filenames, AFAIK it would be sufficient, if url() would escape them properly using hexadecimal sequences. That's at least, what RFC tells, if I understand it correctly.

Furthermore, with parts of i18n in D6 core, I would call this a critical bug. Of course, it's a fault of the user trying to upload a filename that won't be loadable. But who expects a user to know RFC 2396 ?

JirkaRybka’s picture

Can someone on Windows check this: If filename is hexadecimal-escaped, AND you work with it on Windows configured to some other language (charset) than the one used in filename (you know, Windows convert everything from UTF-8 into it's native encoding - at least XP do, I'm even unable to see my own Czech chars on Drupal.org under XP, because the site claims to be English language and XP behves accordingly, but unfortunately my access to that machine is too limited for testing)... So, can someone check if it's possible to handle hex-escaped filename under Windows running different locale?

I seem to remember weird issues on Win with 'unprintable' chars in filenames coming from a Linux-burned CDs - files shown in directories, but impossible to open or handle in any other way. Does IE ensure the filenames are clean for Windows limited charset?

drewish’s picture

obviously it'd be best to fix this in core but i did want to point out a work around to this for contrib modules. create a validator function that you pass to file_save_upload() which takes $file by ref and alters the destination. something along the lines of:

function contrib_file_name_fixer(&$file) {
  $file->filename = preg_replace('/[^a-zA-Z0-9,._\+\()\-]/', '_', $file->filename);
  $file->destination = file_destination(file_create_path($file->filename), FILE_EXISTS_RENAME);
  return array();
}

you'd want to try a little harder to transliterate the the name but that should allow you to change the final filename to be safe.

smk-ka’s picture

There is a new module that takes care of transliteration and cleaning of filenames:

Transliterate filenames

It is based on the PHP UTF-8 project (though heavily modified), supporting transliteration of the full Unicode range, and is able to process generic file uploads.

drewish’s picture

that module doesn't seem like it does much on its own... i guess other modules would have to support it?

smk-ka’s picture

No support required, it cleans the filenames of any file uploads in a generalized way, by manipulating the global $_FILES array. That way the filename has already sanitized when it is about to be processed by whatever module.

sun’s picture

...and it works like a breeze with all file-related modules I was able to test so far (including upload, image, attachment, imagefield) in Drupal 5 and 4.7. D6 should work, too.

I'd suggest to mark this and all other related issues on d.o as fixed now. file_translit might be moved into core someday. Until then, it will serve in contrib. I'll add an note to Inline and Img_Assist projects.

JirkaRybka’s picture

I would like the credentials more re-usable, though, than just filenames. Particularly, I need something like that in the Pathauto module, non-english paths are all underscores there, giving no SEO effect.

JirkaRybka’s picture

Sorry - my mistake: I overlooked that Pathauto already have a sort of transliteration. But still, it might be good for everyone to have one solution for all modules.

sun’s picture

@JirkaRybka: Transliteration of paths/urls and tokens is another issue and probably a feature. As mentioned before in #17, global transliteration of all urls is not needed if reserved characters were replaced and non-ascii-characters were properly escaped (which also allows urls to be ranked better in search engines). Thus, url() would have to support transliteration optionally as a site setting. Of course, if file_translit works out for filenames, it could do the same for urls. However, it is based on the php utf-8 project and there are several other transliteration approaches that may work better for certain languages, f.e. Chinese. If you'd ask me, we should allow any module to alter any url before it is returned via module_invoke_all(). However, I repeat, that's another issue.

JirkaRybka’s picture

Ok with me, I just wonder if it's a good idea for core to provide some raw string-transliteration function, as to avoid code duplication in modules.

webchick’s picture

I agree w/ JirkaRybka... a "cleansing" string has a lot of uses outside of files. So it'd be nice if it were called something like drupal_sanitize_utf8_string() or something.

sun’s picture

Title: Filenames should be cleaned of non-alphanumeric characters » Files/URLs are not accessible if they contain reserved or non-ascii characters
Version: 7.x-dev » 6.x-dev
Component: upload.module » base system
Category: feature » bug

First of all, this issue rather applies to non-English sites only.
So, I'd suggest to let this whole thing evolve in contrib:

1) rename 'file_translit' module to 'sanitize_utf8' module.

2) add support for utf8 character escaping to this module (to support URLs).

3) regarding URL support, add support for internationalized domain names (IDNs) to this module.

4) add two functions that other modules can invoke to either escape or transliterate a passed string.

5) optional: Enhance the module to support different transliteration projects, so the existing php utf-8 would become one of the choices. (see reason in my previous comments).

6) finally, add a module hook to core's url() function, either by calling module_invoke('sanitize_utf8', 'url', $url) directly or by calling module_invoke_all('url', $url) (which could be a performance issue if we do not cache the list of implementations via module_hook() in a static variable).

Since this is a bug, we should try to get 6) into D6.

smk-ka has a wide experience on this complex topic, so it would be great if he would point out a direction to go.

smk-ka’s picture

Some notes:
D's whole workflow actually is fully UTF-8 aware. This means there should be no missing or bad escaping 'bug' regarding url generation, as urlencode() already takes care of (ie. escapes) invalid characters. The real problem is that PHP's file functions and/or the underlying OS are not able to create filenames that contain UTF-8 sequences.

To test create a text file and save it with a name that contains characters that are outside of your system's native (OEM) encoding. If you're on a latin system use cyrillic characters, for example: русском.txt. Attach it to a node and check your files folder, where the file has been created but with UTF-8 sequences showing up in your OS' native encoding (at least on Windows).

Interestingly, if we turn on private files then the file is written *and* read by PHP, thus the effect is nixed and the file can be successfully downloaded.

With an approach similar to private files, it should be possible to store the file using a MD5 hash of its name, but transmit it back to the client with its original name, since the two operations are completely independent form each other and the original filename would still be in the database.

smk-ka’s picture

Oh, and the original poster mentions a file named user.module.patch @@ ^^ asdf . asdf that should cause problems. Since this file doesn't contain UTF-8 sequences, I'm not having any problems downloading it with Firefox. I suppose he's using or referring to Internet Explorer, which is (not surprisingly) a minefield of it's own. That browser seems to have a lot more problems with URIs containing anything else than pure ASCII.

smk-ka’s picture

Ouch, I just saw urls to (public) files are not generated by the url() function, instead they're manually built and are *not* urlencoded. That means the bug sun was talking about actually exists (see file_create_url()).

smk-ka’s picture

Status: Active » Needs review
FileSize
1.68 KB

Attached patch fixes the url generation.

Explanation: urlencode() is to be used to escape query parameters. However, with clean urls enabled, query parameters are becoming part of the path, but paths need to be rawurlencoded(). File urls are always paths, so we need a flag in case clean urls are disabled.

drewish’s picture

Version: 6.x-dev » 6.0-beta2

no comments on the patch but i wanted to point back up the issue and mention that i proposed adding transliteration to core and got shot down: http://drupal.org/node/110972 i'd suggest reviewing that issue before assuming you'll have much luck getting such a feature committed to core.

steinmb’s picture

Version: 6.0-beta2 » 6.x-dev

Hmmm I have tested with 26.1.2008 version of 6.x dev and it works just fine.
- Drupal running on OS X 10.5.1 in MAMP
- Verified that files are correctly created in /files dir and load in Safari 3.x, FF 2.x and FF 3 beta2.
- OK. Create file with Norwegian special characters: "stein magne bjørklund.jpg"
- OK. Create file with special characters: "user.module.patch @@ ^^ asdf . asdf . "

marcingy’s picture

Have only tested this for handling spaces but it does that job well. +1

c960657’s picture

Status: Needs review » Closed (duplicate)

Marking as duplicate of #238299: file_create_url should return valid URLs. That bug has a patch (feel free to try it out and post comments) as well as a test case.