Previously, filtering of translation strings was removed from core in this issue: http://drupal.org/node/97640

However, the resulting possibility that translation strings could contain dangerous script, etc, has not been addressed since.

Rather than filtering on input (which was clearly broken) why not simply validate using the suggested code from Steven:

#33
Steven - December 19, 2006 - 19:09

Where translations come from is a matter of trust... holding this patch because the necessary CVS scanning infrastructure is in place is kind of silly. That's like saying we should keep module installation broken because someone might post an insecure one.

Whether or not a module can 'taint core' is stupid. A module could easily leave behind PHP code in a node or block to do its bidding after it's been uninstalled too, or a module could even insert malicious translation strings itself on install.

As I already said before, the following test can be used to detect possibly bad translation strings. It should not have any false positives. But it is only a test, not a transformation, as it destroys valid HTML.

if (decode_entities($value) != decode_entities(filter_xss_admin($value))) { /* possibly bad value */ }

Either we build a scanning script for cvs, or we add this check to Drupal core when importing. I cannot guarantee that this will never give any false positives, although for the typical cases of translatable strings, it should not be a problem. It should certainly not let any XSS through.

However, I believe we should treat this as a separate issue. This patch itself fixes a bug when importing and editing locales, namely that some strings are irreversibly corrupted.

In the unlikely case that the input strings fails this validation, the translator can probably work around by using an entity, right?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

pwolanin’s picture

Status: Active » Needs review
FileSize
2.48 KB

Here's a quick patch for 7.x. I tested the UI part, but not the import part yet.

same patch applies with offset to 6.x. A backport for 5.x would be needed.

Gerhard Killesreiter’s picture

Looks good but needs some testing with more than one translation.

Dries’s picture

pwoladin, how about a simpletest test case?

pwolanin’s picture

@Dries - I'm not sure about the import part - can certainly do one for the UI.

pwolanin’s picture

one problem - I get a huge # of exceptions for the existing locale test:

87 passes, 0 fails, 810 exceptions

Appears to be due to a stale static variable, since the exceptions all relate to a missing custom language that I created via the UI t.language = 'en-US

pwolanin’s picture

separate issue for the test exceptions: http://drupal.org/node/276153

note, we may want to restrict the list of tags further than filter_xss_admin allows . For example <img> tags are common xss vectors for IE. Is there any reason a translator would include images or tables, etc?

a possible list:

'a', 'b', 'big', 'del', 'em', 'i', 'ins', 'pre', 'q', 'small', 'span', 'strong', 'sub', 'sup', 'tt', 'ol', 'ul', 'li', 'p', 'br'

Obviously, 'a' can be problematic too, but I think there are a number of t() texts that use it.

pwolanin’s picture

patch with test and modified tag list. Not sure how to check the import - probably that should be an additional test.

pwolanin’s picture

ok, silly not to factor out that common list of tags into a single function. Attached does so and is more expansive than the list above - includes almost all tags from filter_xss_admin to avoid false positives.

Damien Tournoud’s picture

Hum. I don't see the point in validating user input here. Doesn't Drupal assume that the administrator (which is the only one that can import or edit translations by default) is always fully trusted?

Moreover, the easier attack vector on translation files seems to me not from translation strings, but directly on the plural formula, that is evaluated as PHP code without any sanitation.

pwolanin’s picture

No, I don't assume that someone uploading a translation is fully trusted, or someone on a large site whose only authority is to translate strings should be able to hijack the site. Clearly some administrative permissions require full trust- but translating the interface?

pwolanin’s picture

still applies with offset

meba’s picture

I don't like this line:

form_set_error('translations', t('The string you submitted is invalid - it may contain unsafe HTML.'));

Which string? This may be confusing not only for end users. Can you include the string with check_plain (so it's safe :)? Or at least some unique information about the string so you are able to identify it.

pwolanin’s picture

sure, good idea.

meba’s picture

FileSize
9.75 KB

I was a little bit confusing in my last message. The line I was reffering to was correct. However, the same applies for importing the PO file.

Attaching a patch which remembers strings that were skipped and presents them to the user.

However: marking this as Code needs work because during testing, I noticed that this string gets skipped:

Custom title for the block. (Empty string will use block default title, <none> will remove the title, text will cause block to use specified title.)

Due to the tag <none> so I guess the approach with tags whitelisting may not be enough. Anyway, I've (I hope that correctly) grepped Czech translation of core, which is 100 % complete, for HTML tags and here we go:

&lt;a&gt; &lt;abbr&gt; &lt;acronym&gt; &lt;br&gt; &lt;code&gt; &lt;description&gt; &lt;em&gt; &lt;h1&gt; &lt;li&gt; &lt;link&gt; &lt;none&gt; &lt;ol&gt; &lt;p&gt; &lt;pre&gt; &lt;span&gt; &lt;strong&gt; &lt;sub&gt; &lt;sup&gt; &lt;ul&gt;

But there may be much more tags in contrib...

meba’s picture

Status: Needs review » Needs work

Oops, sorry. Here is the list:

<a> <abbr> <acronym> <br> <code> <description> <em> <h1> <li> <link> <none> <ol> <p> <pre> <span> <strong> <sub> <sup> <ul>
Gábor Hojtsy’s picture

1. I think we'd better do a list of HTML tags appearing in Drupal core and contrib at this point and discuss support for those. I would say it is unlikely that more will appear. We clearly need the a tag, since inline HTML links are explicitly suggested for easier translations. meba has a great list for core. I've heard he might be doing a full contrib scan as well, so it would be great to see the bigger picture.

2. While I share Damien Tournoud's concerns on the plural formula which is converted and then evaluated as PHP code, we should tighten up where we can. We can certainly improve on validation of the plural forumla, since we know about the few characters which are allowed. So we can quite clearly define the possible parts of a plural formula and lock is down hopefully enough. That should be another issue. I've opened it at http://drupal.org/node/323155 with detailed background info and suggestions.

3. This issue is not only to be backported to 5.x and 6.x, it is also relevant for l10n_client and l10n_server, since those should also stop people submitting translations which are not valid. This whole situation of being worries about translations arose from the more commonplace translation tools, which allow anyone to contribute not just people with vetted CVS accounts.

Gábor Hojtsy’s picture

Ok, http://drupal.org/node/323155 turned out to be a non-issue / false alarm. We should be fully back at dealing with locale input sanitization on the HTML :).

Jose Reyero’s picture

I think the HTML tags whiltelist is the way to go.

Just an idea to consider: What if we get out the html tags from the English string, and use that very same tags for validation of the translated string? Is there any case where we may need different tags?

Also, better if we have a reusable validation function, so it can be used by other module/s too like the l10n collection.

meba’s picture

FileSize
1.43 KB

Attaching a list of tags I found in contrib. There are some very interesting tags :)

Please note that if a tag has several attributes, there can be any combination of them, I joined them together.

Gábor Hojtsy’s picture

Oh, wow, <?php !invoke; print l($node-> is clearly a WTF, agreed. As for the others, there are mostly acceptable tags, but there are suspicious tags, like <description> and <destination>, which seem like indicating an inline tag requirement in t()-s where some "random" XML format is generated, not HTML. How do we deal with that? We suggest people using inline tags in t() to avoid breaking it down to untranslatable half-sentences and words. This is going to be quite interesting... Hm.

meba’s picture

The PHP comes from:

!invoke; print l($node->title, $node->field_file_field[0]['filepath'], array('class' => 'media'));

jquery_media: 6.x-1.0 (1), 6.x-1.1 (1)

meba’s picture

All unusual tags:

<description>

The parent website's description; comes from the <description> element in the feed.

Used in Drupal core, og_aggregator

<destination>, <command>, <classes>:

Usage: drush [options] sync <source> <destination>

Drush module obviously :)

<basic>:

<basic validation="">

Seems like Views module

<input>:
Several modules

<netnews>

...               generated for this Netnews Link. The file will be written to
               modules/netnews/<netnews link="" name="">.trace ...

Netnews module

<script>
Several modules

<use>:

<use sitewide setting>

FB module

<current>:

<current>Failed to load user from facebook fbu=%fbu

Facebook

<hx>:

Create table of Contents based on <hx></hx> tags.

<empty>:
CCK, Fieldreference, uc_cart_widget

Gábor Hojtsy’s picture

Jose: validating against the source string does not really work when you import a .po file since then you have source and target strings as "user input", so the source can be tainted as well. When does that show up anywhere on the UI then is good question though.

Gábor Hojtsy’s picture

meba: these look like things we can tell people to just move out of t(), since they are not stuff to actually translate. Then we could use the rest of the list for the whitelisting filtering code.

pwolanin’s picture

It seems that there should never be literal brackets (< or >) in translatable strings - they need to be written as &lt;

Gábor Hojtsy’s picture

Erm, no. There should be HTML links for example in translatable text to preserve context for the text of the link for translation.

pwolanin’s picture

new issue to fix <none> and the like in core: http://drupal.org/node/329998

Note - this is done correctly in block module itself, it's just the install file's schema description that is wrong.

pwolanin’s picture

@Gabor - to clarify my comment about brackets - I meant that if the brackets were part of the text (not actual tags) they should be written that way.

pwolanin’s picture

Status: Needs work » Needs review
FileSize
6.42 KB

Since #329998: Remove all unescaped <> chars from t() strings in core was committed, here's a re-roll of the patch from #8 fixing the error text as suggested to include the escaped string which was rejected.

pwolanin’s picture

whoops - in a stroke of irony, the new assertions in locale.test are missing t().

Also added additional test for the validation of strings when imported from a file, and refactored the checks into a single helper function that can be re-used elsewhere.

Gábor Hojtsy’s picture

Status: Needs review » Needs work

Looks good to me. Passes test. It does validate strings on UI submission, file import and provides an API function which modules such as l10n_server and l10n_client can use to do the same. I was thinking of two remaining things:

- It might be better to put the validation into _locale_import_one_string_db() since that is the last stop before the database. That would help people reuse that function to get prevented.
- _locale_import_one_string_db() might be improved then to also count 'refused' strings, which would be another possibility in $report, so that we can report refused strings on import as well. That would help people debug bad translations. Some modules might still use disallowed tags, so that would help us migrate to a safer localization environment with contributed modules. So in short, we should not be entirely silent about dropped strings. Let the user debug (and know we protected them :).

pwolanin’s picture

@Gabor - I think for D7 that would be a good idea- but I want something we can backport to 6 quickly - is something with that level of API change going to be ok for D6?

Gábor Hojtsy’s picture

It is a security improvement, so I think a little bit of API change is acceptable in this case. It is really just:

- a new key in the $report array, existing ones are not changed
- a new message added to the number of things imported (I'd prefer to concatenate a new sentence, instead of modifying the existing one, so that we keep the existing); maybe modify the existing one in Drupal 7, add a new string on Drupal 6

Otherwise I have no issues with the patch, so keep up the good work!

pwolanin’s picture

ok, how about this?

pwolanin’s picture

here's a better patch using format_plural() and a more robust test for form errors.

Gábor Hojtsy’s picture

Status: Needs work » Needs review

Looks great. The only point I have to say now, is to move the comment from the top of locale_string_is_safe() to the phpdoc above it, so it appears better in API docs. That is minor, so setting back for review. As long as tests pass, I am all for this patch.

pwolanin’s picture

Status: Needs review » Needs work
FileSize
12.41 KB

ok, moved comments.

pwolanin’s picture

Status: Needs work » Needs review
pwolanin’s picture

further improved code comment.

Dries’s picture

Status: Needs review » Fixed

Committed to CVS HEAD. Thanks!

pwolanin’s picture

Version: 7.x-dev » 6.x-dev
Status: Fixed » Patch (to be ported)

Thanks, Dries. We also need to backport this for 6.x and 5.x

pwolanin’s picture

Here's an intial 6.x backport (not tested yet).

pwolanin’s picture

Status: Patch (to be ported) » Needs review
pwolanin’s picture

a 5.x backposrt as well (needs testing)

Gábor Hojtsy’s picture

Version: 6.x-dev » 5.x-dev

Thanks, committed to 6.x. Since the 5.x is a bit different, it probably needs some more testing there.

hass’s picture

There is a minor typo in the committed code becuase. http://cvs.drupal.org/viewvc.py/drupal/drupal/includes/locale.inc?r1=1.1...

Damien Tournoud’s picture

This caused issues with i18n "block translation" feature, where a lot of markup should be allowed... Please see #352121: locale_string_is_safe() is not suitable for user provided text in non-default textgroups.

Gábor Hojtsy’s picture

TR’s picture

Status: Needs review » Closed (won't fix)

#40 says this was fixed in Drupal 7.x and #45 says this was fixed in Drupal 6.x. The issue remains open only to backport the change to Drupal 5.x. Since Drupal 5.x is no longer supported, I'm marking this issue as won't fix.