Of course, this requires the mbstring extension, but the last time I checked, UTF-8 handling was completely hosed without it, anyway.

(Patch is actually against D7, but I hope will apply against D8 as well.)

Comments

pillarsdotnet’s picture

Status: Active » Needs review

Do we have a D8 testbot yet?

pillarsdotnet’s picture

Version: 8.x-dev » 7.x-dev
StatusFileSize
new543 bytes

Let's test against the D7 testbot, then switch back to D8.

pillarsdotnet’s picture

Version: 7.x-dev » 8.x-dev
pillarsdotnet’s picture

Version: 8.x-dev » 7.x-dev

Hmm.. Maybe I've got to leave it at 7.x-dev until the testbot actually tests?

Status: Needs review » Needs work

The last submitted patch, check_plain-mbstring.patch, failed testing.

pillarsdotnet’s picture

Version: 7.x-dev » 8.x-dev
heine’s picture

Can you describe the problem background? Why do you have invalid bytesequences in your strings?

pillarsdotnet’s picture

Title: What is the proper way to sanitize untrusted user input? » Eliminate "Invalid multibyte sequence" errors from check_plain()
Category: support » feature
Status: Active » Needs work

It varies. For the getid3 module, it was because the author's name (as encoded in his sample jpeg file) contained invalid characters. Sometimes it comes from user input (usually spam), or from erroneous web requests that are currently handled by the Search404 module. The problem is, get_plain is used to sanitize text from untrusted sources, yet it produces runtime errors when those untrusted sources prove ... untrustworthy. The above patch works for me. I realize that my version handles invalid input gracefully whereas the original function is required to abort with a runtime error, but I leave it to your judgment which is the better strategy.

heine’s picture

Status: Needs work » Closed (won't fix)

FYI there's a "do not use" remark on the ENT_IGNORE documentation.

Charset conversion should happen upstream from check_plain. Explicit warnings (that should go to the log only if Logging and errors is set) are preferred over silent fails.

pillarsdotnet’s picture

Category: feature » support
Status: Closed (won't fix) » Active

Yes, I'm aware of both the PHP documentation and Drupal policy. I believe that my code usage is safe and sane, but I understand if you disagree.

Since it appears that I'm going nowhere with the suggestion that Drupal gracefully handle errors, here is a new proposal:

If you can provide example code that would demonstrate to module authors the proper way to sanitize untrusted user input without the possibility of producing errors in the logs, I promise to file a bug report against every D7 contrib module that uses check_plain instead of following your example.

pillarsdotnet’s picture

Title: Eliminate "Invalid multibyte sequence" errors from check_plain() » What is the proper way to sanitize untrusted user input?
pillarsdotnet’s picture

Title: Eliminate "Invalid multibyte sequence" errors from check_plain() » What is the proper way to sanitize untrusted user input?
Category: feature » support
Status: Needs work » Active

FYI, the currently available documentation suggests that check_plain is the proper way to sanitize raw user input strings, even though it breaks when the text is not valid UTF-8:

From http://api.drupal.org/api/drupal/includes--common.inc/group/sanitization/7

Functions to sanitize values.

See http://drupal.org/writing-secure-code for information on writing secure code.

From http://drupal.org/writing-secure-code

No piece of user-submitted content should ever be placed as-is into HTML.

  • Use check_plain or theme('placeholder') for plain text.

See how to handle text in a secure fashion for more details.

From how to handle text in a secure fashion

When outputting plain-text, you need to pass it through check_plain() before it can be put inside HTML. This will convert quotes, ampersands and angle brackets into entities, causing the string to be shown literally on screen in the browser.

Most themeable functions and APIs take HTML for their arguments, and there are a few that automatically sanitize text by first passing it through check_plain():

pillarsdotnet’s picture

Component: base system » documentation
heine’s picture

Component: documentation » base system

check_plain escapes plaintext UTF-8 data for use in HTML. It does not convert between encoding, nor does it break on invalid bytes; it reports the issue and returns an empty string (which is great).

The getid3 library has a known bug where it misaligns fieldboundaries on read which causes it to misinterpret the specified encoding.

If you run with the above patch, your site is wide open to XSS; use the bitwise OR operator to combine flags.

pillarsdotnet’s picture

Version: 8.x-dev » 7.x-dev
Category: support » bug
Status: Active » Needs review
StatusFileSize
new543 bytes

Feeding your suggested change to the testbot...

Meanwhile, you have neatly sidestepped my proposal. May I conclude that there is no possible way, according to Drupal, so handle user input without the possibility of errors?

check_plain escapes plaintext UTF-8 data for use in HTML. It does not convert between encoding, nor does it break on invalid bytes; it reports the issue and returns an empty string (which is great).

I'm not aware of any standard that requires, e.g., TEXTAREA input from a browser to be UTF-8. Deleting an entire string on the basis of one malformed character and logging the action as an error for administrative review doesn't seem so "great" to me.

heine’s picture

You can use drupal_validate_utf8 to check if a certain string is valid utf-8. If you don't know the input encoding, you are pretty much toast.

heine’s picture

I'm not aware of any standard that requires, e.g., TEXTAREA input from a browser to be UTF-8.

There is. It is called the HTML 4.01 Specification .

Status: Needs review » Needs work

The last submitted patch, check_plain-mbstring.patch, failed testing.

pillarsdotnet’s picture

Version: 8.x-dev » 7.x-dev
Category: task » bug
Status: Active » Needs work
I'm not aware of any standard that requires, e.g., TEXTAREA input from a browser to be UTF-8.

There is. It is called the HTML 4.01 Specification .

I just read through the HTML 4.01 Specification. I believe you are referring to the following:

From http://www.w3.org/TR/html401/interact/forms.html#h-17.13.3.3

The form data set is then encoded according to the "http://www.w3.org/TR/html401/interact/forms.html#form-content-type">content type specified by the "http://www.w3.org/TR/html401/interact/forms.html#adef-enctype" class="noxref">enctype
attribute of the "einst">FORM element.

And from http://www.w3.org/TR/html401/interact/forms.html#adef-enctype

enctype = "http://www.w3.org/TR/html401/types.html#type-content-type">content-type "http://www.w3.org/TR/html401/types.html#case-insensitive">[CI]
This attribute specifies the content type used to submit the form to the server (when the value of method is "post"). The default value for this attribute is "application/x-www-form-urlencoded". The value "multipart/form-data" should be used in combination with the INPUT element, type="file".
accept-charset = charset list [CI]
This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received.

The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

Also from http://www.w3.org/TR/html401/charset.html#idx-character_encoding-7

User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.

I suppose the above may be interpreted as a requirement that all user-agents must be able to translate user-input into UTF-8, but even so, I don't think it is to Drupal's credit if it throws away input from user-agents that are not fully conformant. And as an administrator who routinely checks logs for errors, I certainly don't want to hear about every non-conformant user-agent that happens to hit my site.

pillarsdotnet’s picture

Version: 7.x-dev » 8.x-dev
Category: bug » task
Status: Needs work » Active
pillarsdotnet’s picture

Title: What is the proper way to sanitize untrusted user input? » Drupal provides no error-free way to sanitize text from untrusted sources.
Version: 7.x-dev » 8.x-dev
Category: bug » task
Status: Needs work » Active
heine’s picture

FAPI forms specify an accept-charset of UTF-8 since #194652: Specify accept-charset="UTF-8" attribute for all forms. If you get data from another source, (file, webservice, whatever) it is your duty to convert to UTF-8 before working with it.

I don't understand the 'sanitize' part of the title. drupal_convert_to_utf8 can be used to convert to UTF-8. You must know what encoding the source string is in though; If you don't know anything about the string you receive, all mb_detect_encoding can do is guess, which is exceptionally difficult on short strings.

pillarsdotnet’s picture

I don't understand the 'sanitize' part of the title.

I want a way to remove or translate non-conformant characters without throwing away the entire string and without generating error-log output.

drupal_convert_to_utf8 can be used to convert to UTF-8.

Are you saying that the following code will meet my requirements?

/**
 * Remove or translate invalid characters prior to filtering with check_plain().
 */
function drupal_sanitize_string($untrusted_string) {
  return drupal_convert_to_utf8($untrusted_string,'utf-8');
}

If so, then I just need to replace every instance of check_plain($foo) with check_plain(drupal_sanitize_string($foo)).

pillarsdotnet’s picture

You must know what encoding the source string is in though; If you don't know anything about the string you receive, all mb_detect_encoding can do is guess, which is exceptionally difficult on short strings.

I don't know anything about any string that is received from an untrusted source. Just because a user-agent claims that it sent UTF-8 doesn't mean it actually sent UTF-8. It might have lied. And I don't want my logs filled with runtime warnings generated by the assumption that user-agents never lie.

heine’s picture

Status: Active » Closed (won't fix)

We tell the browsers what they must send. If the useragent lies, or the other party (webservice) sends in an unknown encoding, you must throw away the data. If you have a string in an unknown encoding, you can't really go over the bytes saying, nah, we'll remove this, keep that and expect that the remaining bytes that happen to be valid UTF-8 sequences make _sense_.

In so far as I can see you're trying to deal with a non-issue (non-conformant browsers), a bug in getid3 and some other issue you haven't described.

pfrenssen’s picture

In case this may help others that are struggling with bad characters in strings. I've had the same problem with strings containing binary characters (binary zeroes in my case).

Try: $var = filter_var($var, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);

Or, sanitize an entire multidimensional array with:

$array = array(); // some big ass Drupal array
array_walk_recursive($array, 'filter_binary_data');

function filter_binary_data(&$var) {
  $var = filter_var($var, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
}