Closed (won't fix)
Project:
Drupal core
Version:
8.0.x-dev
Component:
base system
Priority:
Minor
Category:
Task
Assigned:
Unassigned
Reporter:
Created:
7 Jan 2011 at 18:30 UTC
Updated:
29 Jul 2014 at 19:16 UTC
Jump to comment: Most recent file
Comments
Comment #1
pillarsdotnet commentedDo we have a D8 testbot yet?
Comment #2
pillarsdotnet commentedLet's test against the D7 testbot, then switch back to D8.
Comment #3
pillarsdotnet commentedComment #4
pillarsdotnet commentedHmm.. Maybe I've got to leave it at 7.x-dev until the testbot actually tests?
Comment #6
pillarsdotnet commentedComment #7
heine commentedCan you describe the problem background? Why do you have invalid bytesequences in your strings?
Comment #8
pillarsdotnet commentedIt varies. For the getid3 module, it was because the author's name (as encoded in his sample jpeg file) contained invalid characters. Sometimes it comes from user input (usually spam), or from erroneous web requests that are currently handled by the Search404 module. The problem is, get_plain is used to sanitize text from untrusted sources, yet it produces runtime errors when those untrusted sources prove ... untrustworthy. The above patch works for me. I realize that my version handles invalid input gracefully whereas the original function is required to abort with a runtime error, but I leave it to your judgment which is the better strategy.
Comment #9
heine commentedFYI there's a "do not use" remark on the ENT_IGNORE documentation.
Charset conversion should happen upstream from check_plain. Explicit warnings (that should go to the log only if Logging and errors is set) are preferred over silent fails.
Comment #10
pillarsdotnet commentedYes, I'm aware of both the PHP documentation and Drupal policy. I believe that my code usage is safe and sane, but I understand if you disagree.
Since it appears that I'm going nowhere with the suggestion that Drupal gracefully handle errors, here is a new proposal:
If you can provide example code that would demonstrate to module authors the proper way to sanitize untrusted user input without the possibility of producing errors in the logs, I promise to file a bug report against every D7 contrib module that uses check_plain instead of following your example.
Comment #11
pillarsdotnet commentedComment #12
pillarsdotnet commentedFYI, the currently available documentation suggests that check_plain is the proper way to sanitize raw user input strings, even though it breaks when the text is not valid UTF-8:
From http://api.drupal.org/api/drupal/includes--common.inc/group/sanitization/7
From http://drupal.org/writing-secure-code
From how to handle text in a secure fashion
Comment #13
pillarsdotnet commentedComment #14
heine commentedcheck_plain escapes plaintext UTF-8 data for use in HTML. It does not convert between encoding, nor does it break on invalid bytes; it reports the issue and returns an empty string (which is great).
The getid3 library has a known bug where it misaligns fieldboundaries on read which causes it to misinterpret the specified encoding.
If you run with the above patch, your site is wide open to XSS; use the bitwise OR operator to combine flags.
Comment #15
pillarsdotnet commentedFeeding your suggested change to the testbot...
Meanwhile, you have neatly sidestepped my proposal. May I conclude that there is no possible way, according to Drupal, so handle user input without the possibility of errors?
I'm not aware of any standard that requires, e.g., TEXTAREA input from a browser to be UTF-8. Deleting an entire string on the basis of one malformed character and logging the action as an error for administrative review doesn't seem so "great" to me.
Comment #16
heine commentedYou can use drupal_validate_utf8 to check if a certain string is valid utf-8. If you don't know the input encoding, you are pretty much toast.
Comment #17
heine commentedThere is. It is called the HTML 4.01 Specification .
Comment #19
pillarsdotnet commentedI just read through the HTML 4.01 Specification. I believe you are referring to the following:
From http://www.w3.org/TR/html401/interact/forms.html#h-17.13.3.3
And from http://www.w3.org/TR/html401/interact/forms.html#adef-enctype
Also from http://www.w3.org/TR/html401/charset.html#idx-character_encoding-7
I suppose the above may be interpreted as a requirement that all user-agents must be able to translate user-input into UTF-8, but even so, I don't think it is to Drupal's credit if it throws away input from user-agents that are not fully conformant. And as an administrator who routinely checks logs for errors, I certainly don't want to hear about every non-conformant user-agent that happens to hit my site.
Comment #20
pillarsdotnet commentedComment #21
pillarsdotnet commentedComment #22
heine commentedFAPI forms specify an accept-charset of UTF-8 since #194652: Specify accept-charset="UTF-8" attribute for all forms. If you get data from another source, (file, webservice, whatever) it is your duty to convert to UTF-8 before working with it.
I don't understand the 'sanitize' part of the title. drupal_convert_to_utf8 can be used to convert to UTF-8. You must know what encoding the source string is in though; If you don't know anything about the string you receive, all mb_detect_encoding can do is guess, which is exceptionally difficult on short strings.
Comment #23
pillarsdotnet commentedI want a way to remove or translate non-conformant characters without throwing away the entire string and without generating error-log output.
Are you saying that the following code will meet my requirements?
If so, then I just need to replace every instance of
check_plain($foo)withcheck_plain(drupal_sanitize_string($foo)).Comment #24
pillarsdotnet commentedI don't know anything about any string that is received from an untrusted source. Just because a user-agent claims that it sent UTF-8 doesn't mean it actually sent UTF-8. It might have lied. And I don't want my logs filled with runtime warnings generated by the assumption that user-agents never lie.
Comment #25
heine commentedWe tell the browsers what they must send. If the useragent lies, or the other party (webservice) sends in an unknown encoding, you must throw away the data. If you have a string in an unknown encoding, you can't really go over the bytes saying, nah, we'll remove this, keep that and expect that the remaining bytes that happen to be valid UTF-8 sequences make _sense_.
In so far as I can see you're trying to deal with a non-issue (non-conformant browsers), a bug in getid3 and some other issue you haven't described.
Comment #26
pfrenssenIn case this may help others that are struggling with bad characters in strings. I've had the same problem with strings containing binary characters (binary zeroes in my case).
Try:
$var = filter_var($var, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);Or, sanitize an entire multidimensional array with: