Error message encoding issues with multibyte characters [#2261795]

In _linkchecker_status_handling(), error messages and response codes are assumed to always be ISO-8859-1. I've experienced issues with handling of error codes using multibyte characters (specifically Arabic text) where the error message is output as a string of nonsense characters.

The source encoding should be automatically detected if possible using mb_detect_encoding().

Comment	File	Size	Author
#1	mb_error_encoding-2261795-1.patch	1.08 KB	ben.kyriakou
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

ben.kyriakou CreditAttribution: ben.kyriakou commented 8 May 2014 at 11:04

Issue summary:

View changes

File	Size
mb_error_encoding-2261795-1.patch	1.08 KB

Added patch to fix this issue - if available, the source encoding is first checked by using mb_detect_encoding(). If not available, it will fall back to ISO-8859-1.

Comment #2

ben.kyriakou CreditAttribution: ben.kyriakou commented 8 May 2014 at 11:05

Status:

Active

» Needs review

Comment #3

hass CreditAttribution: hass commented 9 May 2014 at 16:42

Status:

Needs review

» Needs work

Have you seen how https://api.drupal.org/api/drupal/includes%21unicode.inc/function/drupal... works? Your solution may work with mb function, but otherwise it will fail as these function does not exists... :-(((

Comment #4

Richard Damon CreditAttribution: Richard Damon commented 11 May 2014 at 02:49

Looking at the http headers for the returned page (in particular the Content-Type: header), should tell you the encoding for the page. (If it is omitted, it can be assumed to be ISO-8859-1, but if it is different, it should be specified). There is no need to "guess" the encoding. I suppose adding a guess if it doesn't define it might make sense for some "broken" sites.

Comment #5

hass CreditAttribution: hass commented 11 May 2014 at 18:52

It's not about a header. It's your local server error message that has an unknown encoding and that mysql update does not save the error message. I have not found any way to detect the encoding of this string yet. :-((( in my case the system was german and this means not ISO for sure. I had a core case about this too without any result at all.

Maybe easier to open a php bugcase in the hope to find the root cause.

Comment #6

Richard Damon CreditAttribution: Richard Damon commented 11 May 2014 at 20:50

I am sure the Database error is because the data being sent to a Text field marked UTF-8 and the data is not encoded as UTF-8 and thus contains illegal characters. Converting from ISO-8859-1 "works" as any byte stream is a valid character stream, so the results will be valid UTF-8.

The real solution is to look at the response headers from the HTTP request, which should contain a Content-Type: header (at least if the page is not encode as ISO-8859-1), telling the encoding of the data on the page. If it didn't have it, then the page couldn't have any character not in ISO-8859-1. (This is a standard header used in many internet protocols to define encodings).

Generally, the basic level transfer routines do NOT check this header and convert the data, as they leave that for the final client (you may WANT the original data for some reason). I often will add a wrapper around the basic routines that does parse some of the basic headers and normalizes the data (convert all data into UTF-8, for instance, regardless of the original character set)

In the response packet, these headers will be placed in the header member as an associative array, so $result->header['content-type'] will have the header which will normally include a field with the character encoding. You should be able to use this encoding as the source incoming for your conversion (instead of just always using ISR-8859-1).

Comment #7

hass CreditAttribution: hass commented 13 May 2014 at 09:16

The real solution is to look at the response headers from the HTTP request, which should contain a Content-Type: header (at least if the page is not encode as ISO-8859-1), telling the encoding of the data on the page. If it didn't have it, then the page couldn't have any character not in ISO-8859-1. (This is a standard header used in many internet protocols to define encodings).

It looks like you have not understand the root cause. The message comes from your LOCAL php machine as it has NOT any answer from a remote host. As I know, there is no header with the datatype.