In _linkchecker_status_handling(), error messages and response codes are assumed to always be ISO-8859-1. I've experienced issues with handling of error codes using multibyte characters (specifically Arabic text) where the error message is output as a string of nonsense characters.

The source encoding should be automatically detected if possible using mb_detect_encoding().

CommentFileSizeAuthor
#1 mb_error_encoding-2261795-1.patch1.08 KBben.kyriakou
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

ben.kyriakou’s picture

Issue summary: View changes
FileSize
1.08 KB

Added patch to fix this issue - if available, the source encoding is first checked by using mb_detect_encoding(). If not available, it will fall back to ISO-8859-1.

ben.kyriakou’s picture

Status: Active » Needs review
hass’s picture

Status: Needs review » Needs work

Have you seen how https://api.drupal.org/api/drupal/includes%21unicode.inc/function/drupal... works? Your solution may work with mb function, but otherwise it will fail as these function does not exists... :-(((

Richard Damon’s picture

Looking at the http headers for the returned page (in particular the Content-Type: header), should tell you the encoding for the page. (If it is omitted, it can be assumed to be ISO-8859-1, but if it is different, it should be specified). There is no need to "guess" the encoding. I suppose adding a guess if it doesn't define it might make sense for some "broken" sites.

hass’s picture

It's not about a header. It's your local server error message that has an unknown encoding and that mysql update does not save the error message. I have not found any way to detect the encoding of this string yet. :-((( in my case the system was german and this means not ISO for sure. I had a core case about this too without any result at all.

Maybe easier to open a php bugcase in the hope to find the root cause.

Richard Damon’s picture

I am sure the Database error is because the data being sent to a Text field marked UTF-8 and the data is not encoded as UTF-8 and thus contains illegal characters. Converting from ISO-8859-1 "works" as any byte stream is a valid character stream, so the results will be valid UTF-8.

The real solution is to look at the response headers from the HTTP request, which should contain a Content-Type: header (at least if the page is not encode as ISO-8859-1), telling the encoding of the data on the page. If it didn't have it, then the page couldn't have any character not in ISO-8859-1. (This is a standard header used in many internet protocols to define encodings).

Generally, the basic level transfer routines do NOT check this header and convert the data, as they leave that for the final client (you may WANT the original data for some reason). I often will add a wrapper around the basic routines that does parse some of the basic headers and normalizes the data (convert all data into UTF-8, for instance, regardless of the original character set)

In the response packet, these headers will be placed in the header member as an associative array, so $result->header['content-type'] will have the header which will normally include a field with the character encoding. You should be able to use this encoding as the source incoming for your conversion (instead of just always using ISR-8859-1).

hass’s picture

The real solution is to look at the response headers from the HTTP request, which should contain a Content-Type: header (at least if the page is not encode as ISO-8859-1), telling the encoding of the data on the page. If it didn't have it, then the page couldn't have any character not in ISO-8859-1. (This is a standard header used in many internet protocols to define encodings).

It looks like you have not understand the root cause. The message comes from your LOCAL php machine as it has NOT any answer from a remote host. As I know, there is no header with the datatype.