I have been processing web pages from RSS sources without problems.
Usually, those pages have set the charset in a meta tag like this:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
But in others, there is not meta tag showing the charset, and charset is defined in the HTML header:
Screenshot 01

I downloaded an article from one of those sites to check the problem.
Then I moved it to localhost and I imported it using the XPath HTML parser.
The word Oñati is not shown correctly:
Screenshot 02
Then I modified the HTML and I put the meta tag with the charset, after import the word was shown correctly:
Screenshot 01

Does Drupal ignore the HTTP header charset? Internet browsers don't.

Thank you all.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Gotz84 created an issue. See original summary.

MegaChriz’s picture

Status: Active » Closed (duplicate)

Parsers by default expect the source to be UTF-8. If the charset is anything else, it will probably not be correctly parsed. There is already an issue open about this: #1220606: Add support for encoding conversions for any parser, so I'm closing this issue as a duplicate of that one.

I think the reason why setting the metatag helps here is that the XPath HTML parser uses Tidy to clean up the HTML.

Gotz84’s picture

But as I said in my question, the HTML is supposedly in UTF-8.
I got in touch with the webmaster of the site and he told me they send the charset in the headers not in the page as a metatag.
So, does Drupal ignore the header data? (the first image of my question).

Thank you @MegaChriz

MegaChriz’s picture

Status: Closed (duplicate) » Active

Okay, that sounds like a different issue than #1220606: Add support for encoding conversions for any parser, so reopening this one.

I've glanced through the code of both http_request_get() and drupal_http_request(). The only specific thing I saw about encoding is the following line in http_request_get():

curl_setopt($download, CURLOPT_ENCODING, '');

According to http://php.net/manual/en/function.curl-setopt.php, this should mean that it accepts any encoding. I don't know enough about cURL to know if it tries to convert encoding.

You say you downloaded the file and moved it to localhost before importing it. Could it be that the character encoding is converted by your editor or your localhost server? Maybe it helps if you specify the encoding explicitly in your editor? Some editors allow you to set the encoding of a file.
I'm just guessing here what the cause of the issue could be. Content in UTF-8 should get imported just fine.

MegaChriz’s picture

To come back to your question: I don't know if Drupal ignores the header in which the character set is specified. But I didn't find any specific in the code that checks that header. It may be that cURL or PHP does something with that header, but I don't know anything about that.

Gotz84’s picture

Hello @MegaChriz

Sorry for the delay, I have been outside for a week.

I downloaded the example news article using wget, and I moved to localhost. Downloading with curl does the same.
Using file -bi command in the terminal says that the files are in utf-8
text/html; charset=utf-8
Then, if I open the article in localhost with firefox, as the server doesn't send any coding info in the header
Content-Type text/html
and there are no metatags that says that is utf-8, the article is showed with bad coded letters.

While in the original source, the header showed in firefox has the coding info
Content-Type text/html; charset=utf-8.

The example article is here
Thank you

MegaChriz’s picture

Category: Bug report » Support request
Status: Active » Fixed
FileSize
38.36 KB

I have been able to reproduce the issue! Well, at first I didn't, but after fiddling somewhat with the parser settings, I finally got to the same result.

How I reproduced the issue

I could reproduce the issue using the following steps:

  1. Created importer, set XPaths:
    • Context: //body
    • Title: //h1
    • Body: //span[contains(concat(' ', @class, ' '), ' herria ')]
  2. On the parser settings, activated debug options 'Use Tidy' and - underneath 'Debug query' - 'Title' and 'Body'.
  3. Made the text field 'Tidy encoding' empty (it was 'UTF8' by default).

How to fix

To fix the issue, set the 'Tidy encoding' option to 'UTF8'. See also attached image.

Consider to use the Feeds extensible parsers module in the future, which is the successor of the Feeds XPath Parser module. It also has a nicer UI.

Feel free to reopen this issue if you still encounter the same problem after having set the tidy encoding option to 'UTF8'.

Gotz84’s picture

Hello @MegaChriz,

In my case, it doesn't fix the problem.
Is title correctly coded on the debug query when you import the feed?
I made a simple test only importing this article to and It didn't change.
I checked tidy, set UTF8 like in your picture (this is the default value).
I checked to debug title and it showed as:
Herriko bi puntutan trafiko neurketak egin ditu Oñatiko Udalak

I don't know why this fixes the problem for you, but not for me.

Gotz84’s picture

Status: Fixed » Active
MegaChriz’s picture

Status: Active » Postponed (maintainer needs more info)

Can you provide an export of your importer? Maybe there is an other configuration error in there. Or maybe it is a bug with the Xpath XML parser.

Gotz84’s picture

I made this simple test getting only the title

$feeds_importer = new stdClass();
$feeds_importer->disabled = FALSE; /* Edit this to true to make a default feeds_importer disabled initially */
$feeds_importer->api_version = 1;
$feeds_importer->id = 'goiena_test';
$feeds_importer->config = array(
  'name' => 'goiena_test',
  'description' => '',
  'fetcher' => array(
    'plugin_key' => 'FeedsHTTPFetcher',
    'config' => array(
      'auto_detect_feeds' => 0,
      'use_pubsubhubbub' => 0,
      'designated_hub' => '',
      'request_timeout' => '',
      'auto_scheme' => 'http',
      'accept_invalid_cert' => 1,
    ),
  ),
  'parser' => array(
    'plugin_key' => 'FeedsXPathParserHTML',
    'config' => array(
      'sources' => array(
        'xpathparser:0' => '//h1',
      ),
      'rawXML' => array(
        'xpathparser:0' => 0,
      ),
      'context' => '//html/body',
      'exp' => array(
        'errors' => 0,
        'tidy' => 1,
        'tidy_encoding' => 'UTF8',
        'debug' => array(
          'xpathparser:0' => 'xpathparser:0',
          'context' => 0,
        ),
      ),
      'allow_override' => 0,
    ),
  ),
  'processor' => array(
    'plugin_key' => 'FeedsNodeProcessor',
    'config' => array(
      'expire' => '-1',
      'author' => '1',
      'authorize' => 1,
      'mappings' => array(
        0 => array(
          'source' => 'xpathparser:0',
          'target' => 'title',
          'unique' => 1,
        ),
      ),
      'insert_new' => '1',
      'update_existing' => '0',
      'update_non_existent' => 'skip',
      'input_format' => 'plain_text',
      'skip_hash_check' => 0,
      'bundle' => 'art_culo',
      'language' => 'und',
    ),
  ),
  'content_type' => '',
  'update' => 0,
  'import_period' => '-1',
  'expire_period' => 3600,
  'import_on_create' => 1,
  'process_in_background' => 1,
);

MegaChriz’s picture

Project: Feeds » Feeds XPath Parser
Version: 7.x-2.x-dev » 7.x-1.x-dev
FileSize
77.35 KB
140.66 KB

Thanks for the importer export. Alas, I'm still not able to reproduce your issue. Not even on a clean install. See attached image:

When searching for similar issues on the web, I came across this (the XPath parser uses the DOMXPath class for executing xpath queries):
http://stackoverflow.com/questions/8993747/php-domxpath-encoding

So that sounds like there are differences in our environments which causes the issue in your case to appear and in my case not. You could try the XPath HTML parser from Feeds extensible parsers to see if that improves anything.
The HTML page in question is a bit malformed, however:

Anyway, seems more like Feeds Xpath parser issue, so moving this issue to that queue.

Gotz84’s picture

FileSize
34.64 KB

Hello @MegaChriz,

I tried to use Feeds Extensible Parser without luck.
I selected HTML Xpath parser as a parser, but then in mappings xpathparser is missing from source list.

First I made a backup of the database, then I tried several things:
- Clear caches
- Delete all previous feed importers
- deactivate Feeds XPath Parser
- deactivate all Feeds related modules and then reactivate
...
But still is missing.

Those are the modules we use and their version (maybe there are conflicts)
Feed modules list

MegaChriz’s picture

When going from the Feeds XPath Parser module to the Feeds Extensible Parsers module, you have to redefine your mappings, as the sources will be named differently. Instead of xpathparser:0, xpathparser:1, etc. you can define the source names yourself on the parser settings (where you configure the xpaths).

Gotz84’s picture

Thank you @MegaChriz,
I thought the process was the same: first choose a bundle type, then mapping and at last configure xpath.
Now I see that I have to configure the xpath first and at last make the mapping.

I tried to make a test but I'm having problems trying to get the lang attribute of html with the HTML Xpath parser.
With the previous one I have no problems
Previous parser lang attribute
But now it's empty
Extensible parser lang attribute

The xpath value is ancestor::html/@lang but I tried setting the context on html and then only @lang etc without success.
Am I doing something wrong or is it a bug?

(btw, I am making test in another site with the same problem, because now in Goiena they set the meta tag with charset attribute)

MegaChriz’s picture

I'm not sure how to select an attribute on the root element either. Maybe this is a bug. If I save the DOMNode object that is passed to FeedsExXpathDomXpath::evaluate() using $document->saveXML($context_node); (sorry for the technical terms), then the html tag no longer has attributes.

Gotz84’s picture

Ok, so I'll have to wait then to a solution of one of both problems!
Thank you for all @MegaChriz

Gotz84’s picture

@MegaChriz,
Do I have to open a new issue about the bug of html tag attributes tag in Feeds Extensible Parser or this issue can be used for both problems?
Thank you

MegaChriz’s picture

@Gotz84
I think it should be a new issue. This issue should focus on fixing the encoding problem. For this problem, I do think that http://stackoverflow.com/questions/8993747/php-domxpath-encoding could be the same issue and therefore may help to come to a solution.

Perhaps this issue could be fixed with a change in Feeds (namely by creating a solution for #1220606: Add support for encoding conversions for any parser), but I'm not sure about that yet.

Gotz84’s picture

Hello @Megachriz,
would you like to check if you can get the title of this article correctly?
I think that the document is in utf-8,however the charset is set to "Windows-1252".
Although it is set to Windows-1252, the browsers shows the text well coded.

I tried with "HTML Xpath parser" and "XPath HTML parser". Both with/without tidy, source encoding utf-8, windows-1252...
I have no luck.
The title I got always is
La 37ª edición de la Regata Ingenieros Deusto ya tiene su cartel anunciador

Instead of
La 37ª edición de la Regata Ingenieros Deusto ya tiene su cartel anunciador

Can I do something? I wrote them to ask if they had had a mistake setting the charset to "Windows-1252".
Thank you

I forgot the link to the article, sorry:
http://www.deusto.es/cs/Satellite/deusto/es/universidad-deusto/vive-deus...(Universidad+de+Deusto-Notas+de+Prensa+de+Deusto)

leolandotan’s picture

Hi guys,

I'm also having this issue specifically like @Gotz84 is experiencing. I'm fetching from a Spanish RSS feed.

So far I got no solution on this yet.