Just for now this module deals well only with ISO-8859-1 documents. The problem has two sides:
1) In case of UTF-8 PHP DOMDocument requires "content-type" meta tag to be first in head section to decode document correctly (i.e., read comments on php manual site).
2) There are a lot of different 8bit charsets on the web and they can't be understood by DOMDocument at all.

Down here is a small dirty patch which helps me to work with "windows-1251" encoded documents. It finds document charset, converts file to UTF-8 with iconv and moves "content-type" meta tag to be first in "head" section. May be it can help you with these problems but I'm sure you can do it in better way.... :-)

--- FeedsQueryPathParser.inc.orig       2010-10-22 00:50:39.000000000 +0400
+++ FeedsQueryPathParser.inc    2010-10-26 20:21:50.000000000 +0400
@@ -13,12 +13,23 @@
    * Implementation of FeedsParser::parse().
    */
   public function parse(FeedsImportBatch $batch, FeedsSource $source) {
-    $batch->setTitle(trim(qp($batch->getRaw(), 'title')->text()));
+    $doc = @qp($batch->getRaw(), NULL, array('ignore_parser_warnings' => TRUE));
+
+    // Convert document to UTF-8
+    $ContentType = qp($doc, 'meta[http-equiv="content-type"]');
+    if ($ContentType->hasAttr('content') && preg_match('/charset=([-\w]*)/i', $ContentType->attr('content'), $matches)) {
+      $ContentType->attr('content', preg_replace('/charset=([-\w]*)/i', 'charset=utf-8', $ContentType->attr('content')));
+      qp($doc, 'meta[http-equiv="content-type"]')->remove();
+      qp($doc, 'head')->prepend($ContentType->html());
+      $doc = qp(drupal_convert_to_utf8(utf8_decode($doc->html()), $matches[1]), NULL, array('ignore_parser_warnings' => TRUE));
+    }

     $this->source_config = $source->getConfigFor($this);
     $this->rawXML = array_keys(array_filter($this->source_config['rawXML']));

-    foreach (qp($batch->getRaw(), $this->source_config['context']) as $child) {
+    $batch->setTitle(trim(qp($doc, 'title')->text()));
+
+    foreach (qp($doc, $this->source_config['context']) as $child) {
       $parsed_item = array();
       foreach ($this->source_config['sources'] as $source => $query) {
         $parsed_item[$source] = $this->parseSourceElement($child, $query, $source);

ps: This module is awesome! :-)

Comments

skylord’s picture

BTW, "'ignore_parser_warnings' => TRUE" is strongly needed when parsing HTML. htmlqp() in 2.1 version of QueryPath uses it by default.

twistor’s picture

Assigned: Unassigned » twistor
Status: Active » Fixed

Sorry for the ridiculous delay, this hasn't been too much on my radar. I've committed this since I haven't run into this and it will only run if you do anyway. Thanks!

http://drupal.org/cvs?commit=495210

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.