Just for now this module deals well only with ISO-8859-1 documents. The problem has two sides:
1) In case of UTF-8 PHP DOMDocument requires "content-type" meta tag to be first in head section to decode document correctly (i.e., read comments on php manual site).
2) There are a lot of different 8bit charsets on the web and they can't be understood by DOMDocument at all.
Down here is a small dirty patch which helps me to work with "windows-1251" encoded documents. It finds document charset, converts file to UTF-8 with iconv and moves "content-type" meta tag to be first in "head" section. May be it can help you with these problems but I'm sure you can do it in better way.... :-)
--- FeedsQueryPathParser.inc.orig 2010-10-22 00:50:39.000000000 +0400
+++ FeedsQueryPathParser.inc 2010-10-26 20:21:50.000000000 +0400
@@ -13,12 +13,23 @@
* Implementation of FeedsParser::parse().
*/
public function parse(FeedsImportBatch $batch, FeedsSource $source) {
- $batch->setTitle(trim(qp($batch->getRaw(), 'title')->text()));
+ $doc = @qp($batch->getRaw(), NULL, array('ignore_parser_warnings' => TRUE));
+
+ // Convert document to UTF-8
+ $ContentType = qp($doc, 'meta[http-equiv="content-type"]');
+ if ($ContentType->hasAttr('content') && preg_match('/charset=([-\w]*)/i', $ContentType->attr('content'), $matches)) {
+ $ContentType->attr('content', preg_replace('/charset=([-\w]*)/i', 'charset=utf-8', $ContentType->attr('content')));
+ qp($doc, 'meta[http-equiv="content-type"]')->remove();
+ qp($doc, 'head')->prepend($ContentType->html());
+ $doc = qp(drupal_convert_to_utf8(utf8_decode($doc->html()), $matches[1]), NULL, array('ignore_parser_warnings' => TRUE));
+ }
$this->source_config = $source->getConfigFor($this);
$this->rawXML = array_keys(array_filter($this->source_config['rawXML']));
- foreach (qp($batch->getRaw(), $this->source_config['context']) as $child) {
+ $batch->setTitle(trim(qp($doc, 'title')->text()));
+
+ foreach (qp($doc, $this->source_config['context']) as $child) {
$parsed_item = array();
foreach ($this->source_config['sources'] as $source => $query) {
$parsed_item[$source] = $this->parseSourceElement($child, $query, $source);
ps: This module is awesome! :-)
Comments
Comment #1
skylord commentedBTW, "'ignore_parser_warnings' => TRUE" is strongly needed when parsing HTML. htmlqp() in 2.1 version of QueryPath uses it by default.
Comment #2
twistor commentedSorry for the ridiculous delay, this hasn't been too much on my radar. I've committed this since I haven't run into this and it will only run if you do anyway. Thanks!
http://drupal.org/cvs?commit=495210