Hi,
I've got a site running drupal 4.6 and I'm trying to get it to parse the Russian language feed from the BBC
(feed URL: http://newsrss.bbc.co.uk/rss/russian/russia/rss.xml)
This feed is encoded in windows-1251 rather than in utf-8 that drupal likes. At first the error I was getting was:
Unsupported encoding 'windows-1251'. Please install iconv, GNU recode or mbstring for PHP.
then I installed iconv, recompiled php (--with-iconv) and started getting this error:
Could not convert XML encoding 'windows-1251' to UTF-8
I found a similar thread about import.module from a few years ago:
http://drupal.org/node/3359
It suggests this code:
if ($encoding != "utf-8") {
$data = iconv ($encoding, "utf-8", $data);
$encoding = "utf-8";
}
I tried that, and couldn't get that to work ($encoding wasn't set), so with help from google (and this link:
http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodi... )
I added this:
$rx = '/<?xml.*encoding=[\'"](.*?)[\'"].*?>/m';
if (preg_match($rx, $data, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "utf-8";
}
and that appears to work, at least partially - I've stopped getting the encoding errors, I now get:
Failed to parse RSS feed BBC: no element found at line 23.
There is an element at line 23, though. Prior to being reencoded, it's a title tag.
Has anyone got any ideas on how to progress further?
BTW: I've added this code at line 456, as part of the aggregator_parse_feed function in this file/version:
/* $Id: aggregator.module,v 1.233.2.8 2005/12/05 08:56:48 dries Exp $ */
Cheers,
David
Comment | File | Size | Author |
---|---|---|---|
#3 | common.inc.patch_2.txt | 542 bytes | magico |
Comments
Comment #1
buddaThis report is for Aggregator.module not aggregator_node.module
Comment #2
magico CreditAttribution: magico commentedI confirm that this bug only happens in 4.6.x
Comment #3
magico CreditAttribution: magico commentedThe problem was in the function
drupal_xml_parser_create
which I corrected from version 4.7.xIt resolved the problem and now it's possible to import those feeds you mentioned.
Here is the patch for it!
Comment #4
Steven CreditAttribution: Steven commentedCommitted to 4.6, thanks!
Comment #5
(not verified) CreditAttribution: commentedComment #6
mgiffordI don't know why a better solution foor this isn't to just check while the feeds are being converted. I added this
function aggregator_save_item($edit) {
$edit['title'] = mb_convert_encoding($edit['title'],"UTF-8","auto");
$edit['description'] = mb_convert_encoding($edit['description'],"UTF-8","auto");
$edit['author'] = mb_convert_encoding($edit['author'],"UTF-8","auto");
Seems to work fine in recent releases of 4.6 & 4.7.
A number of utf feeds out there that I need to access still aren't using utf8 characters.
Mike
Comment #7
magico CreditAttribution: magico commentedDo not reopen old and solved issues, without a reason.
To propose new code, create a new issue.