Hi,

I've got a site running drupal 4.6 and I'm trying to get it to parse the Russian language feed from the BBC
(feed URL: http://newsrss.bbc.co.uk/rss/russian/russia/rss.xml)

This feed is encoded in windows-1251 rather than in utf-8 that drupal likes. At first the error I was getting was:
Unsupported encoding 'windows-1251'. Please install iconv, GNU recode or mbstring for PHP.
then I installed iconv, recompiled php (--with-iconv) and started getting this error:
Could not convert XML encoding 'windows-1251' to UTF-8

I found a similar thread about import.module from a few years ago:
http://drupal.org/node/3359

It suggests this code:
if ($encoding != "utf-8") {
$data = iconv ($encoding, "utf-8", $data);
$encoding = "utf-8";
}

I tried that, and couldn't get that to work ($encoding wasn't set), so with help from google (and this link:
http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodi... )

I added this:
$rx = '/<?xml.*encoding=[\'"](.*?)[\'"].*?>/m';
if (preg_match($rx, $data, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "utf-8";
}

and that appears to work, at least partially - I've stopped getting the encoding errors, I now get:
Failed to parse RSS feed BBC: no element found at line 23.
There is an element at line 23, though. Prior to being reencoded, it's a title tag.

Has anyone got any ideas on how to progress further?

BTW: I've added this code at line 456, as part of the aggregator_parse_feed function in this file/version:
/* $Id: aggregator.module,v 1.233.2.8 2005/12/05 08:56:48 dries Exp $ */

Cheers,
David

CommentFileSizeAuthor
#3 common.inc.patch_2.txt542 bytesmagico
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

budda’s picture

Project: Aggregator Node » Drupal core
Version: master » 4.6.8
Component: Code » aggregator.module

This report is for Aggregator.module not aggregator_node.module

magico’s picture

Version: 4.6.8 » 4.6.9
Status: Needs work » Active

I confirm that this bug only happens in 4.6.x

magico’s picture

Status: Active » Needs review
FileSize
542 bytes

The problem was in the function drupal_xml_parser_create which I corrected from version 4.7.x
It resolved the problem and now it's possible to import those feeds you mentioned.

Here is the patch for it!

Steven’s picture

Status: Needs review » Fixed

Committed to 4.6, thanks!

Anonymous’s picture

Status: Fixed » Closed (fixed)
mgifford’s picture

Version: 4.6.9 » 4.7.3
Status: Closed (fixed) » Needs review

I don't know why a better solution foor this isn't to just check while the feeds are being converted. I added this

function aggregator_save_item($edit) {

$edit['title'] = mb_convert_encoding($edit['title'],"UTF-8","auto");
$edit['description'] = mb_convert_encoding($edit['description'],"UTF-8","auto");
$edit['author'] = mb_convert_encoding($edit['author'],"UTF-8","auto");

Seems to work fine in recent releases of 4.6 & 4.7.

A number of utf feeds out there that I need to access still aren't using utf8 characters.

Mike

magico’s picture

Status: Needs review » Closed (fixed)

Do not reopen old and solved issues, without a reason.
To propose new code, create a new issue.