Im getting ‚ or ’ characters in my feed. I see this is a character encoding issue.

However, the feed I am pulling from is "iso-8859-1". Is this not an acceptable character set? If not, is there a way to convert it using feeds?

For example: http://www.medworm.com/rss/userss.php?qu=PCOS&journals=on

<?xml version="1.0" encoding="iso-8859-1"?>
<!-- generator="FeedCreator 1.7.2" -->
<rss version="2.0">

The output looks likes this (in a view) http://www.pcosvancouver.com/research

My basic feed importer is set as:
HTTP Fetcher
Download content from a URL. - Auto detect feeds
Common syndication parser - Parse XML feeds in RSS 1, RSS 2 and Atom format.
Node processor
Text Format - Full HTML
Replace existing nodes
Nodes Never Expire

Thank you for any assistance. I am not sure if this is a bug, or just a noob error.

Comments

peem83’s picture

I have the same issue with importing nodes. my csv contain 'ë' characters. when I change csv file encoding type import inserts empty title for nodes.

HunterElliott’s picture

I believe that to have it import higher-end ascii characters properly, you must save your feed/export your feed as a UTF-8 file.

As an example, export out something from Excel that has these characters as a regular CSV, then reopen the file in Excel. You'll see they're all garbage characters now. Then export the original file as a Unicode Text file, your characters should show properly.

(you can also just open these exported files in Notepad or some other plain-text editor)

colle901’s picture

For CSV files, the UTF-8 encoding is simple to do and works for me. However, I need to know if there is a solution for XML feeds from external sites where I do not have any control over the supplied character encoding?

xaqrox’s picture

Status: Active » Closed (duplicate)
xaqrox’s picture

Issue summary: View changes

Thank you message added

erwangel’s picture

Issue summary: View changes
Status: Closed (duplicate) » Active

I'm reopening this issue because the "duplicate" on which it was closed is only about "csv import" or the issue is rather generic to all importers.

Here is my case :
symptom : the same as the one that initiated the issue (accentuated characters like "é", "ù", etc converted as "é", "ù" after Feeds import
collateral problem : feeds tamper could not preg_match strings (filter words)
cause/origin: incoming feed displayed encoding iso-8859-1 (<?xml version="1.0" encoding="iso-8859-1"?>) while server's header was utf-8 (Content-Type: application/rss+xml; charset=utf-8)
solution : change/correct the feed's "displayed" encoding and if "real encoding"
so in common_syndication_parser.inc, added the following

function common_syndication_parser_parse($string) {
	//dpm($string,'string');

        // get the encoding announced by the feed
	$rx = '/<?xml.*encoding=[\'"](.*?)[\'"].*?>/m';
	if (preg_match($rx, $string, $m)) {
		$encoding = strtoupper($m[1]);
	} else {
		$encoding = "utf-8";
	}
        
        // get the "real encoding"
	$mb_encoding = mb_detect_encoding ($string);

	if ($encoding != "utf-8") {

               // only encode string if needed
		if ($mb_encoding != 'UTF-8') {
			$string = iconv ($encoding, "utf-8", $string);
		}

                // change the encoding to utf-8
		$string = preg_replace('/'.$encoding.'/i', 'utf-8', $string);
	}

  @ $xml = simplexml_load_string($string, NULL, LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOCDATA);

Discussion: this worked for me but this is probably not the best place to do it as it will only correct "Common syndication parser". Perhaps a better place is http_request.inc in http_request_get function after headers are read (line 200). Also th mb_detect_encoding will not always give the right result.
Here is a similar problem and a patch that was committed to an old version of drupal's common.inc Error when importing non utf-8 feeds

babusaheb.vikas’s picture

I have used feeds version 7.x-2.0-alpha8 and its work for me.
So check it once with feeds version 7.x-2.0-alpha8 I hope your problem will solved.

MegaChriz’s picture

Title: ‚ or ’ character in Feed » Add support for encoding conversions for any parser

Retitling.

@erwangel
Regarding the comment in #6, have you tested if this is still an issue in the latest dev of Feeds?

MegaChriz’s picture

MegaChriz’s picture

Marked #1916100: Font error as a duplicate.

MegaChriz’s picture

erwangel’s picture

@MegaChriz #7 : sorry for the late answer. Yes for me it works fine now ! (7.x-2.0-beta2)