I'm using Drupal 7 (.12 for now) and the Feeds module (7.x-2.0-alpha4+40-dev) along with the "7.x-1.0-beta3+4-dev" version of feeds_xpathparser. When trying to import a SlideShare rss feed (http://www.slideshare.net/rss/user/RealDolmen/presentations) I got the following error:

    * CData section not finished <div style="width:425px" id="__ss_7979215"> <stron on line 450. Error code: 63
    * PCDATA invalid Char value 11 on line 450. Error code: 9
    * Opening and ending tag mismatch: embed line 449 and a on line 450. Error code: 76
    * Opening and ending tag mismatch: item line 427 and strong on line 450. Error code: 76
    * Opening and ending tag mismatch: channel line 3 and div on line 450. Error code: 76
    * Sequence ']]>' not allowed in content on line 450. Error code: 62
    * internal error on line 450. Error code: 1
    * Extra content at the end of the document on line 450. Error code: 5
    * Exception: There was an error parsing the XML document. in FeedsXPathParserXML->setup() (line 33 of modules\contrib\feeds_xpathparser\FeedsXPathParserXML.inc).

This is apparently caused by the character with code "11" (no idea how it got in there). So for now we've managed to solve it by applying the following patch in FeedsXPathParserXML.inc:

    $doc = new DOMDocument();
    $use = $this->errorStart();
+    $raw = str_replace(chr(11), '', $raw); // PATCH :: there's a problem with this strange character (char: 11) in CDATA...
    $success = $doc->loadXML($raw);
    unset($raw);

Can you apply the patch or do you have a better way to avoid problems with these "low number" character codes?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

twistor’s picture

Title: PCDATA invalid Char value 11 » Strip ASCII characters below 32
Category: bug » feature
Priority: Major » Normal

It's a bug in the XML, not in the module.

I make this point because I have avoided adding things to fix people's XML.

On the bright side, there is support for Tidy, if you install it. I tested and it will fix this specific problem.

BenVercammen’s picture

Enabling the PHP Tidy module (in php.ini) and checking the "Use Tidy" option (in the "XPath Parser Settings" > "Debug Options" group) does indeed solve the problem. Thanks!

Still, it's an XML from SlideShare and I can't shake the feeling I shouldn't be the one "jumping through hoops" to fix it.

Anyways, is it possible to add some more information to the current error message, or a suggestion to use the Tidy cleanup? I can imagine it's hard to find out what's wrong for non-developers... (It took me a while to figure it out as well)

Tharna’s picture

There are characters that are valid utf-8 but aren't valid in xml. This patch removes those characters from the loaded xml data.

twistor’s picture

Project: Feeds XPath Parser » Feeds Extensible Parsers
Component: Code » XML parser

This is still a valid feature request. Moving to the correct issue.

twistor’s picture

Status: Active » Needs review
FileSize
1.32 KB

This strips null bytes since those break XML decoding.
All other invalid characters should be ignored since we've added DOMDocument::recover = TRUE.

This was also fixed in feeds_xpathparser.

twistor’s picture

With a valid test.

  • twistor committed 42dae37 on 7.x-1.x
    Issue #1517642 by twistor, Tharna: Strip ASCII characters below 32
    
twistor’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

MegaChriz’s picture

This has been ported by @Thangaraj Moorthi to D8, so crediting him here.