Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I'm using Drupal 7 (.12 for now) and the Feeds module (7.x-2.0-alpha4+40-dev) along with the "7.x-1.0-beta3+4-dev" version of feeds_xpathparser. When trying to import a SlideShare rss feed (http://www.slideshare.net/rss/user/RealDolmen/presentations) I got the following error:
* CData section not finished <div style="width:425px" id="__ss_7979215"> <stron on line 450. Error code: 63
* PCDATA invalid Char value 11 on line 450. Error code: 9
* Opening and ending tag mismatch: embed line 449 and a on line 450. Error code: 76
* Opening and ending tag mismatch: item line 427 and strong on line 450. Error code: 76
* Opening and ending tag mismatch: channel line 3 and div on line 450. Error code: 76
* Sequence ']]>' not allowed in content on line 450. Error code: 62
* internal error on line 450. Error code: 1
* Extra content at the end of the document on line 450. Error code: 5
* Exception: There was an error parsing the XML document. in FeedsXPathParserXML->setup() (line 33 of modules\contrib\feeds_xpathparser\FeedsXPathParserXML.inc).
This is apparently caused by the character with code "11" (no idea how it got in there). So for now we've managed to solve it by applying the following patch in FeedsXPathParserXML.inc:
$doc = new DOMDocument();
$use = $this->errorStart();
+ $raw = str_replace(chr(11), '', $raw); // PATCH :: there's a problem with this strange character (char: 11) in CDATA...
$success = $doc->loadXML($raw);
unset($raw);
Can you apply the patch or do you have a better way to avoid problems with these "low number" character codes?
Comment | File | Size | Author |
---|---|---|---|
#6 | feeds_ex-strip-invalid-chars-1517642-6.patch | 1.32 KB | twistor |
#5 | feeds_ex-strip-invalid-chars-1517642-5.patch | 1.32 KB | twistor |
#3 | strip_invalid_utf8_chars-1517642-3.patch | 619 bytes | Tharna |
Comments
Comment #1
twistor CreditAttribution: twistor commentedIt's a bug in the XML, not in the module.
I make this point because I have avoided adding things to fix people's XML.
On the bright side, there is support for Tidy, if you install it. I tested and it will fix this specific problem.
Comment #2
BenVercammen CreditAttribution: BenVercammen commentedEnabling the PHP Tidy module (in php.ini) and checking the "Use Tidy" option (in the "XPath Parser Settings" > "Debug Options" group) does indeed solve the problem. Thanks!
Still, it's an XML from SlideShare and I can't shake the feeling I shouldn't be the one "jumping through hoops" to fix it.
Anyways, is it possible to add some more information to the current error message, or a suggestion to use the Tidy cleanup? I can imagine it's hard to find out what's wrong for non-developers... (It took me a while to figure it out as well)
Comment #3
Tharna CreditAttribution: Tharna commentedThere are characters that are valid utf-8 but aren't valid in xml. This patch removes those characters from the loaded xml data.
Comment #4
twistor CreditAttribution: twistor as a volunteer commentedThis is still a valid feature request. Moving to the correct issue.
Comment #5
twistor CreditAttribution: twistor as a volunteer commentedThis strips null bytes since those break XML decoding.
All other invalid characters should be ignored since we've added DOMDocument::recover = TRUE.
This was also fixed in feeds_xpathparser.
Comment #6
twistor CreditAttribution: twistor as a volunteer commentedWith a valid test.
Comment #8
twistor CreditAttribution: twistor as a volunteer commentedComment #11
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedThis has been ported by @Thangaraj Moorthi to D8, so crediting him here.