As reported in this forum topic (near the bottom), PHP5 cannot parse UTF-8 encoded XML feeds that start with the so-called "byte order mark" which most Microsoft apps fondly prefix UTF-8 encoded files with as a signature.

E.g. http://msdn.microsoft.com/rss.xml

The XML specs allow for BOM's in UTF-x encoded feeds. PHP4's parser is smart enough to strip it away, while PHP5 reports an "Empty document" error. The attached patch explicitly strips the BOM if present.

Note that even after this patch, Drupal still doesn't parse XML 100% according to spec... most notably, the following situations will fail:
- XML requires that any parser handle UTF-16 encoded XML... PHP doesn't support this, so we would need to check for the UTF-16 BOMs (little and big endian), strip them out, then convert to UTF-8.
- XML says that external encoding information (like the HTTP Content-Type: text/xml; charset=utf-8) takes precedence over the encoding="" stuff inside the document. We currently don't check the HTTP headers at all in aggregator.module or allow the passing of external encoding info to drupal_xml_parser_create().
- XML says that if the detected and declared encoding are not equal, an error should be thrown.

In theory I could cook up a patch to make Drupal's parser 100% compliant, but aside from this issue it handles pretty much every feed out there. I very much doubt anyone would make a UTF-16 encoded feed, as it would certainly break every other PHP-based parser out there. Same for external encoding information: it's just not used, as no-one out there supports it. Heck, even MagpieRSS, the most popular parsing library, didn't support encodings at all not so long ago.

The only argument pro is "standards compliance", but I'm reluctant to write code that will not be executed except by some weird masochist who wants to break XML parsers.

CommentFileSizeAuthor
bom.patch1.26 KBSteven
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Steven’s picture

Title: Make Drupal parse XML 100% according to specs » Fix XML UTF-8 bom issue. Parse according to specs or not?

Better title.

Steven’s picture

And for those who feel the need to induce their brain to seep out of their ears and run off to a dark and safe place, here's the relevant portion of the XML specs:
http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

Morbus Iff’s picture

Patch looks good for me. I had to do this once for AmphetaDesk many many moons ago, as versions of expat prior to 1.95.2 had the same problem (in that case, they'd cause the script to segfault, not just fail to parse). The regexp I ended up using is identical to the one of this patch. +1.

Dries’s picture

Committed to HEAD and DRUPAL-4-6.

Anonymous’s picture