It looks like the feed parser has a problem with the following RSS file.

The errors:

Dec 23 13:17:02 halk drupal: http://animals.m2osw.com|1230067022|aggregator|<hidden IP>|http://animals.m2osw.com/cron.php||0||The feed from National Geographic News seems to be broken, due to an error "XML_ERR_NAME_REQUIRED" on line 98.
Dec 23 13:17:02 halk drupal: http://animals.m2osw.com|1230067022|aggregator|<hidden IP>|http://animals.m2osw.com/cron.php||0||The feed from National Geographic News seems to be broken, due to "200 feed not parseable".

Line 98 is a <title> tag with a lone & character...

I'm attaching the file that causes problems (renamed .txt from .rss so it can be attached here).

The file is taken from: http://news.nationalgeographic.com/index.rss

Thank you.
Alexis Wilke

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

philsward’s picture

When is this going to be fixed? I would "assume" this would be a fairly simple problem to overcome, but I don't know since I'm not a PHP programmer...

A similar issue has been placed at: http://drupal.org/node/275567 which provides a small workaround for this issue.

The "XML_ERR_NAME_REQUIRED" problem seems to be caused by the "&" character within the xml feed. I'm not 100% sure, but I believe it is limited to the <title></title> area of the feed...

Is there anyway to have the aggregator do a check for the & when parsing the .xml file / rss feed and put something in it's place such as &amp; or and?

The issue: http://drupal.org/node/275567 posted above has been around since July of 08 which outlines the same problem and I am surprised this is still in the queue since it affects 5, 6 and probably 7, but I suppose it is a "minor" issue with all things considered...

AlexisWilke’s picture

philsward,

Thank you for your support. The fix in #275567: The feed from Your Site seems to be broken, because of error "XML_ERR_NAME_REQUIRED" on line 97. would work great. I will add that to my core modules. 8-)

The problem, I think, is that they consider a lone & a bug in the XML and thus in the source, not in Drupal. But I consider this a lack of support for broken feed that would otherwise work just fine. It is as if your browser was to refuse a page because there are totally broken tags. 99% of the time, your browser will manage just fine, so should Drupal with broken feeds.

Thank you.
Alexis Wilke

AlexisWilke’s picture

FileSize
547 bytes

There is a patch so you can easily fix your version too. 8-)

Note also that it is possible that the & appears in the body text too.

AlexisWilke’s picture

Okay. The old patch works in most cases, but today I ran in a stream with two words glued to an ampersand:

blah&foo

So I wrote a better implementation that will check the validity of all characters and entities. Those references have a very specific syntax defined in the XML 1.0 documentation. I use that definition to check the ampersand usage. If it breaks, then I replace the ampersand by &amp; and leave the rest alone.

Whether the entity is known, that I do not check, but so far I never had an improperly named entity.

Thank you.
Alexis Wilke

Damien Tournoud’s picture

Status: Active » Fixed

I see no reason to make the aggregator dumber. Please ask the original site to fix its feed.

Damien Tournoud’s picture

Status: Fixed » Closed (won't fix)
AlexisWilke’s picture

> I see no reason to make the aggregator dumber. Please ask the original site to fix its feed.

I though I would get that answer.

I'm sure some people will be using my patch since it is just not possible to ask really dumb people (all those millions with broken RSS feeds) to fix them.

Now I'm sad to see that Drupal will not even work as well as Outlook.

Thank you for your answer, after 6 months of waiting...
Alexis Wilke

gradrupal’s picture

The feed from another seems to be broken, because of error "XML_ERR_NAME_REQUIRED" on line 7.
Downlaoded and installed aggregator 6.1 and poormans cron, forgive me, I am a beginner to this, can use ftp and filezilla, but am no php man or xhtml, and cant understand what to do with your patch; tried a google news http, and got the above message, can you help me install your patch, gra.

njardim’s picture

This issue is still arround in version 7!

Wake up people!!!!!

We need better products!!!

njardim’s picture

AlexisWilke: Can you please post the aggregator.module file or a link to it?

Thanks

AlexisWilke’s picture

njardim,

Good idea! 8-)

You will find it here:

http://www.m2osw.com/doc-drupal-m2osw-aggregator

And I put info about all the things that I fixed and never posted here because when you read the answer in #5, I don't see the point in talking with the author of this dumb module.

Thank you.
Alexis

njardim’s picture

Alexis: Do you have a version of the patched aggregator.module for drupal 7 core?

Thanks,

njardim

AlexisWilke’s picture

njardin,

No. I'm not working with D7... Sorry.

Alexis

njardim’s picture

Tremendous effort from your part. And it should be appreciated.

Maybe you should join the core development team someday.

Regards,

njardim

lordzik’s picture

Hello,
i've tried both patches but still get an error.
Try to test your solution with this RSS (polish ministry of healthcare):
http://www.mz.gov.pl/wwwmz/rss

Is it because polish national font in title? Or "-" sign??

Thanx in advance

lordzik’s picture

Status: Closed (won't fix) » Needs work
AlexisWilke’s picture

Interesting, that RSS uses ISO-8859-2 and breaks FireFox 3.x RSS reader.

I'm pretty sure that the encoding is the problem because I think we expect UTF-8. Otherwise I can retrieve the file with wget and looking at the XML code... It seems just fine.

Now, if you read #5, I think you'll have to ask them to fix their RSS feed... 8-) Good luck with that!

I'll see whether I can get an easy fix for that one.

Thank you.
Alexis

edit:

Okay, found the problem of this one. The ISO-8859-2 is converted by Drupal to a UTF-8 file. So We're good on that one. And the conversion works just fine.

The problem left was a spurious < character (in "pojemności < 50 mg"). It should have been &lt; instead of a bare <. I added a statement in my version of the aggregator. Please, try with version 1.1 as posted here:

https://secure.m2osw.com/doc-drupal-m2osw-aggregator

File attachments are at the bottom of the page.

Dave Reid’s picture

Status: Needs work » Closed (won't fix)

Quite simply, the feed does not validate and that's where the problem is.
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fnews.nationalgeo...

lordzik’s picture

Status: Closed (won't fix) » Needs work

Great! I've updated the module and now that RSS is aggregated just fine.
Thank you very much!

One thing that bothers me is that after module update Drupal says that i have to run update.php. So i've run it and among 6 SQL updates to aggregate module i get one error:

user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'TYPE text' at line 1 query: ALTER TABLE i6sj_aggregator_item ALTER COLUMN link TYPE text in /path/to/drupal6/modules/aggregator/aggregator.install on line 350.

My table is like this:
mysql> desc i6sj_aggregator_item;
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| iid | int(11) | NO | PRI | NULL | auto_increment |
| fid | int(11) | NO | MUL | 0 | |
| title | varchar(255) | NO | | | |
| link | varchar(255) | NO | | | |
| author | varchar(255) | NO | | | |
| description | longtext | NO | | NULL | |
| timestamp | int(11) | YES | MUL | NULL | |
| guid | varchar(255) | YES | | NULL | |
| nid | int(11) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+

What's the problem?

Best regards

Damien Tournoud’s picture

@lordzik: the problem is blindly downloading and installing a module from an untrusting source. This module has not been peer-reviewed and will not receive any security update.

lordzik’s picture

Hmmmm, so that problem with RSS parsing won't be fixed in core branch??
I understand that some RSS sources does not validate to anything but an application have to handle such problems. Even if you assume that input from user is no longer than 15 characters you still make some checks and validation to prevent problems :)

AlexisWilke’s picture

lordzik,

I told you already... Core says: "the RSS provider has to fix his feed". Go talk to your government about it! 8-)

The link can be more than 255 characters, so the update attempts to fix the problem by changing the field to type TEXT. I'm not too sure why MySQL doesn't like it. I use PostgreSQL.

Btw, the warning in #20 can be ignored. Very frankly, all the 3rd party modules are uploaded without "peer review". So they are no better than my download. If someone detects a security issue, they can contact me at once, just like the security team from here (who 99% of the time does not answer their emails.)

Anyway, I'm glad it works for you. That's a relief that someone uses my code changes!

Thank you.
Alexis

lordzik’s picture

@Alexis Wilke
I don't like the way "core" acts in this case because you can't change the world - but can adapt your code to handle it...

Can you make a new patch for core module?
And did you submit other patches for core module that you've included in your version of the module?

Again, thank you for your help.

ps for core developers: Drupal 5 did parse RSS from #15 just fine! Problem appeard after site upgrade from D5 to D6.

lordzik’s picture

And one more thing - you probably can handle an error in update process by checking if db_type is pgsql ;)

AlexisWilke’s picture

lordzik,

I do agree that the answer in #5 is not valid. Most RSS I have found, be it from large companies like National Geographic, CNN, or that Polish website, have some problems. At times, the problem is temporary (like that < problem, once that item is too old, it wouldn't appear in the RSS and thus not cause any problem.)

I have not submitted any more patches in regard to aggregator for 2 reasons:

1) From what I understand my patches for the aggregator are absolutely not welcome

2) My version is already so different that it would be very difficult for me to submit a patch (their loss, but you can use my tarball as is!)

However, I have been keeping up with all the updates from Drupal.org which is 0 updates for the last 10 releases or so. In other words, the Drupal Core version of Aggregator is perfect, isn't it?

Thank you.
Alexis

Status: Needs work » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.