It looks like the feed parser has a problem with the following RSS file.
The errors:
Dec 23 13:17:02 halk drupal: http://animals.m2osw.com|1230067022|aggregator|<hidden IP>|http://animals.m2osw.com/cron.php||0||The feed from National Geographic News seems to be broken, due to an error "XML_ERR_NAME_REQUIRED" on line 98.
Dec 23 13:17:02 halk drupal: http://animals.m2osw.com|1230067022|aggregator|<hidden IP>|http://animals.m2osw.com/cron.php||0||The feed from National Geographic News seems to be broken, due to "200 feed not parseable".
Line 98 is a <title> tag with a lone & character...
I'm attaching the file that causes problems (renamed .txt from .rss so it can be attached here).
The file is taken from: http://news.nationalgeographic.com/index.rss
Thank you.
Alexis Wilke
Comment | File | Size | Author |
---|---|---|---|
#4 | aggregator-fix_ampersand-6.x.patch | 1.64 KB | AlexisWilke |
#3 | aggregator-6.x-ampersand.patch | 547 bytes | AlexisWilke |
index.txt | 12.48 KB | AlexisWilke |
Comments
Comment #1
philsward CreditAttribution: philsward commentedWhen is this going to be fixed? I would "assume" this would be a fairly simple problem to overcome, but I don't know since I'm not a PHP programmer...
A similar issue has been placed at: http://drupal.org/node/275567 which provides a small workaround for this issue.
The "XML_ERR_NAME_REQUIRED" problem seems to be caused by the "&" character within the xml feed. I'm not 100% sure, but I believe it is limited to the
<title></title>
area of the feed...Is there anyway to have the aggregator do a check for the & when parsing the .xml file / rss feed and put something in it's place such as
&
orand
?The issue: http://drupal.org/node/275567 posted above has been around since July of 08 which outlines the same problem and I am surprised this is still in the queue since it affects 5, 6 and probably 7, but I suppose it is a "minor" issue with all things considered...
Comment #2
AlexisWilke CreditAttribution: AlexisWilke commentedphilsward,
Thank you for your support. The fix in #275567: The feed from Your Site seems to be broken, because of error "XML_ERR_NAME_REQUIRED" on line 97. would work great. I will add that to my core modules. 8-)
The problem, I think, is that they consider a lone & a bug in the XML and thus in the source, not in Drupal. But I consider this a lack of support for broken feed that would otherwise work just fine. It is as if your browser was to refuse a page because there are totally broken tags. 99% of the time, your browser will manage just fine, so should Drupal with broken feeds.
Thank you.
Alexis Wilke
Comment #3
AlexisWilke CreditAttribution: AlexisWilke commentedThere is a patch so you can easily fix your version too. 8-)
Note also that it is possible that the & appears in the body text too.
Comment #4
AlexisWilke CreditAttribution: AlexisWilke commentedOkay. The old patch works in most cases, but today I ran in a stream with two words glued to an ampersand:
blah&foo
So I wrote a better implementation that will check the validity of all characters and entities. Those references have a very specific syntax defined in the XML 1.0 documentation. I use that definition to check the ampersand usage. If it breaks, then I replace the ampersand by & and leave the rest alone.
Whether the entity is known, that I do not check, but so far I never had an improperly named entity.
Thank you.
Alexis Wilke
Comment #5
Damien Tournoud CreditAttribution: Damien Tournoud commentedI see no reason to make the aggregator dumber. Please ask the original site to fix its feed.
Comment #6
Damien Tournoud CreditAttribution: Damien Tournoud commentedComment #7
AlexisWilke CreditAttribution: AlexisWilke commented> I see no reason to make the aggregator dumber. Please ask the original site to fix its feed.
I though I would get that answer.
I'm sure some people will be using my patch since it is just not possible to ask really dumb people (all those millions with broken RSS feeds) to fix them.
Now I'm sad to see that Drupal will not even work as well as Outlook.
Thank you for your answer, after 6 months of waiting...
Alexis Wilke
Comment #8
gradrupal CreditAttribution: gradrupal commentedThe feed from another seems to be broken, because of error "XML_ERR_NAME_REQUIRED" on line 7.
Downlaoded and installed aggregator 6.1 and poormans cron, forgive me, I am a beginner to this, can use ftp and filezilla, but am no php man or xhtml, and cant understand what to do with your patch; tried a google news http, and got the above message, can you help me install your patch, gra.
Comment #9
njardim CreditAttribution: njardim commentedThis issue is still arround in version 7!
Wake up people!!!!!
We need better products!!!
Comment #10
njardim CreditAttribution: njardim commentedAlexisWilke: Can you please post the aggregator.module file or a link to it?
Thanks
Comment #11
AlexisWilke CreditAttribution: AlexisWilke commentednjardim,
Good idea! 8-)
You will find it here:
http://www.m2osw.com/doc-drupal-m2osw-aggregator
And I put info about all the things that I fixed and never posted here because when you read the answer in #5, I don't see the point in talking with the author of this dumb module.
Thank you.
Alexis
Comment #12
njardim CreditAttribution: njardim commentedAlexis: Do you have a version of the patched aggregator.module for drupal 7 core?
Thanks,
njardim
Comment #13
AlexisWilke CreditAttribution: AlexisWilke commentednjardin,
No. I'm not working with D7... Sorry.
Alexis
Comment #14
njardim CreditAttribution: njardim commentedTremendous effort from your part. And it should be appreciated.
Maybe you should join the core development team someday.
Regards,
njardim
Comment #15
lordzik CreditAttribution: lordzik commentedHello,
i've tried both patches but still get an error.
Try to test your solution with this RSS (polish ministry of healthcare):
http://www.mz.gov.pl/wwwmz/rss
Is it because polish national font in title? Or "-" sign??
Thanx in advance
Comment #16
lordzik CreditAttribution: lordzik commentedComment #17
AlexisWilke CreditAttribution: AlexisWilke commentedInteresting, that RSS uses ISO-8859-2 and breaks FireFox 3.x RSS reader.
I'm pretty sure that the encoding is the problem because I think we expect UTF-8. Otherwise I can retrieve the file with wget and looking at the XML code... It seems just fine.
Now, if you read #5, I think you'll have to ask them to fix their RSS feed... 8-) Good luck with that!
I'll see whether I can get an easy fix for that one.
Thank you.
Alexis
edit:
Okay, found the problem of this one. The ISO-8859-2 is converted by Drupal to a UTF-8 file. So We're good on that one. And the conversion works just fine.
The problem left was a spurious < character (in "pojemności < 50 mg"). It should have been < instead of a bare <. I added a statement in my version of the aggregator. Please, try with version 1.1 as posted here:
https://secure.m2osw.com/doc-drupal-m2osw-aggregator
File attachments are at the bottom of the page.
Comment #18
Dave ReidQuite simply, the feed does not validate and that's where the problem is.
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fnews.nationalgeo...
Comment #19
lordzik CreditAttribution: lordzik commentedGreat! I've updated the module and now that RSS is aggregated just fine.
Thank you very much!
One thing that bothers me is that after module update Drupal says that i have to run update.php. So i've run it and among 6 SQL updates to aggregate module i get one error:
user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'TYPE text' at line 1 query: ALTER TABLE i6sj_aggregator_item ALTER COLUMN link TYPE text in /path/to/drupal6/modules/aggregator/aggregator.install on line 350.
My table is like this:
mysql> desc i6sj_aggregator_item;
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| iid | int(11) | NO | PRI | NULL | auto_increment |
| fid | int(11) | NO | MUL | 0 | |
| title | varchar(255) | NO | | | |
| link | varchar(255) | NO | | | |
| author | varchar(255) | NO | | | |
| description | longtext | NO | | NULL | |
| timestamp | int(11) | YES | MUL | NULL | |
| guid | varchar(255) | YES | | NULL | |
| nid | int(11) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
What's the problem?
Best regards
Comment #20
Damien Tournoud CreditAttribution: Damien Tournoud commented@lordzik: the problem is blindly downloading and installing a module from an untrusting source. This module has not been peer-reviewed and will not receive any security update.
Comment #21
lordzik CreditAttribution: lordzik commentedHmmmm, so that problem with RSS parsing won't be fixed in core branch??
I understand that some RSS sources does not validate to anything but an application have to handle such problems. Even if you assume that input from user is no longer than 15 characters you still make some checks and validation to prevent problems :)
Comment #22
AlexisWilke CreditAttribution: AlexisWilke commentedlordzik,
I told you already... Core says: "the RSS provider has to fix his feed". Go talk to your government about it! 8-)
The link can be more than 255 characters, so the update attempts to fix the problem by changing the field to type TEXT. I'm not too sure why MySQL doesn't like it. I use PostgreSQL.
Btw, the warning in #20 can be ignored. Very frankly, all the 3rd party modules are uploaded without "peer review". So they are no better than my download. If someone detects a security issue, they can contact me at once, just like the security team from here (who 99% of the time does not answer their emails.)
Anyway, I'm glad it works for you. That's a relief that someone uses my code changes!
Thank you.
Alexis
Comment #23
lordzik CreditAttribution: lordzik commented@Alexis Wilke
I don't like the way "core" acts in this case because you can't change the world - but can adapt your code to handle it...
Can you make a new patch for core module?
And did you submit other patches for core module that you've included in your version of the module?
Again, thank you for your help.
ps for core developers: Drupal 5 did parse RSS from #15 just fine! Problem appeard after site upgrade from D5 to D6.
Comment #24
lordzik CreditAttribution: lordzik commentedAnd one more thing - you probably can handle an error in update process by checking if db_type is pgsql ;)
Comment #25
AlexisWilke CreditAttribution: AlexisWilke commentedlordzik,
I do agree that the answer in #5 is not valid. Most RSS I have found, be it from large companies like National Geographic, CNN, or that Polish website, have some problems. At times, the problem is temporary (like that < problem, once that item is too old, it wouldn't appear in the RSS and thus not cause any problem.)
I have not submitted any more patches in regard to aggregator for 2 reasons:
1) From what I understand my patches for the aggregator are absolutely not welcome
2) My version is already so different that it would be very difficult for me to submit a patch (their loss, but you can use my tarball as is!)
However, I have been keeping up with all the updates from Drupal.org which is 0 updates for the last 10 releases or so. In other words, the Drupal Core version of Aggregator is perfect, isn't it?
Thank you.
Alexis