Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
By Steve Dondley on
To my dismay, I learned that Google has changed the format of its RSS news feeds. A link to a news article that was once:
news.google.com/news/url?...4456b3208be04&cid=1106225330&ei=cmdXROaFBYqcHY3bnakG
becomes
news.google.com/news/url?...4456b3208be04&cid=1106225330&ei=1mZXRKfuC8WeHNmLhKAH
a few minutes later. Notice the different in the last part of the URL. Perhaps Google is doing this on purpose?
At any rate, this is causing many duplicate entries in the aggregator. I think the only solution is to hack Drupal to ignore the last few characters of the url.
Comments
Problem with Drupal?
I took a close look at Google's RSS feed. I see the following in there:
<guid isPermaLink="false">tag:news.google.com,2005:cluster=41ef6ba2</guid>
Perhaps this is what should be used?
--
Get better help from Drupal's forums and read this.
Info on GUID
http://blogs.law.harvard.edu/tech/rss#ltguidgtSubelementOfLtitemgt
--
Get better help from Drupal's forums and read this.
yes same problems with google news and aggregator.module
i've been trying to hack the code in the aggregator.module to compare the current aggregator_item title with other titles in an effort to avoid duplicate titles (regardless of url, link), but i know very little php or mysql basically, so it's muddle and bumble along and test.
i had thought this was the section of code to change, around line 894
i had thought a check for $title = $feed['title'] around the save might avoid duplicates but i didn't get it to work.
another option would be a "clean aggregator_items and related aggregator_category_items" routine that would work like the timestamp flush routine. around line 839, something like...
if it were in that section of the code, it MIGHT be something VAGUELY like...but i'm sure this is not right...and it would be timexpensive to run for every line of the aggregator data...
again, i'm groping in the dark, but still groping. i'd love help.
it strikes me that perhaps a aggcleaner module might be a better approach, running off poormanscron or cron, and certainly if this title compare were to become part of the aggregator.module there should be an option to set the value on or off for those who don't need it.
the best place for the routine is probably outside of the iterative loops if possible, perhaps after a new feed has come in and been processed but then there'd need to be a loop anyway to process it all. ok i'm rambling.
thanks to anyone who takes an interest in this.
blake
Patch posted
I posted a patch on this here:
http://drupal.org/node/61433
In addition to applying the patch, you have to create a new field in the aggregator_item table called "guid". It's a varchar with a length of 255.
--
Get better help from Drupal's forums and read this.
thanks...and...
i've put in your patch and made the db change and it is TONS better. thank you.
however, i am still getting some duplicates for some reason, likely that the guid changed somehow. this is nowhere near the number i got before.
so if there's anyone out there who can help me know what i need to in order to clean my aggregator_item and aggregatore_category_item tables of all duplicates based on title, keeping those with no duplicates and the latest timestamped of those that are duplicates, i'd really love to figure this out.
i'll say for myself that i've been studying mysql and php diligently and looking at lots of drupal code, but i'm not there yet.
thanks for the patch, and thanks in advance to anyone who educates me a bit. i know how to execute mysql and open the db in php, and i know how to send some queries, but some i don't follow...especially those where %s and %d are used...i've searched for info on this on the web and will be hitting the bookstores when i get enough money for one of those php mysql books. what do you recommend?
thanks,
blake
Please check your database
Can you please look in the database and see if the guid did change? Also, have you double-checked to make sure the duplicates are in fact duplicates? Maybe the headlines are identical but go to different publications (like on a wire story).
%s is just a format code for a string. %d is a format code for a decimal (number) value. These are part of Drupal's database interface and not particular to php or mysql.
--
Get better help from Drupal's forums and read this.
How/where do I install patch?
Sorry for the question but not sure if I add it to the aggregator module, or what??
Jim