The aggregator should either remove all URLs that are local to the exporting system or fix them. We get a lot of 404s on d.o for such URLs.

Comments

brianV’s picture

Assigned: Unassigned » brianV

I have a patch in the works to fix this - marking as assigned so we don't duplicate the same work.

Going to sleep on what I have done, and do some tweaking & testing tomorrow.

brianV’s picture

Status: Active » Needs review
StatusFileSize
new2.38 KB

Patch attached.

This adds a new function to aggregator.parser.inc which runs two preg_replace()'s against the item's description looking for relative links in href or src attributes, and substitutes either the domain, or the domain + path into the links as necessary.

My only concern is that thousands of feed items could be imported in a cron run in some rare cases. Running two regexes against each item could add up to a fair amount of processing time.

Please review, and send your comments. This is my first core patch, so pardon any 'newbie' mistakes :p

brianV’s picture

StatusFileSize
new2.51 KB

Ehh, because this is a core patch, attaching a copy generated from Drupal root

brianV’s picture

StatusFileSize
new2.53 KB

Updated version to comply with Drupal's coding standards.

keith.smith’s picture

Status: Needs review » Needs work

Note that comments must begin with sentence case and end in a period, per the coding standards.

brianV’s picture

Status: Needs work » Needs review
StatusFileSize
new2.54 KB

New version, with proper comment formatting.

brianV’s picture

Something similar to this could also be very useful when generating a RSS feed in Drupal. Having relative paths in a RSS feed serves no real purpose when imported on the other end.

Perhaps the function I wrote should be made more general and added to the main API, so that it can be called when an RSS feed is being generated. Thoughts?

Status: Needs review » Needs work

The last submitted patch failed testing.

brianV’s picture

Status: Needs work » Needs review
StatusFileSize
new2.12 KB

Revision 4, fixes a type (oops!)

alex_b’s picture

It would be great to get a test for this functionality... : )

I'm still a bit torn between putting this cleanup step on the input (like the patch here) or on the output (a cached filtering process in a preprocess function which has the additional benefit of being overrideable). But as long as #14104: Aggregator should support filter.module does not move, I am fine with leaving this cleanup step on the input side.

brianV’s picture

Title: Strip or fix all local URLs » Aggregator: Convert all relative URLs to absolute URLs in feed items

alex - perhaps a combination of the two may be of some value.

This code to fix the URLs is specific to Aggregator's needs. I would leave it on the input side, but still push toward allowing aggregator to conform regular input filter system on it's output.

Also, I am not sure how filter.module works, but you need to know the item's URL in order to convert the links from relative to absolute. Can this be done in a filter? Or do the filter hooks only take the body of the content you want to change?

Renaming issue title for better clarity.

brianV’s picture

StatusFileSize
new2.74 KB

I just realized I posted the last patch in the wrong format. Re-upping in the correct format.

cburschka’s picture

Status: Needs review » Needs work

Very important fix!

This function needs a unit test case...

Some minor points on the comments:

Convert all relative links contained in the src and href elements within

Strictly speaking, href and src are attributes, while a and img are elements.

This is the link to the item

All other parameters are described without a complete sentence, so "this is" should be removed for consistency.

jeffschuler’s picture

BrianV, re: your comment #7, just a heads up to this issue from 2005: #29023: This feed is valid, but may cause problems for some users.

brianV’s picture

Issue tags: +Needs tests

Oops - thought this was a different issue. The tag still applies, though

brianV’s picture

Status: Needs work » Needs review

New revision has tests and coding style fixes pointed out by Arancayter.

ngaur’s picture

Looks promising.

For my purposes, I'm not in a position to correct URLs at the receiving end. I need to be able to publish RSS feeds that will work for sites which don't implement this sort of URL correction.

Also I need D6 implementation, but presumably that will come.

brianV’s picture

@ngaur,

Please open a new issue for 'Converting relative URLs to absolute URLs in output RSS feeds'. I will take a look at it when I have a chance.

catch’s picture

Status: Needs review » Needs work

Looks like #16 was supposed to come with a patch.

brianV’s picture

Grrr... there *WAS* a patch. It had all the the unit tests and everything. Guess I have to do it again from scratch....

ngaur’s picture

What part of drupal should such a ticket relating to feed generation be filed against?

alex_b’s picture

ngaur: Drupal project, component node.module

grendzy’s picture

momardieng’s picture

@brianV,

Was a D6 version of this patch ever posted? I looked everywhere on this site but could not find one. I would really love to get my hands on it as I am having this issue with feeds coming from several sites that I really need (for example http://www.aps.sn/aps.php?page=backend&id_secteur=28). Thanks!

brianV’s picture

Momar,

Unfortunately, this was only ever patched against D7 (HEAD). If and when a patch for this is accepted, it would only then be backported to D6.

I never ended up finishing it off - I had a version at one point with tests, that I suspect would have been just about RTBC-quality. Unfortunately, for some unknown reason, it disappeared from post #16, and I lost my local copy as I reverted my local working copy (since I thought the version in the issue was safely here!).

I never got up the heart to rewrite the tests for this issue as it took a few hours of work. TBH, I think tests are really all that are needed at this point to bring this one to RTBC.

boris mann’s picture

Just found this issue from http://chrisshattuck.com/blog/save-planet-well-drupal-planet-actually-dr... -- would be nice to be able to consume the RSS feeds that Drupal produces without errors.

gábor hojtsy’s picture

@BrianV: can you move this patch over to #88183: Relative URLs in feeds should be converted to absolute ones that this was marked a duplicate of? Thanks!

kurtzhong’s picture

Status: Closed (duplicate) » Needs review
Issue tags: -Needs tests

Status: Needs review » Needs work
Issue tags: +Needs tests

The last submitted patch, aggregator-fix-relative-links-rev-5.patch, failed testing.

symbios_zi’s picture

how to fix this problem in D6? the actual problem

jvieille’s picture

Please, a D6 fix!

Thanks

brianV’s picture

Status: Needs work » Closed (duplicate)

This patch is a duplicate. re-marking as such.