Problem/Motivation

  1. The root cause of this mess: the fact that Drupal generates file URLs without knowledge about the context those URLs will be used in. But the place where a file URL is generated in may itself not know the context either! For example, an image may be rendered into HTML and then that HTML may happen to be sent in an e-mail (or RSS), and therefore the URL should be absolute. But that was impossible to know at the time when the image template was being rendered (for example when rendering an article to be sent via e-mail, the rendering of the article is not at all different: it's still rendered to HTML).
  2. Any time you're sending HTML in a non-web context (for example: e-mail, RSS), you have to make sure all URLs are absolute. But file_create_url() is only invoked on files referenced by code. A complete solution would already have to make all relative URLs absolute anyway, including those in user-generated content.

Proposed resolution

The only sensible solution is to solve the problem at its root. Write a response event subscriber (a response filter in Symfony terminology) that updates all relative URLs in RSS responses to be absolute.

Remaining tasks

  • Test coverage.
  • Reviews.

User interface changes

None.

API changes

None.

Data model changes

None.

Original issue summary

The feeds Drupal creates include node->teaser or node->body (depending on configuration) in the description fields items. This text can contain relative links to other content on the same domain. Many aggregators, including Drupal's own and bloglines.com, can not handle these relative links. The RSS specification specifies that all links should be absolute.

The attached patch fixes the feed descriptions in format_rss_items (common.inc). It adds a function named 'filter_absolute_urls' to common.inc, which can also be used by aggregator.module and other modules that need absolute urls (such as modules that send out emails).

There are alternative approaches to this fix. One could be to use nodeapi 'rss items' hook, using a contrib module. However in the current implementation of this hook in node.module, modules can not modify the body and/or teaser of items. The cleanfeeds module uses this approach, and needs a patch for node.module to get it to work.

Another approach would be to enable the filter system for feeds, and make a filter which creates absolute urls. Having a 'filter_absolute_urls' function in common.inc would still be very usefull in this scenario.

The 'filter_absolute_urls' function can probably be made a little bit more efficient; I currently have two regexps, one for relative urls ("path/to/somewhere") and one for urls relative to the domain root ("/path/to/somewhere").

CommentFileSizeAuthor
#119 88183-2-119.patch16.11 KBalexpott
#119 113-119-interdiff.txt1.6 KBalexpott
#113 interdiff.txt2.04 KBwim leers
#113 88183-113.patch16.53 KBwim leers
#110 no-patch-rss.xml_.txt2.62 KBalexpott
#110 with-patch-rss.xml_.txt2.39 KBalexpott
#110 Screen Shot 2016-05-27 at 10.22.30.png557.16 KBalexpott
#104 88183-104.patch16.58 KBwim leers
#71 interdiff-88183-71.txt3.38 KBkylebrowning
#71 88183-71.patch9.47 KBkylebrowning
#65 interdiff-dbg.txt918 byteswim leers
#65 rss_relative_to_absolute-88183-65-debug.patch10.43 KBwim leers
#65 rss_relative_to_absolute-88183-65.patch10.29 KBwim leers
#61 interdiff.txt5.48 KBwim leers
#61 rss_relative_to_absolute-88183-61.patch10.29 KBwim leers
#59 rss_relative_to_absolute-88183-59.patch9.3 KBwim leers
#59 rss_relative_to_absolute-88183-59-tests_only_FAIL.patch2.29 KBwim leers
#57 interdiff.txt993 byteswim leers
#57 rss_relative_to_absolute-88183-57.patch7.06 KBwim leers
#54 rss_relative_to_absolute-88183-100.patch6.13 KBwim leers
#46 relative-urls-in-feeds-46.patch2.77 KBmarvil07
#45 detect-xmlbase.patch591 bytesmarvil07
#41 aggregator-fix-relative-links-88183-rev8.patch4.78 KBbrianV
#39 aggregator-fix-relative-links-88183-rev7.patch2.18 KBbrianV
#38 aggregator-fix-relative-links-88183-rev6.patch2.19 KBbrianV
#36 aggregator-fix-relative-links-rev-5.patch2.74 KBbrianV
#26 drupal_force_absolute_urls_2.patch1.67 KBahoeben
#24 drupal_force_absolute_urls_1.patch1.71 KBahoeben
#21 drupal_force_absolute_urls_0.patch1.63 KBahoeben
#20 drupal_force_absolute_urls.patch1.63 KBahoeben
#14 absolute-urls-88183-14.patch1.4 KBchx
#11 filter_absolute_urls_0.patch1.68 KBwim leers
filter_absolute_urls.patch1.28 KBahoeben
#72 88183-72.patch9.47 KBkylebrowning
#72 interdiff-88183-72.txt3.38 KBkylebrowning
#73 rss_relative_to_absolute-88183-73.patch10.31 KBwim leers
#73 interdiff.txt937 byteswim leers
#74 rss_relative_to_absolute-88183-74.patch10.78 KBwim leers
#74 interdiff.txt3.14 KBwim leers
#77 rss_relative_to_absolute-88183-77.patch11.42 KBwim leers
#77 interdiff.txt6.61 KBwim leers
#78 rss_relative_to_absolute-88183-78.patch14.25 KBwim leers
#78 interdiff.txt3.54 KBwim leers
#83 88183-83.patch13.64 KBalexpott
#84 83-84-interdiff.txt2.72 KBalexpott
#84 88183-84.patch13.58 KBalexpott
#89 88183-89.patch13.96 KBwim leers
#89 interdiff.txt1.89 KBwim leers
#95 0001-Re-roll-89.patch14.65 KBkylebrowning
#96 88183-96.patch16.24 KBwim leers
#96 interdiff.txt5.13 KBwim leers
#100 88183-100.patch16.8 KBwim leers
#100 interdiff.txt3.47 KBwim leers
#101 Screen Shot 2016-05-25 at 19.56.55.png136.84 KBwim leers
#102 88183-102.patch16.82 KBwim leers
#102 interdiff.txt4.79 KBwim leers
#104 interdiff.txt2.5 KBwim leers

Comments

ednique’s picture

Same problem here...

when content is published on other sites, links are wrong as they don't contain a full url...

ahoeben’s picture

I guess this issue is not getting much attention because its against 4.7.x as opposed to CVS.

I'll see if I can find time to install CVS and see if the problem persists in the current CVS. Nevertheless, this is an ugly issue in 4.7.x too, because the aggregator in Drupal is not even displaying its Drupal generated feeds properly.

magico’s picture

Version: 4.7.3 » 5.x-dev

1. create a node
2. insert a relative link (eg. href="node/add")
3. access the rss.xml
4. links are still relative and must e absolute

dmitrig01’s picture

Version: 5.x-dev » 6.x-dev
Category: bug » feature
Priority: Normal » Critical
phillipadsmith’s picture

This is a pretty major issue. Any update on it?

Anonymous’s picture

The added function needs some phpDOC comments. This hasn't seen the light of day because the status is set to patch (code needs work) meaning that someone has reviewed the code and decided that it needed work. Others will not look until it is patch (code needs review). So once you've made the changes, change the status.

phillipadsmith’s picture

Just pointing out that it seems like a relatively big issue to leave open. The RSS spec asks for absolute URLs and Drupal is producing relative ones -- so Drupal RSS feeds won't validate (or, in some cases, work at all).

FYI: to those that are really stuck with this problem, you can set the $base_url in settings.php and that will add a absolute path to all of your RSS feed links.

Phillip.

bessone’s picture

Version: 6.x-dev » 5.2
Category: feature » bug

I have copied this function on my common.inc (drupal 5.2) and after add the call on format_rss_item function in that way:

$output .= ' <description>'. check_plain(filter_absolute_urls($description)) ."</description>\n";

and works!!! Now no problem with feed and feedburner!

Thanks!!!

Saluti
BES

Anonymous’s picture

@ahoeben: Are you planning to provide a modified patch to provide the comments?

drumm’s picture

Version: 5.2 » 6.x-dev

I would like to see this get into 6.x first. Here are problems I see:
- Function documentation is incomplete. Need more @param and @return.
- Second return statement is never reached.

wim leers’s picture

StatusFileSize
new1.68 KB

Reroll, with some comments.

wim leers’s picture

Assigned: Unassigned » wim leers
Status: Needs work » Needs review
steven jones’s picture

Status: Needs review » Needs work

The token amount of documentation is just not enough.

Rename the parameter $url to $html, as you're not passing in one url at a time to this function, you're passing in html.

Change the documentation to something like:

@param $html
A block of html containing relative and absolute urls in 'src' and 'href' attributes
@param $absolute_url
An absolute URL that points to this Drupal website.

Is the second parameter really needed for this function? why can't we just use url('')?

chx’s picture

Assigned: wim leers » chx
Status: Needs work » Needs review
StatusFileSize
new1.4 KB

Cleanup and comments. Changed the name of the function -- you must not call this filter* something as it's outside of filter module namespace.

snufkin’s picture

Status: Needs review » Reviewed & tested by the community

tested with relative and absolute paths (e.g.: /node/add/story, node/add/story, fullurl.com/node/add/story), all three gives the same, correct results in the feed as absolute url.

dmitrig01’s picture

does it work relatively in a subdirectory?
If your install is in drupal and you link to /drupal/node/add does it work?

snufkin’s picture

my drupal installation was located under http://localhost/projects/drupal/drupal6 and it worked

ahoeben’s picture

Thanks for picking this up! Had to take a break from Drupal for a while, sorry 'bout disappearing... Patch looks good, works for me.

gábor hojtsy’s picture

Status: Reviewed & tested by the community » Needs review

Conceptual issues:

- I can't sew how does this handle some situations, like if I have a link from http://example.com/news/local to "worldwide", which is a relative link to "http://example.com/news/worldwide".
- I see that the second preg matches a link to '/', but I found it hard to come up with examples myself for cases when the first matches or when the second matches. In other words: why do we need two different matchers?
- This function only matches lower case attribute names, which might not be that big of a problem for those following XHTML rules, but frankly, not everyone cares. I am not too comfortable with doing a case insensitive match, because it is a lot sloooooooower, then what we have now, but this still need to be on our mind.

Cosmetic issues:

- "html" should be "HTML" in comments
- The function does not concentrate on "a href", but rather "... href", so this should be corrected. Other tags have href attributes, so this is not exclusive to "a".

ahoeben’s picture

StatusFileSize
new1.63 KB

Reroll of my original patch, with cosmetic changes and documentation.

The second argument to the function is useful to have:

  • The ' http://example.com/news/local to "worldwide" ' scenario: Relative urls in a Drupal page are not relative to $base_url. It's bad practice to use relative urls in Drupal pages (we should use relative-to-the-document-root urls instead), but some people still do. Having the link to the drupal page as a 'relative' instead of $base_url at least makes the rss feed behave the same as the original document.
  • When the function is used in an aggregator module with incoming relative links, the links are relative to the source specified in the incoming feed, not to the Drupal site.
ahoeben’s picture

StatusFileSize
new1.63 KB

woops

chx’s picture

,$base_document misses a space.

ahoeben’s picture

Before I roll a new patch for that space, any more opinions on whether or not we should do case insensitive pregs instead? I am partial to case insensitivity, since the result of this function will likely be cached anyway...

ahoeben’s picture

StatusFileSize
new1.71 KB

New patch

  • Combines two regexps in one preg_replace call
  • Makes regexps case sensitive for people using HreF= or SRC=
  • Fixes relative links to http://example.com/node/1 (makes them relative to http://example.com/node)
  • Adds missing space (thanks chx ;-)

DrupalTestbedBot tested ahoeben's patch (http://drupal.org/files/issues/drupal_force_absolute_urls_1.patch), the patch failed. For more information visit http://testing.drupal.org/node/130

ahoeben’s picture

StatusFileSize
new1.67 KB

Thank you, mister DrupalTestbedBod. Is this patch better?

dries’s picture

Version: 6.x-dev » 7.x-dev
Category: bug » feature
Status: Needs review » Needs work

This isn't a bug, IMO. I'm not a big fan of regex'ing things into place.

The XML specification provides a clean solution for this; the xml:base attribute. The feeds generated by Drupal support include this attribute. What needs fixing is the aggregator module, not the feed generation.

In documentation, we write 'URL', not 'url'.

ahoeben’s picture

Point taken for outgoing RSS feeds.

However, there are still situations where the suggested 'drupal_force_absolute_urls' function is useful; One such situation would be for applying the xml:base attribute to relative links on incoming RSS feeds.

IMHO it is still a critical bug for 6.x that Drupal can not correctly aggregate its own feeds. I'm a bit reluctant to mark this as an 'aggregator.module' bug, because there are other situations where the same problem exists and where xml:base is not a solution. Links in outgoing emails for example.

wolfderby’s picture

Version: 7.x-dev » 5.x-dev

subscribing

oh I just changed the version because I thought people wouldn't be working on it, if it's not the current version. n00b mistake apparently sorry :(

this is beyond my current capabilities and I hope it gets fixed!

wim leers’s picture

Version shouldn't have been changed.

chx’s picture

Version: 5.x-dev » 7.x-dev
Assigned: chx » Unassigned

Is this still a problem? I bet it is! Someone please reroll it and add a test.

grendzy’s picture

Version: 7.x-dev » 8.x-dev
Component: base system » aggregator.module

Per Dries' comment #27, moving to aggregator.module. And, features are really 8.x now. Though if someone wants to try to argue that not implementing the aforementioned "xml:base attribute" part of the RSS spec is a bug, this might have a shot at 7.x. (This is frustrating, not a day goes by without broken links and images on Drupal Planet.)

Also marked #395764: Aggregator: Convert all relative URLs to absolute URLs in feed items as a duplicate.

gábor hojtsy’s picture

Version: 8.x-dev » 7.x-dev

Oh, its totally a bug. Even to be backported to Drupal 6 if at all possible.

gábor hojtsy’s picture

Category: feature » bug
gábor hojtsy’s picture

#395764: Aggregator: Convert all relative URLs to absolute URLs in feed items has a working patch for Drupal 7 that only needs tests. Asked @brianV to move the patch over.

brianV’s picture

StatusFileSize
new2.74 KB

As requested by Gabor, here is the patch from the other issue.

Note that this patch applied cleanly in March, 2009, so is going to need to be updated to current HEAD. Also, it needs a test case.

I would appreciate if someone else take it over from here, as I don't have the free time to focus on it for the next while.

sun’s picture

Priority: Critical » Major
+++ modules/aggregator/aggregator.parser.inc	24 Mar 2009 06:34:05 -0000
@@ -325,3 +330,40 @@ function aggregator_parse_w3cdtf($date_s
+ * Convert all relative links contained in the src and href elements within ¶
+ * a feed item's description to absolute links.

1) Trailing white-space.

2) See http://drupal.org/node/1354 for code documentation standards.

+++ modules/aggregator/aggregator.parser.inc	24 Mar 2009 06:34:05 -0000
@@ -325,3 +330,40 @@ function aggregator_parse_w3cdtf($date_s
+  $item_url_parts = parse_url($link);

You likely want to use drupal_parse_url(), which probably simplifies this code a lot.

Powered by Dreditor.

brianV’s picture

Status: Needs work » Needs review
StatusFileSize
new2.19 KB

I found some time after all to produce an updated patch w/ sun's feedback. Note that there is still no test case.

brianV’s picture

Since I know it's only a matter of time until it's pointed out - rev7 is attached which fixes the extra whitespace...

brianV’s picture

Status: Needs review » Needs work

Setting to CNW. I realize that I forgot about the case where things are relative to the root - they should be against the working dir.

brianV’s picture

Status: Needs work » Needs review
StatusFileSize
new4.78 KB

Here is a new revision. I went back to the old parse_url() logic, as that drupal_parse_url() doesn't give the base url of the link, which is needed in one of the two use cases.

brianV’s picture

*facepalm*

Ignore that last patch. Sorry about wasting your time and attention. I'll have a new roll on Monday when I'm a little less burnt out.

sun.core’s picture

Priority: Major » Normal

This bug seems to exists for a long time already, so the major priority isn't qualified.

+++ aggregator-fix-relative-links-88183-rev7.patch
@@ -0,0 +1,56 @@
+ ¶
++    // Try to replace all relative links with absolute links.

1) Trailing white-space.

2) Duplicate + signs indicate a wrongly re-rolled patch.

Powered by Dreditor.

jody lynn’s picture

Version: 7.x-dev » 8.x-dev
marvil07’s picture

Status: Needs review » Needs work
StatusFileSize
new591 bytes

The XML specification provides a clean solution for this; the xml:base attribute.

Documentation for that on w3c. It define it in a way that the URLs should be resolved with the nearest parent which define the attribute.

So, I guess the idea here would be to use the last patch, but use:

$xml_base = isset($channel['XML:BASE']) ? $channel['XML:BASE'] : $item['link'];
aggregator_convert_relative_links($xml_base, $item['description']);

instead of

aggregator_convert_relative_links($item['link'], $item['description']);

The patch with this comment adds the definition of xml:base to the channel array during xml parsing, but only for the top level.
It identifies top level by asking the element name: rss, feed(atom) or rdf:RDF(rdf).
To actually do it to support the full specification, we would need to ask for current attributes of "description" elements, and if not defined ask to all parents until find it or reach the root.
I'm not sure if we want to do that here(sounds more like a feed/xml parser library feature), or this approach would be enough, but it would solve at least some of the problem.

So, we anyway will need to do regex's to replace the links(so we do it once), but with the attached patch we are also trying to use xml:base if defined.

marvil07’s picture

Status: Needs work » Needs review
StatusFileSize
new2.77 KB

This patch fixes the re-roll on #41 and adds what is mentioned on #45.

wim leers’s picture

Status: Needs review » Needs work

I'm afraid this needs a test case? I think I also found an edge case that is not yet properly supported, but which should NOT prevent this patch from going in, because this is going to fix 99.5% of all cases already, which is better than where we're at today.

+++ b/core/modules/aggregator/aggregator.parser.incundefined
@@ -327,3 +340,42 @@ function aggregator_parse_w3cdtf($date_str) {
+  // Replace all relative links not starting with '/'. Absolute links are not
+  // matched, since ':' is not in the capture pattern.

Some CMSes use the colon (':') in their URLs, e.g. for filtering search results: http://example.com/magicalSearchPage:filter:foo.

corbacho’s picture

#46
That regex will fail also, because it will capture protocol-less URLs, when it shouldn't
<img src="//drupal.org/sites/all/themes/bluecheese/images/btn-search.png" />

This type of URLs are mostly useful in these cases, when the content goes outside of your domain (like me reading drupal planet news in Newsblur reader), and you don't know if the RSS-reader domain is using http or https.

( In a positive note... I don't think many modules use this type of URLs for images)

Final note: I haven't noticed before Drupal Planet broken images because Google Reader fixes on-the-fly the "relative path" issues, replacing them with absolute paths. But G.Reader will die soon.. and the alternatives out there are not that smart. (I verified Newsblur and feedly)

twistor’s picture

+++ b/core/modules/aggregator/aggregator.parser.incundefined
@@ -327,3 +340,42 @@ function aggregator_parse_w3cdtf($date_str) {
+  // Create parts to use in the replacement patterns.
+  $item_url_parts = parse_url($link);
+  $item_base_url  = $item_url_parts['scheme'] . '://' . $item_url_parts['host'];

parse_url() returns FALSE on failure.

scheme might not be set.

need to handle port.

protocol relative urls.

+++ b/core/modules/aggregator/aggregator.parser.incundefined
@@ -327,3 +340,42 @@ function aggregator_parse_w3cdtf($date_str) {
+    $item_path_parts = pathinfo($item_url_parts['path']);

pathinfo($item_url_parts['dirname'], PATHINFO_DIRNAME);

+++ b/core/modules/aggregator/aggregator.parser.incundefined
@@ -327,3 +340,42 @@ function aggregator_parse_w3cdtf($date_str) {
+  // Replace all links starting with '/'.
+  $regexp      = "/\b(href|src)(\s*=\s*['\"])([\/][A-Za-z0-9\/]*)(['\"])/";
+  $description = preg_replace($regexp, '$1$2' . $item_base_url . '$3$4', $description);
+
+  // Replace all relative links not starting with '/'. Absolute links are not
+  // matched, since ':' is not in the capture pattern.
+  $regexp      = "/\b(href|src)(\s*=\s*['\"])([^\/][A-Za-z0-9\/]*)(['\"])/";

These regexps will over match. We need at least the <> in there.
Also, 'codebase' and 'code' attributes can contain urls.

Is there are reason that we're not using the dom here?

twistor’s picture

Also 'user' and 'pass' should be supported in parse_url().

and ipv6 hosts with something like:

if (preg_match('/!^[\da-f]*:[\da-f.:]+$!ui/', $parts['host'])) {
  $url .= '[' . $parts['host'] . ']';
}
john franklin’s picture

Those regexes are missing all the special characters that can be in a URL, including underscore, dash and period.

I think the regex needs to be closer to this to get 99.5% coverage.

  $regexp      = "/\b(href|src)(\s*=\s*['\"])([\/][A-Za-z0-9\-\._]+\/?[A-Za-z0-9_\.\-\?\+\/~=&#;,]*)(['\"])/";

Edit: The added back in uppercase letters.

fomenkoandrey’s picture

a similar problem in Drupal 8.0.3
all is well in the feed.
but when displaying the contents of the feed on other site - no images.
relative link does not work on another site, images are not displayed.
do for so many years and did not fix this bug? Is there a way to solve the problem in Drupal 8?

wim leers’s picture

Title: Relative urls in feeds » Relative URLs in feeds should be converted to absolute ones
Version: 8.0.x-dev » 8.1.x-dev
Component: aggregator.module » request processing system
Assigned: Unassigned » wim leers
Priority: Normal » Major
Issue summary: View changes
Status: Needs work » Active
Related issues: +#1494670: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send

As #52 just showed, this is also still a problem in Drupal 8.

In fact, #1494670: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send made all file URLs relative, and in doing so, made this problem more prominent. But it's always been a problem, because we never solved it on the right level. See #1494670-121: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send and later. Let me in particular repeat the most relevant bit of my analysis in #1494670-143: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send:

any time you're sending HTML in a non-web context (for example: e-mail, RSS), you have to make sure all URLs are absolute. But file_create_url() is only invoked on files referenced by code. A complete solution would already have to make all relative URLs absolute anyway, including those in user-generated content. Note that this has been a bug in Drupal since 2006: #88183: Relative URLs in feeds should be converted to absolute ones.

So let's fix this once and for all.

wim leers’s picture

wim leers’s picture

Issue tags: +Needs tests
wim leers’s picture

Priority: Major » Critical

Letting this inherit the critical priority from #1494670: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send — up to the committers to decide otherwise.

wim leers’s picture

The last submitted patch, 54: rss_relative_to_absolute-88183-100.patch, failed testing.

wim leers’s picture

Issue tags: -Needs tests
StatusFileSize
new2.29 KB
new9.3 KB

And here's test coverage. The test-only patch also is the interdiff.

wim leers’s picture

Issue summary: View changes
wim leers’s picture

StatusFileSize
new10.29 KB
new5.48 KB

Berdir is going to take on #2704597: Relative URLs in mails should be converted to absolute ones and asked me to move the key logic he'll need there into the Html utility class. That will make his job easier over there.

Status: Needs review » Needs work

The last submitted patch, 61: rss_relative_to_absolute-88183-61.patch, failed testing.

The last submitted patch, 59: rss_relative_to_absolute-88183-59-tests_only_FAIL.patch, failed testing.

The last submitted patch, 59: rss_relative_to_absolute-88183-59.patch, failed testing.

wim leers’s picture

Status: Needs work » Needs review
StatusFileSize
new10.29 KB
new10.43 KB
new918 bytes

Rebased #61.

#59's non-test-only patch should have passed, I'm 99% certain this is due to a small mistake in the test and the fact that testbot installs Drupal in a subdirectory. So I added a debug patch.

The last submitted patch, 65: rss_relative_to_absolute-88183-65.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 65: rss_relative_to_absolute-88183-65-debug.patch, failed testing.

eric_a’s picture

--- a/core/lib/Drupal/Component/Utility/Html.php
+++ b/core/lib/Drupal/Component/Utility/Html.php
@@ -1,6 +1,7 @@
 <?php
 
 namespace Drupal\Component\Utility;
+use Symfony\Component\HttpFoundation\Request;
 
 /**
  * Provides DOMDocument helpers for parsing and serializing HTML strings.

If symfony/http-foundation is going to be a drupal/core-utility component dependency then core/lib/Drupal/Component/Utility/composer.json needs an update now that #2337283: Add a composer.json file to every component got into 8.1.x.

twistor’s picture

  1. +++ b/core/lib/Drupal/Component/Utility/Html.php
    @@ -37,6 +38,15 @@ class Html {
    +   * All attributes that may contain URIs.
    +   *
    +   * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes
    +   *
    +   * @var string[]
    +   */
    +  protected static $uriAttributes = ['href', 'poster', 'src', 'cite'];
    

    Do we want to match against the appropriate tag names?

  2. +++ b/core/lib/Drupal/Component/Utility/Html.php
    @@ -402,4 +412,33 @@ public static function escape($text) {
    +  public static function transformRelativeUrlsToAbsolute($html, Request $request) {
    +    $html_dom = Html::load($html);
    +    $xpath = new \DOMXpath($html_dom);
    +
    +    // Update all relative URLs to absolute URLs in the given HTML.
    +    foreach (static::$uriAttributes as $attr) {
    +      foreach ($xpath->query("//*[starts-with(@$attr, '/') and not(starts-with(@$attr, '//'))]") as $node) {
    +        $node->setAttribute($attr, $request->getSchemeAndHttpHost() . $node->getAttribute($attr));
    +      }
    +    }
    +    return Html::serialize($html_dom);
    

    This should just pass in the $base_url as string or Psr\Http\Message\UriInterface, rather than a full Request object. It removes the dependency on Symfony, and makes it usable generically. Also why are we skipping other types of relative URLs that may appear in body content? I think that would also fix the bug about sub-directories. We need something like Psr\Http\Message\UriInterface::resolve().

dawehner’s picture

+++ b/core/lib/Drupal/Component/Utility/Html.php
@@ -402,4 +412,33 @@ public static function escape($text) {
+  public static function transformRelativeUrlsToAbsolute($html, Request $request) {
+    $html_dom = Html::load($html);

it is a component so it should not depend on another code, just pass in the schema and host as variable.

kylebrowning’s picture

StatusFileSize
new3.38 KB
new9.47 KB

This attempts to fix #70, I'm assuming we meant base_root instead of $base_url, but an easy change if not.

Did not attempt to fix #69.1 though.

kylebrowning’s picture

StatusFileSize
new9.47 KB
new3.38 KB

Lost a parenthesis, whoops.

wim leers’s picture

Status: Needs work » Needs review
StatusFileSize
new10.31 KB
new937 bytes
  1. +++ b/core/lib/Drupal/Component/Utility/Html.php
    @@ -422,20 +421,20 @@ public static function escape($text) {
    +   * @param string $base_root
    +   *   The base url.
    

    Inconsistent.

  2. +++ b/core/lib/Drupal/Core/EventSubscriber/RssResponseRelativeUrlFilter.php
    @@ -31,7 +30,8 @@ public function onResponse(FilterResponseEvent $event) {
    +    global $base_root;
    

    From injected data to a global. This is very bad.

  3. +++ b/core/lib/Drupal/Core/EventSubscriber/RssResponseRelativeUrlFilter.php
    @@ -41,13 +41,13 @@ public function onResponse(FilterResponseEvent $event) {
    +   * @param string $base_root
    +   *   The base root URL.
    

    Inconsistently inconsistent…


So, first, let's make the patch green. Then let's address the concerns raised (which I totally agree with). Interdiff relative to #65.

wim leers’s picture

StatusFileSize
new10.78 KB
new3.14 KB

#68-#72: nice, I move one small subset into a component at Berdir's request and I get loads and loads of feedback.

#68: Indeed, fixed.

#69:

  1. Why would we want to match against tag names too? That'd make this needlessly complex. These attributes exist on multiple tags. They always have the same meaning. Therefore this is sufficient, and simpler.
  2. The reason the current patch only does root-relative URLs is because that is what is necessary for #1494670: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send.

    RE: "other types of relative URLs": no, we don't need/want to update protocol-relative URLs.
    No, supporting protocol-relative URLs wouldn't have fixed the bug about subdirectories.
    Yes, something like \GuzzleHttp\Psr7\Uri::resolve() could be nice, but I've yet to see relative URLs using periods in Drupal. And no, that doesn't live on PSR's UriInterface, it's something specific to Guzzle's implementation of that interface.

#70: done.
#71/#72: cleaned up your implementation.


Having written all this made me realize something, see next comment.

wim leers’s picture

StatusFileSize
new10.77 KB
new1.13 KB

D'oh, syntax error.

The last submitted patch, 74: rss_relative_to_absolute-88183-74.patch, failed testing.

wim leers’s picture

StatusFileSize
new11.42 KB
new6.61 KB

So Twistor makes a great overall point in #69: if we have this in the Html component, it should truly be generic. But it's not actually generic right now.

However, he's actually wrong. And I'm wrong too! We're both wrong :)

I'll explain why.


The current patch is actually broken. Despite passing tests. Because the test I wrote is wrong.

The whole reason we're doing this is because of #1494670: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send. That made it so that pretty much all file URLs are made root-relative using file_url_transform_relative().

But the test makes it so that <a href="/foo"> on a site http://example.com/drupal gets changed to <a href="http://example.com/drupal/foo. But actually, /foo was not pointing inside Drupal in the first place.
More concrete example: file_url_transform_relative(file_create_url('public://llama.jpg')) generates /sites/default/llama.jpg if Drupal runs at http://example.com, and generates /drupal/sites/default/llama.jpg if Drupal runs at http://example.com/drupal. That's why it's a root-relative URL: it means it's relative to the site's root. So we only need to prepend the scheme and host, not also the base path.

So the changes I made in #73 were wrong, I should've fixed the tests instead.


Therefore, this patch:

  1. fixes the test coverage, and is now hopefully much, much more clear
  2. renames Html::transformRelativeUrlsToAbsolute() to Html::transformRootRelativeUrlsToAbsolute(), which causes it to now actually match its documentation
  3. adds detailed documentation to Html::transformRootRelativeUrlsToAbsolute() explaining why relative URLs other than root-relative URLs are not being transformed
  4. changes the parameter it accepts (and its assertions) from $base_url to $scheme_and_host

Next: unit test coverage.

wim leers’s picture

StatusFileSize
new14.25 KB
new3.54 KB

And here's unit test coverage.

wim leers’s picture

Assigned: wim leers » Unassigned
xjm’s picture

Priority: Critical » Major

I don't think this meets the definition of a critical bug. It is however a major bug with feed functionality.

alexpott’s picture

Priority: Major » Critical

Discussed with @catch. Unlike the similar #2704597: Relative URLs in mails should be converted to absolute ones this issue, which was made way worse by #1494670: References to CSS, JS, and similar files should be root-relative URLs: avoids mixed content warnings & fewer bytes to send is entirely core's problem and in my mind is critical bug - images in displayed in feeds no longer work and core is just doing it wrong.

alexpott’s picture

Status: Needs review » Needs work
Issue tags: +Needs reroll

Needs a reroll.

alexpott’s picture

Status: Needs work » Needs review
Issue tags: -Needs reroll
StatusFileSize
new13.64 KB
alexpott’s picture

StatusFileSize
new2.72 KB
new13.58 KB

A few tidy ups.

alexpott’s picture

+++ b/core/lib/Drupal/Component/Utility/Html.php
@@ -37,6 +37,15 @@ class Html {
+   * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes
+   *
+   * @var string[]
+   */
+  protected static $uriAttributes = ['href', 'poster', 'src', 'cite'];

We've not included code, codebase, data, and manifest. Should we document why they are not here?

wim leers’s picture

#84: thanks for fixing all those nits! All look great.

#85:

We've not included code, codebase, data, and manifest. Should we document why they are not here?

Sure. Proposal:

* - The attributes 'code' and 'codebase' are omitted, because they only exist for the <applet> tag. The time of Java applets has passed.
* - The 'manifest' attribute is omitted because it only exists for the <html> tag. That tag only makes sense in a HTML-served-as-HTML context, in which case relative URLs are guaranteed to work.

data exists for <object> and I guess that still is a thing that exists and should work. Of course, 99.99% of the time, it's for e.g. YouTube embeds. But, you never know. So, shall we add support for that one after all?

alexpott’s picture

The time of Java applets has passed.

+1 :)

I like all the suggestion in #86 including including the data attribute.

wim leers’s picture

Assigned: Unassigned » wim leers

Great :) Will reroll then!

wim leers’s picture

Assigned: wim leers » Unassigned
StatusFileSize
new13.96 KB
new1.89 KB

Voilà.

dawehner’s picture

Status: Needs review » Needs work
  1. +++ b/core/modules/file/src/Tests/FileFieldRSSContentTest.php
    diff --git a/core/modules/node/src/Tests/NodeRSSContentTest.php b/core/modules/node/src/Tests/NodeRSSContentTest.php
    index 27870bd..d1ce240 100644
    
    index 27870bd..d1ce240 100644
    --- a/core/modules/node/src/Tests/NodeRSSContentTest.php
    
    --- a/core/modules/node/src/Tests/NodeRSSContentTest.php
    +++ b/core/modules/node/src/Tests/NodeRSSContentTest.php
    
    +++ b/core/modules/node/src/Tests/NodeRSSContentTest.php
    @@ -60,4 +62,47 @@ function testNodeRSSContent() {
     
    +  /**
    +   * Tests relative, root-relative, protocol-relative and absolute URLs.
    +   */
    +  public function testUrlHandling() {
    +    // Only the plain_text text format is available by default, which escapes
    

    Its weird that we test generic core functionality as part of node module. This IMHO belongs somewhere else

  2. +++ b/core/tests/Drupal/Tests/Component/Utility/HtmlTest.php
    @@ -320,4 +320,68 @@ public function testSerialize() {
    +    // All possible attributes that support URIs.
    +    $attributes = ['href', 'poster', 'src', 'cite', 'data'];
    

    What about srcset?

alexpott’s picture

alexpott’s picture

Priority: Critical » Major

Discussed with @catch, @xjm, @effulgentsia, @Cottser and @webchick. They've convinced me that this is not a critical bug. The reasoning is that the feed is still functional and the lack of absolute links is results is just a visual issue and the link to the site providing the feed will still work.

I think the patch is close - once we've got all the html attributes then we should be good.

kylebrowning’s picture

@alexpott, so to be clear, we need any html attribute that has a src attribute?

alexpott’s picture

kylebrowning’s picture

StatusFileSize
new14.65 KB

@alexpott Right I see that, but are formactions really going to come through RSS feeds? Can you be more specific on which html attributes are actually missing, versus needed, it seems the html4 ones are covered.

At any rate, here is a re-roll.

wim leers’s picture

Status: Needs work » Needs review
StatusFileSize
new16.24 KB
new5.13 KB

#90.1: I agree with that, but it's out of scope here. There is no pure RSS test! If there were, I'd be adding this to that. We shouldn't overhaul general RSS test coverage here. This is the best place.

#90.2: dammit, what a great find! :(


#95: there's something wrong with that reroll, there's a syntax error in there, in a part that didn't conflict. After fixing that syntax error, it's still failing, so there's another bit wrong. Rerolled it myself, now the problem is gone. Easier than hunting it down :) Probably the cat's fault! :D

Hence interdiff relative to #89.


So, NW for:

  1. action: <form action>
  2. formaction: <button formaction
  3. icon: <command icon, but per https://developer.mozilla.org/en-US/docs/Web/HTML/Element/command, this is effectively gone
  4. srcset: <img srcset>

Addressed all of this.

kylebrowning’s picture

Status: Needs review » Reviewed & tested by the community

Thanks Wim, just tested this too.

The last submitted patch, 71: 88183-71.patch, failed testing.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work
  1. +++ b/core/tests/Drupal/Tests/Component/Utility/HtmlTest.php
    @@ -321,4 +321,79 @@ public function testSerialize() {
    +    $attributes = ['href', 'poster', 'src', 'cite', 'data', 'action', 'formaction', 'srcset'];
    +    $attributes = ['srcset'];
    

    We need to remove the second line here.

  2. +++ b/core/tests/Drupal/Tests/Component/Utility/HtmlTest.php
    @@ -321,4 +321,79 @@ public function testSerialize() {
    +            "$tag_name, $attribute, $base_path: root-relative" => ["<$tag_name $attribute=\"http://example.com{$base_path}already-absolute 200w, {$base_path}root-relative 300w\">root-relative test</$tag_name>", 'http://example.com', "<$tag_name $attribute=\"http://example.com{$base_path}already-absolute 200w, http://example.com{$base_path}root-relative 300w\">root-relative test</$tag_name>"],
    +            "$tag_name, $attribute, $base_path: protocol-relative" => ["<$tag_name $attribute=\"http://example.com{$base_path}already-absolute 200w, //example.com{$base_path}root-relative 300w\">root-relative test</$tag_name>", 'http://example.com', FALSE],
    +            "$tag_name, $attribute, $base_path: absolute" => ["<$tag_name $attribute=\"http://example.com{$base_path}already-absolute 200w, http://example.com{$base_path}root-relative 300w\">root-relative test</$tag_name>", 'http://example.com', FALSE],
    ...
    +          "$tag_name, $attribute, $base_path: root-relative" => ["<$tag_name $attribute=\"{$base_path}root-relative\">root-relative test</$tag_name>", 'http://example.com', "<$tag_name $attribute=\"http://example.com{$base_path}root-relative\">root-relative test</$tag_name>"],
    +          "$tag_name, $attribute, $base_path: protocol-relative" => ["<$tag_name $attribute=\"//example.com{$base_path}root-relative\">root-relative test</$tag_name>", 'http://example.com', FALSE],
    +          "$tag_name, $attribute, $base_path: absolute" => ["<$tag_name $attribute=\"http://example.com{$base_path}root-relative\">root-relative test</$tag_name>", 'http://example.com', FALSE],
    

    It looks as though the test text was intended to vary but everythign says root-relative

  3. +++ b/core/tests/Drupal/Tests/Component/Utility/HtmlTest.php
    @@ -321,4 +321,79 @@ public function testSerialize() {
    +          continue;
    +        }
    

    What's wrong with an else? Or considering it is different just doing srcset outside the loop?

wim leers’s picture

Status: Needs work » Reviewed & tested by the community
StatusFileSize
new16.8 KB
new3.47 KB

#99:

  1. D'OH!!!!!!! That was for debugging purposes… So glad you caught that!
  2. Oh, hah! The tests actually arecorrect, it's just the path that always contained the same string, which was indeed very misleading. Fixed. Verify with git diff --color-words if you would like to convince yourself. See attached screenshot for that.
  3. Nothing wrong with an else, I just wanted to reduce the interdiff size, and it stood out more, stressing it was the exceptional case.

    I think it may indeed be cleaner to pull it out of the loop.

Doing point 3 separately to make the steps/interdiffs simpler.

wim leers’s picture

Status: Reviewed & tested by the community » Needs review
StatusFileSize
new136.84 KB

Accidentally RTBC'd #100, sorry.

And here's the screenshot I promised.

wim leers’s picture

StatusFileSize
new16.82 KB
new4.79 KB

This addresses point 3.

Ready to be RTBC'd again IMHO.

Status: Needs review » Needs work

The last submitted patch, 102: 88183-102.patch, failed testing.

wim leers’s picture

Status: Needs work » Needs review
StatusFileSize
new2.5 KB
new16.58 KB

PHPStorm--

(It was doing very strange stuff with reformatting code whenever I typed a curly brace — yes, really, I couldn't believe it either! — and this caused me to mess a few things up :/)

kylebrowning’s picture

Status: Needs review » Reviewed & tested by the community
alexpott’s picture

alexpott’s picture

alexpott’s picture

+++ b/core/modules/taxonomy/src/Tests/RssTest.php
@@ -105,7 +105,7 @@ function testTaxonomyRss() {
-    $this->assertRaw('<rss version="2.0"', "Feed page is RSS.");
+    $this->assertTrue(!empty($this->cssSelect('rss[version="2.0"]')), "Feed page is RSS.");

+++ b/core/modules/views/src/Tests/Wizard/BasicTest.php
@@ -80,7 +80,7 @@ function testViewsWizardAndListing() {
-    $this->assertRaw('<rss version="2.0"');
+    $this->assertTrue(!empty($this->cssSelect('rss[version="2.0"]')));

Why are these changes necessary to fix the bug?

wim leers’s picture

@alexpott Because it's brittle parsing. Look at the raw output. Because it is processed by DOMDocument, it adds some top-level attributes:

<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0" xml:base="http://tres/taxonomy/term/1">

These are completely harmless, but because we're doing raw string comparison instead of properly parsing XML, our tests fail.

(This is why we should never use assertRaw().)

alexpott’s picture

Issue summary: View changes
StatusFileSize
new557.16 KB
new2.39 KB
new2.62 KB

So looking at a before and after of the frontpage rss.xml with a single node with a body with some links and an image... Here is a git diff --color-words output...

The files used to generate this are also attached.

So there are some interesting changes...

  1. I think we should adjust the views templates to remove the space between utf-8" and ?> - that can be done in a followup.
  2. The change to the description looks okay - it is empty.
  3. The change to ordering of the rss tag looks fine too
  4. The change of escaping a double quote to having a double quote is interesting. I think this is okay and we result in less bytes being sent but we should think about it.
  5. We should open a followup to fix the about link - it needs to be absolute too... about="/user/1" is not canonical

Leaving at rtbc but more opinions about the change to escaping would be nice.

alexpott’s picture

Status: Reviewed & tested by the community » Needs review

So I think we should have the followups for #110.1 and #110.5 open before commit - also I think we need at at least a comment on the issue about the change to escaping and why it is okay. I think it is as I said before more opinions are welcome.

kylebrowning’s picture

@alexpott, any reason the #110.5 shouldn't be handled in this issue?

wim leers’s picture

StatusFileSize
new16.53 KB
new2.04 KB

Agreed that #110.5 can be done here. Done.

#110.1 is so incredibly minor. Those spaces are not meaningful anyway. If @alexpott still wants that, I'd be happy to create a follow-up though.

brianV’s picture

Status: Needs review » Reviewed & tested by the community

Follow up issue for #110.1 created: https://www.drupal.org/node/2772887

#113 looks good to me, so setting to RTBC.

alexpott’s picture

Status: Reviewed & tested by the community » Needs review

So we still haven't had any comment about the escaping change shown in #110. I think it is an okay consequence since the " is not a special character in XML with a field and therefore we're sending less bytes which is a good thing. But it'd be nice for someone else to consider implications before committing this patch.

wim leers’s picture

I also don't see any problems in &quot; becoming ". I think this is the sort of change you have to roll out and then see in the wild if there are any real-world problems with it or not. There's such a broad variety in RSS readers, that it's impossible to predict if this will cause problems or not.

The most reassuring thing I can think of, is testing with https://validator.w3.org/feed/. No difference there.

kylebrowning’s picture

Anyone we know we can task with this review?

twistor’s picture

Status: Needs review » Reviewed & tested by the community

I don't think there's a problem with converting &quot; to ". In fact, I think that's more standard than escaping it.

I could list a bunch of popular feeds that leave quotes alone, here's two:

alexpott’s picture

StatusFileSize
new1.6 KB
new16.11 KB
+++ b/core/lib/Drupal/Core/EventSubscriber/RssResponseRelativeUrlFilter.php
@@ -0,0 +1,71 @@
+  protected function rssRel2abs($rss_markup, Request $request) {
...
+    // Invoke htmlRel2abs() on all HTML content embedded in this RSS feed.
+    foreach ($rss_dom->getElementsByTagName('description') as $node) {
+      $html_markup = $node->nodeValue;
+      if (!empty($html_markup)) {
+        $node->nodeValue = Html::transformRootRelativeUrlsToAbsolute($html_markup, $request->getSchemeAndHttpHost());

I think our standards prefer long descriptive names over short and incorrect english ones... ie 2 != to :)

Plus obviously htmlRel2abs() does not exist anymore.

Given this is a minor refactor leaving at RTBC.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

My contributions to this issue have been all nits and coding standards. It's great to see this fixed - and nice to see that Drupal 8's architecture makes fixing this type of thing a bit easier.

Committed and pushed 417d9006232b6746c0716bac4601afcfd2a19310 to 8.2.x and da4b6e6 to 8.1.x. Thanks!

  • alexpott committed 417d900 on 8.2.x
    Issue #88183 by Wim Leers, ahoeben, brianV, alexpott, kylebrowning,...

  • alexpott committed da4b6e6 on 8.1.x
    Issue #88183 by Wim Leers, ahoeben, brianV, alexpott, kylebrowning,...
wim leers’s picture

Yay!

alexpott’s picture

Issue tags: +8.2.0 release notes

10 year old bugs that affect people's Drupal planet feeds feels worthy of a release note mention :)

wim leers’s picture

#124: I was secretly hoping for that :D

  • alexpott committed 417d900 on 8.3.x
    Issue #88183 by Wim Leers, ahoeben, brianV, alexpott, kylebrowning,...

  • alexpott committed 417d900 on 8.3.x
    Issue #88183 by Wim Leers, ahoeben, brianV, alexpott, kylebrowning,...
ahoeben’s picture

Aww, I had an alarm set to celebrate the 10th anniversary of this issue and everything... ;-)

Yay, it's fixed!

fomenkoandrey’s picture

works!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Anonymous’s picture

Version: 8.1.x-dev » 7.x-dev
Status: Closed (fixed) » Active

Can this be backported to Drupal 7?

("No" is a fine answer... but it would be nice to have a definitive one, either way.)

Thanks! (I hope changing to version 7/active is not incorrect here!)

wim leers’s picture

@ahoeben Hah! :D :D

alexpott’s picture

Version: 7.x-dev » 8.2.x-dev
Status: Active » Closed (fixed)

@ahoeben - so in order to backport this - under the new policy we open a new issue linked to this one. However I think backporting might prove tricky given all the previous discussion.

berdir’s picture

Hm. I noticed that having twig debug enabled combined with this results in *very* weird stuff. Basically, all you get is page with the opening <?xml version="1.0"?> and nothing else. And a bunch of parsing errors when you check the logs/another page.

It was always a bit weird with that (displaying xml with a template is pretty crazy :)) but this made it quite a bit worse.

berdir’s picture

And based on my tests, this also caused a small behavior change: double quotes are no longer encoded. I used to look for the raw string '<a href="' to make sure that things are escaped exactly once (remember how often we broke this? ;)), but now it is ''<a href="'.

I suppose that's not wrong but it is different, so I thought it's worth mentioning.

berdir’s picture

twistor’s picture

@berdir, the quote encoding issue was addressed from #110 on.

james.williams’s picture

Feeds coming from the render cache don't get their relative URLs converted - so I've opened #2959134: Relative URLs in cached feeds should be converted to absolute ones.

keithdoyle9’s picture

Is there an issue already submitted about having absolute URLs in templates be absolute instead of relative? All my nodes currently have relative URLs for canonical links which are invalid.

james.williams’s picture

@keithdoyle9 this ticket was very specifically about URLs in feeds. I think you're looking for #2738373: Canonical link should be absolute.