Motivation

Protocol-relative URLs solve the issue with mixed-mode (HTTPS- and HTTP-facing) Drupal instances and filter caching, without breaking (if readers are well-behaved) links in external feeds (such as RSS feeds from nodes).

It'd be extremely useful if there were a setting for Pathologic to rewrite all otherwise-absolute URLs pointing internally to use protocol-relative absolute URLs instead.

Case example

Consider a Drupal site at example.com/ that responds to both HTTP and HTTPS; images and links are inserted by administrators on the HTTPS version, causing them to be absolutely linked to https://example.com/foo/bar. If that link were changed to "//example.com/foo/bar," visitors on the HTTP side would see links to "http://example.com/foo/bar," but admin would still see/click on links to "https://example.com/foo/bar."

Our environment is set up exactly in this manner, so the functionality is necessary for us (so far, we've been getting by with just disabling absolute URLs and having Pathologic output relative URLs only, but this breaks links in feeds).

Proposed resolution

Attached is a patch that theoretically adds this functionality via a "Modify absolute URLs to be protocol-relative" configuration checkbox (positioned just after the "Output full absolute URLs" checkbox). For the time being, I added a compatibility setting of "FALSE" to the option for the test suite; a real test case should be written, which I haven't done.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Garrett Albright’s picture

I like this idea. I'm afraid I can't use your patch directly because of style issues (} else {, $var1 = $var2 = "value") and I think radio buttons would work better than double check boxes on the form, but I'll use it as inspiration to get this feature into the module. Thanks!

Garrett Albright’s picture

Please give the just-released Beta 5 release a try and let me know if it works for you.

jay.dansand’s picture

Oops! Here's a version of the patch that fixes the style problems. That's what I get for trying to multitask projects.

Anyway, that was an impressively quick turnaround, thank you! The only potential issue I see is in the new conversion function:

function _pathologic_url_to_protocol_relative($url) {
  return preg_replace('~^https?://~', '//', $url);
}

2 issues, one of which is probably only an academic concern (I also don't know if we should make it case-insensitive by passing the "i" pattern modifier to preg_replace(). Couldn't hurt?)

Protocols other than HTTP (an academic issue that probably doesn't matter)

This won't work with schemes other than HTTP(S), though that probably doesn't matter (I don't know if we should worry about Drupal sites running on gopher, for instance). Technically, we could support any arbitrary scheme; see RFC3986, particularly sections 3.1 (scheme definition), 3.3 (path determination), and especially 5.2.2 (algorithm for constructing target based on [relative] reference - this is the thing that makes protocol-relative URLs actually work).

The original patch tried to use Drupal's $base_url to accommodate any scheme Drupal runs on (hence all the REGEXP overkill). If we choose to support any scheme instead of just HTTP(S), it should be noted that not all schemes have a path component starting with "//" so REGEXP/strpos() searches for "//" should be converted to searching for ":" instead.

URLs that shouldn't be rewritten (this may actually be a real concern)

Not all "http://" returned by url() should be replaced with "//". hook_url_outbound_alter() lets modules modify this; the example usage changes feed URLs to point off-site. It's possible that the external server doesn't listen to HTTPS (or HTTP?) requests, so we shouldn't rewrite these URLs arbitrarily. The $base_url-derived REGEXP hooey in the original patch tried to account for this, and only changed internally-pointing absolute URLs. I think this could be an important case to handle.

I'm probably way overthinking this, and it doesn't really matter. Just a thought. Again, thanks for the awesomely fast turnaround!

Garrett Albright’s picture

With regards to your first concern, though I think it's important to maintain Gopher compatibility (tongue firmly in cheek), I think those sorts of edge cases are where you'd be better off using one of the other protocol output methods.

As for your second one, I can see what you're getting at - the code needs to ensure that the URL created still refers to "this" server before it tries to rewrite it. Hmm.

jay.dansand’s picture

For the first point, wow... talk about coincidences. I had no idea you made Drupher; that's hilarious. I agree they're all edge cases, though with the REGEXP work already in place (see below) it's free to support. I don't care; I just work in higher ed. so purely academic problems are the norm :)

For the second issue, I think that's where we still need to do some kind of comparison against $base_url; it lets Drupal determine "this is where I think I am running from" and we are sure to only modify HTTP(S) references back to the Drupal base. Anything pointing elsewhere would be untouched.

I've just merged the $base_url-derived REGEXP from the original patch with the new function in beta5. It's not pretty, but it works: anything starting with $base_url is replaced with a protocol-relative version of $base_url. I did modify the REGEXP to only support HTTP(S) as well, though we can easily revert that if desired.

Garrett Albright’s picture

I took another approach to it; here's the diff. No new release yet, though - hoping to get #1672430: File paths prepended with ?q= when clean URLs are disabled addressed first. Thanks for your feedback with this.

Garrett Albright’s picture

Status: Needs review » Closed (fixed)

7.x-2.0 released with this feature, so closing this issue.