For sites that leverage a mixed HTTP/HTTPS setup, protocol-realtive URLs (via pathologic) is a logical choice. However it appears that there are some conditions upon which those URLs can be captured in the field cache in a format that is not compatible to mixed HTTP/HTTPS requests.

Consider the following steps:

  1. Server is setup for mixed HTTP/HTTPS on standard ports 80/443, and some pages on the site do not force https for anonymous page views.
  2. Pathologic is setup on a text format and configured to generate proto-relative internal URLs.
  3. The input format above is active on the body field for node content.
  4. A node representing /page-a has a body field that contains an internal link to /page-b
  5. The field cache is cold (clear all caches)
  6. A request is made to http://www.mydomain.com:80/page-a -- Note that the port is explicitly declared in the request (this is the key to the bug).

Though that last step may seem far-fetched, it looks like some spiders may actually append a port 80 to the domain (regardless of the referral they are following) for some strange reason. I can confirm this in our observations on at least one production site.

When these conditions come together the request in the last step will result in a base_url of http://www.mydomain.com:80. This means that pathologic's call to url() (with the "absolute" option) for the /page-b href will produce http://www.mydomain.com:80/page-b. When this is ultimately converted into a proto-relative URL we end up with //www.mydomain.com:80/page-b stored in the field cache. Later, if this link is followed over a secure connection, the client will try https://www.mydomain.com:80/page-b which will result in a broken link, security warnings, and, in some clients, potentially even blacklisting behavior.

The tricky thing here is that it looks like scheme-relative URLs can technically define a port. Furthermore, there is nothing wrong with explicitly declaring a port for any request (though the port can only be omitted when the standard port for a scheme is implied). So to address this it probably does not make sense to unconditionally strip all ports for proto-relative URLs produced by pathologic. However, we may want to consider stripping the "standard" ports (80, 443, etc.) if they are found. Perhaps in _pathologic_url_to_protocol_relative().

I'd be happy to get a patch going for this, but I think it might warrant some discussion first.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

rjacobs created an issue. See original summary.

rjacobs’s picture

Priority: Normal » Major

Note also that I found it locally tricky to replicate that last step (#6) above when debugging. It seem that most all clients (browsers, wget, curl, etc.) will strip the :80 from the URL for http requests no matter what you do. I ended up needing to use telnet to replicate a spider's request with the explicit port 80 declared (and thus replicating the problematic behavior we were seeing in a production site):

$ telnet www.host.com 80
GET /page-b HTTP/1.1
Host: www.host.com:80
Connection: close

I have not formally tested this against 7.x-3.x, though a quick scan of the code in the repo leads me to conclude this same issue is present there are well.

Also, I'm marking this "major". It may not be a common problem encountered, but the ramifications are high. Furthermore, it's hard to mitigate as one cannot control what requests will be coming in from spiders when the field cache is cold.

rjacobs’s picture

Status: Active » Needs review
FileSize
2.23 KB

Here's a patch. It basically just adds a filter to _pathologic_url_to_protocol_relative():

    // Also strip the port if it's explicitly declared in the URL. We only do
    // this for default ports that accompany http/https schemes as they can
    // cause conflicts for scheme-relative URLs.
    // @see https://www.drupal.org/node/2547185
    $url = preg_replace('~:(80|443)([^0-9@][^@]*)?$~', '$2', $url);

By this point in the process (after url() is called) all colons should be already encoded with the exception of the port delimiter and user/pass delimiter. So the regex just tries to deal with those cases by stripping the port if it's 80 or 443 (but not any other number), and without touching any user:password parts. This seems cleaner than formally parsing the URL yet again. I'm not sure if a user:password component (which is super-rare for http/https anyway) could even make it's way through url() on an internal path, but I figured the filter should be aware of it.

Some rudimentary test coverage is also included.