My site is running on Acquia Cloud with their Varnish setup and using CDN 7.x-2.7 using Amazon Cloudfront with Far Future Expiration disabled and only CDN configured to only serve pdf files. Occasionally and seemingly randomly my top level domain starts responding with a varnish cached 301 redirect and Too many redirects on all browsers. This only happens for the TLD non-https url e.g. http://www.example.com

Here's a curl -I during the issue:

curl -I http://www.example.com
HTTP/1.1 301 Moved Permanently
Age: 288
Content-Type: text/html; charset=UTF-8
Date: Mon, 29 Feb 2016 21:10:29 GMT
Link: <http://www.example.com/>; rel="canonical"
Location: http://www.example.com/
Server: nginx
Vary: Accept-Encoding
Via: 1.1 varnish
X-AH-Environment: prod
X-Cache: HIT
X-Cache-Hits: 2162
X-Drupal-Cache: MISS
X-Request-ID: v-30c92772-df28-11e5-910a-22000bde070b
X-Varnish: 1697788944 1697780962
Connection: keep-alive

And an nscurl performed at the same time:

nscurl http://www.example.com
Load failed with error: Error Domain=NSURLErrorDomain Code=-1007 "too many HTTP redirects" UserInfo={NSUnderlyingError=0x7f9cf2c41830 {Error Domain=kCFErrorDomainCFNetwork Code=-1007 "(null)"}, NSErrorFailingURLStringKey=http://www.example.com/, NSErrorFailingURLKey=http://www.example.com/, NSLocalizedDescription=too many HTTP redirects}

The problem lasts for a max of about 10 minutes then goes away for a few more days before it pops up again. A search of my project reveals that the CDN SEO settings are the only non-core place where rel="canonical" appears so it seems the most likely culprit. It looks like varnish is caching the 301 redirect that is sent back to the Amazon Cloudfront user-agent but I reserve the right to be completely wrong. Any thoughts?

CommentFileSizeAuthor
#2 2678374-2.patch856 bytesWim Leers
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jamesrward created an issue. See original summary.

Wim Leers’s picture

Status: Active » Needs review
Related issues: +#1060358: CDN and SEO
FileSize
856 bytes

Thanks for this very clear bug report!

Vary: User-Agent

I have a suspicion. #1060358: CDN and SEO has zero mention of Vary. In the response you showed, there is this:

Vary: Accept-Encoding

… but ever since the SEO feature, we're actually varying the response by the User-Agent header too.

So the CDN module should emit:

Vary: User-Agent

(And then whichever thing is adding Accept-Encoding would still add it, thus resulting in Vary: Accept-Encoding User-Agent overall.)

However, there's a problem with doing that: there are so very many different user agents out there that it will cause super inefficient caching. See http://www.rimmkaufman.com/blog/vary-user-agent/30112012/ for example.

So, this is off the table.

Cache-Control: no-cache

This could work. This would tell Varnish not to cache it. It would also tell the CDN to not cache it. Which means a CDN won't cache our 301s, and thus not lighten the load as much anymore in case of bots requesting all pages via the CDN domain, but that's not something that can be fully solved at the CDN level.

Can you try the attached patch?

jamesrward’s picture

Thanks for the fast response. Any thought on how best to test this? First I'm trying to recreate the problem without the patch before I test with the patch.

I tried clearing my Varnish cache then running:

watch -d "curl -I -s http://www.example.com"

Then hitting example.com repeatedly with a spoofed user-agent in the CDN user agents list. While my spoofed user agent keeps getting the Too many redirects error back I can't seem to get varnish to cache it.

The curl command keeps responding with a 200 and a varnish cache miss. I guess if this was easier to trigger it would be happening more often on production so that's a good thing. Any idea where the varnish cache edge-case might be hiding?

Worst case we'll just run with the patch in production but with the intermittent nature of the issue it will be like the rock that keeps tigers away if the issue never pops up again.

jamesrward’s picture

I've got the issue recreated. Will report back with steps to reproduce once I re-test to confirm which steps were necessary. Don't waste any time/thought on this for now :)

Wim Leers’s picture

Thanks, and keep me posted!

jamesrward’s picture

Ok here is how I am reproducing consistently. Feel free to suggest a less crazy way of accomplishing this if you see one.

1.) Fire up watch to keep an eye on the response to curl -I requests for your homepage. Note that depending on your setup you may need /frontpage at the end of this. If you test with a fresh D7 install you definitely will.

watch -n 5 -d "curl -I http://www.example.com"

2.) Clear varnish caches

3.) Clear all caches with drush cc all

4.) As soon as you see X-Drupal-Cache: MISS start spamming your homepage with spoofed requests from Amazon Cloudfront. I used Firefox with User Agent Switch add-on for this. I found Chrome was far less reliable and seems to be doing some extra response caching.

5.) Your curl -I should start responding with a 301 and an X-Varnish hit (two numbers after X-Varnish). Open up another browser (Chrome was weird here too) and enjoy your broken homepage.

In my case the issue persists for 900 seconds (15 minutes) but I'm pretty sure varnish is getting this from the Cache-Control settings in Drupal. Which of course means this patch should do the trick.

Testing the patch with this scenario now and will report back.

jamesrward’s picture

#2 is passing this test with flying colors so far. I'll keep hammering at it after lunch but it's looking good so far.

jamesrward’s picture

I've thrown everything I can think of at this patch and it's looking good to me. We're going to get this on production and I will report back if we see any odd behaviour but this looks RTBC to me.

Thanks!

jamesrward’s picture

I may have spoke too soon. With this patch in place the CDN mirror of my site example.cloudfront.net/ is completely browseable with no redirects back to www.example.com. I assume this is not the intended behaviour and that the CDN should 301 back to www.example.com on every page. Looks like the no-cache directive is keeping the 301 off the CDN entirely.

jamesrward’s picture

Status: Needs review » Needs work

Marking as needs work as the proposed solution didn't work and we had another too many redirects issue recently. Should I be asking Acquia to add a rule to their VCL to prevent caching requests from Amazon Cloudfront? What if we have the 301 redirect go somewhere other than the homepage so it's not as visible or give it some kind of Varnish cache busting query string?

Wim Leers’s picture

Should I be asking Acquia to add a rule to their VCL to prevent caching requests from Amazon Cloudfront?

Not caching any requests if the user agent is Amazon CloudFront seems wrong too: it's fine for Varnish to cache static assets.

What if we have the 301 redirect go somewhere other than the homepage so it's not as visible or give it some kind of Varnish cache busting query string?

I don't see how that would solve it either.


I wonder if we should change the implementation of the SEO feature, to disallow access CDN user agents to anything, except the public files directory and any of the profiles/*, misc/*, sites/* directories. That would perhaps fix it more reliably, and would make for simpler VCL rules.

Wim Leers’s picture

Category: Bug report » Support request

Marking this as a support request for now because it looks like nobody else has this problem.

Wim Leers’s picture

Title: Too many redirect cached by varnish » SEO (duplicate content prevention) causing redirect loop in combination with Varnish?

And better title.

Wim Leers’s picture

Category: Support request » Bug report
Priority: Normal » Major

I was thinking about this last night, and I'm pretty sure I've figured it out.

#2 was mostly right, but the conclusion was wrong. The problem is indeed that we don't have Vary: User-Agent, but sadly we cannot use that, since there's tens of thousands of variations of it. It'd kill caching. And for that very reason, most reverse proxies don't respect it anyway.
But, this is plain wrong:
<h3><code>Cache-Control: no-cache

This could work. This would tell Varnish not to cache it. It would also tell the CDN to not cache it. Which means a CDN won't cache our 301s, and thus not lighten the load as much anymore in case of bots requesting all pages via the CDN domain, but that's not something that can be fully solved at the CDN level.

The reason it's wrong: because the non-CDN-User-Agent responses will still end up being cached, which means Varnish will cache it, and then it'll end up in the CDN anyway.

i.e. what happens is this:

  1. End user: HTTP GET http://example.com/blog/llamas-are-cool — this hits Varnish. Empty cache, so Varnish requests it from the web server. The web server responds, with a response that has Cache-Control: public, max-age=X. Varnish caches the response and then relays it.
  2. CDN: HTTP GET http://example.com/blog/llamas-are-cool — this hits Varnish, and it's cached already.
  3. Result: http://cdn.example.com/blog/llamas-are-cool is serving duplicate content.

The reason the patch in #2 did seem to work at first is that you were explicitly testing this, which is why this was the request order:

  1. CDN: HTTP GET http://example.com/blog/llamas-are-cool — this hits Varnish. Empty cache, so Varnish requests it from the web server. The web server responds with a 301 response that hasCache-Control: no-cache. Varnish does not cache the response, and just relays it.
  2. End user: HTTP GET http://example.com/blog/llamas-are-cool — this hits Varnish. Empty cache, so Varnish requests it from the web server. The
  3. Result: http://cdn.example.com/blog/llamas-are-cool is doing a redirect to http://example.com/blog/llamas-are-cool.
    Yay!
    But…
  4. End user: HTTP GET http://example.com/blog/llamas-are-cool — this hits Varnish. Empty cache, so Varnish requests it from the web server. The web server responds, with a response that has Cache-Control: public, max-age=X. Varnish caches the response and then relays it.
  5. CDN: HTTP GET http://example.com/blog/llamas-are-cool — this hits Varnish, and it's cached already.
  6. Result: http://cdn.example.com/blog/llamas-are-cool is serving duplicate content.

i.e. as long as Varnish is only getting CDN requests, it will work fine. But as soon as a regular end user has requested a page, then it'll be cached in Varnish, which means it'll eventually make its way onto the CDN too.

Wim Leers’s picture

Title: SEO (duplicate content prevention) causing redirect loop in combination with Varnish? » SEO (duplicate content prevention) causing redirect loop in combination with Varnish

So, the issue title is actually spot-on.

The reason this doesn't happen with Drupal's built-in page cache, which is also a dumb reverse proxy like Varnish (well, much dumber actually, but I digress), is that the CDN module implements hook_boot(), which allows logic to run even for cached pages.

The rudimentary answer is therefore: you have to duplicate the logic of _cdn_seo_should_redirect() in Varnish if you want to prevent duplicate content from appearing on your CDN.

Now working on a better answer.

Wim Leers’s picture

There are basically two solutions, neither of which is automatic. So I'm afraid you have to either:

  1. Create the necessary matching logic for your reverse proxy setup (Varnish or otherwise).
  2. Ensure that requests by the CDN are cached *separately* by your reverse proxy. This effectively implements a very limited Vary: User-Agent. (In other words: let your reverse proxy normalize the User Agent header into two categories: CDN user agent and everything else. Then use this normalized value as a key for looking up cached responses.)

The logic in the current module is sound, we also don't need that Cache-Control: no-cache header. The only thing missing, is documentation. We could add an automated test for this to hook_requirements() though.

Wim Leers’s picture

Title: SEO (duplicate content prevention) causing redirect loop in combination with Varnish » SEO (duplicate content prevention) causing redirect loop in combination with reverse proxy between CDN and web server
jamesrward’s picture

Great stuff Wim. Thanks for persevering on this.

Wim Leers’s picture

Title: SEO (duplicate content prevention) causing redirect loop in combination with reverse proxy between CDN and web server » Document that SEO (duplicate content prevention) causes redirect loop in combination with reverse proxy between CDN and web server
Category: Bug report » Task
Status: Needs work » Needs review
Related issues: +#2708771: Port SEO (duplicate content prevention) to 8.x-3.x

#2708771: Port SEO (duplicate content prevention) to 8.x-3.x, the D8 port of this functionality, already includes #16 in the documentation.

I think this addition to the README would make sense:

6) If your site is behind a reverse proxy such as Varnish, so that your stack
   looks like: CDN <-> reverse proxy <-> web server, then you need to take extra
   measures if you want to prevent duplicate content showing up on the CDN. See
   https://www.drupal.org/node/2678374#comment-11278951 for details.

  • Wim Leers committed 4f39792 on 7.x-2.x
    Issue #2678374 by Wim Leers, jamesrward: Document that SEO (duplicate...
Wim Leers’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

das-peter’s picture

Component: Origin Pull mode » Code

Just in case someone stumbles over this, here's a snipped for Varnish 4:

vcl 4.0;
sub vcl_hash {
  # Add CDN specific cache as documented here:
  # https://www.drupal.org/node/2678374#comment-11278951
  # Compare the request user agent to the CDN user agent - if matching we change
  # the hash
  if (req.http.user-agent ~ "Amazon CloudFront") {
    hash_data("CDN");
  }
  return (lookup);
}
standingtall’s picture

For Nginx (particularly if you are using perusio config).

These lines will ensure that different cache is maintained for user agent.

}

## This is very important CDN module issue https://www.drupal.org/node/2678374
set $cdn 'NO';

 if ($http_user_agent ~* "CloudFront") {
    set $cdn 'CDN';
    }
fastcgi_cache_key $uri$request_method$is_args$args$cdn;

}
pattersonc’s picture

I think it's worth pointing out that the Varnish (v4) snippet above (#23) calls return (lookup) which will abort the execution of the default vcl_hash routine which contains logic for hashing host and url. I believe the goal is to simply add "CDN" to the hash_data and continue on with the default vcl implementation.

Reference: https://varnish-cache.org/docs/4.0/users-guide/vcl-hashing.html

sub vcl_hash {
  # Add CDN specific cache as documented here:
  # https://www.drupal.org/node/2678374#comment-11278951
  # Compare the request user agent to the CDN user agent - if matching we change
  # the hash
  if (req.http.user-agent ~ "Amazon CloudFront") {
    hash_data("CDN");
  }
}