The current cache headers are suboptimal for cases where Drupal is running behind reverse proxy caches.

This patch tries to implement a more reasonable Cache-Control and Expires headers.

It does the following:

- Instead of having Expires in 1978, it would be set to the current time + the number of seconds defined by the cache_lifetime variable.

- Instead of Cache-Control must-revalidate, it sets it to public and with a max-age to what cache_lifetime says it should be.

It would be nice if this can be tested on s.d.o to see if it makes a difference.

P.S. The current Expires and Cache-Control headers are intimately tied to Drupal caching. This may not be the best case in some cases (e.g. reverse proxy), since it still does all the processing for generating the cached items, storing them in the database, and clearing them later form the database. At some future point we may want to de-couple these two things (Drupal's database cache, vs. reverse proxy). The new reverse_proxy setting can be checked, and all Drupal's database cache bypassed in this case, but proper Cache-Control and Expires headers sent.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

kbahey’s picture

Here is the relevant section in RFC 2616 on w3c web site.

simon rawson’s picture

Have not tested the patch as I am not running 6 yet...

In principle this patch gets a big thumbs up from me, it is a very desirable feature. (One drupal site I maintain runs behind a reverse proxy and current headers make that worthless.)

But it is such a fundamental issue (and even Dries has mentioned it before) that I'm convinced it can't be this easy to fix... surely...

kbahey’s picture

This patch is a definite improvement, but not The Ultimate Solution (tm).

The Ultimate Solution (tm) will be to decouple the Cache-Control headers from Drupal's internal cache.

However, the code freeze is upon us (2 days), and I'd rather get this in for Drupal 6 than nothing. The Ultimate Solution (tm) will require a lot of refactoring, and there is no time for that now.

douggreen’s picture

FileSize
1.09 KB

I have no comment on the patch, yet, but here's a minor re-roll to meet string coding standards.

moshe weitzman’s picture

i looked into this. i wonder if max_age is too high. it should be no greater than time when we will flush this item. but figuring out that time seems complicated. see top of cache_get().

pwolanin’s picture

Version: 6.x-dev » 7.x-dev
Robin Monks’s picture

Status: Needs review » Reviewed & tested by the community

Patch applies and appears to work properly. Bootstrap and sessions simpetests pass without fail. RTBC.

Robin

Robin Monks’s picture

FileSize
1.04 KB

Reroll that only fixes the offset and fuzz when applying the patch.

Robin

Dries’s picture

I wonder why this is an improvement. Did we measure/benchmark it?

Robin Monks’s picture

I believe this is something primarily useful in a cluster/proxy environment. I can't give you figures based on what this would do there, that said I can try benchmarking this on a drupal install that is not run through a proxy and see if it makes any difference, if that would be useful?

Robin

kbahey’s picture

@Dries

Currently, when running behind a reverse proxy, like we do on d.o, we have no caching of HTML at all. Only media files and css, and js gets cached. This is because our cache headers are faulty and forces the request to get sent from the reverse proxy to the web server no matter what.

This is based on observations by Narayan Newton on d.o itself. He blogged about it here http://staff.osuosl.org/~nnewton/?q=node/34. I discussed the matter with him and this patch was the result.

If we want the reverse proxy to take the load off the web server, then we need this patch. Obviously, everyone does want that.

We can get Narayan to test it on scratch.d.o if need be.

Dries’s picture

Status: Reviewed & tested by the community » Needs review

I think it would be useful to test this patch on d.o (or scratch.d.o) for a bit and measure the impact. I'd like to better understand the implications / gain. I'll mark it as 'code needs review' until we have some more data. Looks like a good patch though.

Dries’s picture

I wonder if this patch was applied to drupal.org because I have serious caching issues with d.o now.

killes@www.drop.org’s picture

I am running this patch for about 12h now on d.o. There is one error report from user mikl who says that he gets the frontpage for anonymous although he is logged in.

killes@www.drop.org’s picture

I was actually able to reproduce the problem. If I do a "hard reload" then the problem goes away.

Robin Monks’s picture

Is it just me, or does Drupal.org seem faster :)

Login bug is an issue though :(

Robin

killes@www.drop.org’s picture

This also happens with other pages if they are already in Squid. Obviously, Squid cannot distinguish between users with a valid logged in session and ones without.

Robin Monks’s picture

As I recall, it should be possible to tell squid to not use it's cache if a certain cookie is present. Perhaps this functionality could be configured?

Robin

killes@www.drop.org’s picture

I've disabled the patch on d.o again, it was too confusing for people

Dries’s picture

Status: Needs review » Needs work

I had the same problem described above. I was logged in but browsing pages as an anonymous user. In other words, this patch needs more work.

kbahey’s picture

Without some intelligence in the proxy to differentiate between logged in and anonymous users, there is no solution to this.

For example, if the proxy can check a certain cookie, and we can add a cookie such as DRUPAL_LOGGED_IN and set it to FALSE or TRUE.

I am not sure how to make Squid aware of this, and let it do a pass through the requests when this is set.

Can anyone with Squid knowledge advise on whether this is doable and how to do it?

kbahey’s picture

Using Varnish, the config will be something like this:

  if (req.http.Cookie && req.http.Cookie ~ "DRUPAL_LOGGED_IN=TRUE") {
    pass;
  }  

  # Remove the "Cookie:" header from the request.
  remove req.http.cookie;

So, requests that have this cookie set will be passed through the proxy without any caching.

Anonymous users will be served from the cache.

The code to add the cookie is trivial. Boost does something similar already.

kbahey’s picture

Status: Needs work » Needs review
FileSize
2.91 KB

Here is an updated patch, based on discussions with killes and webchick.

It does the following:

1. Checks if the reverse_proxy variable is set.
2. If so, it will:
a. Set a new cookie called DRUPAL_LOGGED_IN and sets it to 1 if the user is logged in, or to 0 otherwise.
b. Output the correct headers that causes the page to be cached

The default.settings.php contains directives for Squid and Varnish that check the above cookie and pass thru the request if the cookie is set to 1. Otherwise, the user is served from the cache.

@killes, if you can apply this to d.o, as well as the Squid changes, and then see if there is any difference from before.

Dries’s picture

killes, make sure to not only test it, but to actually measure impact, please.

webchick’s picture

Subscribing.

There are some minor code style inconsistencies (in 7.x we concatenate with ' . $var rather than '. $var, and there should be periods at the end of all inline comments) and I'd like to see the legibility of the settings.php snippets improved (is that Squid stuff all on one line, or..?) but on visual inspection this seems to address the concerns I brought up in #drupal-infrastructure.

kbahey’s picture

FileSize
2.72 KB

This is a reroll against today's changes, and also contains one important change:

Instead of relying on a cookie, which has to go on the browser and be deleted/expired, it uses a custom header that the reverse proxy would understand and pass through the request.

The header is:

X-Drupal-Logged-In: Yes

Squid's configuration would be like so:

acl drupal_logged_in rep_header "^X-Drupal-Logged-In: Yes$"
cache deny drupal_logged_in

Squid can even delete the header if so desired.

That makes things simpler.

kbahey’s picture

FileSize
2.67 KB

Some minor changes, most importantly is removing time() as the rest of core does now.

kbahey’s picture

FileSize
3.65 KB

After thinking about this for a while, it seems that a cookie is indeed needed. This is because there are two things that happen:

Request: from browser to server, and with the proxy in between. The proxy has to be told that this request should not be server from its own cache, but should be passed through to the server. There is no way to do this except with a cookie, since we can't just send headers from a browser.

Reply: from server to browser, with proxy in between. The server has to tell the proxy not to cache a certain request, because it is from a logged in user. So, again a cookie is checked for.

The attached patch implements that.

kbahey’s picture

FileSize
6.28 KB

OK, this is one is now tested with Squid, and is proven to work.

Anonymous users will have pages served from the reverse proxy, which in turn gets its pages from the drupal cache (db, memcache, whatever).

Logged in users will not see any of the cached pages.

We should see an improvement on drupal.org if this is applied, by having a lot of hits on pages served from Squid.

Two points, for future consideration, and should not concern us now:

First a minor thing: when an authenticated user requests a page (say node/123), then this causes the cached copy in squid to expire, and not be replaced by the one served to the authenticated user. This is what we want to happen so as not to show anonymous users pages that are customized for authenticated users. But, it means that whenever an authenticated user requests a page, there will be no cached copy of it for the anonymous users. This is either a squid limitation, as it does not allow a transparent pass through, or that there is an option for it in the configuration that I have yet to discover.

Second, we should have a future improvement to decouple reverse proxy caching from Drupal caching. Right now, we need Drupal caching to be on in order to have the reverse proxy cache save pages from Drupal. It would be nice to have a situation where there is no Drupal cache, and only a reverse proxy cache.

Right now, the patch is as good as it gets. Let us try it.

kbahey’s picture

And here are the benchmarks. Using siege I ran 10 concurrent users hammering the site (two paths, node and node/1) as fast as possible for one minute ...

D7, no proxy

Response time:                  0.03 secs
Transaction rate:             329.66 trans/sec
Concurrency:                    9.95
Successful transactions:       19872
Longest transaction:            0.29
Shortest transaction:           0.00

D7, with reverse proxy

Response time:                  0.02 secs
Transaction rate:             598.66 trans/sec
Concurrency:                    9.98
Successful transactions:       36063
Longest transaction:            0.03
Shortest transaction:           0.01

So, 329 requests/second without this patch and a reverse proxy, and 598 requests/second with. Not shabby at all!

The reverse proxy is a lowly PII 350MHz with 384MB of RAM by the way, and has its CPU saturated when I run the tests, so things can be even faster on a more modern machine.

Damien Tournoud’s picture

First a minor thing: when an authenticated user requests a page (say node/123), then this causes the cached copy in squid to expire, and not be replaced by the one served to the authenticated user.

That's quite a huge point. Seeing that, I'm not sure this patch would have any effect on d.o.

Other strong point against this: this change will also make other proxies in the chain (ie. enterprise gateways, ...) cache some pages if there are viewed by anonymous users. Users behind proxies will experience caching issues when they login.

kbahey’s picture

Clarification on the previous benchmark. The first one (without the reverse proxy) was hitting the Drupal server directly. The second one was hitting the reverse proxy. In both cases the patch was applied.

Here is another set of benchmarks with and without the patch, both via the proxy. You can see the big difference all the same.

Thru the proxy, without the patch

Transactions                   12504 hits
Availability:                 100.00 %
Elapsed time:                  60.10 secs
Response time:                  0.05 secs
Transaction rate:             208.05 trans/sec
Concurrency:                    9.99
Successful transactions:       12504
Failed transactions:               0
Longest transaction:            0.19
Shortest transaction:           0.01

Thru the proxy, with the patch

Transactions:                  34498 hits
Availability:                 100.00 %
Elapsed time:                  60.40 secs
Response time:                  0.02 secs
Transaction rate:             571.16 trans/sec
Concurrency:                    9.98
Successful transactions:       34498
Failed transactions:               0
Longest transaction:            0.07
Shortest transaction:           0.01
kbahey’s picture

@Damien

I think it will help with d.o. The frequently accessed pages though will not benefit a lot, but other will. This is mostly because of how Squid works. If we can make it do a passthru without affecting its cache, then we have a solution. Otherwise, the first request an anonymous user makes to a page that has been expired from Squid by an authenticated user, will go to memcache, our primary cache on d.o, and then Squid will be primed again. No need to execute the entire page in Drupal again.

As for the second point, I am not sure this will be an issue. Most large sites use reverse proxies and that does not cause issues for people. I tried two different browsers and was logged in in one of them and logged off in the other, and things worked fine.

Testing this on d.o is what we need to do before it goes into core D7. We will find out for sure if there are issues this way.

c960657’s picture

+  if (variable_get('reverse_proxy', 0)) {
+    // Optimize the headers for anonymous users when we have a reverse proxy such as Squid
+    $lifetime = variable_get('cache_lifetime', 300);
+    header('Expires: ' . gmdate('D, d M Y H:i:s', $cache->created + $lifetime) . ' GMT');
+    $age = abs($_SERVER['REQUEST_TIME'] - $cache->created);
+    header('Cache-Control: public, max-age=' . $age);

AFAICT this will not work for users behind yet another proxy, i.e. not only the reverse proxy but also e.g. a company firewall.

If an anonymous user loads the page, it will be cached in the company firewall. If the same user then logs in and requests the same page again, the company firewall doesn't know that it need to fetch the page but just returns the old copy (unless the reverse proxy is configured to change the Cache-Control header into must-revalidate or similar).

An alternative way of solving the problem is adding a Vary: Cookie header. This solves the problem not only for reverse proxies, but for all proxies that may sit between the user and the web server (in general, cache-friendly headers are useful to both reverse proxies and "regular" proxies). However, in order to make anonymous users get cached pages from the proxy, you need to make sure that they don't have different cookies set, i.e. only authenticated users should have a session cookie. I don't know if this is likely to happen in Drupal.

kbahey’s picture

The code you pointed to only gets triggered when the site is being browsed by an anonymous user, AND page caching is on AND a reverse proxy is configured.

(unless the reverse proxy is configured to change the Cache-Control header into must-revalidate or similar).

That is exactly what happens once a user becomes logged in. Drupal sends all the requests as must-revalidate, and using acl's on Squid to check the cookie, the page is not cached on the reverse proxy anymore.

Re: the vary/cookie thing (here is the section in RFC2616), the patch causes the DRUPAL_LOGGED_IN cookie to be set when someone is logged in and destroyed when someone logs off. So it seems like we just need to add that header, if I understand that correctly. We can set it when someone is logged in, and unset it if they are not. How about that?

I will change the patch and test this when I have the time, unless someone else does it.

kbahey’s picture

I am not behind a forward proxy, so can't tell if it is any different in that regard.

But, I did some testing:

First I added the Vary: Cookie header only when the user is logged in (i.e. Cache-Control is must-revalidate). I can't see any difference in this case. In fact, I noticed that the patch as it is does NOT suffer from the concern quoted in #31. The page is in the cache and the cache is just bypassed when the user is logged in (via the new DRUPAL_LOGGED_IN cookie), once the user is logged off, the first request to page is served from cache already, and there is no need to go to the origin server (Drupal).

Just after the logout (an authenticated session ends), the next request is served from the cache without a miss first to go to the origin server.

1220914980.326     12 192.168.0.2 TCP_MISS/302 616 GET http://x.example.com/logout - FIRST_UP_PARENT/192.168.0.254 text/html
1220914980.366      5 192.168.0.2 TCP_IMS_HIT/304 379 GET http://x.example.com/ - NONE/- text/html

Next, I added the Vary: Cookie to the case when user is logged off and page cache is on, and reverse proxy is on. I saw that it requires more page reloads to get the page in Squid's cache that way.

1220915282.657    518 192.168.0.2 TCP_MISS/302 616 GET http://x.example.com/logout - FIRST_UP_PARENT/192.168.0.254 text/html
1220915282.704     25 192.168.0.2 TCP_MISS/200 2397 GET http://x.example.com/ - FIRST_UP_PARENT/192.168.0.254 text/html
1220915287.693     14 192.168.0.2 TCP_IMS_HIT/304 378 GET http://x.example.com/ - NONE/- text/html

So, unless there is a change in behavior when behind a proxy, I don't see this header as adding any value. In fact, it makes us lose some performance.

c960657’s picture

That is exactly what happens once a user becomes logged in. Drupal sends all the requests as must-revalidate, and using acl's on Squid to check the cookie, the page is not cached on the reverse proxy anymore.

The problem is with pages that were cached by external proxies before the user logged in. When the user logs in, he is still served the cached pages by the external proxy, i.e. the request never reaches the web server or the reverse proxy.

Unless you add a Vary header to anonymous pageviews, the external proxy uses the URL as cache key. It doesn't know whether the user is logged in or not, so if a give cached page may be served to anonymous users, it may be served to authorized users as well.

kbahey’s picture

FileSize
6.33 KB

Unless you add a Vary header to anonymous pageviews, the external proxy uses the URL as cache key. It doesn't know whether the user is logged in or not, so if a give cached page may be served to anonymous users, it may be served to authorized users as well.

Thanks for the info.

Now, we set two cookies: one is the SESSxxxxxxxxxx and other is DRUPAL_LOGGED_IN. The first is the PHP session cookie, and the xxx part is the md5 of the $base_url, so unique to the site. The contents vary for each user.

Anyway, here is a patch with the Vary Header when reverse proxy is on and the user is anonymous. It has some effect on throughput, but still not bad at all.

Tested using concurrent users hammering the site for 1 minute, using 25 unique URLs (nodes) and the home page.

no vary header

Transactions:                   33555 hits
Response time:                  0.02 secs
Transaction rate:             558.97 trans/sec
Successful transactions:       33555
Longest transaction:            0.31
Shortest transaction:           0.01

With vary header for anonymous

Transactions:                   31524 hits
Response time:                  0.02 secs
Transaction rate:             528.13 trans/sec
Successful transactions:       31524
Longest transaction:            0.10
Shortest transaction:           0.01

Can you please test it if you are behind a corporate proxy?

c960657’s picture

I just installed Siege. Though it claims support for cookies, in my tests it didn't send the session cookie back to the server (I added the following to the bottom of index.php: file_put_contents('/tmp/foo.log', var_export(apache_request_headers(), 1) . "\n" . var_export(apache_response_headers(), 1) . "\n\n", FILE_APPEND);).

Did you manage to make Siege send cookies? If yes, how many parallel sessions (i.e. distinct cookies jars) are you using? The Vary header makes Squid maintain one copy of the page per session, so if all siege requests send the same cookie or no cookie at all, it doesn't really give a realistic picture.

kbahey’s picture

I used 10 concurrent users for the tests, and I blogged about it at: increasing Drupal's speed Squid caching reverse proxy.

Yes, siege says it supports cookies. But as you just found out, it does not.

I used this in index.php: print ip_address() . ':' . session_name() . ':' . var_export($_COOKIE, TRUE);

This is from Firefox, when logged in:

192.168.0.222:SESSedbd0a768e12356a6a5255dcc9b09ea7:array ( 'SESSedbd0a768e12356a6a5255dcc9b09ea7' => '75b1ee123817ae876877f584c355bdf7', 'DRUPAL_LOGGED_IN' => 'Y', )

And then when I log off, the other cookie goes away and the session cookie changes the value:

192.168.0.222:SESSedbd0a768e12356a6a5255dcc9b09ea7:array ( 'SESSedbd0a768e12356a6a5255dcc9b09ea7' => 'a5d23d671fc2db800cf53d22e98e6ba8', )

But from siege, I get:
192.168.0.111:SESSedbd0a768e12356a6a5255dcc9b09ea7:array ()

Bad siege, no cookie!

But even if this works, one copy of the page per cookie beats the idea of generating the same page and caching it for anonymous users. Anonymous users will each have a different cookie value, and that will cause Drupal to serve the page from its own cache for each anonymous user.

I guess we need to rethink alternatives to the Vary: Cookie idea. Darn corporate proxies ...

c960657’s picture

Anonymous users will each have a different cookie value, [...]

I am not sure this is necessary. Unless you write to $_SESSION in anonymous sessions, I don't think you need to set cookies for anonymous users. But I don't have much experience with PHP sessions or knowledge about Drupal's session handling, so I don't know whether this is feasible.

kbahey’s picture

FileSize
6.3 KB

But the fact is, we do maintain sessions for anonymous users in Drupal using separate cookie values.

The reason is to allow for things like filling forms, purchasing from a store (cart), and the like.

Go ahead and try it. Add this line print_r($_COOKIE[session_name()]); at the end of index.php, make sure the page cache is disabled, and then visit the site from 2 different browsers (e.g. Opera and Firefox), and watch the different values at the bottom of the page.

This is a design thing in Drupal and we can't just rip it out now.

Anyway, I am attached a rerolled patch that does not use the Vary header anymore.

c960657’s picture

But the fact is, we do maintain sessions for anonymous users in Drupal using separate cookie values. The reason is to allow for things like filling forms, purchasing from a store (cart), and the like.

When an anonymous user has put something in a shopping cart, $_SESSION is no longer empty. I am not suggesting that Drupal should refrain from setting the session cookie for anonymous users - only as long as $_SESSION is empty. Do you think that will work?

BTW what prevents a page from being cached if it contains e.g. the contents of the user's shopping cart? Do you have an example of a shopping cart module or similar that can be used for testing anonymous sessions?

kbahey’s picture

FileSize
5.17 KB

This is a Drupal 6 version of the patch in #42, for those who want to try it on live sites already in production.

Please report back how it worked for you.

Status: Needs review » Needs work

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14272. If you need help with creating patches please look at http://drupal.org/patch/create

kbahey’s picture

Version: 7.x-dev » 6.x-dev
FileSize
5.17 KB

Setting correct version, so test bot does not balk.

kbahey’s picture

Version: 6.x-dev » 7.x-dev
FileSize
6.3 KB

Back to 7.x with the correct patch for that branch.

kbahey’s picture

Status: Needs work » Needs review
FileSize
4.68 KB

Reroll for latest 7.x, another patch caused conflicts. Now applies cleanly.

Status: Needs review » Needs work

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14438. If you need help with creating patches please look at http://drupal.org/patch/create

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14437. If you need help with creating patches please look at http://drupal.org/patch/create

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14437. If you need help with creating patches please look at http://drupal.org/patch/create

kbahey’s picture

Status: Needs work » Needs review

OK, bot, don't get mad. Here it is with the missing files ...

Hope you are satisfied now ...

kbahey’s picture

FileSize
6.32 KB
Robin Monks’s picture

*Robin gets out his can of bot-b-gone

c960657’s picture

Status: Needs review » Needs work

This is rather difficult to test. The reverse proxy doesn't alter the request or response headers, so if I enable the reverse_proxy setting and access the webserver through a regular forward proxy (but without a reverse proxy), everything should still work, right?

+ *   acl cookie_logged_in_set rep_header Set-Cookie DRUPAL_LOGGED_IN=Y
+ *   cache deny cookie_logged_in
+ *   acl cookie_logged_in req_header Cookie DRUPAL_LOGGED_IN=Y
+ *   cache deny cookie_logged_in_set

The "acl cookie_logged_in" line should precede the "cache deny cookie_logged_in" line, otherwise Squid barfs on start-up.

+ header('Cache-Control: public, max-age=' . $age);
Instead of $age I believe you should use max(0, $lifetime - $age). $age is the current age, but max-age is used to specify the maximum allowed age before the cache is considered stale.

The current patch uses different code paths whether or not a reverse proxy is used. For instance, If-Modified-Since is only supported when not using a reverse proxy, though a reverse proxy may benefit from this as well. In general, if you send out cache-friendly headers, they will help both reverse and forward proxies. For instance, if we send out Cache-Control: public when using a reverse proxy, we might as well send the same header when not using a reverse proxy.

But the fact is, we do maintain sessions for anonymous users in Drupal using separate cookie values. [...] This is a design thing in Drupal and we can't just rip it out now.

Well, we could try :-) But this is the subject of #201122: Drupal should support disabling anonymous sessions.

I haven't given it too much thought, but I think we either have to send Vary: Cookie or make the reverse proxy strip some of the response headers so that other proxies (that are unaware of the DRUPAL_LOGGED_IN semantics) don't cache too much.

kbahey’s picture

Status: Needs work » Needs review
FileSize
6.43 KB

Thank you for review. I have changed the code and added working examples to settings.php.

Here is the patch.

c960657’s picture

I have found a way to reproduce the problem that I described in comments #34 and #37. You need access to a regular (forward) proxy.

  1. Enable $conf['reverse_proxy'].
  2. Log in on your site using Firefox.
  3. Enable caching in Normal mode, set the minimum cache lifetime to e.g. 1 hour and clear Drupal's cache.
  4. Enable Firebug or LiveHTTPHeaders, so that you can watch the HTTP request headers sent from your browser.
  5. Follow a link to the front page (i.e. do not load the front page using the browser's Reload button).
  6. Copy the request headers to a text editor - they should look something like this:
    GET / HTTP/1.1
    Host: example.org
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: da,en;q=0.7,en-us;q=0.3
    Accept-Encoding: gzip,deflate
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Keep-Alive: 15
    Connection: keep-alive
    Referer: http://example.org/
    Cookie: SESS0f09ac24ff1447c1bedb7e0bf0a93252=e3ca6c523775e8592401b4718fb7e636; has_js=1
    
  7. Using a telnet client, connect to a regular (forward) proxy server and send the request. Notice that the response is an HTML page without login fields, i.e. this is an authorized page view.
  8. Remove the Cookie line, and send the request again. Notice that the response is now an HTML page with login fields, i.e. this is an anonymous page view.
  9. Send the request (without Cookie) again and notice that the response includes a Cache-Control: public header, i.e. this is a hit from Drupal's cache.
  10. Send the request (without Cookie) again and notice that Squid (if that is what you are using) indicates, that this is a hit from Squid's cache (my Squid sends an X-Cache: HIT header).
  11. Send the request again, this time with the original Cookie line. Notice that the response is still a hit from Squid's cache and that the HTML page has login fields, i.e. we get a cached page, even though we sent a cookie belonging to an authorized user.
  12. Send the request again, but this time add an additional Cache-Control: no-cache. Notice that the response is an HTML page without login fields. This step is just to verify that the Cookie is still valid, i.e. that the previous request should have returned an authorized page as well.
kbahey’s picture

I don't see the DRUPAL_LOGGED_IN=Y cookie. This means that you are not using the patch, or something else is wrong.

Also, I have found that clearing Squid's cache, restarting it, and monitoring the access.log that Squid generates is very helpful in tracking what is happening, so please provide those.

c960657’s picture

Sorry, I must have messed up something when writing up the recipe. But I just ran the tests again twice, and it is just the example headers in step 6 that are wrong. The problem still occcurs when Cookie is e.g. DRUPAL_LOGGED_IN=Y; SESS0f09ac24ff1447c1bedb7e0bf0a93252=1b3e96500ffb2e6bf42e96fec0c614d2; has_js=1.

markus_petrux’s picture

I might be wrong but if it is required for the proxy cache to know about the DRUPAL_LOGGED_IN cookie, this may affect other proxy caches that may live in the middle, between the user and the site's reverse proxy.

ie. when first anonymous user requests frontpage on site with enabled reverse proxy, response will include max-age headers and DRUPAL_LOGGED_IN, but if a proxy cache in the middle knows nothing about DRUPAL_LOGGED_IN cookie and the response does not include Vary header, then it will cache the page for every other user visiting the site behind said proxy cache in the middle ¿?

kbahey’s picture

@markus_pertrux

Can the Vary specify a specific cookie? The issue is that we have two cookies, one is the session cookie, and its name will vary from site to site, and the its contents will vary from one user to the other. The other cookie, DRUPAL_LOGGED_IN, is constant, and we can use it to disable caching by downstream proxies. But how do I specify only that cookie?

markus_petrux’s picture

I'm afraid it is not possible. AFAICT, you cannot tell the user agent to include a particular HTTP header that can be configured server side. I think the only one is the Cookie header. So that is what you have to use in Vary header as a whole.

References:
13.6 Caching Negotiated Responses / 14.44 Vary
http://www.ietf.org/rfc/rfc2616.txt (HTTP 1.1 Specs)

An additional consideration regarding the Vary header is that it might be required to negotiate with user agents different cached versions when content is Gzipped (Accept-Encoding). There may be buggy user agents out there that may display garbage if you Gzip responses because they don't send Accept-Encoding header properly, and you may wish to include User-agent header in Vary header, if that happens to a significant part of your user base. Example:

http://httpd.apache.org/docs/2.0/mod/mod_deflate.html

kbahey’s picture

@markus_petrux

Here is an idea. Instead of Vary-ing by cookie, we can vary by a header.

So, if we serve the page with X-Drupal-Logged-In instead of setting a cookie, that should work with the reverse proxy as well as any forward proxies in the way. It should also eliminate the need for the cookie and changes to Squid's reverse proxy as well.

Thoughts?

markus_petrux’s picture

The problem is that you need to tell the user agent to return back the "X-Drupal-Logged-In" so the proxy can see if something in the cache matches what the user agent is actually requesting.

AFAIK, the only header that the server can tell the user agent to return back is the Cookie. And the HTTP Specs do not support substrings in Vary header fields. ie. Vary header fields ought to be headers the user agent sends, mayching as a whole, not partially. :(

kbahey’s picture

Status: Needs review » Needs work

Setting this to code needs work, since we cannot control how many proxies are in the way, and we have no way to handle them cleanly so far.

markus_petrux’s picture

One possible method to achieve this would be to include Cookie in Vary header, but it may only worth if most of the visits are anonymous AND cookies are not sent for them, ie. no session for anonymous visits by default, unless absolutely necessary.

But proxy-caches will get a lot of dups since every logged in user will have it's own version of each page cached (this affects all proxies in the way) because of cookies, specially session cookies.

If the site does not have logged in users, then it would be reasonable to use this kind of headers optimizations. That could be implemented with a settings.php option, maybe.

kbahey’s picture

But proxy-caches will get a lot of dups since every logged in user will have it's own version of each page cached (this affects all proxies in the way) because of cookies, specially session cookies.

That is unrealistic, since it defeats the whole point of having a reverse proxy: not having to go back to the origin server so much and keeping one copy of a page and serving it FAST to many users, saving resources.

markus_petrux’s picture

If you want to Gzip pages, compression can be done by web server or proxy server. If compression is done by the proxy, then it will spend a lot of time compressing pages for each single request, so it might be more efficient to perform compression on the web server. In that case, the proxy will have to keep several versions of the same page, one for each combination of requested URL + client Accept-Encoding headers.

When compression is done by the proxy, it needs less cache storage, but it costs more CPU, and it may not perform as better as compression implemented at web server level.

So it depends...

mikl’s picture

Concerning Siege and Cookies – I had no problem when I used something like this:

siege -H "Cookie: SESSbd24548940bff1baaedcaaaaab2ac30=aaaaaaaaaa156a24992b5becea78fbb532e;" -c 50 "http://dpekspres.dk" -b -t30s

The headers going out of Drupal-issue is a hard nut to crack. Is it possible that we could send out a special header – X-Drupal-Cache – for Varnish/Squid/whatever that would override the standard expires header? That would keep us out of external proxies and then we could have a rule like:

if (req.http.Cookie ~ "drupal_logged_in") { pass; }

I think that's the most sensible approach, if it's possible, because as I see it, there's no way to avoid the external proxy mess without having
"Expires: Sun, 19 Nov 1978 05:00:00 GMT" or something similar.

I'm going to look into it on the Varnish-side, but if you have any good ideas, I'd love to hear them :)

http://varnish.projects.linpro.no/wiki/VCLExampleLongerCaching

kbahey’s picture

The -H argument will cause all users in siege to be logged in as the same user, not as different users, can cause excessive contention and does not simulate real life loads.

If we use a X-Drupal-Cache or something similar, we need to come up with rules that would handle it in Squid as well.

tomchuk’s picture

It seems to me that the solution may not be something that can be accomplished from within Drupal, but rather through overwriting headers at the reverse proxy. The patch already assumes that the reverse proxy needs to be configured to pass through on the presence of a cookie, why not also require that Cache-Control and Expires headers be replaced at the reverse proxy with what Drupal has been sending all along?

If the reverse proxy strips the Cache-Control and Expires headers that Drupal sends (with this patch) and sends Cache-Control: no-cache no-store and Expires: Sun, 19 Nov 1978 05:00:00 GMT, forward proxies between the client and reverse proxy will always hit the reverse proxy on every request.

I'm currently using this with Akamai Site Delivery in front of a Drupal 5 site with great success. I've got the ttl determined by path and file extension (in Akamai config), am sending a cookie to pass through logged in users and am sending no-cache with all html pages sent from Akamai.

One thing that also needs to be considered is pages that vary for anonymous users - One drupal_set_message served to an anonymous user and things start to get a little funny. In my index.php I've got the following after the call to menu_execute_active_handler():

global $user;
if (!$user->uid && count(drupal_get_messages(NULL, FALSE)) != 0){
  header("Expires: Sun, 19 Nov 1978 05:00:00 GMT");
  header("Cache-Control: no-cache, no-store");
}

Barring any displayed errors generated in theme_page(), this makes sure that the reverse proxy (Akamai in my case) does not cache the page with the message. I'm not sure how applicable this is to D7, but something to consider.

One final problem (in D5 at least): The ttl for images in my files directory is 7 days. Because the cache key is the URL, and the image file name doesn't change when a user changes their picture, there were all sorts of issues. I solved by issuing a SOAP request to Akamai in hook_user to flush that file (and it's imagecached versions), but that solution is very specific to my situation.

kbahey’s picture

A generic solution using standard cache headers is preferred, because it means it would work with many proxies and not just one.

However, so far we have not been able to come with such a solution, due to the propagation of headers to downstream proxies.

Making the reverse proxy strip the cache headers is a good idea, but it means that there has to be a solution for each specific proxy (one for Squid, one for Varnish, one for Akamai, ...etc.).

Drupal.org is using Squid, and if you can come up with the directives to strip those headers, it will help a lot.

tomchuk’s picture

I'm not a squid expert by any means. But the following may work...

If you're doing clean urls by rewriting with squid:

acl dynamic_content urlpath_regex index\.php$
header_access Cache-Control deny dynamic_content
header_replace Cache-Control no-store, no-cache

A little trickier if you're doing the rewriting on the origin:

# some sort of regex to match where static content lives
acl static_content urlpath_regex -i ^/(files|includes|misc|modules|profiles|scripts|sites|themes)

# Default - make the client/forward proxy hit squid every time
header_access Cache-Control deny all
header_replace Cache-Control must-revalidate no-cache no-store

# Let the client/forward proxy static content for a day
header_access Cache-Control deny static_content
header_replace Cache-Control public, max-age=86400

Again, I'm a total Squid noob, and am doing this from the documentation, chances are it's not going to work ;)

kbahey’s picture

I am trying with this configuration:

# Drupal specific stuff
acl admin urlpath_regex ^/admin
cache deny admin

# some sort of regex to match where static content lives
acl static_content urlpath_regex -i ^/(files|includes|misc|modules|profiles|scripts|sites|themes)

# Default - make the client/forward proxy hit squid every time
header_access Cache-Control deny all
header_replace Cache-Control must-revalidate no-cache no-store

# Let the client/forward proxy static content for a day
header_access Cache-Control deny static_content
header_replace Cache-Control public, max-age=86400

But the origin server gets hit (so it is a miss for Squid), and here are the headers send when I am not logged in and hit a regular URL (/user/register in this case). Note the duplicate Cache header.

  HTTP/1.0 200 OK
  Date: Sun, 19 Oct 2008 02:46:59 GMT
  Server: Apache
  X-Powered-By: PHP/5.2.4-2ubuntu5.3
  Set-Cookie: SESS06cf1d6bed47474747efb410c408b0a73=89bedcdd8a3e16917d5ffb4eed615b63; expires=Tue, 11 Nov 2008 06:20:19 GMT; path=/; domain=.blah.example.com
  Expires: Sun, 19 Nov 1978 05:00:00 GMT
  Last-Modified: Sun, 19 Oct 2008 02:46:59 GMT
  Cache-Control: public, max-age=86400
  Cache-Control: public, max-age=86400 <<<<< =====NOTE THIS DUPLICATE
  X-Generator: Drupal 7 (http://drupal.org)
  Content-Length: 5632
  Content-Type: text/html; charset=utf-8
  X-Cache: MISS from localhost <<<<< ===== AND THE MISS HERE
  X-Cache-Lookup: MISS from localhost:80
  Via: 1.0 localhost:80 (squid/2.6.STABLE18)
  Connection: keep-alive

I will try any other suggestions you have ...

tomchuk’s picture

Ugh, sorry to waste your time, as I said, total squid noob. I've been playing with a set of virtual machines trying to get this to work with no luck - host -> vm1 squid forward proxy -> vm2 squid reverse proxy -> vm3 apache. I've done this on Akamai, done it on Varnish, but no luck on Squid. Maybe someone with a little more experience with Squid can lend a hand.

mikl’s picture

Hey tomchuk, if you have managed to get it working on Varnish, I'd love to see your VCL for it. I'm finding it a bit difficult to figure out :)

robinbowes’s picture

Khalid,

I arrived here via your 2bits.com article: http://2bits.com/articles/increasing-drupals-speed-squid-caching-reverse...

What's the current status of this issue?

R.

kbahey’s picture

The current status is that the patch does work, and does speed up things and reduce server load. However, it affects any mid stream proxies (e.g. corporate or ISP caching proxies).

So, we cannot find a way to say "cache for anonymous only, but not logged in users" in a generic enough way that would not affect those proxies in the middle.

If there are any ideas, hints, please let us know.

Swampcritter’s picture

Is it possible to get this patch to work under D5 as well?

kbahey’s picture

It is more work. It requires another feature which is only in Drupal 6.x and above (managing IP addresses for a reverse proxy).

Not impossible, but more work.

mfb’s picture

Subscribing because I very much like the title of this issue, though I'd prefer just not setting a session cookie when it's not needed, if possible (#201122: Drupal should support disabling anonymous sessions).

sdboyer’s picture

Subscribe...at chx's suggestion, I've been redoing drupal's session handling in OO (#335411: Switch to Symfony2-based session handling) and though nothing about that helps the specific problems here, it's becoming increasingly clear that a layer of abstraction which keeps us from having to load bootstrap.inc down with conditionals wouldn't be a bad thing at all. So I'll keep my eye out for this, and maybe try to find time to throw another set of eyes at squid configurations.

klavs’s picture

Title: Implement better cache headers for reverse proxies » Suggestion for more optimal solution for working with reverse proxies

Hi guys,

I've read this thread with great interest and wanted to add my suggestion to a more optimal solution, which exploits that the reverse proxies is at the control of the site-owner.
The suggested solution, is something I helped design and implement at several large sites in Denmark - after 9/11. It is the solution which delivers the best performance.

I hope you like it:
http://drupal.org/node/346347

kbahey’s picture

Title: Suggestion for more optimal solution for working with reverse proxies » Implement better cache headers for reverse proxies

Don't change issue titles without a good reason.

klavs’s picture

My bad - I didn't realize it wasn't just a comment subject. I'll wake up now :)

klavs’s picture

And here's a talk on OSCOM - how to deliver dynamic content via ESI
http://oscom.org/events/oscom4/proposals/esi.html

Upayavira’s picture

Maybe I'm being simple here, and someone can correct me, but for my usecase, I implemented a simple module that is working a treat, let's call it reverseproxy for the sake of argument:

function reverseproxy_init() {
  global $user;
  if (!$user->uid) {
    $cache_lifetime = variable_get('cache_lifetime', 5*60);
    drupal_set_header("Expires: " . gmdate("D, d M Y H:i:s", time() + $cache_lifetime) . " GMT");
    drupal_set_header("Cache-Control: public, max-age=" . $cache_lifetime);
  }
}

At present, it will cause anonymous pages to be cached, and logged in users' pages to not be cached. I'm sure there are improvements to that logic for anonymous pages viewed by logged in users, etc, but it seems to me that the required effect can be achieved with a simple module, rather than a nasty patch to bootstrap.inc.

What am I missing?

kbahey’s picture

That approach works and has the advantage of not having to patch core.

But, it still does not solve the issue of intermediate proxies (e.g. corporate firewalls, ISP caching proxies, ...etc.)

If an intermediate proxy caches the page and then someone tries to log in, they may get served a page from that proxy's own cache, and the user may not even be able to login.

It also suffers in that your own proxy cannot differentiate between logged in and not logged in users as well.

These are the hard issues to solve and stalling this patch.

David Strauss’s picture

Can you change the cookie names to something like "LOGGED_IN" instead of "DRUPAL_LOGGED_IN". Drupal is a registered trademark, and we should avoid naming functions, identifiers, etc. after it. We should follow the Firefox model and keep branding at arm's length from functionality.

kbahey’s picture

Naming functions "Drupal" is not an issue at all. We had this discussion before, and there is no branding issue. After all inside Drupal itself there are a dozen functions named drupal_something(). In this case it describes what the cookie is.

This patch does work fine, but we have the intermediate proxy issue mentioned above that needs to be solved. Can people focus their effort on getting this solved? If it does get solved, then we have a chance of pushing this in core for 7.x and everyone benefits.

David Strauss’s picture

After all inside Drupal itself there are a dozen functions named drupal_something()..

Most of those happened before Dries took an active stance on registering and defending his Drupal trademark. The existence of existing use in the code base is therefore not a good reason to continue expanding the practice. Encumbering free, open-source software with trademarks used in the functional operation of the code is generally a bad thing.

Dries has said:

I'd expect that trademark law wins from a license (it is a law after all), and that I'd be able to force them to rename their functions.

The GPL v3 also maintains:

Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: [...] Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks

Drupal isn't in the spirit of being free software if you have to remove the trademarks in ways that break the software and other software interfacing with yours in order to use the right to fork. While we could spend lots of time with lawyers determining that how it's OK (or having the lawyers write a trademark policy such that it is OK) to use "Drupal" in functional aspects without explicit trademark licensing, I would rather stop using "Drupal" in functional code and avoid the problem entirely.

kbahey’s picture

The patch here is intended for/destined to core, so having a cookie or function name Drupal is OK, simply because core uses it. See Moshe's comment here.

Also, Drupal is GPL v2+, not GPL v3 yet. So the section you mention is not relevant.

I don't see a need to rename it so far.

If you want to rename it for a derivative product (e.g. Pressflow), then feel free to do so.

Again, let us focus on the intermediate proxy issue and get it working, then we can quibble about renaming it BLORT_LOGIN or something.

Dries’s picture

I'll discuss this with my lawyer. We're actively working on the trademark policy so I've pointed him to this thread. Meanwhile, I suggest we don't worry about it too much and that we proceed as normal. My personal take is that using the word 'Drupal' in Drupal code shouldn't be a problem but we'll check that with reality (i.e. the law).

I'm pretty sure that they use the word 'Linux' in the Linux code base, that they use the word 'Mozilla' in the Mozilla code base, that they use the word 'Debian' in the Debian code base, and so on. In other words, I think we might be looking for problems where no problem exists. We're doing what every other Open Source project on the planet is doing.

netaustin’s picture

FileSize
3.16 KB

We can't solve this problem for everyone, so let's just make it easier for developers to solve it for themselves without hacking core.

Perhaps implementing reverse proxy awareness directly in Drupal isn't the best approach. What if we were to simply allow modules to interfere with caching expiration time via a simple new hook. This would take the hard problem--attempting to deal with all proxies at once--out of Drupal's hands and put it into the hands of people who are writing modules to interact with caching proxies.

I don't think the "who knows which proxies we're facing" problem can be solved, or even should be solved in Drupal core. I think that problem should be left to individual webmasters, who have much more power over the situation. For example--a reasonable solution is to direct all authenticated traffic to an alternate domain. I do this by redirecting user logins, which *always* skip proxies because they use POST rather than GET. This has the added benefit of taking the proxy completely out of the picture for authenticated users. It also raises an additional SEO problem of making all content theoretically available on two domains, though you can solve that with a rewrite rule that reads a similar cookie to DRUPAL_LOGGED_IN, and if it's not set, permanently redirects users to the canonical, unauthenticated URL. This also requires changing the cookie url in settings.php.

Sounds complicated, right? Well, that's the point. There may be simpler solutions than what I've done, but this topic is too specialized and too complicated a problem for Drupal Core to try to solve, and most *good* solutions will involve more than just Drupal Core by itself. And even if this is built in, you would still need to enable mod_expires and edit .htaccess to allow your reverse proxy to cache static assets too.

What's more, a robust caching solution would require a module that can talk directly to the CDN to purge content when it changes.

Additionally, I think if we're going to allow control over the lifetime for all pages, we should allow it for each individual page. This way, your site could have lots of content at / and a store at /store, and everything under / could be cached, except /store, which can't.

As to the cookie situation: Remember that the client can get that cookie by interacting through the proxy via POST or through pages or AJAX requests that aren't cached. So a session cookie can still be meaningful, and it would be very problematic for the cache to serve a cached session cookie (welcome to someone else's shopping cart).

So I'm opposed to Vary: Cookie, I think that the "bypass cache" method should be left up to the developer who knows the whole story, and I think that the lifetime should be controllable per-url, not per-site.

So the attached patch (based on kbahey's but without the Vary details) adds a new hook: hook_expires. It expects a unix timestamp as the return value, and the lowest number wins. The scenario I'm imagining is a CDN module with an aggressive caching strategy and a custom helper module which overrides the CDN module for specific paths.

This is my first foray into Drupal Core, so apologies in advance if I did anything wrong.

netaustin’s picture

FileSize
3.2 KB

Use this patch instead. Fixes a comment styling issue and an inaccuracy.

seo services’s picture

yes, you are right, we can not solve this problem, because problem is in code. I also try to solve this in subdirectory of website http://www.aeesoft.com but could not solve. waiting for a good solver......

netaustin’s picture

FileSize
3.92 KB

Created with diff -up per webchick's request.

netaustin’s picture

FileSize
3.95 KB

Response to webchick's review. Use this.

moshe weitzman’s picture

`function_exists('module_implements')` - i think you want to use if (!defined('MAINTENANCE_MODE')). thats used all over core now.

can you rename to hook_http_expires(). we have far too many caches that could be confused here.

i have no opinion on how solvable this can and should be in core. this patch has stalled though, so maybe a fresh approach is needed.

netaustin’s picture

Status: Needs work » Needs review
FileSize
6.83 KB

Here's an updated patch which addresses the following:

  • Moshe's request to rename the hook to hook_http_expires.
  • Moshe's request to not use function_exists().
  • webchick's request to provide unit tests for this patch.

I've added two assertions to BootstrapPageCacheTestCase::testPageCache() in bootstrap.test and a very short example of hook_http_expires to system_test.module. All bootstrap tests pass for me. I ran the other tests, and though there were some failures, they occured with or without this patch installed.

markus_petrux’s picture

I like this new approach to allow external modules deal with this. ie. allow site administrator provide the solution that works for them.

Though, I would suggest to allow the external module generate any header. For example, I may need to use s-maxage, or any other thing my proxy needs.

Aside, AFAICT, when using max-age, expires is not necessary.

netaustin’s picture

I think you are right that max-age renders expires unnecessary. But some proxies and CDNs need one or the other specifically--this is an area of HTTP where strange behavior is ubiquitous. This article suggests that it's best to use both if you can, or just max-age if you can't. I'd prefer to keep both in, though if we're going to get rid of one, it should certainly be Expires.

I think that for every other HTTP header, drupal_set_header() will still suffice. But expiration needs some more help, because you could potentially have lots of different modules interested in it. It's also far, far simpler to allow a module to return "seconds to live" than force it to calculate and serve the correct Expires and max-age header properly, and we need to do some math against the age of the cache anyways.

markus_petrux’s picture

Though, the HTTP/1.1 spec says "The max-age directive takes priority over Expires". See "13.2.4 Expiration Calculations" in http://www.ietf.org/rfc/rfc2616.txt

Maybe using both could even fool certains reverse cache engines.

The hook could be passed a little more information about the page so the implementation could decide which headers to generate. The computations of max-age just need to be coded once, within the hook or by Drupal core alternative, so no matter how complex it is, IMHO, it should be done by one or the other.

markus_petrux’s picture

Related thread "[squid-users] Expires: vs. Cache-Control: max-age":

http://www.mail-archive.com/squid-users@squid-cache.org/msg58850.html

netaustin’s picture

Right. If a consensus emerges, I would be altogether in favor of eliminating the Expires header altogether, though I would want to eliminate it the old Expires header too.

Any information that a module needs to determine caching, they can get using other functions (arg() and user_access() might be useful, or a module could store max-age on a per-node basis). The only information that a module can't get from the environment is whether the hook was called in the "serving from cache" state or the "serving fresh" state, e.g. drupal_page_headers() or drupal_page_cache_headers(), so that's what I've passed this hook.

markus_petrux’s picture

Here's an interesting tutorial that shows one possible way to deal with what kbahey wanted to do with DRUPAL_LOGGED_IN cookie and squid:

http://www.howtoforge.com/how-to-set-up-a-caching-reverse-proxy-with-squ...

This is an example where we may need to send s-maxage. But there may be more configurations that one may need to do. For example, one thing I may need to compute the expiration time, is the date the page was cached, or maybe it would be nice to have the possibility to alter the page content. That way, one could replace pieces of the cached page that need to be more dynamic than the whole page. In fact, a hook here opens a lot of possibilities.

If there was such a hook with more information at hand, and able to alter the content of the page and the headers, then something like the following would be easier to implement:

http://groups.drupal.org/node/12951#comment-42034

I'm doing that on a D5 site.

netaustin’s picture

Sure, but that takes us back to the original battle, how to handle edge cases, for instance where there are forward proxies between the reverse proxy and the user. I contend that should be left to the developer implementing the caching solution. I'm totally open to altering the patch to be something along the lines of the following:

function my_module_headers_alter($header_data, &$content, $is_cached) {
  $content = nl2br(strip_tags($content)); // What I would like to do :-)
  $header_data['Cache-Control']['max-age'] = 86400;
  $header_data['Cache-Control']['s-maxage'] = 43200; // Precedence over s-maxage.
  $header_data['raw']['X-Written-By'] = "Austin"; // Could also be done by drupal_set_header()
  return $header_data;
}

Now we're stepping on drupal_set_header() though, which would suffice perfectly well for one-line headers without the complexity of Cache-control.

If this is what you're after, and I can get a few more people to agree that a more flexible hook is the desired solution, I will prepare a patch, alter the unit tests for it, and submit it again.

I'll also factor out the Expires header--you're right, with max-age=0, it does not need to be there.

One thing, though--I would rather use the most restrictive return value for max_age than the value of the last module that ran if that's amenable to you. Chances are, 98% of Drupal sites will not implement hook_headers_alter, and 98% of those that do implement hook_headers_alter will only implement it in one module. But still!

markus_petrux’s picture

Sweet. That's more or less what I had in mind. A couple of things here:

1) If you wanted to compute max-age the way you did in the patch you would probably need the date the page was cached. So the hook would need an additional paramenter for that, or maybe the whole $cache object.

2) Since this may become a function that can only be implemented by a single module. Maybe the function name could be configured in settings.php, so it can be changed at will. For example, this is how you can define a custom cache engine, by providing a calue for $conf['cache_inc'] in settings.php (another example is $conf['page_cache_fastpath']). There could be a $conf['page_cache_headers'] thing or whatever the name fits better as far as it allows site admins alter the headers and/or content of pages. That way we don't have the overhead of using module_implements() compared to the call to a single function (the one provided by Drupal, or a custom one).

Dries’s picture

Please don't commit this patch yet. Want to read up on this a bit. Thanks. Keep working on it though.

webchick’s picture

No worries from me. A fair portion of this is currently zooming way over my head. ;)

Just giving netaustin some tips on how to make patches the Drupal Way (tm). :)

kbahey’s picture

Status: Needs review » Needs work

I very much support the approach in #107. This makes it possible to add things or remove things.

We need it to be able to totally override headers from core for it to be useful.

It will not solve the intermediate proxy issues, but once this avenue opens up, more experimentation and adventures will happen, and soon we will see innovation in contrib that may solve this issue.

So, I guess it is patch needs work.

netaustin’s picture

Awesome. I will have the updated patch uploaded by the end of business tomorrow, EST (GMT 2200 or so).

kbahey’s picture

netaustin

I recommend that you create a new issue for the "contrib-modifiable" headers, since the followup emails on this issue are too long, and your patch will not solve creverse proxy caching on its own, but pave the way for it.

So, go ahead and create a new issue and provide the link here, then we can start fresh in your new issue.

c960657’s picture

FileSize
7.66 KB

This patch implements the Vary approach that was discussed earlier. This was made possible with the landing of #201122: Drupal should support disabling anonymous sessions.

The intention behind the patch is to make the default headers more cache-friendly, but in a way that is transparent to the user, i.e. users shouldn't have to hit CTRL+F5 in order to get fresh pages (at least not if their browser and proxy server are configured in compliance with HTTP/1.1).

For now an s-maxage Cache-Control directive is added based on cache_lifetime. Perhaps this should be a setting of its own.

I have done some testing with Squid. I learned that ETags are a rather delicate matter. For instance, if you use page_compression, you should send different ETags to clients that do and do not support compression. The ETag should also change if the page changes due to language negotiation (Accept-Language), but I didn't implement that, because Drupal will soon drop the support for serving multiple languages on the same URL using language negotiation (see #339958: Cached pages returned in wrong language when browser language used).

The patch breaks some tests that are based on the assumption that the precense on an ETag header is an indication that the page was served from Drupal's cache, but this assumption no longer holds (i.e. this is a problem with the existing test, not with this patch). Also, the patch should be accompanied by tests, but I wont write those before there is general consensus that this approach is sensible.

I support the idea of allowing the headers to be overwritten using hooks. However, the default set of headers should be as cache-friendly as possible. So I don't think the two ideas are conflicting.

c960657’s picture

Status: Needs work » Needs review
FileSize
14.79 KB

Now with updated tests.

andreiashu’s picture

subscribing

Dries’s picture

This patch looks sane at first glance.

- Did you do some benchmarking without Squid (and with and without this patch)? While the tests succeed, I want to make sure that we didn't broke performance. Unlikely, I know.

- Does this warrant a CHANGELOG.txt entry? Will this result in significant performance improvements when a Proxy cache is enabled?

- The extra documentation in drupal_page_header() is very helpful. Is there an RFC or authoritative document that you could refer to from the code comments? That would help future discussions.

- I was not fully up-to-speed on the Vary header. It might be useful to reference the proper section in RFC2616 along with a one line summary. Maybe: "The Vary header is used to indicates the set of request-header fields that fully determines whether a cache is permitted to use the response to reply to a subsequent request without revalidation."

- I'm curious to learn how this behaves with HTTP/1.0 clients (vs HTTP/1.1 clients). Based on what I've seen and read, I don't think this is a showstopper but I'm not 100% confident yet.

Good work!

David Strauss’s picture

Will this result in significant performance improvements when a Proxy cache is enabled?

A single Squid server can handle thousands of requests per second. Varnish is even better. The benefit is staggering.

Dries’s picture

Status: Needs review » Needs work

@David, I understand the benefits of a reverse proxy cache. ;-) I was wondering about the impact of this particular patch; they make the cache headers better. I was just wondering how much better that was and if it is worth mentioning in the CHANGELOG.txt.

c960657’s picture

Status: Needs work » Needs review
FileSize
17.02 KB

I did some rudimentary performance testing using siege, and it didn't reveal any significant change in performance when not using a proxy.

The patch will allow for a big performance improvement for sites with mostly anonymous users, but it requires that people set up a reverse proxy, so mentioning this in CHANGELOG.txt is probably a good idea. I added a mention.

I reworked the comment for drupal_page_header(). My understanding of the issue described in the comment has changed a bit since I uploaded vary-2.patch. AFAICT the problem is actually a bug in Firefox. I added a reference to the Bugzilla bug for Firefox and to the relevant section in the RFC.

I added the suggested explanation of the Vary header.

Good point about HTTP/1.0 clients. I added an explicit Expires header in the past for all anonymous requests (authenticated requests already has this). This prevents caching in HTTP/1.0 clients (including proxies). HTTP/1.1 clients should ignore this when Cache-Control:max-age=x is specified. Squid 2.6 appears to support this properly.

Damien Tournoud’s picture

+  // Make HTTP proxies save one copy of this page for each different value of
+  // the Accept-Language request header.
+  header('Vary: Accept-Language', TRUE);

This should probably be a drupal_set_header(), right? We want this information to be cached. Also, why the $replace = TRUE?

c960657’s picture

$replace = TRUE is a bug. I'll fix that in the next reroll.

I am not sure the header() call belongs in drupal_set_header(), though. It would be best if it is only set for URLs that may do language negotiation using the Accept-Language request header. Whether this happends depends on a number of things in language_initialize(), and I'd rather not duplicate this logic in drupal_set_header(). But I don't have strong feelings about this.

Dries’s picture

I believe there is an active discussion in doing away with the multiple output languages per URL thing (i.e. each language would get its own unique URL).

markus_petrux’s picture

Isn't it possible to do this adding hooks or something similar to allow each particular site modify the headers? It looks like this was the approach before #114. :-/

c960657’s picture

FileSize
25.18 KB

@dries:
I think you are referring to #339958: Cached pages returned in wrong language when browser language used. For now I removed the changes to language.inc, because caching of pages using language negotiation is broken with or without HTTP caching. We can revisit this when either issue lands.

@markus_petrux:
Modules can override the default headers in hook_init() or in the menu callback. Cache-Control, Expires and Last-Modified can simply be replaced with new values.
Modifying Vary is a bit tricky, though. Replacing the whole header is not an option due to Vary: Accept-Encoding in drupal_page_cache_header(), and appending other fields to Vary is not really compatible with Drupal's cache, so the only thing modules can change without breaking stuff is whether Cookie is present in the Vary header or not. As a simple fix for this (without introducing callbacks or output buffering for all requests), I added support for a “magic” parameter for drupal_set_header, "no-vary-cookie". I admit it is not a very elegant solution, but I felt that adding a hook simply to allow modifying a header that generally shouldn't be modified too much was a bit overkill. Does this solution allow you to send the kind of headers you had in mind?
I removed Vary: Cookie from drupal_page_header(), because authenticated requests aren't cached anyway.
The idea of using output buffering for all requests is probably worth pursuing, though I would rather do this in a separate issue.

Dries’s picture

Before adding the magic parameter, I'd like to understand if there is a valid use-case for this. It sounds like a recipe for introducing bugs.

c960657’s picture

The use case for omitting Vary: Cookie is if you have a page that looks identical for all users, authenticated and anonymous. This could be an RSS feed, or a regular page that has no visual indication of whether a user is logged in or not (e.g. on a site where the only registered user is the webmaster). Omitting the Vary header will allow better caching down the line. Some proxies (including older versions of Squid) have limited support for Vary and are “reluctant” to cache URLs with this header set (see this). Also, if your anonymous users for some reason have other cookies set on your domain (e.g. cookies from Google Analytics and the like), and a page does not depend on cookies (including the session cookie), you will get better caching in proxies without a Vary: Cookie.

Internet Explorer has some issues with Vary. I am not too familiar with these, but here is what I could find:

When Vary is sent for files that open in external applications, e.g. if you are generating an Excel or PDF file on the fly (see this). In this case you should omit Vary and instead control caching using only Expires and Cache-Control. This only a problem with dynamically generated files – drupal_page_cache_header() is not invoked when downloading uploaded files.

Also, a Vary header completely prevents IE6 from caching pages, so IE6 users visiting the same page twice will have to download the whole page twice. I don't know whether this is acceptable? IE7 users will get the same amount of caching as in D6 AFAICT from this.

* * *

So while this patch offers better performance when used in connection with proxies, and maintains status quo for most users without a proxy, it does hurt local caching in IE6, and it may increase the possibility that module developers trigger weird IE bugs in complex scenarious (like dynamically generated content) ... :-(

markus_petrux’s picture

hmm... AFAICT, Vary header works ok with IE6.

For example, see this site. Here we use "Vary: Accept-Encoding,User-Agent", and works nicely with IE6, Firefox, Opera, and so on...

BTW, the "Vary: Cookie" will take the session cookie into account, so it will cache the same page for every single user.

In regards to #124/#125: agree that hooks here complicate things, but maybe it could be implemented in a way that the drupal provided functions can be overridden, like the cache and sessions related stuff. That makes it easier to adapt for particular site needs without patching core.

c960657’s picture

In regards to #124/#125: agree that hooks here complicate things, but maybe it could be implemented in a way that the drupal provided functions can be overridden, like the cache and sessions related stuff.

I am not sure in what you would like to achieve with this that you cannot do using drupal_set_header(). Could you give some examples?

markus_petrux’s picture

I would like to have the possibility to do things like this:

http://groups.drupal.org/node/12951#comment-42082

ie. I may need to alter not only the header, but also the logic.

Also, PHP does not offer a method to remove a HTTP header that's been set with header(). You can however use header('foo: '), but that's also sent as an empty header, which makes it look somehow hack-ish.

c960657’s picture

I'm sorry, but I'm not sure I follow you. I hope you will elaborate with some concrete examples.

The code in #12951 seems mostly related to the built-in cache, not HTTP proxy caches. HTTP proxies are somewhat limited compared to Drupal's cache. For instance, you cannot flush the proxy cache on demand. Improving Drupal's built-in cache may be relevant, but I think it is outside the scope of this bug.

The current patch does allow you to specify different HTTP headers for different types of content, e.g. Cache-Control: max-age=60 for the front page, and Cache-Control: max-age=900 for individual node pages. You can do that by calling drupal_set_header() in hook_boot(). The header is saved in the cache, though, so all requests for a specific URL get identical responses, including the headers. Is that a problem? I guess I can move things around, so that you can override the cached headers from hook_boot().

Could you give an example of a situation, where you would you like to remove a header? (BTW, PHP 5.3 has a new function, header_remove()).

markus_petrux’s picture

I would just send the headers that I really need. PHP 5.3 is not stable for production.

I can only give you an example for a site running D5. Here, I'm bypassing Drupal page cache, I use mine, and besides setting the headers I need, I'm altering some cached pages for logged in users before they are sent to the browser.

With this patch, I could alter the headers in hook_boot(), but I could not alter the page contents printed by drupal_page_cache_header(). If it was possible to define the name of the functions involved here, then I could just use my functions, and there I could just send the headers I need, and alter cached pages before sending them. That would allow me to inject different content for blocks, etc.

c960657’s picture

FileSize
55.24 KB

I have now added a way to replace or unset headers before they are sent. E.g. drupal_set_header('Cache-Control', 'public') will replace the header, and drupal_set_header('Cache-Control') will prevent the header from being added.

Allowing headers to be unset rather than just replaced required quite a lot of code. I'm not sure it's worth the added complexity, since you can usually emulate unsetting a header by replacing it with something else (e.g. unsetting Cache-Control or setting it to public has a similar effect). Opinions are welcome.

As an added bonus, the request that fills the cache now issues the same headers as the following hits that are served from the cache.

The latest edition of the patch hasn't been thoroughly tested, but comments on the approach would be appreciated.

Altering the page before returning it from the cache may be useful, but I think it is outside the scope of this issue.

Status: Needs review » Needs work

The last submitted patch failed testing.

c960657’s picture

Status: Needs work » Needs review
FileSize
54.22 KB

Reroll.

Markus, does the patch give you the wanted flexibility?

Status: Needs review » Needs work

The last submitted patch failed testing.

c960657’s picture

Status: Needs work » Needs review
FileSize
54.22 KB

Reroll.

@markus_petrux: Does this address your concerns?

If people think this patch touches too much, we can also proceed with the simpler approach in vary-4.patch.

markus_petrux’s picture

Unfortunately, at this moment I'm 100% involved in a big Drupal 6 project, so I can only browse the code here. Thanks a lot for taking my concerns into account. It's much appreciated. From the looks of it, I would say yes, it covers the need to alter the headers, and it looks good to me, but I cannot confirm how well it works because lack of time to build a test environment for D7 and test this carefully.

Dries’s picture

I like this patch a lot. Good work all.

The one thing I'm not 100% happy with is the fact that the Vary-header stuff is somewhat hidden in code. It might be OK if the default behavior results in maximum performance, but if not, many people are never going to know/learn about it. I don't think I have a solution, but it would be nice if we could expose it some more. It is certainly not a requirement -- but let's discuss it before I commit this patch. Is this something that might have to bubble up to settings.php or even the Performance configuration page? Your call ...

I think we can proceed without a deep dive from markus -- if he identifies and issue we can follow-up on it later.

c960657’s picture

Vary: Cookie is necessary if the same URL looks different depending on $user (e.g. if all pages contain a login form that is only displayed to anonymous users, or if authenticated users have Edit links displayed on node pages). On sites where all users are anonymous except admin users, it would allow for better HTTP caching if the Vary: Cookie header was omitted.

When an authenticated user hits a public page through a reverse proxy (e.g. node/123 after pressing Save on node/123/edit), he might get the cached version. If the cache allows it, he can flush the cache using CTRL+F5. This easily gets annoying, especially if he prefers to use e.g. SimpleMenu. A better option is to create a DNS alias, e.g. admin.example.org, that bypasses the reverse proxy that is used for www.example.org, or simply does not omit Vary: Cookie.

Omitting Vary: Cookie can be considered a “super-aggressive” cache mode that may give an extra performance benefit in certain specialized scenarios, but also may cause too much caching with confusing results if the site owner doesn't know what he is doing. It is probably best left as an option that can only be specified using settings.php (similar to the settings suggested in #370454: Simplify page caching). I'll look into that.

c960657’s picture

FileSize
56.4 KB

Now with a omit_vary_cookie variable that is documented in default.settings.php.

Status: Needs review » Needs work

The last submitted patch failed testing.

c960657’s picture

FileSize
56.58 KB

Reroll.

c960657’s picture

Status: Needs work » Needs review
BartVB’s picture

A suggestion; maybe it's an idea to figure out how other projects have solved this problem? Drupal is not the first that wants to interoperate properly with a reverse proxy :D

A good example would be MediaWiki, here's a nice writeup about their caching headers:

http://www.harmfrielink.nl/wiki/index.php/MediaWiki_Caching

Also they solved the problem with external proxies that cache anoymous pages which cause problems when the users logs in by sending:

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

Especially the 'private' part is clever. It also might be smart to get someone from the Squid project involved? They know quite a bit more about caching headers and (reverse) proxies than we do.

Edit: Currently trying to do too much in too little time. Overlooked the fact that this patch is already in 'review' stage... The use of ETag is pretty clever, can't is also be used to get rid of the whole 'Vary: cookie' problem? If the ETag includes the logged-in/session state the cookie doesn't really matter. This would result in much more fine grained control over the proxies in the chain.

c960657’s picture

MediaWiki also uses Vary: Cookie - this is the key part of this patch. The approach suggested in this issue is along the lines of what MediaWiki does (though I may have missed some important points). So it is good to see that we are on the right track :-)

One thing that is different in the approach described on that page is that Squid is configured to modify the Cache-Control header so that no browsers or downstream proxies may use a locally cached copy of a page without revalidating the page with the reverse proxy. I'm not sure I understand the purpose of this. If the page is allowed to be cached in the reverse proxy, why not allow it to be cached elsewhere too?

For authenticated users, Drupal sends a bunch of Cache-Control directives in drupal_page_header(), though not 'private' (this isn't changed by the patch). According to the RFC, the directives currently used in Drupal have the same effect as the use of 'private' AFAICT, but it may be (?) that the support for these directives in various browsers may differ. However, this only affects caching in the browser's local cache and is thus outside the scope of this issue. I don't understand why you think that 'private' is particularly clever. Does it offer something that isn't handled by the patch?

I don't understand your suggestion about using ETag to get rid of Vary. With Vary, anonymous users can get pages from the proxy without hitting the webserver at all. Without Vary, how is the proxy to know how to tell the difference between anonymous and authenticated users, and only send cached content to the former?

Dries’s picture

I did another deep dive on this and everything looks good. I've now spent hours reviewing this patch, and I think it does the right thing. If we need to tweak a few things going forward, this patch implements a base system with good APIs that enable further tweaking or adjusting of the actual headers. We can now make incremental changes on top of this system/patch.

I've decided to proceed and commit this patch to CVS HEAD.

I'm setting this patch to 'code needs works' so we can update the upgrade guidelines in the drupal.org handbook. Please mark this issue 'fixed' once the API changes (e.g. drupal_set_header()) have been documented.

Great work, c960657!

berenddeboer’s picture

A comment on use of the "private" parameter: you don't want that in this case. Squid reverse proxy is a shared cache and it is perfectly OK for that shared cache to cache pages for authenticated users.

With Vary: Cookie the reverse proxy will never ever cache a page for anonymous users. Every user gets their own cookie with Drupal, so every request has its own cookie and will never match an existing, otherwise perfectly valid response. I'm not sure, but I think Squid might already do Vary: Cookie by default, as cookies basically make every page unique. That's one of the problems with cookies, you can't do reverse proxy very well. And that's where HTTP authentication comes in which nicely solves that.

Damien Tournoud’s picture

@berenddeboer: we don't send cookies for anonymous users any more, except when strictly needed. This way, pages can be properly cached by any proxy server in between the user and the server.

c960657’s picture

Priority: Normal » Critical
FileSize
5.54 KB

Follow-up: The patch that was committed broke private downloads due to the new header handling - raising to critical. The problem is fixed by this patch.

The change to default.settings.php wasn't committed for some reason (I assume this wasn't intentional). This is also included in this patch.

c960657’s picture

BTW I discovered the problem using the tests for private downloads in the patch for #284899: Drupal url problem with clean urls.

Dries’s picture

Can we have a test for private downloads?

c960657’s picture

Private downloads are tested in FileDownloadTest::testPrivateFileTransfer(). The hook_file_download() implementation for this test was properly fixed by this patch that was committed. The problem is with the implementations in user.module and upload.module.

There is no test for the implementation of hook_file_download() in user.module, and in fact AFAICT the hook is never invoked, because profile pictures no longer work using private files, due to a regression from #357403: Move user picture to managed files. I will file an issue for this later.

There is also no test for the implementation of hook_file_download() in upload.module. I'll look into that.

c960657’s picture

FileSize
9.25 KB

I added a test for private files in upload.module and filed #443200: User pictures does not work with private files.

Status: Needs review » Needs work

The last submitted patch failed testing.

c960657’s picture

Status: Needs work » Needs review
FileSize
11.38 KB

The previous patch accidentally included part of the fix for #443200 but not the required changes to user.test.

This updated patch also includes a change to page_set_cache() and drupal_set_header() so they do a better job of preserving the original case of the header names.

Drupal_is_amazing’s picture

I'm a little new to Drupal, so most of the above 150+ comments are hard to grasp.

What's the status of the attempt to let drupal use reverse proxies?
Does it work on D6, or only D7? Is this something I should use on a production site?

And if so, could I ask for quick overview of how I'd go from 6.10 to a drupal site that can make use of the reverse proxy?

Many Thanks!

Dries’s picture

Status: Needs review » Needs work

Committed. Thanks. Marking to 'code needs work' until documentation is updated.

mikl’s picture

Amazing work you've done here, c960657, I look forward to playing with it some time :)

To answer #157 by "Drupal_is_amazing", the answer is that this the work that’s gone on here is only for Drupal 7. It is still possible to make Drupal 6 use reverse proxies without this patch (you can more or less just send Vary: Cookie yourself, but that is not officially supported.

So while possible, be prepared to spend a good deal of time testing and patching to make it work :)

c960657’s picture

Status: Needs work » Fixed

I added a description of the API changes to http://drupal.org/node/224333.

David Strauss’s picture

For anyone interested, I am maintaining a backport of this to Drupal 5 and Drupal 6.

kbahey’s picture

David

Like we do with other backports that will not make it to official releases, please post the patches in a new issue set to wont fix, and link to it from here.

David Strauss’s picture

@kbahey I don't intend to post patches here. All that ever results in is nagging every minor release about updating the patch. People can build a patch for the latest Drupal core by installing Bazaar and running "bzr diff --old=bzr://vcs.fourkitchens.com/drupal/6 --new=bzr://vcs.fourkitchens.com/pressflow/6-patches/reverse-proxy-support > reverse_proxy.patch" or the same with "5" instead of "6" for the respective Drupal core version.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

awm’s picture

I know I am late for the party here but is this committed to drupal 6?
And another question: does this work with varnish regardless of the varnish ttl?
And for drupal8 is this irrelevant?

klausi’s picture

Issue summary: View changes

If you struggle with the Browser logout behaviour documented on drupal_page_header() after the Drupal 7.33 update here is the follow-up issue to remove ETag and Last-Modified headers: #2381839: Changed date format for Last-Modified header breaks caching for Varnish/Nginx configs.