Before gzipping a response, this module looks to $_SERVER['HTTP_ACCEPT_ENCODING'], to check if the requesting client actually supports gzip according to the Accept-Encoding header. This is good, makes sure the few clients out there that don't support gzip won't be left out in the cold.

However, this logic does not consider the caching mechanisms of Drupal itself and of CDNs Drupal may be placed behind. If a response altered by gzip_html is cacheable in the page cache, currently, that cache entry will be used for all subsequent requests regardless of later values of the Accept-Encoding header.

That means the following scenario can take place, at the level of Drupal or at the level of a CDN: The first client to request a not-yet-cached page supports gzip. The response is cached, in the gzipped form that this module alters it to. A later client that does not support gzip requests the same page. That later client gets a gzipped response anyway.

The reverse holds true too: if the client that triggers generation of the page cache entry does not support gzip, everyone who gets the cached page afterward will get the cached un-gzipped version. Not as critical, but the performance/bandwidth that could have been gained is lost.

Comments

bvoynick created an issue.

bvoynick’s picture

At the CDN level, I think this is a pretty simple fix. For any responses this module compresses, the response header Vary: Accept-Encoding should be added, to tell downstream caches likes CDNs that there are different versions of the file depending on the Accept-Encoding request header. Can be done with one line:
$response->setVary('Accept-Encoding', false);

The Drupal side is thornier. What needs to happen is, all responses that carry Drupal's cacheability metadata, and that are eligible for compression - except ignoring the Accept-Encoding/HTTP_ACCEPT_ENCODING condition - need a cache context added to the mix. The context can make it so that clients that do support gzip get a different cache entry than those that don't.

Using Drupal's built-in "headers" cache context, this does the trick:

if (method_exists($response, 'addCacheableDependency')) {
  $cache_context = new \Drupal\Core\Cache\CacheableMetadata;
  $cache_context->addCacheContexts(['headers:Accept-Encoding']);
  $response->addCacheableDependency($cache_context);
}

BUT I believe this code introduces something of a Denial of Service vulnerability and should not be committed as-is.

The problem is, Drupal Core's headers cache context seems to be a pretty naive/trusting implementation. Any value the client cares to send is respected and results in a new cache context. An attacker trying to overload a server running the above code could build an HTTP client that sets a randomized garbage string as its Accept-Encoding header. Continually receiving novel values for that header, this module would effectively force Drupal to always go through expensive rendering instead of pulling from cache. Drupal 8's dynamic caching of individual render elements might reduce the impact of this, but I'm sure it's still more expensive to assemble the final output from individual render item caches than to be able to just pull a completed response off the shelf.

Some CDNs would mitigate this by normalizing the Accept-Encoding header to one of its few legitimate values. But I don't know that all CDNs do, and there will be sites that do not have a CDN.

I think the solution is to write a custom cache context that would only ever have two values: gzip supported, or not. That way the module does not open any wider of a DOS window than it is meant to.

bvoynick’s picture

Playing with this more, I've realized I didn't understand some things about Drupal's page caching system and this module's place in the ecosystem.

Drupal's Internal Page Cache module, which caches entire responses for anonymous users, ignores cache context. I had Internal Page Cache enabled, and this was the root of the problem. With Internal Page Cache uninstalled, this problem is solved on the Drupal side, no need for adding additional cache context to the response object.

Of course, uninstalling internal page cache also creates much more server load unless behind a caching proxy like a CDN.

At the least, module documentation (and ideally messaging in Drupal too) should alert the user to the interaction between this module and Internal Page Cache.

A more ambitious course would be to investigate switching from this event subscriber to a middleware model, to try to modify responses after the internal page cache acts, and/or insert cache handling before the internal page cache is even asked to serve the response: https://www.drupal.org/docs/8/api/middleware-api/overview

Berdir’s picture

> BUT I believe this code introduces something of a Denial of Service vulnerability and should not be committed as-is.

Drive by review/comment:

* You don't need anything as fancy as this. Regular page cache always caches on the full query string, any unique combination of query parameters will always at least bypass internal/anonymous page cache. Dynamic page cache will only vary by certain or all query parameters if instructed to do so via cache contexts, but even then, all it needs is a page with a pager, and you can count to infinity with ?page=99999 or whatever.

* Cache context actually don't work on the internal/anonymous page cache as you found out. To make it work with page cache, you would need to either customize \Drupal\page_cache\StackMiddleware\PageCache::getCacheId (with a subclass and service provider alter) or run this as as a middleware after/before page cache, so that page cache always caches the non-compressed version and then you either compress or not. Of course that would mean that you'd then need to do that on every cache hit. A third option would be to have a response policy that disables caching on requests that don't support gzip, how many there are depends on your site.