Updated: Comment #0

Problem/Motivation

Over at #2167039: Regression: page cache tags broken, also: some routes incorrectly use _controller -> No page object in template variables -> Fatal in seven_preprocess_page(), we fixed broken page cache tags. As described in #26 of that issue, the different execution order forced us into a certain corner. The solution ended up being setting a response header that contained the cache tags. At first in non-serialized form, but eventually in serialized form.

Therefor, we now have a cache_tags header whose value looks like this:

a:8:{s:9:"node_view";b:1;s:17:"node_view_article";b:1;s:4:"node";a:1:{i:1;s:1:"1";}s:4:"user";a:1:{i:1;s:1:"1";}s:13:"filter_format";s:10:"basic_html";s:13:"taxonomy_term";a:2:{i:1;s:1:"1";i:2;s:1:"2";}s:18:"taxonomy_term_view";b:1;s:4:"menu";a:4:{s:5:"tools";s:5:"tools";s:6:"footer";s:6:"footer";s:4:"main";s:4:"main";s:7:"account";s:7:"account";}}

That's a lot of useless data right there: not only is it useless for 99.99% of clients, it's also quite a bit of data on the wire (that gzips relatively poorly, too).

Proposed resolution

As described in the original issue that added cache tags (#636454: Cache tag support), the potential there to leverage cache tags to have both comprehensive and simple purging of reverse proxies like Varnish… is very appealing. Drupal 8 shipping with native support for this would encourage shared hosting environments to provide reverse proxies, which would result in faster Drupal sites for the hosting client, and less cost for the hosting provider. (Gandi Simple Hosting is the one example I'm aware of that does this already.)

Therefor I propose to:

  1. keep this header, but just make it useful for reverse proxies, while still also being used by Drupal's page cache. In other words: Drupal's page cache then acts like a built-in reverse proxy.)
  2. Another advantage is that it becomes trivial to write tests that check whether a certain cache tag has bubbled up the page correctly.

The equivalent of the above header would be a X-Drupal-Cache-Tags header with the following value:

node_view:1 node_view_article:1 node:1 user:1 filter_format:basic_html taxonomy_term:1 taxonomy_term:2 taxonomy_term_view:1 menu:tools menu:footer menu:main menu:account

Remaining tasks

None.

User interface changes

None.

API changes

A new header is available: X-Drupal-Cache-Tags.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Wim Leers’s picture

Status: Active » Needs review
FileSize
6.09 KB

And patch.

Status: Needs review » Needs work

The last submitted patch, 1: cache_tags_header-2217371-1.patch, failed testing.

larowlan’s picture

Could be a stupid question, but could we hash it to save information leaking, might end up smaller too.

dawehner’s picture

  1. +++ b/core/lib/Drupal/Core/EventSubscriber/HtmlViewSubscriber.php
    @@ -93,6 +93,36 @@ public function onHtmlPage(GetResponseForControllerResultEvent $event) {
    +   * @see \Drupal\Core\Cache\DatabaseBackend::flattenTags()
    

    Is tere any reason we don't simply make that method a public static one?

  2. +++ b/core/modules/system/lib/Drupal/system/Tests/Cache/PageCacheTagsIntegrationTest.php
    @@ -66,24 +66,24 @@ function testPageCacheTags() {
         $this->verifyPageCacheTags('node/' . $node_1->id(), array(
    ...
         $this->verifyPageCacheTags('node/' . $node_2->id(), array(
    

    Can't we just remove the identical and replace it with equal now? Is there any reason that the order of cache tags would matter?

Wim Leers’s picture

Title: Expose cache tags to proxies: X-Drupal-Cache-Tags header (by just slightly changing an existing header) » Expose cache tags to reverse proxies: X-Drupal-Cache-Tags header (by just slightly changing an existing header)
Status: Needs work » Needs review
FileSize
7.26 KB
6.82 KB

#3: There is no sensitive information in there. Plus, the same information can be deduced from the HTML. Not a stupid question though :)

#4: Thanks for the review — both great points! However, in PHP var_dump(array('a', 'b') == array('b', 'a')) will actually output FALSE. So we'd have to resort to sorting first. Which is what I did. We might as well have kept assertIdentical() then, but I dropped that, at your request.

Also fixed test failures.

Status: Needs review » Needs work

The last submitted patch, 5: cache_tags_header-2217371-5.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
7.97 KB
763 bytes

This should make it green.

Damien Tournoud’s picture

At this point, we might as well use this as the ETag, this gives us client-side validation while still allowing more intelligent agents to take additional decisions.

catch’s picture

With the current implementation, the Etag would only change if the actual list of tags themselves change (i.e. a page no longer has node 2 on it, but node 3 instead) - since we're (rightly I think) not including the tag invalidation counts in the header.

If we did include the invalidation counts in the headers, using the ETag gets more interesting in terms of client side invalidation though.

For the use case of clearing varnish as originally discussed, the idea has been to directly invalidate via regexp on the headers. In that case you don't need to store tag invalidations, because the cache backend is just purging when a cache tag is updated - nothing local about cache tags needs to be kept at all (for the page cache at least).

Wim Leers’s picture

#8: Very interesting! "cache tags on the page" would indeed be a very large part of the puzzle to generate a reliable ETag.

However, let us please not expand the scope to include that. Let's do that in a follow-up issue :)

Damien Tournoud’s picture

I thought this was a list of tag fingerprints. We have this readily available, so we should just use it.

nielsvm’s picture

Agreeing with @Wim on the scope here. External caches can just store the header and outgoing calls (e.g. from purge) can instruct Varnish (or a CDN) to wipe regex based on the cache tag string.

Getting this in would be king!

Damien Tournoud’s picture

I do understand where this wants to go, and it would be trivial to implement a Varnish ban rule based on the cache tag. What I'm discussing is that (1) this is at odd with standard HTTP caching mechanisms, so non-Drupal-specific caching agents are not going to be able to leverage it, (2) this introduces an inconsistency between client-side invalidation (which relies on ETag only) and intermediate agent invalidation.

Wim Leers’s picture

this is at odd with standard HTTP caching mechanisms

It's just a custom header. It's not at odds with HTTP caching at all. Every HTTP client will just ignore it.

so non-Drupal-specific caching agents are not going to be able to leverage it

True, but that's also not the goal.

this introduces an inconsistency between client-side invalidation (which relies on ETag only) and intermediate agent invalidation.

There is no inconsistency. Again, this is just a custom header. It serves a very different purpose. ETag is for invalidating a single HTTP resource (it is associated with a single URL). This custom X-Drupal-Cache-Tags header can be used for invalidating all affected HTTP resources. Of course, in a non-standard way. Because there is no standard for it yet. That's why it's prefixed with X-Drupal. Once there is a standard for this, we can move to that :)

(Mark Nottingham has proposed Linked Cache Invalidation — also see the RFC, but that hasn't gone anywhere. Plus, it still requires the implementation to know every URL where a thing (entity in Drupal) appears. So it still would be less complete.)

Gábor Hojtsy’s picture

Wim Leers asked me to review this patch. The thing that stands out as very odd to me is that there is no encapsulation for this functionality. Logic related to it appears oddly as several places in the codebase.

catch’s picture

I thought this was a list of tag fingerprints. We have this readily available, so we should just use it.

This isn't readily available. To make it available we'd have to store either an increment or timestamp of the last time any particular cache tag is invalidated. Just the list of tags isn't enough to know if an item is valid or not.

The database backend does store that information, but it's an implementation detail and not imposed by the API. Right now it would be valid to have a 'varnish' backend for the page cache that does nothing when items are stored (i.e. similar to pressflow/7.x external cache), and invalidates via purge when items are cleared. Even using the 'internal' page cache, something like redis could do direct invalidation of cache items rather than tracking information about tag invalidations and comparing on read.

Damien Tournoud’s picture

@catch: That sounds like a good reason to revise the API. I think #918538: Decouple cache tags from cache bins is moving in this direction, but I haven't looked very closely.

Wim Leers’s picture

FileSize
11.93 KB
7.4 KB

#15: dawehner asked in #4 to do it this way, to avoid code duplication.
How about this alternative way of avoiding code duplication then: remove both the DB and memory cache back-ends' flattenTags() methods and have a generic one at Cache::flattenTags()?

catch’s picture

@Damien that tries to clear up the relationship between bin storage and tags (and I liked your last suggestion on there), however there are two different and equally valid approaches to tag invalidation, direct clearing vs. comparison on get.

ETags relies on comparison on GET (in fact that's exactly what what an ETag is for), but I don't think we want to enforce one or the other approach on the tag storage.

One possibility would be to have two different interfaces so that implementations can indicate which they do, and then use it for the ETag or not depending on the answer.

Status: Needs review » Needs work

The last submitted patch, 18: cache_tags_header-2217371-17.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
12.73 KB
866 bytes

Green again.

Berdir’s picture

Hm, the flattenTags implementation so far was an implementation detail, a different backend implementation might want to use a different one. But I guess they can still decide to do so.

Wim Leers’s picture

#22: Precisely.

It's either this, or duplicate the flattenTags() code, which dawehner was against. I'm fine with either, as long as we make a decision and not bikeshed it :) I personally have a preference to duplicate the code, precisely to avoid the slight concern being raised here by both Gábor and Berdir.

catch’s picture

Also this adds a cross-dependency between the implementations and the Cache class. It's a lot better than adding it to the interface and enforcing the format but still feels a bit odd. What about putting that in a trait?

Wim Leers’s picture

Heh! In my mind, traits are very general purpose things. Here, we deal with specific purpose things. At any point in time, any of the cache backends may want to change how they store cache tags. It just happens to be that X-Drupal-Cache-Tags wants to output cache tags is similar to what the current cache backends are doing for storing tags.
I think we should just "duplicate" the code. We're bikeshedding about something that's not worth bikeshedding about: some 15 LoC that just happen to be very similar.

Can everybody live with that?

Wim Leers’s picture

FileSize
7.31 KB
8.93 KB

Here's a reroll that does precisely that: everything back to the old (pretty much exactly like #1), but now with a HtmlViewSubscriber::flattenCacheTags() method that returns a string of space-separated flattened cache tags. So it's less similar to DatabaseBackend::flattenTags() than before.

(Hence the interdiff is bigger than the patch itself.)

Berdir’s picture

+++ b/core/includes/common.inc
@@ -3241,7 +3241,7 @@ function drupal_page_set_cache(Response $response, Request $request) {
       ),
-      'tags' => array('content' => TRUE) + drupal_cache_tags_page_get($response),
+      'tags' => explode(' ', $response->headers->get('X-Drupal-Cache-Tags')),
       'expire' => Cache::PERMANENT,

I don't think that's correct?

This will then look like array('node:1', 'node:2') but what we want is array(array('node' => array(1, 2)?

It will probably work right now as the flatten in the database backend does essentially the same, but as you said yourself, we shouldn't rely on that.

Wim Leers’s picture

FileSize
8.3 KB
3.39 KB

You're right: it does exploit an implementation detail.

This reroll fixes that.

catch’s picture

+++ b/core/lib/Drupal/Core/EventSubscriber/HtmlViewSubscriber.php
@@ -77,7 +77,7 @@ public function onHtmlPage(GetResponseForControllerResultEvent $event) {
+        $response->headers->set('X-Drupal-Cache-Tags', $this->convertCacheTagsToHeader($tags));

Should this be static::convertTagsToHeader() now?

Berdir’s picture

+++ b/core/lib/Drupal/Core/EventSubscriber/HtmlViewSubscriber.php
@@ -129,4 +129,36 @@ protected function flattenCacheTags(array $tags) {
 
+  /**
+   * 'Flattens' a tags array into an array of strings.
+   *

No it doesn't :)

Wim Leers’s picture

#29: it's possible, but unnecessary: static:: is only necessary if you want to use LSB to determine the method to be called.

Wim Leers’s picture

FileSize
8.32 KB
806 bytes

#30: AAAAARRGH! I blame the endless misery with local test running for lacking self-review :(

dawehner’s picture

This overly looks good.

+++ b/core/includes/common.inc
@@ -3241,7 +3241,7 @@ function drupal_page_set_cache(Response $response, Request $request) {
+      'tags' => Drupal\Core\EventSubscriber\HtmlViewSubscriber::convertHeaderToCacheTags($response->headers->get('X-Drupal-Cache-Tags')),

I wonder why we can't just use an use statement instead. This would be a bit nicer to read here.

Wim Leers’s picture

Sure, I can do that.

Any other concerns?

dawehner’s picture

Not from my side.

Wim Leers’s picture

FileSize
8.62 KB
1.11 KB

Alright, then this will hopefully be RTBC :)

(Addresses #33.)

Status: Needs review » Needs work

The last submitted patch, 36: cache_tags_header-2217371-36.patch, failed testing.

catch’s picture

#31 I know it's not necessary, but $this->convertCacheTagsToHeader($tags)); made me go to check if the method was static or not (since the reverse is called statically elsewhere), whereas I'd otherwise have saved the check. I can't really find anything else to complain about though :)

Wim Leers’s picture

dawehner’s picture

You could argue that helper methods should not live on subscribers, but I don't care.

Wim Leers’s picture

Status: Needs work » Needs review
FileSize
8.62 KB
990 bytes

As per #38 and asking in IRC: changing from $this->staticMethod() to static::staticMethod().

#40: My thoughts exactly.

Berdir’s picture

Status: Needs review » Reviewed & tested by the community

Yes, I prefer this as well.

Looks good to me.

Header HEAD:
cache_tags:a:9:{s:9:"node_view";b:1;s:4:"node";a:1:{i:1;s:1:"1";}s:4:"user";a:1:{i:1;s:1:"1";}s:4:"file";a:1:{i:1;s:1:"1";}s:9:"file_view";b:1;s:13:"taxonomy_term";a:5:{i:1;s:1:"1";i:2;s:1:"2";i:3;s:1:"3";i:4;s:1:"4";i:5;s:1:"5";}s:18:"taxonomy_term_view";b:1;s:13:"filter_format";s:10:"basic_html";s:4:"menu";a:5:{s:5:"tools";s:5:"tools";s:6:"footer";s:6:"footer";s:4:"main";s:4:"main";s:7:"account";s:7:"account";s:5:"admin";s:5:"admin";}}

Header with patch:
X-Drupal-Cache-Tags:node_view:1 node:1 user:1 file:1 file_view:1 taxonomy_term:1 taxonomy_term:2 taxonomy_term:3 taxonomy_term:4 taxonomy_term:5 taxonomy_term_view:1 filter_format:basic_html menu:tools menu:footer menu:main menu:account menu:admin content:1

Much better :)

nielsvm’s picture

Very exciting that this just went RTBC. I just validated that this will work 100% when using Varnish as external cache for instance, I will soon build a proof of concept setup based on my work on the purge module for D8.

As can be seen here: https://www.varnish-software.com/static/book/Cache_invalidation.html#ban...

It becomes very trivial to write a VCL that accepts HTTP BAN requests based on X-Drupal-Cache-Tags, e.g.: ban req.http.host ~ "example.com" && obj.http.X-Drupal-Cache-Tags ~ /<REGEX based on req.http.X-Ban-Tag goes here>/

Varnish'es BAN lurker will then simply iterate the heap with cached items, match each header against the added BAN rule and remove the items as requested. The next time the item is requested Varnish will go back to the backend (Drupal) and restore a fresher copy of the page in cache. Thus changing any cache tag in Drupal can result in massive (and efficient) BAN's on the server side and thus very effective external cache invalidation.

Big big big wins here, very nice.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 41: cache_tags_header-2217371-41.patch, failed testing.

Wim Leers’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
8.62 KB

Chasing HEAD.

catch’s picture

Status: Reviewed & tested by the community » Fixed

Created #2225765: Consider using cache tags (and cache contexts) for the ETag as follow-up.

Committed/pushed to 8.x, thanks!

  • Commit 0b6e47e on 8.x by catch:
    Issue #2217371 by Wim Leers: Expose cache tags to reverse proxies: X-...
Wim Leers’s picture

Issue tags: -sprint

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.