Updated: Comment #0
Problem/Motivation
Over at #2167039: Regression: page cache tags broken, also: some routes incorrectly use _controller -> No page object in template variables -> Fatal in seven_preprocess_page(), we fixed broken page cache tags. As described in #26 of that issue, the different execution order forced us into a certain corner. The solution ended up being setting a response header that contained the cache tags. At first in non-serialized form, but eventually in serialized form.
Therefor, we now have a cache_tags
header whose value looks like this:
a:8:{s:9:"node_view";b:1;s:17:"node_view_article";b:1;s:4:"node";a:1:{i:1;s:1:"1";}s:4:"user";a:1:{i:1;s:1:"1";}s:13:"filter_format";s:10:"basic_html";s:13:"taxonomy_term";a:2:{i:1;s:1:"1";i:2;s:1:"2";}s:18:"taxonomy_term_view";b:1;s:4:"menu";a:4:{s:5:"tools";s:5:"tools";s:6:"footer";s:6:"footer";s:4:"main";s:4:"main";s:7:"account";s:7:"account";}}
That's a lot of useless data right there: not only is it useless for 99.99% of clients, it's also quite a bit of data on the wire (that gzips relatively poorly, too).
Proposed resolution
As described in the original issue that added cache tags (#636454: Cache tag support), the potential there to leverage cache tags to have both comprehensive and simple purging of reverse proxies like Varnish… is very appealing. Drupal 8 shipping with native support for this would encourage shared hosting environments to provide reverse proxies, which would result in faster Drupal sites for the hosting client, and less cost for the hosting provider. (Gandi Simple Hosting is the one example I'm aware of that does this already.)
Therefor I propose to:
- keep this header, but just make it useful for reverse proxies, while still also being used by Drupal's page cache. In other words: Drupal's page cache then acts like a built-in reverse proxy.)
- Another advantage is that it becomes trivial to write tests that check whether a certain cache tag has bubbled up the page correctly.
The equivalent of the above header would be a X-Drupal-Cache-Tags
header with the following value:
node_view:1 node_view_article:1 node:1 user:1 filter_format:basic_html taxonomy_term:1 taxonomy_term:2 taxonomy_term_view:1 menu:tools menu:footer menu:main menu:account
Remaining tasks
None.
User interface changes
None.
API changes
A new header is available: X-Drupal-Cache-Tags
.
Comment | File | Size | Author |
---|---|---|---|
#45 | cache_tags_header-2217371-45.patch | 8.62 KB | Wim Leers |
Comments
Comment #1
Wim LeersAnd patch.
Comment #3
larowlanCould be a stupid question, but could we hash it to save information leaking, might end up smaller too.
Comment #4
dawehnerIs tere any reason we don't simply make that method a public static one?
Can't we just remove the identical and replace it with equal now? Is there any reason that the order of cache tags would matter?
Comment #5
Wim Leers#3: There is no sensitive information in there. Plus, the same information can be deduced from the HTML. Not a stupid question though :)
#4: Thanks for the review — both great points! However, in PHP
var_dump(array('a', 'b') == array('b', 'a'))
will actually outputFALSE
. So we'd have to resort to sorting first. Which is what I did. We might as well have keptassertIdentical()
then, but I dropped that, at your request.Also fixed test failures.
Comment #7
Wim LeersThis should make it green.
Comment #8
Damien Tournoud CreditAttribution: Damien Tournoud commentedAt this point, we might as well use this as the
ETag
, this gives us client-side validation while still allowing more intelligent agents to take additional decisions.Comment #9
catchWith the current implementation, the Etag would only change if the actual list of tags themselves change (i.e. a page no longer has node 2 on it, but node 3 instead) - since we're (rightly I think) not including the tag invalidation counts in the header.
If we did include the invalidation counts in the headers, using the ETag gets more interesting in terms of client side invalidation though.
For the use case of clearing varnish as originally discussed, the idea has been to directly invalidate via regexp on the headers. In that case you don't need to store tag invalidations, because the cache backend is just purging when a cache tag is updated - nothing local about cache tags needs to be kept at all (for the page cache at least).
Comment #10
Wim Leers#8: Very interesting! "cache tags on the page" would indeed be a very large part of the puzzle to generate a reliable
ETag
.However, let us please not expand the scope to include that. Let's do that in a follow-up issue :)
Comment #11
Damien Tournoud CreditAttribution: Damien Tournoud commentedI thought this was a list of tag fingerprints. We have this readily available, so we should just use it.
Comment #12
nielsvm CreditAttribution: nielsvm commentedAgreeing with @Wim on the scope here. External caches can just store the header and outgoing calls (e.g. from purge) can instruct Varnish (or a CDN) to wipe regex based on the cache tag string.
Getting this in would be king!
Comment #13
Damien Tournoud CreditAttribution: Damien Tournoud commentedI do understand where this wants to go, and it would be trivial to implement a Varnish ban rule based on the cache tag. What I'm discussing is that (1) this is at odd with standard HTTP caching mechanisms, so non-Drupal-specific caching agents are not going to be able to leverage it, (2) this introduces an inconsistency between client-side invalidation (which relies on
ETag
only) and intermediate agent invalidation.Comment #14
Wim LeersIt's just a custom header. It's not at odds with HTTP caching at all. Every HTTP client will just ignore it.
True, but that's also not the goal.
There is no inconsistency. Again, this is just a custom header. It serves a very different purpose.
ETag
is for invalidating a single HTTP resource (it is associated with a single URL). This customX-Drupal-Cache-Tags
header can be used for invalidating all affected HTTP resources. Of course, in a non-standard way. Because there is no standard for it yet. That's why it's prefixed withX-Drupal
. Once there is a standard for this, we can move to that :)(Mark Nottingham has proposed Linked Cache Invalidation — also see the RFC, but that hasn't gone anywhere. Plus, it still requires the implementation to know every URL where a thing (entity in Drupal) appears. So it still would be less complete.)
Comment #15
Gábor HojtsyWim Leers asked me to review this patch. The thing that stands out as very odd to me is that there is no encapsulation for this functionality. Logic related to it appears oddly as several places in the codebase.
Comment #16
catchThis isn't readily available. To make it available we'd have to store either an increment or timestamp of the last time any particular cache tag is invalidated. Just the list of tags isn't enough to know if an item is valid or not.
The database backend does store that information, but it's an implementation detail and not imposed by the API. Right now it would be valid to have a 'varnish' backend for the page cache that does nothing when items are stored (i.e. similar to pressflow/7.x external cache), and invalidates via purge when items are cleared. Even using the 'internal' page cache, something like redis could do direct invalidation of cache items rather than tracking information about tag invalidations and comparing on read.
Comment #17
Damien Tournoud CreditAttribution: Damien Tournoud commented@catch: That sounds like a good reason to revise the API. I think #918538: Decouple cache tags from cache bins is moving in this direction, but I haven't looked very closely.
Comment #18
Wim Leers#15: dawehner asked in #4 to do it this way, to avoid code duplication.
How about this alternative way of avoiding code duplication then: remove both the DB and memory cache back-ends'
flattenTags()
methods and have a generic one atCache::flattenTags()
?Comment #19
catch@Damien that tries to clear up the relationship between bin storage and tags (and I liked your last suggestion on there), however there are two different and equally valid approaches to tag invalidation, direct clearing vs. comparison on get.
ETags relies on comparison on GET (in fact that's exactly what what an ETag is for), but I don't think we want to enforce one or the other approach on the tag storage.
One possibility would be to have two different interfaces so that implementations can indicate which they do, and then use it for the ETag or not depending on the answer.
Comment #21
Wim LeersGreen again.
Comment #22
BerdirHm, the flattenTags implementation so far was an implementation detail, a different backend implementation might want to use a different one. But I guess they can still decide to do so.
Comment #23
Wim Leers#22: Precisely.
It's either this, or duplicate the
flattenTags()
code, which dawehner was against. I'm fine with either, as long as we make a decision and not bikeshed it :) I personally have a preference to duplicate the code, precisely to avoid the slight concern being raised here by both Gábor and Berdir.Comment #24
catchAlso this adds a cross-dependency between the implementations and the Cache class. It's a lot better than adding it to the interface and enforcing the format but still feels a bit odd. What about putting that in a trait?
Comment #25
Wim LeersHeh! In my mind, traits are very general purpose things. Here, we deal with specific purpose things. At any point in time, any of the cache backends may want to change how they store cache tags. It just happens to be that
X-Drupal-Cache-Tags
wants to output cache tags is similar to what the current cache backends are doing for storing tags.I think we should just "duplicate" the code. We're bikeshedding about something that's not worth bikeshedding about: some 15 LoC that just happen to be very similar.
Can everybody live with that?
Comment #26
Wim LeersHere's a reroll that does precisely that: everything back to the old (pretty much exactly like #1), but now with a
HtmlViewSubscriber::flattenCacheTags()
method that returns a string of space-separated flattened cache tags. So it's less similar toDatabaseBackend::flattenTags()
than before.(Hence the interdiff is bigger than the patch itself.)
Comment #27
BerdirI don't think that's correct?
This will then look like array('node:1', 'node:2') but what we want is array(array('node' => array(1, 2)?
It will probably work right now as the flatten in the database backend does essentially the same, but as you said yourself, we shouldn't rely on that.
Comment #28
Wim LeersYou're right: it does exploit an implementation detail.
This reroll fixes that.
Comment #29
catchShould this be static::convertTagsToHeader() now?
Comment #30
BerdirNo it doesn't :)
Comment #31
Wim Leers#29: it's possible, but unnecessary:
static::
is only necessary if you want to use LSB to determine the method to be called.Comment #32
Wim Leers#30: AAAAARRGH! I blame the endless misery with local test running for lacking self-review :(
Comment #33
dawehnerThis overly looks good.
I wonder why we can't just use an use statement instead. This would be a bit nicer to read here.
Comment #34
Wim LeersSure, I can do that.
Any other concerns?
Comment #35
dawehnerNot from my side.
Comment #36
Wim LeersAlright, then this will hopefully be RTBC :)
(Addresses #33.)
Comment #38
catch#31 I know it's not necessary, but
$this->convertCacheTagsToHeader($tags));
made me go to check if the method was static or not (since the reverse is called statically elsewhere), whereas I'd otherwise have saved the check. I can't really find anything else to complain about though :)Comment #39
Wim Leers36: cache_tags_header-2217371-36.patch queued for re-testing.
Comment #40
dawehnerYou could argue that helper methods should not live on subscribers, but I don't care.
Comment #41
Wim LeersAs per #38 and asking in IRC: changing from
$this->staticMethod()
tostatic::staticMethod()
.#40: My thoughts exactly.
Comment #42
BerdirYes, I prefer this as well.
Looks good to me.
Header HEAD:
cache_tags:a:9:{s:9:"node_view";b:1;s:4:"node";a:1:{i:1;s:1:"1";}s:4:"user";a:1:{i:1;s:1:"1";}s:4:"file";a:1:{i:1;s:1:"1";}s:9:"file_view";b:1;s:13:"taxonomy_term";a:5:{i:1;s:1:"1";i:2;s:1:"2";i:3;s:1:"3";i:4;s:1:"4";i:5;s:1:"5";}s:18:"taxonomy_term_view";b:1;s:13:"filter_format";s:10:"basic_html";s:4:"menu";a:5:{s:5:"tools";s:5:"tools";s:6:"footer";s:6:"footer";s:4:"main";s:4:"main";s:7:"account";s:7:"account";s:5:"admin";s:5:"admin";}}
Header with patch:
X-Drupal-Cache-Tags:node_view:1 node:1 user:1 file:1 file_view:1 taxonomy_term:1 taxonomy_term:2 taxonomy_term:3 taxonomy_term:4 taxonomy_term:5 taxonomy_term_view:1 filter_format:basic_html menu:tools menu:footer menu:main menu:account menu:admin content:1
Much better :)
Comment #43
nielsvm CreditAttribution: nielsvm commentedVery exciting that this just went RTBC. I just validated that this will work 100% when using Varnish as external cache for instance, I will soon build a proof of concept setup based on my work on the purge module for D8.
As can be seen here: https://www.varnish-software.com/static/book/Cache_invalidation.html#ban...
It becomes very trivial to write a VCL that accepts HTTP BAN requests based on X-Drupal-Cache-Tags, e.g.:
ban req.http.host ~ "example.com" && obj.http.X-Drupal-Cache-Tags ~ /<REGEX based on req.http.X-Ban-Tag goes here>/
Varnish'es BAN lurker will then simply iterate the heap with cached items, match each header against the added BAN rule and remove the items as requested. The next time the item is requested Varnish will go back to the backend (Drupal) and restore a fresher copy of the page in cache. Thus changing any cache tag in Drupal can result in massive (and efficient) BAN's on the server side and thus very effective external cache invalidation.
Big big big wins here, very nice.
Comment #45
Wim LeersChasing HEAD.
Comment #46
catchCreated #2225765: Consider using cache tags (and cache contexts) for the ETag as follow-up.
Committed/pushed to 8.x, thanks!
Comment #48
Wim Leers