Problem/Motivation
#2697795: [meta] Cache storage in database can fill the disk, especially via 404s points out that 404 responses are cached indefinitely.
When using the database cache backend, or another backend that does not limit storage space via LRU or similar, this can lead to cache filling in the case of either a large number of real 404s or a mis-behaving crawler.
Proposed resolution
Set a shorter TTL for 404 responses in page_cache module, so that they get expired after time rather than only via cache tags.
Consider whether to do this for all cache backends, or only those that don't support LRU or other space limitation - this would require a way for cache backends to indicate whether they do this though, which we don't have yet.
Remaining tasks
User interface changes
API changes
Data model changes
| Comment | File | Size | Author |
|---|---|---|---|
| #51 | 2699613-51.patch | 5.56 KB | alexpott |
| #51 | 42-51-interdiff.txt | 4.43 KB | alexpott |
| #42 | 2699613-42.patch | 5.58 KB | alexpott |
| #42 | 38-42-interdiff.txt | 5.19 KB | alexpott |
| #38 | 2699613-38.patch | 5.02 KB | alexpott |
Comments
Comment #2
catchComment #3
catchHere's a patch. Does not take much to do this, but I didn't add any configuration.
This could potentially happen in a patch release
Comment #5
fabianx commentedI think this should be configurable at least via settings.php.
Also s/greater/lesser/?
Comment #6
catchYeah I couldn't decide and changed it more than once ;)
Settings.php seems like an OK place to add the configuration yeah.
Comment #7
catchPostponed #2526150: Database cache bins allow unlimited growth: cache DB tables of gigabytes! on this, so bumping it to critical.
Comment #8
catchHere it is with settings. Didn't look into the test failure yet.
Comment #10
alexpottFixed the test - the problem is with how
$response->getMaxAge()works. I've also move the logic out of set() since I'm not sure it belongs there. Plus I think if the ttl is set to FALSE we shouldn't make any attempt to cache a 404. I also think we should debate whether or not making Drupal cache 404s is correct - since some form of cache busting always possible if we are.I was also wondering whether 404s should be a cache redirect to a canonical 404 page - but maybe that would give us more problems.
Comment #11
wim leers#10: that reminds me of the
4xx-responsecache tag, see #2472281: 404/403 responses for non-existing nodes are cached in Page Cache/reverse proxy, are not invalidated when the node is created.Comment #12
catch@alexpott so the use case for caching 404s at all is if you have a bad link somewhere (especially an image tag or similar), which isn't handled by the fast_404 handling - then the caching avoids having to build that page every time.
For me personally, I really dislike sites that redirect to a 404 page - if I've mis-typed a URL for example then look up at the address bar to see /404
Comment #13
catchDon't think that's right. If you're using page cache and no reverse proxy, you expect some page caching despite a 0 max_age.
Going from permament down to an hour is a much smaller change than going from permanent down to 0.
Comment #14
borisson_I think this is related to #2472335: Support an independent max-age for 4xx responses , while that had a different reasoning, I think the result of that issue is similar to the patch in here.
Comment #15
alexpottIt discussion with @catch I think we should consider calling this negative cache... see https://en.wikipedia.org/wiki/Negative_cache and http://www.squid-cache.org/Doc/config/negative_ttl/
Also I think rather then futzing stuff in PageCache we should consider setting an expires header in 404 generation.
Comment #16
alexpottAlso re the redirect idea - I didn't mean to do a page redirect. What I was trying to say is that with 404s we cache the same page over and over again - so should we try to only store one 404 page?
Comment #17
catchPer #2699627: url.path cache context for breadcrumbs is unnecessarily granular the breadcrumbs can be different. Not that I think that's a good feature (could live without breadcrumbs altogether except on menchi katsu and kaki fry ;)) but currently we provide those if the path parents are valid paths - such as admin/config/development/abc
Similarly search_404 means different 404 pages for similar reasons.
So we'd either need to special case the page cache (but somehow allow search_404 to override the behaviour), or we'd need to generically support cache contexts - however that'll have significant overhead for cache retrieval including on non-404s.
I don't like using the expires header in page_cache much in general - it's almost deprecated at this point in favour of max age. I do think negative_cache is a good idea though - that's exactly what this is for, and we could then apply it to all url.path* cache contexts when there's a 404.
Comment #18
alexpottAfter discussing with @catch here's an approach that renames the setting
negative_cache_ttland uses it for 403 and 404s.Comment #19
catchWhy max age here but expires if it's not a 404?
Also should we be checking for ::isClientError()rather than specifically 403/404?
Comment #20
alexpottComment #21
catchlol. I think I must have been projecting my dislike of the expires header, but we should probably use getExpires() so the only functional change is that we cache for an hour, or less if it was already going to be less.
Comment #22
alexpottActually I think it is a bug that we're not using getMaxAge everywhere here - because that falls back to
return $this->getExpires()->format('U') - $this->getDate()->format('U');Comment #23
alexpottDiscussed some more with @catch. The expires check in page cache is super confusing. By ignoring max-age PageCache is not a http compatible cache implementation and in fact it can be advantageous that it is not. It allows people to use a varnish implementation that does not support cache tags and have a small max-age whilst storing a permanently cache page in PageCache and clearing it using cache tags. Apparently the expires checking was implemented in #566494: Running cron in a shutdown function breaks the site?!?!
Therefore removing the max-age and not implementing an expires check for 4xx responses.
Comment #24
wim leers+1
Adding a new setting… shouldn't this belong on
\Drupal\system\Controller\Http4xxController::on404()in combination with #2352009: Bubbling of elements' max-age to the page's headers and the page cache?Of course, #2352009 will be hard to achieve due to BC. But I'd urge us to have a @todo here pointing to that issue, stating that this should be removed when we do that.
Why TTL here, why not max-age?
Comment #25
alexpott@Wim Leers / #24
Basically page cache using max-age is separate and different issue.
Comment #26
catchYes page cache's use and non-use of max-age and use of expires is very confusing, but if we ignore the 'Expires' handling which was added for a very specific and unrepresentative use case and which is more or less undocumented (not to mention the expires header being near-deprecated in general), then we're left with it ignoring those values. I think we should leave all of this to #2352009: Bubbling of elements' max-age to the page's headers and the page cache, don't feel like it needs a @todo - that issue will completely change the implementation/functionality.
#23 looks like a good improvement on what I had in #10.
Comment #27
wim leers#25: Excellent explanations! I'd like to see point 1 made more explicit then, because the setting you're adding in
default.settings.phpis not in any way explained to be Page Cache-specific. I'd like both the docs and the name to convey this. I think that'd make it non-confusing/non-ambiguous.So, for the setting name:
s/negative_cache_ttl/page_cache_negative_cache_ttl/
or
s/negative_cache_ttl/page_cache_negative_ttl/
or
s/negative_cache_ttl/page_cache_4xx_ttl/
Comment #28
catchThe only thing with the rename is I think this is a concept we could use elsewhere.
For example render cache could set a shorter TTL for all cache entries written on 4* responses. I haven't opened an issue for that yet - not sure if it's a good idea, but don't think we'd want two different settings for it - more page_cache would end up piggybacking on the render cache setting (even if it's introduced here).
Comment #29
wim leersBut how do you know if something written to render cache (or any other cache for that matter) for a 4xx response is actually only relevant for 4xx responses? 4xx responses may just as well use cache/compute things that are relevant in other situations as well.
Comment #30
catchWe could add the negative_ttl to cache metadata from the route/url cache contexts on 4** where we know it's only going to be relevant there. This might help with the url cache context on form actions for example.
Even if we did it to all items, the worst that happens is the item expires and gets cached again - either by another 4** or permanently.
Comment #31
wim leersSounds interesting. I can't quite imagine how that'd work though.
Fine by me. But can we then at least put that rationale in the documentation (or the change record)?
Comment #32
catchAdded some extra docs, not sure if that's enough to clarify though. Also it's really at this point clarifying the existing design of the page cache rather than the behaviour added by the patch.
Comment #33
catchOpened #2714227: Use negative_ttl for render cache/url cache context on 404s.
Comment #34
alexpottDiscussed with @catch, @effulgentsia, @xjm and @webchick. We decided to keep this critical since it is an avenue for DDOS and could result in data loss.
Comment #35
berdirWondering if it would be worth it to still respect the Expires header, and only do the default 403/404 logic if no specific header is set.
But probably not..
Would it make sense to also respect this for the max-age header that we send out?
I guess that could be a non-critial follow-up?
Not sure about our guideline for using or not using REQUEST_TIME.
We started using $_SERVER['REQUEST_TIME'] in some cases, see e.g. \Drupal\Core\Cache\MemoryBackend::getRequestTime() and \Drupal\Core\Path\AliasManager::getRequestTime().
AFAIK, that is to make unit testing easier, especially if we need to fake the current time. Doesn't apply here I think.
Wondering how to test the default expiration time.
Building the page cache cid for a given page is fairly easy, we could load the cache entry manually and check the expire timestamp.
Comment #36
berdirAh I see, #2472335: Support an independent max-age for 4xx responses is the issue for the max-age thing.
Comment #37
catchI think only #3 and #4 aren't covered by the other issue.
I opened #2717207: Add a time service to provide consistent interaction with time()/microtime() and superglobals for #3. Agreed it doesn't matter here but the global constant won't be available to components in a subtree split and no point defining our own if PHP's got one. Would rather do that all at once and add a DrupalCS rule for it though, then deprecate the constant.
#4 ought to be doable.
Comment #38
alexpottHere's the requested test.
Comment #39
berdirThank you, I think this is good to go then?
Comment #40
wim leersI think this looks good now, but there's too many nits left to be fixed on commit, so moving back to NW. Sorry.
This is a bit misleading: some external proxies (e.g. Varnish) and some CDNs (e.g. Fastly and CloudFlare) also support cache tags. So they can use the same strategy.
Nit: s/ttl/TTL/
s/expires date/'expire' field on the cache item/
AFAIK we always call this
$cache_item?s/ot/to/
I don't understand this "multiple of 10" stuff?
s/you//
s/To disable caching/To disable caching of client error responses/
Comment #41
mpdonadioWouldn't `$request->server->get('REQUEST_TIME')` make more sense here since it is available to avoid the global? Also a few lines down.
Comment #42
alexpottre #40 / @Wim Leers (thanks for the review)
re #41/ @mpdonadio (thanks for the review) - sure why not - we have the request.
I also improved a variable name.
Comment #43
catchI'm not qualified to RTBC the whole patch since I worked on it, but I think I can RTBC the interdiff relative to #38 which was already RTBCed.
Assigning to @effulgentsia since ideally neither me nor Alex Pott commit this one and he said he'd try to look at it.
Comment #45
catchRe-test was fine, looks like a random fail.
Comment #46
effulgentsia commentedCan 403 responses fill a cache? How?
I read the comments in this issue, such as #15 and #28, but I'm not convinced this is a great name, because:
4xx-responsecache tag that we're able to clear whenever something changes that might affect that response. So using a parallel name might actually create a more harmful implication in people's minds than a helpful association.Unassigning from myself pending feedback on the above.
Comment #47
catch#1 - let's say you have a site with a million users, and anonymous users don't have permission to view user profiles. Someone could try to visit all 1m profiles via a script and fill your cache. Regular browsing and search engine crawling would not hit those profiles because they're not linked from anywhere.
You could argue that a site where anonymous users can see the profiles could also get the cache filled, but at least in that case the caching is likely to have a higher hit rate. So I think the comment is valid but interested what you think after the explanation.
#2 - I don't have a strong opinion on this, are you suggesting $unbounded_cache_ttl? Or we could do $client_error_cache_ttl (although that makes it sound like we're controlling the client like max_age which we're not).
Comment #48
catchBack to Alex - could use concrete suggestions for something other than $negative_cache_ttl
Comment #49
berdirI think a more interesting scenario is the one from the meta issue. Migrating an existing site to Drupal 8, and possibly not migrating all features/content.
And then google wakes up and notices the big change and starts to checks all the old URL's. We have tens of thousands of requests due to that.
Comment #50
effulgentsia commentedRe #47.1: Yep, thanks for that example.
Re #47.2: My current thinking is:
4xx_cache_ttl.unbounded_url_cache_ttl.Comment #51
alexpottI'm ok 4xx_cache_ttl - does what it says on the tin. However $4xx_cache_ttl does not work as a php variable so I guess we should go with 'cache_ttl_4xx'. I introduced the idea of negative_cache_ttl in #15. I still think 404 is a failure and this is a negative cache but the name is still meaningful.
In discussion with @dawhener we realised that we are using
time(),REQUEST_TIMEand$request->server->get('REQUEST_TIME');- that's insane. So We've standardized on$request->server->get('REQUEST_TIME');Comment #52
dawehnerThings are quite inconsistent / unreadable here. Let's use one variable for a consistent time and not deal with relativity theory.
Comment #53
dawehnerThank you alex!
Comment #54
mpdonadioThank you. :)
Comment #55
catch$cache_ttl_4xx is fine with me, and thanks for the request time standardization. See #2717207: Add a time service to provide consistent interaction with time()/microtime() and superglobals for more request time fun.
Comment #56
effulgentsia commentedThanks. I'm pretty happy with this patch now, and am tempted to commit to 8.2.x. However,
I'm nervous about committing this to 8.1. #26 is a reasonable comment, but are we sure we want to apply that reasoning to the production branch? Especially if there's a tag released between this getting committed and #2352009: Bubbling of elements' max-age to the page's headers and the page cache?
Comment #57
catch#2352009: Bubbling of elements' max-age to the page's headers and the page cache I definitely wouldn't put into a patch release, possibly not even a minor without a configuration option.
This issue I can see either way for 8.1.x or not, but it'd be good to get it into 8.2.x in case anyone is testing that branch, even if it waits longer for 8.1.x
Comment #58
effulgentsia commentedTicking credit boxes.
Comment #60
effulgentsia commentedPushed to 8.2.x.
Setting to NR for 8.1.x. Per #56, I think for that branch we might want to add the Expires header checking for the 4xx responses as well, for BC for the 0-2 production sites that might be relying on it. If we agree to do that, might make sense to add that to 8.2 as well to keep them in sync, unless #2352009: Bubbling of elements' max-age to the page's headers and the page cache lands before then.
Comment #61
catchI don't think we should add any more support for the Expires header than we already have - the behaviour is completely undocumented.
Comment #62
alexpottI agree with @catch - especially considering max-age is set later on (if configured) and preferred by browsers and reverse proxies.
Comment #63
alexpottTicked credit box for @effulgentsia - since his reviews have definitely influenced the direction of this patch.
Comment #64
effulgentsia commentedOk, but currently in 8.1, an Expires header can control the expiration of all page cache entries: whether 2xx or 4xx. Cherry-picking #59 to 8.1 would mean that a production site would see the Expires header no longer controlling the expiration of 4xx page cache entries. So this isn't about adding "any more" support. It's about whether we're ok with removing existing functionality in a patch release.
Yes, but developers can read code. If as a developer, you've had a reason to want an expiration for a given page cache entry, you surely would have been able to step through the code and end up finding the PageCache::fetch() method, and seen not only the code for how
$expireis determined, but even the code comment ofGet the time expiration from the Expires header.Yes, a developer finding the above would surely be puzzled/annoyed that you need max-age for browsers and reverse proxies and Expires for Drupal's page_cache.module. And like with other things one gets puzzled/annoyed by, would eventually accept that that's how Drupal core works, and provide both headers in order to control both things. How does changing page_cache.module to no longer obey a shorter Expires header than $cache_ttl_4xx for 4xx responses in a patch release help those developers? Especially, when fixing page_cache.module to use the max-age header is not yet committed to 8.1, and per #57 probably won't be.
Comment #65
catchOK to put it differently, I'm personally not going to work on adding a bc-layer for the Expires header for 8.1.x, I'd be open to moving this back to 8.2.x and marking it fixed there as an alternative.
If someone else wants to pick this up to add a bc-layer for Expires to 8.1.x then they should feel free to, but let's not then forward-port that to 8.2.x
Comment #66
alexpottSo let's do the simple thing and mark this fixed. I'm slightly concerned about 8.1.x sites being vulnerable to this but if it becomes an issue we can revisit the decision. I think one thing that speaks for ignoring the the expires header is what is the worst that can happen? Nothing it going to be broken.
Comment #67
effulgentsia commentedIf an 8.1 site currently has some particular URL that returns a 4xx response with a short Expires header (e.g., <1 hour) for anonymous users, and that response then changes after that expiration time (e.g., maybe becomes not a 4xx due to some time-sensitive condition), with no corresponding clearing of the general
4xx-responsecache tag or any other cache tag occurring (because if the condition is based on time, then you don't need a cache tag, right), then currently, such a site works as designed, whereas cherry picking #59 to 8.1.x would make such a URL return its old 4xx response for the global $cache_ttl_4xx duration, even if that's longer than the URL-specific time-based condition. This might be an edge case that 0 production sites currently have, but we don't know if that's the case. If such a site has this edge case, then for that site, this would appear as something broken.Happy to leave this fixed for 8.2 though, and only revisit the BC requirement for 8.1 if we decide we need this in 8.1.
Comment #69
effulgentsia commentedRetroactively unassigning myself (I should have done that in #60).
Also tagging for a release notes mention.