Problem/Motivation
The new router supports (or will support when it's all in place) different responses for the same uri based on accept headers. This is also the current design for the serialization work. However Drupal's page cache only caches by path, not accept headers.
Ideally we want to be able to do full caching in reverse proxies and/or the internal page cache regardless of the response (as you currently can for RSS albeit at a different path). We also don't want to serve incorrect responses if HTML gets in the page cache and it bypasses the routing system.
Proposed resolution
Throughout the comment chain below, we have identified the following proposed solutions:
@crell recommends in comment #3
Should be solved by using the mime type in the cache key, inside HttpCache (or Varnish if you're using that).
@beejeebus recommends in comment #5
...we have to normalise the Accept header before it gets to the backend, and include it in the cache key.
@greg.1.anderson in comment #16 (improvement idea)
The cache key should contain a hash of the data type that was stored in the cache, not the set of all types that were requested. If someone used wget to request just "text/html", it would not make sense to generate a different cache key than the request shown for the browsers listed in #13.
This is hard because we haven't bootstrapped modules at the time the cache key is generated and checked.
If a module needs different "granularity" of normalization (e.g. to return a different type for "application/xhtml+xml" than for "text/html"), then it would need to somehow specify in some way that xhtml is significantly different than html, and this would have to happen on a path-specific basis.
@larowlan in comment #42
Provided a patch that still fails testing, but seems to get close to a solution.
@effulgentsia in comment #44
The reason [the patch in #42] fails is that formats are currently registered in KernelEvents::REQUEST subscribers, but currently, page cache is pre-kernel. Also mentioned in #1984766-3: Start relying on Request/Response objects for cache handling
Remaining tasks
1. getting the patch in #42 to work correctly and pass testing
2. from comment #9: "we start hosting a sane sample config on Drupal.org (if we don't already) and we can include well-documented normalization routines there"
3. from comment #9: "Make page caching work with accept header-based routing"
User interface changes
n/a
API changes
drupal_page_get_cache()
has changed to take a Request object, the $check_only param was removed.
the Request object is required to build a page cache cid from drupal_page_cache_get_cid().
nothing in core used the $check_only flag.
Related Issues
- #339958: Cached pages returned in wrong language when browser language used
- #1944472: Add generic content handler for returning dialogs
- #1505080: [META] Content Negotiation for Accept Headers
- #1984766: Start relying on Request/Response objects for cache handling
- #1984766-3: Start relying on Request/Response objects for cache handling
Original report by catch
Additional info from original report, not included in summary:
Someone is bound to point out that there is #1597696: Consider whether HttpCache offers any significant benefit over the existing page cache in the queue, but I see no sign of HttpCache supporting this either, i.e.
<?php
return $this->keyCache[$request] = 'md'.sha1($request->getUri());
?>
Opening this as critical, however it might be postponed on this feature actually being implemented.
I'm not sure which issue it is that actually does this, there is #1505080: [META] Content Negotiation for Accept Headers but this felt a bit off-topic in that issue. However if that's the one we could also just make sure that any patch that introduces this comes with tests to ensure this particular feature still works.
Comment | File | Size | Author |
---|---|---|---|
#79 | 1855260-79.patch | 7.71 KB | Anonymous (not verified) |
#73 | 1855260-73.patch | 7.68 KB | Anonymous (not verified) |
#68 | 1855260-68.patch | 7.01 KB | Anonymous (not verified) |
#63 | 1855260-63.patch | 6.71 KB | Anonymous (not verified) |
#61 | 1855260-61-test-only.patch | 2.92 KB | Anonymous (not verified) |
Comments
Comment #1
catchHmm #1834288: RESTfully request an entity with JSON-LD serialized response went in recently, but there's no mention of page caching in that issue, so this might be bug instead.
No time to test at the moment but will do so another time if no-one beats me to it.
Comment #2
catchVery old, similar-ish bug, except in that case skipping the page cache was OK as a fix: #339958: Cached pages returned in wrong language when browser language used.
With reverse proxies, I'm not sure how this can work now.
We're going to be normalizing the accept header in PHP, which means what the reverse proxy sees for accept, vs. what the routing system sees for accept is going to be different no?
Comment #3
Crell CreditAttribution: Crell commentedI agree that this should be solved by using the mime type in the cache key, inside HttpCache (or Varnish if you're using that). Whether we do that now or wait for the HttpCache pipeline to be completed, I'm not sure. I'm inclined to wait to reduce the overall amount of work that needs to be done.
Comment #4
catchAlso I have zero idea how this is supposed to work with reverse proxies. If the accept header normalization is done in PHP, then short of implementing identical logic in varnish or nginx what do you do?
Comment #5
Anonymous (not verified) CreditAttribution: Anonymous commentedspoke to crell about this a bit.
as far as the http cache layer is concerned, the routing stuff means we may, or may not, get a different response from the backend based on the Accept header. so i think that means we have to normalise the Accept header before it gets to the backend, and include it in the cache key. make sense?
(we do something like this at $dayjob with the User-Agent and Accept-Encoding headers - normalise in nginx, so we can use them in the nginx cache without having a bazillion different keys when we're only varying on 'gzip/not gzip' and 'mobile/desktop'.)
Comment #6
chx CreditAttribution: chx commentedI recommend not worrying about nginx :) Just because PHP is able to do Accept massaging doesn't mean we can't implement it in nginx and a) add the massaged variable to the proxy_cache_key b) pass the massaged version on with proxy_set_header. That really is of no concern. (Heck we can farm it out to a PHP upstream to do the Accept massaging only if that's what we want; after all nginx can do everything; but that's not speedy.)
Comment #7
catchYeah I was thinking about this and it'd probably fine. The main thing is either having upstream override the header, or making sure that headers are normalized reliably in both PHP and upstream, but it should be doable either way. I'm mildly concerned about people running varnish who don't set this up then file bug reports but that just means a change notice somewhere to point them to.
Comment #8
Crell CreditAttribution: Crell commentedYou already have to have a Drupal-customized Varnish configuration to use it. Most people I know borrow the one from 4 Kitchens. It's probably time we start hosting a sane sample config on Drupal.org (if we don't already) and we can include well-documented normalization routines there.
Comment #9
YesCT CreditAttribution: YesCT commentedThe issue could use updating by someone who understands the discussion.
Especially identifying remaining tasks.
It sounds like from #8 that one of the tasks should be
Are there other concrete tasks identified?
Also, is this a [no patch] issue?
If the proposed solution is the issue title: "Make sure page caching works with accept header-based routing" is that to be done along the way, or at a certain point in the release cycle? Via manual testing (would need "steps to test" to help folks do that)?
Maybe a title like "Make page caching work with accept header-based routing" is better and a patch would do that...
Comment #10
larowlanRelated #1944472: Add generic content handler for returning dialogs
Comment #11
msonnabaum CreditAttribution: msonnabaum commentedPerhaps I'm missing something, but does this patch not solve the issue? Seems pretty simple to me.
Comment #12
msonnabaum CreditAttribution: msonnabaum commentedj/k
Comment #13
msonnabaum CreditAttribution: msonnabaum commentedAlso, in my local testing, chrome, safari, firefox, iOS safari, and chrome on iOS all send these for Accept:
[
"text/html",
"application/xhtml+xml",
"application/xml",
"*/*"
]
So I'm not sure how much normalization actually has to be done.
Comment #14
effulgentsia CreditAttribution: effulgentsia commentedtagging
Comment #16
greg.1.anderson CreditAttribution: greg.1.anderson commentedThe cache key should contain a hash of the data type that was stored in the cache, not the set of all types that were requested. If someone used wget to request just "text/html", it would not make sense to generate a different cache key than the request shown for the browsers listed in #13.
This is hard because we haven't bootstrapped modules at the time the cache key is generated and checked. c.f. #5, above: seems the best thing to do is to normalize the accept headers at the time the cache key is generated, and make sure that the modules get the normalized accept headers, not the full accept headers. If a module needs different "granularity" of normalization (e.g. to return a different type for "application/xhtml+xml" than for "text/html"), then it would need to somehow specify in some way that xhtml is significantly different than html, and this would have to happen on a path-specific basis.
Comment #17
larowlanTagging
Comment #18
msonnabaum CreditAttribution: msonnabaum commentedHere's a better approach. Didnt see this method in the symfony 2.0 docs I was looking at.
Still broken because it doesn't handle setting yet.
Comment #19
msonnabaum CreditAttribution: msonnabaum commentedActually, not better, because getRequestFormat just returns html unless you set it…
Comment #20
greg.1.anderson CreditAttribution: greg.1.anderson commentedActually, #18 is moving in the right direction. We just need to run something like this first:
Drupal can later use getRequestFormat to determine what data type to build and return, and it will be consistent with what was done at cache time. All we need is the magic part. If we just used vget to fetch a table that could be uses on a per-path basis (maybe using a regex to match paths?), we could iterate through the available path / type mappings and return TRUE if the type and path both match. The variable could be rebuilt by a function that modules could hook to insert additional path/type mappings.
By default, the table would contain just "all paths" == "text/html". If nothing is set in the look, then getRequestFormat already defaults to "html". (Maybe we need to map "text/html" to "html", or set the default to "text/html" before running the loop.)
Comment #21
msonnabaum CreditAttribution: msonnabaum commentedHere's what I was going for previously.
Comment #22
msonnabaum CreditAttribution: msonnabaum commentedAnd here's the setting bit.
Comment #23
greg.1.anderson CreditAttribution: greg.1.anderson commentedNifty. Seems it would be good to make getContentType call setRequestFormat. Of course, getContentType also needs to work for types other than html, but it is a good start.
Comment #24
msonnabaum CreditAttribution: msonnabaum commentedYes, I agree, I think setRequestFormat is the right method to call, but I figured that should be changed in a larger content negotiation issue.
Comment #25
effulgentsia CreditAttribution: effulgentsia commentedI think #12 would be an acceptable solution to the (Drupal) part of this issue that's prioritized as critical, especially given #13.
I think #16 would be a nice improvement, maybe even worthy of a "major" priority. To do that, we have a 'content_negotiation' service registered in core.services.yml. As long as when the drupal_page_get_cache() is running (or if/when HttpCache replaces it), we have access to the DIC in order to get the content_negotiation service, and are able to get a $request object to pass to it, then we can call
Drupal::service('content_negotiation')->getContentType($request)
to get the cache key we want. Per the issue summary, our implementation of that service will be improved in #1505080: [META] Content Negotiation for Accept Headers.Once #1505080: [META] Content Negotiation for Accept Headers is in, we could in theory do that by collecting all routes that match the path, taking a union of their allowed formats, and passing that union to
bestMatch()
. However, that would require running a bunch of routing code, even on cache hits. I think it would be better to instead get an ordered array of possible matches, among all the formats supported by the site, and cache on that without regard to path-specific restrictions. I think the extra bit of cache duplication from that is lower cost than a much slower cache key generation process.Yeah, per #8, this is relatively straightforward to do for proxies that can be configured (Varnish, nginx, etc.). The annoying elephant in the room is Akamai, which, according to http://www.rimmkaufman.com/blog/vary-user-agent/30112012/, will flat out refuse to cache any page that uses
Vary: Accept
(or any other header other than Accept-Encoding). My current thought on that though is to leave that to contrib to solve. Such a contrib module would need to implement whatever the D8 equivalents of hook_url_inbound_alter() and hook_url_outbound_alter() will end up being (plus we'll need to expose a client-side equivalent hook in the Drupal.url() JS function) to implement whatever strategy it wants for adding format-specific information to the URL. Brainstorming some names for such a module:http_legacy
,unrest
,retro_urls
:)Comment #26
effulgentsia CreditAttribution: effulgentsia commented#25 was an xpost. Looks like #22 already connects to the content_negotiation service. Cool.
Comment #27
msonnabaum CreditAttribution: msonnabaum commentedAkamai is already incompatible with drupal because they ignore Vary: Cookie. I dont think this is worth considering.
Comment #28
msonnabaum CreditAttribution: msonnabaum commentedI should clarify, they are incompatible by default, but you can do whatever you need with a custom configuration from them as I recall.
Comment #29
msonnabaum CreditAttribution: msonnabaum commentedComment #31
Crell CreditAttribution: Crell commentedDoesn't this imply that the key we use is "html" or "json_hal", not the actual mime type? Which, in turn, means an inconsistency with how a reverse proxy will cache it?
That doesn't seem wise.
Also, we shouldn't be using request_uri(). That should really be deprecated. I also don't understand why you're recreating the request object here rather than just using Drupal::request().
Comment #32
msonnabaum CreditAttribution: msonnabaum commentedUsing Drupal::service('content_negotiation')->getContentType() is essentially the normalization we were looking for. I'm using the same logic here to key the cache that we are to determine what response to send. I don't see anything wrong there.
I recreated the request because the headers weren't available unless I called createFromGlobals(). No idea why.
Removing request_uri is out of scope for this issue.
Comment #33
Crell CreditAttribution: Crell commentedI mean the thing you get back from request_uri() can be gotten off of the request object.
If the headers are not available, that's a bug that should be investigated, not hacked around.
Comment #34
msonnabaum CreditAttribution: msonnabaum commentedFair enough, I started #1984766: Change notice: Start relying on Request/Response objects for cache handling to track those changes.
Comment #35
catchWhat about Accept headers sent by RSS feed readers and crawlers? Those are a lot less consistent than browsers usually. I wasn't able to find a handy list on the internet but it looks to me like both Drupal 6 and Drupal 7 aggregator module doesn't set any accept header when requesting feeds at all, for example.
Comment #36
Anonymous (not verified) CreditAttribution: Anonymous commentedi don't think that is a caching problem, but a routing one.
as far as i understand it, if a client asking for RSS doesn't set the right Accept headers, D8 will either serve them html instead of rss, or will return a 404.
Comment #37
catchWell if ten different RSS readers visit the same feed URL with ten different sets of headers, how many unique cache IDs will they generate.
That's also a fun routing issue though if turns out like that.
Comment #38
greg.1.anderson CreditAttribution: greg.1.anderson commented#22 does not suffer from the problem described in #37 (although #12 did). The existing function getContentType() used in #22 is very similar to the proposal in #20, except it is currently just a stub function that handles only html. As mentioned in #24, that will be addressed in a separate issue -- but I don' think there is an existing issue for it yet.
Comment #39
effulgentsia CreditAttribution: effulgentsia commentedIt currently handles html, drupal_ajax, ajax, and iframeupload. Once #1959574: Remove the deprecated Drupal 7 Ajax API is in, I'm not entirely sure what the future of ajax and iframeupload will be. #1944472: Add generic content handler for returning dialogs adds drupal_dialog and drupal_modal. In HEAD, we also have a HalSubscriber that adds 'hal_json'. Other request subscribers can add more. But maybe KernelEvents::REQUEST is the wrong place to add these if they're needed by the page caching layer?
Isn't that #1505080: [META] Content Negotiation for Accept Headers, or are you referring to some other work that's needed beyond the scope of that issue?
Why is that a problem? So Drupal's page cache can benefit from a more efficient cache key, while a reverse proxy would use a less efficient cache key, unless someone configures it with some normalization logic based on knowledge of what Drupal modules are in use. Why is it necessary for Drupal and the reverse proxy to have identical normalization?
In the Drupal page cache, with #22, 1. After #1505080: [META] Content Negotiation for Accept Headers, possibly several, depending on how we want to implement fallbacks, but likely not many. In the reverse proxy, depends on what normalization logic it's configured with.
Comment #40
greg.1.anderson CreditAttribution: greg.1.anderson commentedAgree with #39; thanks for the corrections.
Comment #41
larowlanHere's some tests now #1944472: Add generic content handler for returning dialogs is in - interdiff is the test-only patch.
Note that patch from 22 does not make the test pass so def needs work.
Comment #42
larowlanFixes the debug in the test and add missing blank line at end of yml file.
Interdiff against #22 is test only file.
Again will still fail but at least we have a test now.
Comment #44
effulgentsia CreditAttribution: effulgentsia commentedThanks for the test! The reason it fails is that formats are currently registered in KernelEvents::REQUEST subscribers, but currently, page cache is pre-kernel. I mentioned that in #1984766-3: Change notice: Start relying on Request/Response objects for cache handling as well. I'm not yet clear on whether changes related to that should be discussed/done here, in that issue, or in some new issue.
Comment #45
catchComment #46
jaredsmith CreditAttribution: jaredsmith commentedThis issue could use an updated issue summary, based on the templates at http://drupal.org/issue-summaries.
Comment #47
rainbreaw CreditAttribution: rainbreaw commented@chuchunaku and @rainbreaw are working on summarizing this at the d8 sprint at drupalcon
Comment #48
rainbreaw CreditAttribution: rainbreaw commented@chuchunaku and @rainbreaw have finished/posted our effort at a summary
Comment #49
rainbreaw CreditAttribution: rainbreaw commentedupdating tag to remove "needs issue summary"
Comment #50
catchClosed #1973206: ViewSubscriber with page caching - Denial of Service (SA-CONTRIB-2013-042) as duplicate.
Comment #51
catchComment #52
chx CreditAttribution: chx commentedJust remove the router and be done.
Comment #53
catchIt's as much a REST module issue (the only user of this so far) as it is the router, see the Drupal 7 SA linked earlier.
Comment #54
larowlanthe dialog controller uses accept based routing too.
Comment #55
Anonymous (not verified) CreditAttribution: Anonymous commentedreviving this.
Comment #56
Damien Tournoud CreditAttribution: Damien Tournoud commentedGiven the current state of "Content negotiation" in core, #55 is probably enough.
Comment #57
Anonymous (not verified) CreditAttribution: Anonymous commentedyeah, i am just dumbfounded at our use of accept headers with the ajax dialog controllers, but we can leave that ugly alone and still have the basics working.
Comment #58
klausiI was hoping for a cleaner page caching solution in D8, but it seems that this is the best we can do right now.
This should have a test case that shows that we have actually fixed something.
Comment #59
dawehnerI am wondering whether there should be also the query parameters be involved
Comment #60
Damien Tournoud CreditAttribution: Damien Tournoud commented@dawehner:
getUri()
returns the absolute URL of the request, including scheme, host, possible base path, path info, and query string.Comment #61
Anonymous (not verified) CreditAttribution: Anonymous commentedadded a test.
Comment #62
msonnabaum CreditAttribution: msonnabaum commentedMissing a @param doc for drupal_page_get_cache, but as soon as that's fixed, I think this is RTBC.
Comment #63
Anonymous (not verified) CreditAttribution: Anonymous commentedadded missing @param...
Comment #64
msonnabaum CreditAttribution: msonnabaum commentedComment #65
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is as smart as our content negotiation, so +1 from me too.
Comment #65.0
Damien Tournoud CreditAttribution: Damien Tournoud commentedIssue summary from @chuchunaku and @rainbreaw at drupalcon sprint
Comment #66
catchOther places in core we use $cid_parts, this is only internal so doesn't matter but it stuck out.
I'm still disgusted that we have a rule not to use md5() in core, but this breaks that horrible rule...
This is encouraging but could we also check that the content is correct?
It'd also be good to add a test case for variations on accept header, i.e. chrome sends today:
I found this old dodgy header from Safari, is that a cache hit or a new cache entry?
That and even worse examples on http://www.gethifi.com/blog/browser-rest-http-accept-headers
Comment #67
Crell CreditAttribution: Crell commentedWe really ought to be honoring the Vary header properly rather than just hard coding the Accept header. Our routing system currently supports routing by a whole lot of things. Our cache system should as well.
Comment #68
Anonymous (not verified) CreditAttribution: Anonymous commented#66 - i've added everything except the tests on accept header. this new code doesn't do anything smart with accept headers, it asks for a normalised value from our existing content negotiation code. if we need more tests for the content negotiation bits, can we do it in another issue?
#67 - i have no helpful response to that, so i'll just pretend it doesn't exist.
Comment #70
Anonymous (not verified) CreditAttribution: Anonymous commented#68: 1855260-68.patch queued for re-testing.
Comment #71
alexpottIf we going to invoke the no md5() in core rule then we have to do something about Symfony, Guzzle, Twig, Zend Feed, Assetic, PHPUnit. Just saying :)
Comment #72
Crell CreditAttribution: Crell commentedalexpott: All the more reason we should move those out of the repository and pull them in through Composer properly. ;-)
As long as we're touching this line, can we switch it to use \Drupal::cache() instead? One less function call, one more autoloadable class call. This is a good thing.
(I'm tempted to ask to make all of this functionality a classed object, actually...)
MENU_CALLBACK menu items no longer exist. They should be routes only. Please don't add more things we just need to remove later. :-)
As above, this should be a proper controller. And in that case it can take the Request as an injected value rather than using \Drupal.
At least the way this is structured now, changing how the cid is calculated to be more robust shouldn't touch anything else. So I'm OK with this for now, with the comments above.
Comment #73
Anonymous (not verified) CreditAttribution: Anonymous commentedupdated patch as per review in #72.
Comment #74
Crell CreditAttribution: Crell commentedThinking about it... This should really be 2 separate controller methods, and 2 separate routes. That is what we want to encourage people to do anyway, so it's a more accurate integration test.
Is it feasible to move drupal_page_cache_get_cid(), drupal_page_set_cache(), and drupal_page_get_cache() to a classed object, specified via Settings? That would allow the code to be autoloaded and allow sites to swap out the cid logic themselves for whatever else they decide it should be.
Comment #75
Anonymous (not verified) CreditAttribution: Anonymous commentedi really don't want to do the two separate routes/controllers thing. we are not testing the routing here - that's an implementation detail that the cache system does not care about. we are testing 'the same URI returns different content based on Accept', not 'our routing system can use different controllers based on Accept'. this is much like me pushing back on catch asking for a bunch of different Accept headers, because that would be testing the content negotiation code, not the cache system.
also, i agree with wanting to allow the cid generation to be alterable/pluggable, but was hoping to do that in a follow up. this late in the cycle, small things are easier to land...
update: i created #2062463: allow page cache cid to be alterable for the cid generation bit.
Comment #76
msonnabaum CreditAttribution: msonnabaum commentedI agree about routing. It's a test, nothing matters outside of the code under test.
And yes, let's please do the conversion to an object/service in a followup since that's pure refactoring. Let's just get this in.
Comment #77
Crell CreditAttribution: Crell commentedOK, incremental it is.
Comment #78
alexpottThis can be removed form drupal_page_get_cache() since you've removed the check only functionality.
Otherwise looks good to me and manual testing confirms it's working as expected.
Comment #79
Anonymous (not verified) CreditAttribution: Anonymous commentednice catch, only change is to update for #78.
Comment #80
Damien Tournoud CreditAttribution: Damien Tournoud commentedLooks good to me too.
Comment #81
catchThis still feels incredibly fragile overall, but the patch itself is fine now.
Committed/pushed to 8.x.
Comment #82.0
(not verified) CreditAttribution: commentedadded api change.
also, whiskey is nice.