We noticed on several of our sites that the page_cache table was steadily growing to hundreds of megabytes, containing thousands of rows. We didn't understand why as these sites were not very content heavy and should be very easily cacheable. The example site we used for debugging contains 589 pages and no authenticated users. So we expected to see 589 page_cache entries. Perhaps a few more caused by views pagers, exposed filters and so on. But certainly no more then 1.000.
Problem 1:
After some investigation we noticed a new page_cache entry was created for every unique URL, regardless wether or not that URL actually creates a different output... This is caused by adding query parameters to the URL.
As an example I created a simple bash script that requests url's with a variable id: http://example.tld/?id=x and loops over x from 1 to 10.000. Sure enough the page_cache generated 10.000 cache entries, resulting in 1.5GB of mysql data in the page_cache table.
This should not happen as the cache contexts for the response on this page does not contain the "url.query_args" context. So the cache system should know that the "id" query parameter does not result in any change and should not cause more then 1 entry in the cache_page.
Is this normal behaviour? I could find some related issues, but no clear description as to why this happens, nor a solution:
#2877498: Dynamic cache does not respect query parameters
#2062463: allow page cache cid to be alterable
#2662196: Cache route by Uri and not just Query+Path
I think this could potentially be exploited to overload websites or even crash mysql database or other caching software?
Problem 2:
If we would fix problem 1, a second problem still pops up. Because in our example site we have some modules active, some configuration, some blocks, some real content, ... Mainly a search block is placed in the header of our site on all pages. That in turn creates a form on our website and somehow that seems to add "url.query_args" to the cache context for all pages. So even if the cache id for the page_cache would filter on only query args passed in the cache context, this behaviour would still add all query parameters causing every query parameter to generate a new page_cache entry.
So we did some searching and found that Drupal core adds "url.query_args" in a few places where it should not be needed at all... See the listing below of what I could find.
Why is this implemented like this?
Probably lots of contrib modules already "rely" on this "bazooka"-behaviour...?
core/lib/Drupal/Core/Form/FormBuilder.php:
/**
* #lazy_builder callback; renders a form action URL.
*
* @return array
* A renderable array representing the form action.
*/
public function renderPlaceholderFormAction() {
return [
'#type' => 'markup',
'#markup' => $this->buildFormAction(),
'#cache' => ['contexts' => ['url.path', 'url.query_args']],
];
}
modules/user/src/Form/UserPasswordForm.php:
public function buildForm(array $form, FormStateInterface $form_state) {
...
$form['#cache']['contexts'][] = 'url.query_args';
return $form;
}
modules/views/views.theme.inc:
function template_preprocess_views_mini_pager(&$variables) {
...
// This is based on the entire current query string. We need to ensure
// cacheability is affected accordingly.
$variables['#cache']['contexts'][] = 'url.query_args';
modules/views/src/Plugin/views/style/Table.php:
public function getCacheContexts() {
$contexts = [];
foreach ($this->options['info'] as $field_id => $info) {
if (!empty($info['sortable'])) {
// The rendered link needs to play well with any other query parameter
// used on the page, like pager and exposed filter.
$contexts[] = 'url.query_args';
break;
}
}
return $contexts;
}
Issue fork drupal-3011426
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
idebr commentedComment #3
weseze commentedIs this something that would need to be fixed? Or is it expected behaviour?
I can commit some time if needed.
Comment #4
dwwProblem #1 definitely seems like a bug if it's true, but I suspect the real issue is actually problem #2. If you're sure the cache contexts don't include query_args, there shouldn't be unique cache entries if you're only varying query_args. But it seems like the problem is that query_args is getting added for you, against your will. ;) I agree it's a problem that core does this "bazooka" behavior, but I'm not enough of an expert on the whole D8 caching world, so I can't really say if/how we can fix this. Curious to hear what the caching subsystem maintainers have to say.
Meanwhile, I removed one of the duplicate links to other issues from the summary, and converted the rest to use the [#nid] syntax (e.g.
#2877498: Dynamic cache does not respect query parameters) so folks can see the titles / status without having to click through.Comment #5
weseze commented@dww: problem #1 is definitely true. Event tried on a clean vanilla Drupal install where the cache context contained no trace of query_args. Still created a cache entry varied on each query parameter.
Adding in the right query_args cache context (problem #2) is probably easily fixed, making the cache subsystem work with them will probably be the hard part? (if my findings are correct though)
But before committing my time to create patches, I would need to know that is indeed not intended behaviour or I'm not doing something completely wrong with my cache setup ;)
Comment #6
Pascal- commentedWe're experiencing a similar (or the same?) issue with several of our websites.
Any (temp) fix for this? For now we're doing a manual cache rebuild every 2 days ...
Comment #7
weseze commentedOther the shutting down page caching, or patching core, I don't see any solution...
We also have a few sites that are building up massive caches from time to time.
Having a regular cron task and some finetuned settings might fix this, see: https://www.drupal.org/node/2891281 Although it does not fix the problem, just clears the cache on your cron run...
Comment #8
ndobromirov commentedYou can always use MemCached, Redius or any other evicting cache storage.
Comment #9
weseze commented@ndobromirov: that does not fix the issue here...
Comment #10
chi commentedRe: Problem 1
I think it is by design. Request URI is hard coded in cache ID used by Page Cache module. So setting "url.query_args" cache context makes no sense as it more specific. I suppose that's implemented this way because there is no way to determine a precise cache context from incoming request. So that Page Cache module relies on the hard coded cache context which is made of HTTP host, URI and request format. I think that's a correct approach. After all URI is a Uniform Resource Identifier, why shoud a Drupal site serve the same content for different URIs?
Note that the context does not include cookies so that Page Cache does not serve pages for authenticated users.
Re: Problem 2
Search form does not include "url.query_args" context for me. Also I think blocks are typically rendered in cache placeholder so they are not cached on page level. Pagination is by it's nature is depending on "page" query parameter and when used with views exposed filters or sortable tables it should also depend on other query parameters as well.
Anyway I propose creating a separate ticket for each issue where some Drupal element declares too general page context. Because it's a completely different problem and it affects Dynamic Page Cache as well.
Comment #11
chi commentedThat's a known issue. I believe that affects earlier Drupal versions as well. Note that even we fix somehow the problem #1 it will not protect a site from that issue because the are many pages that really depend on "url.query_args" cache context. If someone wants to fill cache tables on a site he will just pick one of such pages as base URL. Basically any page with pagination can be used that way.
curl http://example.com/some-view/?{1..1000}And besides, there is another way to populate the database with a garbage. When DB Log is enabled any anonymous user can leave messages there by visiting non existing pages
curl http://example.com/hi_admin_let's_put_some_garbage_to_your_databaseor by producing errors on a site with specially crafted URLs likehttp://example.com?page[]. Furthermore there is way to put a few thousands DB log entities at once with a single HTTP request (#2864933: ContextualController::render() should validate incoming IDs).Comment #12
Pascal- commentedA workaround that "fixed" the issue for me, is disabling the cache for views with large amounts of data.
Both in D7 and D8 I had views that were only accessible to admins with large amounts of data, those were causing the database to fill rapidly.
In the views settings, I've disabled the cache and that fixed it.
Comment #18
gagarine commentedComment #20
ressaI tried this (I am using Facets module, which requires this) but the View is still getting cached, and the
cache_pagetable is getting views entries, which it should not.I then tried disabling cache for everything under Performance (
/admin/config/development/performance), setting "Caching | Browser and proxy cache maximum age" to<no caching>but even then, some entries were made in the cache table ...I tried searching for
drupal views "Caching:None" ignoredand found the page Disable the cache for a specific view and tried uninstalling Internal Page Cache (page_cache) and Internal Dynamic Page Cache (dynamic_page_cache) modules, and finally, the View was not getting cached. But caching was pretty much disabled for all pages then.Luckily, the CacheExclude module was also mentioned on that page. I tried it, and it works perfectly: After adding the view paths
/mypathand/mypath/*to exclude also sub-pages, the view is not getting cached at all.Comment #21
prudloff commentedWe also got bitten by this on a website where bot requests expanded massively the size of the SQL database.
The database_cache_max_rows setting helps with this but it is not enforced in real time.
Our current solution to this is :
Comment #22
gpotter commentedLike @prudloff, we ran into the same issue.
The basic example is any controller that returns a render array. If a bot hits that route with unique query strings, it will create a new cache_page record, if that bot hammers on the page consistently, you end up with tens or even hundreds of gigabytes of data because the entire page is stored with the same content. This is namely the Internal Page Cache module that is causing the issue.
"database_cache_max_rows" doesnt work like prudloff mentioned because those max rows are only restricted on a cron run. So by the time the cron runs the site dies because there are too many records to restrict and delete old records.
We had similar solution ideas with Redis cache, or rate limits. Rate limits is a bit concerning of a solution as it could potentially rate limit a legitimate web crawler.
The primary issue on our server is the site would die on cron runs, so we clear cache before a cron run nightly. A cache clear runs at a good speed versus the db row constrain from cron. Probably because a cache clear is a simple quick truncate of the cache tables.
Comment #23
gagarine commentedThis issue has been publicly known for years and still poses a serious risk.
In my opinion, the core problem is that adding irrelevant query parameters still returns a valid page. Instead, Drupal should return a 400 (Bad Request) response and avoid caching the page when “fake” query parameters are injected.
As it stands, adding arbitrary query strings (e.g., ?test, ?foo=bar) bypasses the page cache and triggers full page generation. An attacker can exploit this to cause a denial-of-service (DoS) by flooding the site with requests using unique query parameters, quickly overwhelming the backend.
What’s worse is that Drupal treats these URLs as legitimate and includes them—with the fake query strings—in the HTML it returns. The content is therefor not identical of the page requested on the normal URL.
This should absolutely be treated as a critical security issue.
Comment #24
cilefen commentedIf this is critical, are you saying that #2526150: Database cache bins allow unlimited growth: cache DB tables of gigabytes! does not work, or that it is ineffective?
I completely understand that even with row-limiting, an attacker can essentially replace useful cached objects with useless ones, and that Drupal could do better here. But can those outcomes render a site unusable or cause database deadlocks? Possibly.
Comment #25
catch@cilefen
It's known to be ineffective because it relies on cron trimming the table size. This means that if the table is growing to ten times the row limit in-between cron runs, you not only have tables growing very large in-between cron runs, but also the cron logic has to delete a lot of rows. It doesn't mean the logic is 'bad' as such, it's just not a replacement for a 'real' cache backend like Redis or Memcache which can prune items more efficiently, both the pruning and the deciding which ones to prune.
@gagarine
This would require an 'allowed query parameters' API. But that API would have to validate both the parameter names and the values. And it couldn't just do that, because ?pager is a valid value and would have an integer range of 0-PHP_MAX_INT (or some arbitrary configured value). And unless it could be reliably implemented to validate the query parameters on a route-specific basis (which may not be the controller, it could be any block, JavaScript, 'destination' handling for a form etc.), then that still allows more than enough combinations of query parameters to create a very, very large number of cache entries. It would require more knowledge but it wouldn't prevent anything. And it wouldn't be possible to issue a 400 until Drupal 12 because loads of contrib modules would break.
One possibility here is to just completely prevent caching if there are any query parameters in the URL, those requests would have to rely on dynamic_page_cache instead.
Where are the query strings included in the HTML returned by the page?
Comment #27
catchMR will fail at least one method in PageCacheTest if not elsewhere, because that test method relies on caching pages with query parameters. It would have to be switched to using key value or something.
Comment #28
gagarine commentedCould the value be validated by the module that say they need this parameters? The validation may require a couple of DB queries, but it could be much lighter than building a full page.
The full URL is added in form action and in the pages's JS.
If this was not the case, we could use a checksum to avoid duplication in cache. But adding a parameter will still allow to bypass the cache.
Comment #29
catchFor views, which is the most likely way to end up with a pager query parameter, you would need to build and execute the entire view to validate the parameter to find out how many possible pages there might be - pretty much have built the page at that point.
That's by design - query parameters can be used to pre-fill form values and other things which affect how the form is built, also for the ?destination parameter.
#2503429: [PP-*] Allow both AJAX and non-AJAX forms to POST to dedicated URLs might allow the form action to be consistent each time, but that issue has had very little progress in the past 9 years because it will be incredibly hard to implement.