Problem/Motivation
From the DrupalCon Barcelona Hard Problems Meeting on performance:
- ImageStyle/derivative stuff aggregate generation - remove state gets, file_exists(), cold cache memory/cpu
- Cron subscriber: state get/config get on EVERY request, we can only get rid of the config get by moving it into a module; we’ll still need a config get, but then you can at least avoid the config get if you uninstall that module https://www.drupal.org/node/2507031
- Alex: can’t we cache that?
- Berdir: No, wasn’t able to do so even after 6 months of trying. I used a cache collector, but that didn’t work out.
- Patch: https://www.drupal.org/node/2575105
Proposed resolution
The state service / \Drupal\Core\State\State extends \Drupal\Core\Cache\CacheCollector. This results in less queries to the database as a single cache entry stores commonly accessed state items.
Most code will not need to make any changes.
NOTE: due to a mistake proper discussion of the issue takes place in #2 -> #35 and #65 and after.
Remaining tasks
Decide on approach. See #136 and #137
User interface changes
None.
API changes
\Drupal\Core\State\State::__construct now requires a cache backend and the lock service to be injected. Support for calling this without those services is removed in Drupal 10.
Data model changes
None.
Release notes snippet
A new $settings['state_cache'] setting has been added, which should be set to TRUE to improve performance, but can be set to FALSE if any modules are storing large amounts of data or frequently changing data in the state store. In Drupal 11, the setting will be removed and the state cache will be permanently enabled. See the change record for more information.
| Comment | File | Size | Author |
|---|---|---|---|
| #144 | interdiff-133-142.txt | 800 bytes | quietone |
Issue fork drupal-2575105
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 2575105-use-cache-collector
changes, plain diff MR !5636
Comments
Comment #2
berdirFirst patch. Manual testing is working fine, lets see how many tests are failing.
Tests *will* fail due to stale caches in the parent site.
Comment #4
berdirThat's not a lot of fails and only within tests in almost all cases.
Weird, almost makes me wonder if this is actually working ;)
Comment #6
berdirIt is working, the reason we don't have test fails is because we already have a static cache that we invalidate after every web request.
The race condition that I was fighting with a long time ago in the original issue seems to be gone, yay. To be sure, I'll do 10 or so test runs after the sprints are over.
I did some profiling, but the absolute numbers don't really change that much, due to mysql cache and so on. But we are trading 5 db queries on the frontpage for uid 1 with one call to apc (I'm using the bootstrap bin now, not 100% sure about that).
The risk of course is that we end up with a lot of data in the cache collector that we don't really need on most requests, especially if state is used incorrectly.
We already have quite a few big variables in there, like the admin route list (which is ~80 with just core and 200+ on my site) and the theme extension objects.
I also found #2154461: Consider persistent caching and/or CacheCollector for state system again, where we thought about using the cache separately for each item. With bootstrap/fast chained, that might also be a valid option and less risky if state is mis-used. Thoughts?
Comment #7
berdirI did the profiling (as I wrote, not much to see, so no stats)
Comment #8
wim leersLooks wonderfully simple!
Is some of the State used even on non-HTML responses? If it is, then I think
@cache.bootstrapis fine. Otherwise, I think introducing@cache.statemay make sense?Interesting. Let's document why.
I think the original was correct.
Comment #9
wim leersSo over there, this was said:
It seems like the potential for damage is relatively big. If even a single contrib/custom module creates big State entries (think hundreds of kilobytes or maybe even megabytes), then every page is slowed down massively. With the chained inconsistent consistent cache backend that uses APCu, it seems like it may be better to have multiple small I/Os instead.
Comment #10
berdirFixed the nitpicks.
On separate cache entries vs. cache collector and which cache bin to use, I discussed that with different people including @alex.pott, @Wim Leers and @dawehner.
The thing is that it is a) easy to use a different cache bin simply by changing service arguments and b) also possible to write a different implementation and use that instead.
We want to optimize for the 90% use case, where we assume that people will end up using really problematic modules since contrib modules should hopefully get fix and improved by someone if they do it wrong and many might also not have APCu. It's also possible to do it badly with separate cache entries, if you don't have APCu and are doing many different state get calls per page.
Therefore, the decision is to use the cache collector approach with the bootstrap cache bin, since even fast chained has a single cache get to check if the cache isn't stale, so re-using that makes sense.
We will look into moving some parts away from state even, e.g. the non admin routes since that is now cached and not needed on every or even many requests, but that's a different issue.
Comment #12
wim leersThis is why it failed :)
Once that's fixed, this is ready.
Comment #13
berdir...
Comment #14
wim leersImproved the IS.
Regarding profiling: see #6.
Comment #15
wim leersComment #16
berdirTo provide some more numbers, I also checked a production site in new relic, and over the last 7 days, new relic reports ~10 queries to key_value on a node page, most if not all (on that version) are state, and one query takes about 0.7ms. It also reports ~1200 queries/minute against that table, so most of them should go away with caching this.
Comment #19
wim leersI asked testbot 3083 to be killed.
Comment #20
catchWe're still using state in https://api.drupal.org/api/drupal/core!modules!field!field.purge.inc/fun...
I'd be happier if we moved that to k/v first, or at least made an explicit decision that deleted fields in the bootstrap bin is OK.
Comment #21
wim leersSo, catch, I think you're saying that the rationale given in #10 by Berdir is insufficiently convincing?
i.e. you feel that the "deleted fields state" must be fixed before this can go in, because it could pollute things too much. But like Berdir said in #10 already, we already have the
routing.non_admin_routesstate entry, which we should also remove.Deleting fields feels like a relative edge case, and can be refactored/improved independently of this issue. Furthermore, if I read https://api.drupal.org/api/drupal/core%21modules%21field%21field.purge.i... correctly, then it looks like this can only ever temporarily "polluate" state: while the field is being purged. So it's also self-recovering. Which — if correct — is another reason to do that in a separate issue that this issue does not need to be blocked on?
This inclines me to RTBC it again. Feel free to re-un-RTBC — I may be missing something :)
Comment #22
catch@Wim I think it's convincing for contrib, but only if core sets a good example as to what's in state vs. not.
We could remove the caching of non-admin routes and just get it from this cache instead.
Comment #24
fabianx commentedI have a way to resolve this issue in a BC-compatible way with clear expecations set for contrib:
- Create a decorator, which decorates the state service (instead of changing it)
- Call this decorator state.cached
- Use state.cached instead of state for core services that only store small things
Profit!
Comment #25
fabianx commentedComment #28
berdirHere's a reroll for now. Will try the separate service idea but that will result in a *way* larger patch, with more conflict potentation.
Comment #30
berdirLets try again for 8.3.x
Comment #31
berdirComment #33
fabianx commentedNice work! :)
Comment #34
berdirI've been thinking about that state.cached idea, but I'm not convinced. How do we keep those services in sync? What if someone writes a key directly through state and then we don't see that in state.cache and don't invalidate the cache?
I was thinking if we could maybe instead introduce a setting (container parameter?) to enable or disable caching completey.
That said, I looked at some of our sites and it looks all pretty OK to me. What's interesting is that the by far greatest key in there is locale.translation_status. And that's a list of information for each project and it's only something that's needed in the backend while checking/updating locale data. And we save it per project. So it seems like it should be a separate key value collection IMHO. I'll open an issue for that. Not sure if that is an API or not, but IMHO not. There are API functions to get the data.
The total lenght of all keys is 144340 on one project, with 37 keys, the biggest are:
Another project is 136532 total length, 87 keys, biggest keys:
Comment #35
berdir#2834580: Move locale.translation_status state key to a separate key/value collection
Just because the 2-3 projects that I checked don't have big data in there of course doesn't mean that there are no sites that do. We could make my suggested setting opt-in or even add a requirements hook or so if it uses the default backend to warn about big data in it.
Comment #39
catchThis got committed by mistake here:
But as far as I can tell, the State cache item never gets set. Moving to critical bug for a revert.
Comment #40
catchPatch doesn't reverse apply any more, here's a revert patch based on #30.
Comment #42
berdirHu, that's weird, wasn't aware that we did that accidently, sorry about providing a messed up patch in the other issue :)
> 18:50:39 Fatal error: Uncaught Error: Cannot access parent:: when current class scope has no parent in /var/www/html/core/lib/Drupal/Core/State/State.php:101
Comment #43
catchOne line wasn't fully reverted.
Comment #44
catchBetter when you create the patch properly too.
Comment #49
amateescu commentedThis patch is causing bugs like #3007827: The state cache is not cleared when setting a state, here's a revert patch created manually.
Comment #50
amateescu commentedThe revert works fine, let's get it in so we can continue working on the original patch.
Comment #53
catchCommitted/pushed the revert to 8.7.x and 8.6.x.
Back to CNW.
Comment #54
alexpottThe revert has broken PHP 5 testing :(
core/modules/system/tests/src/Functional/System/ShutdownFunctionsTest.phpnow errors withComment #55
alexpottThis fixes it. It might be a separate issue that this revert has exposed. Before the revert there are two callbacks before the callbacks start being called. _system_test_first_shutdown_function and the lock release callback. After the revert there's only _system_test_first_shutdown_function - in PHP 5 seems to be a bug with a foreach where the array being looped over changes. See https://3v4l.org/le78R
The fix is to use while and next for a more reliable process - see https://3v4l.org/clIom
Comment #56
andypostIt also needs to prevent serialization
it brings state here, so needs __sleep()/__wakeup() to prevent static cache from serialization
Comment #59
xjmBaffled as to how the recent commit exposed this, but since HEAD is busted, I just went ahead and committed #55. Leaving this open for #56 and for followup investigation or a proper fix since the patch in #55 changes code that hasn't been touched in a year. Maybe it should be opened as a separate issue?
Comment #60
alexpott@xjm yes it it baffling. I think it's another thing we broke when we converted code for the
each()deprecation. But the break was hidden because other changes including this one meant that we had more than just the_system_test_first_shutdown_functionin$callbacks. This meant when_system_test_first_shutdown_functionadds yet another callback we weren't at the end of the loop and the test didn't break. When this issue was reverted it became the only callback and therefore it broke as per https://3v4l.org/le78RI ponder given the random set of things that has happened to this issue if we should close it and start with a fresh issue.
Since we're now back to where we were before this got erroneously committed as part of another issue (see #39) setting the issue metadata back to where it is.
Comment #61
alexpottI've opened #3008140: Improve shutdown function testing to add test coverage so there's additional test coverage so we don't get caught out by #55 again.
Comment #65
berdirLets see if we can revive this. Maybe we should do a new issue as this is indeed pretty confusing with the reverts and bugfixes being committed there.
Comment #66
berdirAdding BC.
Comment #67
berdirI suspect that the CLI problems reported by #3007827: The state cache is not cleared when setting a state are related to #2272685: Incorrect handling of invalid cache items in apcu backend. cache collector is the only thing in core to actually rely on cache invalidation and the existing ones probably don't use a fast chained cache backend. And cache invalidation with the apcu backend is a mess, to put it carefully.
As a workaround, switching to a different cache backend that doesn't use fast chained. means an extra cache get from the database/redis/..., but state is fairly frequently invalidated, compared to other bootstrap things, so might actually not be a bad idea anyway.
Comment #69
berdirWell, that is stupid. I forgot the call to persist() even though I likely wrote that documentation myself that you have to do that (although I don't remember why we don't do it by default. I was 7 years ago...).
While the apcu thing didn't help, this is the real problem. I'll see if I can somehow write a better test for this.
Comment #70
berdirHere is a unit test that ensures that you can update values that already are in the cache. I realized that there are two conditions for this bug to happen that might be common during cron jobs but not so much in tests:
a) The value must already be in the cache, so that we don't trigger \Drupal\Core\State\State::resolveCacheMiss() which would mark it for writing, whether or not it changes during the request
b) At least one other miss/set/delete must happen in the same request, otherwise \Drupal\Core\Cache\CacheCollector::$keysToPersist is empty, the cache had been invalidated, nothing is written and it will build it again from scratch on the next request.
I'm testing both cases for b), to also make sure that the cache is updated if nothing else changes.
Attaching a version of the unit test with commented out persist to show that it catches the bug and an interdiff with just the test.
Comment #71
berdirA patch for Drupal 8.9.
Comment #74
berdirComment #75
berdirnamespace here is wrong.
Comment #76
berdirFixed that.
Comment #77
alexpottThis change feels quite impactful. Is there not an argument for adding another tag - something like
needs late destructionI worry that something is contrib is expecting to come after this. Like will this break supercache? http://grep.xnddx.ru/search?text=onKernelTerminate&filename=
The format for these is @trigger_error('Calling ' . __METHOD__ . '() without the $image_factory argument is deprecated in drupal:9.1.0 and is required in drupal:10.0.0. See https://www.drupal.org/node/3173719', E_USER_DEPRECATED);
Comment #79
berdir1. Trying to track down why exactly I made the change and -100, but it was in the first patch already, I guess I didn't pick -100 for a reason. You're right, this is a tricky change.
That said, your search term only finds methods that happen to have the same name, but there are plenty of other subscribers if you search for http://grep.xnddx.ru/search?text=KernelEvents%3A%3ATERMINATE&filename=. I would assume that most of those don't have dependencies between themself, but who knows.
instead of a second tag, maybe an attribute on the tag could work too.
I did debug again why I did this and the reason is \Drupal\automated_cron\EventSubscriber\AutomatedCron, that also has a prio of 100. So we could also "fix" at least core by setting automated cron to 101. The question is what's more likely to break?
a) Something not fully working because it calls state *after* \Drupal\Core\EventSubscriber\KernelDestructionSubscriber and then at least caching of that is messed up (it would never end up in the cache if only called there)
b) Something has to be called *after* this or another destruct event.
I'm honestly unsure. I think a) probably has a smaller impact, if something later on indeed uses state, worst case is that it won't benefit from the new cache. But it might be hard to improve that then in the affected custom/contrib code. b) has higher risk of breaking something but more "potential" improvement.
Given that, for minimal BC impact, we could set automated cron to 101 and discuss moving the kernel destruct to later by default or offering a late destruct option in a follow-up?
Comment #80
alexpottI think changing automated cron to 101 is a good idea. And opening the follow-up to discuss late destruction / changing the priority.
Comment #81
alexpottPatch addresses #80, #79 and #77.
Comment #82
alexpottCreated the change record https://www.drupal.org/node/3177901 and updated issue summary.
Comment #83
berdirOne thing that we haven't discussed anymore is the risk of possibly very large state items. Before that was a minor issue as you only had to load it when you used it. But now it will sooner or later end up the cache and we have to load it on every page.
I guess at a minimum, we should document in the change record that state should be used only for smaller stuff. The issue summary originally used a state.cached service, so the collector usage would be fully optional, but that would require additional changes and at least writes to regular state would also need to invalidate the cache.
One alternative idea I just had was to skip caching large entries automatically, we'd just ignore them and don't call persist() on them, just pass them through. Basically a strlen() > 10/100kb or so? I think the overhead of that would be fairly minimal.
Comment #84
alexpottSeems like a good idea. Maybe this could be baked into the cache collector pattern? Like isn't this true for all use-cases?
Comment #85
andypostComment #86
daffie commented@Berdir said:
@alexpott responded:
Comment #87
berdirI thought about that, but that's not really possible right now. Subclasses are responsible right now to call persist(), and within persist, we no longer know the size. At the same time, it is a bit awkward because writing the test made me realize that we can also not call invalidateCache() then. Because if an item is big *and* being set, we would invalidate the cache for no reason. IMHO also shows again that this whole thing with forcing clients to call persist just doesn't really seem like the best idea :)
So I think we need to do it in State, but then we can also set it to 10k characters whereas other implementations might want to use a higher limit (could be done through a property).
Also did some slight refactoring to have less duplication by reusing the setMultiple()/deleteMultiple methods.
Maybe create a follow-up to automatically persist() including the a max size implementation that could be set/overridden in the subclass?
Comment #88
berdirSomething failed very badly :)
Comment #89
berdirhuh, so you are telling me that not everything in state is a string? But that assumption worked so well in my unit test... (Never trust a unit test)
Not sure if there is a faster way to get the size of an array or object than serializing it. This means an extra serialize call which is not so great. Only on a set/cache miss, but still.
Comment #90
jibranIsn't discover and default two separate bins?
Comment #91
berdirYes, leftover, I switched over to default to avoid bugs with the apcu backend, forgot to update it for the BC fallback.
Comment #92
berdirFixed that.
Comment #93
jibranThanks, a couple of more observations:
Comment #94
berdir> Do we want to collect some before and after stats here?
We have this in production on lots of sites, some of them very large, I don't have any exact numbers, but between this and #3131585: Performance regression caused by using the last installed entity definitions you can expect that queries against key_value will basically vanish for regular requests unless you have some other module that queries key value directly. And since I fixed the cache inconsistency issue, we haven't had any issues with it.
Comment #95
alexpottThis bit feels really tricky. The most likely stuff to fall into this size restriction is the entity type info... here's the rows greater than 10000 on a real site I'm working on which has plenty of custom entity types:
/me wonders how often routing.non_admin_routes is used...
Comment #96
berdir@alexpott: This issue/limit only affects state, other key value collections aren't cached, not like this.
Comment #97
berdirBut still, thinking about it more, "Only on a set/cache miss, but still." means that if you have something above the defined size, it will be a cache miss every time it is requested.
And your non admin routes thing is large and I think used a lot. inspecting, I can see that it has its own cache wrapper around state, so it's actually not that much of a problem: \Drupal\Core\Routing\RoutePreloader::onRequest. Still, the state cache means we could drop the extra cache wrapper. But only if we _don't_ limit the size. Tricky :)
Might be worth inspecting a bit what's in there though. I wondered before if the implementation there is too simplistic, I'm pretty sure there are a _ton_ of routes in there that are actually never or hardly ever needed for link access/generation.
Comment #98
catchCould we cache the size that we calculate in the cache collector, with a special format or whatever so it's distinguishable from actual data? Then rather than recalculating each time we'd just know we have to look it up. Then it really would only be done on cache misses.
Comment #99
berdirHm. interesting idea, but I think tricky with the current setup. I think that could only be done in the actual logic that does write the cache, but then we would have to do it in the CacheCollector base class (or duplicate it into State). But as mentioned, the decision *if* something should be persisted is done in the subclasses, including immediately invalidating the cache in case of a set(), so if we'd change persist() to ignore known-large-items, we'd still end up invalidating the cache.
Comment #100
alexpott#3120301: RoutePreloader: prevent preloading of routes generated by JSON:API fixes the massive route preload entry for me. So now this issue will add a double cache.
Comment #101
berdirGiven my thoughts about the length checking overhead on every load possibly being worse than just caching it unless it is _really_ large, but then you might have different issues anyway, I'm going back to an earlier patch (#81).
> So now this issue will add a double cache.
Given that the related issue is now fixed. lets fix that problem by just removing the separate RoutePreloader cache then? That means an even bigger benefit, because we save a separate cache get.
For a second, I was worried about the routes cache tag there, but that's not an issue at all, the state cache already invalidates itself if it changes. That made me wonder if we should optimize to not write the still fairly large state key if it didn't change (chances of that on cache clears and redeploys seems large and we likely already have it loaded/in cache now), but that's out of scope.
I did have another look around some of our larger sites and since we don't use jsonapi, the routes preload entry is managable even without that core key, the biggest thing I found was a system.js_cache_files state entry with 87kb, that's an array with 620 entries. The same site has one instance of state misuse where it uses it as sort of a cache with dynamic (date) keys, so there are in total 6k keys and close to 1M of data in there, but the cache collector approach will ensure that it doesn't need to load that much. The current cache entry length there is 150kb as a serialized string.
Comment #104
berdirThis seems to be blocked on #2363351: Cached services can't be used in service providers/modifiers now.
Comment #105
catchComment #106
berdirReroll.
Comment #109
quietone commentedFound this because this is an issue that has already been committed but still open. In fact I see two commits.
I agree. (From #65)
Not that I recommend continuing in this issue, but if it makes sense to do so, can the Issue Summary be brought up to date?
Comment #112
berdirReroll for D10.
Comment #113
pradhumanjain2311 commentedTry to fix cs errors.
Comment #114
berdirAnother reroll.
Note: The conflicting commit has been reverted on 10.0, use the patch above there, this is for 10.1.
Comment #116
berdirSorry for the noise, another conflict on 10.0.
Comment #117
berdirAnd a reroll for 10.1, same conflict for 10.0 and 10.1, but the patches are elswhere not the same. #2363351: Cached services can't be used in service providers/modifiers is close now, once that's in I will check if there are any test fails left.
Comment #119
mathilde_dumond commentedversion of the patch from #106 without the tests, to have a patch that applies on D9. test changes are excluded to avoid rerolls for D9
Comment #120
mathilde_dumond commentedsorry about the empty patch, here is again patch 106 with no test
Comment #121
berdirAnother reroll. There's no point in updating the deprecation message versions until the blocker is resolved.
Comment #123
berdirAnother reroll. 10.1 is busy.
Comment #125
berdirNote for people following along here. #2363351: Cached services can't be used in service providers/modifiers is now a hard requirement to use this on 10.1 due to the new DevelopmentCompilerPass that use state. I'm not sure if state is the correct thing to do use there but for now, you need the current patch from that other issue to be able to use this.
Comment #127
berdirHere's a combined patch with the one from the issue. I'm seeing some very weird things on 10.1 with Symfony 6.3 with this (something about Ghost/Lazy services and missing properties). Lets see what testbot thinks.
Comment #128
berdirCan reproduce the errors also with just core on update.php. It's going through DevelopmentSettingsPass as expected, that's not visible in the backtrace though. Error looks like this:
Error: Call to a member function insert() on null in Drupal\Core\Lock\DatabaseLockBackend->acquire() (line 71 of core/lib/Drupal/Core/Lock/DatabaseLockBackend.php).
Somehow it doesn't work to properly initialize a Ghost object so early and then it explodes.
Workaround is to not use the state service there but use the key value store directly and bypass caching and CacheCollector stuff. Not pretty but avoids the issue and actually kind of makes sense as we are not interested in persisting those values in the cache.
(edit: wrong patch name, but the contents should be correct)
Comment #131
berdirAh, much better. Fixing that unit test, then this should green, but still blocked on the cached services issue, even the workaround for DevelopmentSettingsPass as there are some tests that break.
Comment #132
smustgrave commentedBased on your last comment this is postponed #2363351: Cached services can't be used in service providers/modifiers
Comment #133
berdirThat is in, rerolled. Conflicted on the deprecated state stuff.
Comment #134
smustgrave commentedDeprecation messages need updating. Same for CR please
Comment #135
quietone commentedAll I did was to update two trigger_error messages.
Comment #136
berdirI'm slightly torn on this. I strongly believe that this has considerable performance improvements, we're seeing large reductions of query counts with this on production sites.
That said, there are definitely cases where cache is _not_ required and it's wasteful to load things on every request that are only needed on cron jobs or even cache rebuils. See the changes in #128. I'm wondering if we should instead introduce a new service that wraps state and make caching an opt-in. It would implement the same service, so it's a drop-in replacement. Not sure about \Drupal::state(), could be a new method as we want to promote it IMHO, or a new parameter.
That would deal with concerns about sites/custom code that stores a lot of data in state that is not frequently. Thoughts?
The downside is that if a key is inconsistently used, sometimes with cached state, sometimes not would result in cache invalidation issues. That's technically also the case when bypassing the state service and using keyvalue directly with the current patch.
Comment #137
catchShould we be moving some keys out of state to a separate collection? I was thinking about the recent development settings page since that's only accessed on container rebuilds. Not sure what that collection would be called though, like state-2...
We're comparing this to Drupal 7 variables which loaded the entire table of state + config on every request so we're already quite a bit more efficient, but yeah it does seems a waste.
Comment #138
smustgrave commentedSo this issue doesn't stall should this be NW for #137?
Comment #139
catch@smustgrave if we did #136 it would be needs work for that, but if we do #137 it would not, instead we'd open a spin-off and go ahead here.
Comment #140
smustgrave commentedTo keep the issue from stalling going to mark it then. If we need a follow up for #136 can do that for everyone here.
Comment #142
quietone commentedI am reviewing issues that I have bookmarked that were committed and then reverted.
As I read #136 and #137 there is not yet agreement on the approach so setting this to needs review for a decision on that point.
There are failing tests and this should be converted to an MR. I will do that now.
Comment #144
quietone commentedThere was a conflict when applying the patch so here is the interdiff.
I am also tagging no-needs-review-bot to emphasize that a decision should be made before working on the MR.
Comment #145
smustgrave commentedQuestion who can make the decision though?
Comment #146
smustgrave commented@berdir or @catch could you make the calls the remaining tasks? Not sure who can.
Comment #147
catch@berdir's idea is to opt-in individual keys for caching.
My idea is two have two services/collections - one for cached state things, one for infrequent.
I think two collection is more robust - no chance of something being requested both cached and uncached, but we need to name it, and we also need to decide whether the new or old collection is the one that gets the caching.
Once possibility would be state vs. state_cached (or state.cached) and move the frequent stuff over to it.
Comment #148
berdirI don't think I was thinking of doing it opt-in per key, I also assumed we have two state services, not two collections though. Although I suppose that does kind of make it per-key.
I'm not sure I get how exactly we would "move" stuff over. Wouldn't that essentially be an API change? Lets take the maintenance check for example. If we just switch core over, then other modules and for example drush that check the old one will be broken as that will just not be there anymore. We'd need a BC layer on the non-cached state (which I assume would be the current state collection) that redirects keys to the cached version, so we need to somehow give it a list of all those keys, which sounds complicated.
The thing is, we already have non-cached collections, that's every other collection stored in keyvalue storage. If you don't want to have it cached, use \Drupal::keyValue('yourmodule')->get('foo'), that's just a few extra characters (DI is a bit more involved with the factory). The only reason we'd bother with a cached and non-cached state version is it wasn't cached until now and there might be sites out there that have all kinds of things in there due to custom and maybe some contrib modules using it ways it's not really meant to be used.
Of course "the way it's meant to be used" is kind of being changed here, no existing documentation clearly says it should only be used for small bits of information and limited cases, it's just somewhat implied by the examples. Also, the fact that it uses the key value store and is "just" a specific key value collection and the key value API itself doesn't seem to be actually documented anywhere (except the query to get all keys on the documentation page). Neither on https://api.drupal.org/api/drupal/core%21core.api.php/group/info_types/10 nor is there a key value API on https://www.drupal.org/docs/develop/drupal-apis, and https://www.drupal.org/docs/8/api/state-api/overview (which I think is very confusing, especially the reset part).
Anyway, a counter-proposal/idea to the two different things idea:
* We introduce a setting that in D10 defaults to FALSE, so you have to opt-in to get the cache, with a requirements check that points out if the setting is not explicitly set and recommends to either set it to TRUE or FALSE explicitly
* in D11+, we switch the default to TRUE.
* We improve the state documentation to make clear in which cases it should be used and when it's better to use your own key value collection
* Consider moving some examples in core to a different key value collection to have fewer bad examples, coming back to the BC concerns though. I don't think we need to provide BC for example for the twig development mode setting though?
(Also setting to needs work so that @smustgrave no longer needs to be sad about this being stuck in the needs review queue ;))
Comment #149
berdirif we add the setting, what are we going to do with this bit? drop the cache, sites with disabled cache will just get an extra query? Add extra complexity and do a check for the cache setting, only doing it if global state cache is disabled?
Comment #150
catchThis sounds good and I also don't think we need to provide BC for that one (hopefully none of the ones we move, but definitely not that one).
The $settings opt-in also sounds like a good plan.
#149 I could go for either of those approaches, not sure which is best, but at least it'll be temporary.
Comment #151
berdirIt's a bit ugly, but I implemented the setting now in State, have to override a bunch of methods to avoid unnecessary cache get/set/invalidate calls.
This should have an effect on the new performance tests in core, haven't looked much at them yet, I guess that's going to change now.
Comment #152
catchFrom https://git.drupalcode.org/project/drupal/-/merge_requests/5636/pipeline... looks like we're saving approx 10 database queries on nearly every page request, which is great and what I'd expect based on how many state checks we do usually.
Also there's a possibility it's a fluke since occasionally we get runs this fast already, but the full test run was 7m53 seconds with all functional tests finishing under 4 minutes - it may be enough of a change that it's making test runs faster overall. Shouldn't get too excited until that's happened a couple more times though.
Yeah I think that's fine - we can add a change record and the specifics of whether and when keys are set aren't part of the public API, obviously if contrib is actually relying on something a lot we'd need to think about it more carefully, but we can do that case-by-case.
Comment #153
berdir> From https://git.drupalcode.org/project/drupal/-/merge_requests/5636/pipeline... looks like we're saving approx 10 database queries on nearly every page request, which is great and what I'd expect based on how many state checks we do usually.
Yes, saw that too. I was confused about the increase of queries in that umami test though, I need to look at what that exactly does, I wonder if there's some timing issues/state cache race conditions as it does multiple requests in quick succession.
Reproducing that manually in blackfire with a user without any permissions in umami gives me _very_ different numbers and what I expect. -4 state queries, +1 cache_default query, 33/3 in total. And not 17/18.
I think now that it's opt in, it would make more sense to move this to cache_discovery so we use the fast-chained backend for it? That should save another query then.
Comment #154
longwaveChecking on two separate production databases that I work on:
Looks like simple_sitemap should be storing whatever that data is somewhere else.
Comment #155
catchOne thing to note is that cache requests in the performance tests include those from chained fast, so if blackfire is ignoring those, it'll be a huge difference.
Seems sensible to me.isn't it using bootstrap already? If so that's already using chained fast.Comment #156
catchRefreshVariablesTrait calls $state->resetCache() which in the MR calls clear() which resets both the persistent and static caches, it needs to be ::reset() instead. Pushed a commit. Test should fail in the right direction now.
Comment #157
catch#156 makes the change in the performance test look right to me, however we're seeing other test failures in runs which aren't consistent. I think the issue is that any test using state to track.. well.. state between requests, now needs to either switch to WaitTerminateTestTrait or to use the key value service directly without the state wrapper - because the cache will be updated in terminate. Pushed a commit for LanguageNegotiationContentEntityTest which was failing locally for me.
Before I found this, I thought I found a race condition in the state implementation - the cache entries were being invalidated before key value is written to. That's not the cause of the test failures after all, but I still think it's a valid change.
Comment #158
catchGot to a green test run, but had to do some horrible things in
LanguageNegotiationContentEntityTestand some of the test failures are random so could still be a couple of false positives.I think the proper approach to those is going to be switching from state to raw key value in the test modules/test classes, if we do that we can probably undo TerminateWaitTestTrait. However this at least confirms that all of the test failures are due to test and tested site going out of sync when explicitly using the state API to track/alter things in the test/test modules themselves, and not real bugs caused by the change at all.
Comment #159
smustgrave commentedAny recommendation for reviewing this one?
Comment #160
catchI think the main remaining things are:
1. We should open issues for taking things out of state and into key/value directly which are accessed extremely rarely. The new development settings stuff that's only on container rebuild / form save are the obvious ones, there are probably more though.
Also an issue against simple_sitemap would make sense too.
2. Decide whether the changes in core/modules/language/tests/src/Functional/LanguageNegotiationContentEntityTest.php are acceptable here with a follow-up to move the test modules to use key/value, or whether we should refactor those test modules to use key/value in this issue. This also goes for the various test modules used in the tests that are getting
WaitTerminateTestTraitadded. I could go either way here - refactoring the test modules adds scope, but it'll be less churn because we then wouldn't be adding WaitTerminateTestTrait and removing it again.Tagging needs follow-up since we need that for #1 even if we need to make a decision on #2.
Comment #161
berdirDid a rebase after the query logging changes. Ignored all changes to StandardPerformanceTest and OpenTelemetryAuthenticatedPerformanceTest, that will need to be redone completely.
Couldn't run either test on my current system, both fail with "Test was run in child process and ended unexpectedly", will need try somewhere else.
Comment #162
kristiaanvandeneyndeI'll take care of the performance stuff real quick
Comment #163
kristiaanvandeneyndeWent green on my local (the performance stuff)
Comment #164
berdirThanks, that looks great.
One thing I find a bit interesting is that the cache get count goes up. But we're also removing a cache get here in RoutePreloader, does that not run on these requests? Maybe because all these test scenarios actually hit dynamic page cache.
Would be kind of nice to have a test that doesn't use that as that's hiding _a lot_ of stuff, but I guess it could also get a bit overwhelming, that's going to be a lot of queries I guess. and not in scope of this issue.
Comment #165
catch@berdir see #3414373: Route preloading doesn't run on dynamic page cache hits.
There are some cold/lukewarm cache tests in the Umami performance tests which don't do query assertions yet, I've been hesitant to add assertions to them while we had so many random test failures, but we should probably try that now. This isn't the same as no dynamic page cache at all though so we still might want that too.
Comment #166
kristiaanvandeneyndeI would assume the extra get is coming from the collector going to the cache once to get all of the state stuff, whereas before the state stuff were non-cache-related queries, which showed up in the list of individual queries we're tracking.
Quick guess: The cache get that is saved in RoutePreloader shows up in those test cases where the cache get count does not go up by one. The ones where it does go up, RoutePreloader doesn't save a cache get and state introduces one extra get. Just a guess, though.
Edit: Cross-posted with @catch and my guess seems to match what he's saying.
Comment #167
andypostFixed tests after #3113971: Replace REQUEST_TIME in services
Comment #168
kristiaanvandeneyndeLast fail is Drupal\Tests\media\FunctionalJavascript\MediaSourceOEmbedVideoTest::testMediaOEmbedVideoSource, I'm not getting any wiser from the artifacts. I'm not seeing anything related to media in the changes either.
Comment #169
catchWorth running the job again to see if it's an existing random. I don't remember that one specifically but there's been a lot of random media test failures.
Comment #170
andypostThe test reported already but only once in #2829040-188: [meta] Known intermittent, random, and environment-specific test failures and error is different, so maybe it's real issue with redirect?
Comment #171
andypostit's because of error
The provided URL does not represent a valid oEmbed resource.for https://vimeo.com/7073899Comment #172
andyposttraced down the test locally (reproducible) til
\Drupal\media\OEmbed\ResourceFetcher::fetchResource()cURL error 28: Operation timed out after 5003 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://172.30.1.2/media_test_oembed/resource?url=https://vimeo.com/7073899&altered=1Comment #173
kristiaanvandeneyndeWhy can't I do a fast-forward pull on this branch? Happened twice now already 🤔
Re #172, are we sending a GET request to Vimeo every time we run our tests? I'm not sure how happy Vimeo would be with us doing so :)
Comment #174
andypost@kristiaanvandeneynde no, there's mapping in state to return local file and somehow this request is failed.
I used to add debug print and I see that request failed before the controller returns result(
Comment #175
steinmb commentedVery interesting work. Xhprof lead me to this issue, as I am tracking down a performance issues in a D10.2.x —site.
This is information that not often change, tough by reading through the issue and try to understand what is going on, I would say, this data should not be stored here but the site custom module. From my understanding is this data that not often change.
Comment #176
kristiaanvandeneyndeAh yes, now that you mention it: ResourceController::setResourceUrl() uses state.
Then the failure makes slightly more sense, we're changing how state works and now a test using state is failing.
Comment #177
longwaveRebased, I expect the performance tests to fail as I couldn't figure out what the new expected counts were supposed to be from the conflicts.
Also reverted the AutomatedCron event subscriber priority back to 100 and the docs change in core.api.php, I think now that
needs_destructionis handled after all terminate events (following #3412556: Allow needs_destruction services to run on page cache hits) that all terminate event subscribers are safe to use state.@kristiaanvandeneynde fast forward pulls do not work when the branch has been rebased and force pushed, most MR contributors tend to use the rebase method instead of merging in 11.x as merge commits are sometimes tricky to work with and it makes the set of commits in the MR cleaner.
Comment #178
longwaveMaybe a can of worms but is this another use for a feature flag module over settings.php? We could enable the feature flag module on new installs but existing sites can opt in over time, and eventually we can deprecate/remove it?Changed my mind, this is more low level and belongs in settings.php.
Comment #179
longwaveMediaSourceOEmbedVideoTest sets state during the test and then expects the site under test to read that, but now we can't (easily) invalidate the site's state cache so the test doesn't have the effect it thinks it does. To solve this I switched the test ResourceController to use key value instead of state.
Also updated the performance test counts, and updated the change record a bit.
Comment #180
longwaveExtensionList::getPathNames()appears to wrap cache around state, we can perhaps drop the cache wrapper in a followup.Comment #181
kristiaanvandeneyndeIn general this looks really good! I'm wondering why we're adding the WaitTerminateTestTrait without calling its method, though? I couldn't find anything in core that automatically calls that method if it's present either.
Comment #182
longwaveRemoved WaitTerminateTestTrait seemingly with no harm done.
Comment #183
longwaveRemoved the cache wrapper from ExtensionList, hoping this will remove another cache get.
edit: it did not, but it should still save some cache accesses.
Comment #184
kristiaanvandeneyndeJust went over all the changes again and they look good to me. I would just like to see a follow-up so we don't forget to change the state_cache setting in D11.
Also wondering if we need a deprecation notice for dropping the $cache argument from RoutePreloader, although I couldn't find any other example of that in core or the documentation. The docs describe what to do when A, B, C becomes A, C; but not what to do if it becomes A, B. I would err on the side of adding a trigger_error still, using func_get_args().
I have no other objections. Good job on figuring out MediaSourceOEmbedVideoTest, by the way! Setting to NW for the 2 points above but feel free to ping me when those are addressed and I'll gladly set this to RTBC.
Comment #185
longwaveThanks for reviewing, opened #3436954: Remove state_cache setting in Drupal 11.
#160 asked for some other followups, so leaving the tag in place for now.
Will add a deprecation for the RoutePreloader argument, and also change LanguageNegotiationContentEntityTest to use keyvalue in this issue, given we have precedent in the ResourceController for the OEmbed test.
Comment #186
longwaveComment #187
longwaveAlso opened #3436960: simple_sitemap.queue_stashed_results should be in keyvalue instead of state, unsure if @catch wants to open any more followups or not - leaving the tag in place for now.
Comment #188
longwaveAdded a release note given we are adding a new setting here.
The tests failed a couple of times but I think I was just unlucky with random fails - they just passed again so hopefully this is all good now.
Comment #189
catchTagging as a release highlight, because along with other changes in Drupal 10.3 we're dramatically decreasing the database queries on all page requests, can go in a general 'performance improvements' paragraph.
I opened #3436971: Move the TestWaitTerminate middleware to be added conditionally which I'd been thinking about for a while, but seeing the query in the diff reminded me of it.
Now that tests using state have been updated to k/v here, I think that's everything I was worried about in #160 so removing the tag.
Comment #190
kristiaanvandeneyndeRTBC for me, should fix the 2 drupal:10.0.0 references to drupal:10.3.0, though. Holding off on those grounds.
Comment #191
longwaveFixed the deprecation version, don't know what I was thinking there.
In a followup I think we should add a note to https://api.drupal.org/api/drupal/core%21core.api.php/group/info_types/10 and perhaps https://www.drupal.org/docs/develop/drupal-apis/state-api/state-api-over... mentioning that the state store is meant for frequently used and infrequently changed small pieces of data, and the key-value store should be used for larger or more frequently changing data.
Comment #192
kristiaanvandeneyndeAgreed. Code looks good to me now, so RTBC.
Comment #193
catchOne more follow-up, I opened #3437162: Move twig_debug and other development toggles into raw key/value.
Comment #195
catchReally nice to be able to commit this one, we've come a long way since variable_get().
I had to resolve a merge conflict on 10.3.x, it was trivial, let's see whether I did I right or not...
Committed/pushed to 11.x and cherry-picked to 10.3.x, thanks!
Comment #198
berdirWow. Very tempted to insert the "It's been 84 years..." meme. This took a while, thanks for the help to push this over the finish line.
I've clarified and expanded the change record: https://www.drupal.org/node/3177901, specifically around the current opt-in/opt-out behavior and how to check for problematic usages of the state API.
I've also added a short paragraph on recommended/correct usage of the state API and I created a follow-up issue to improve API docs as well as the documentation on drupal.org: #3437176: Improve documentation of the state and key value API
Comment #199
longwaveThis feels like a good heuristic:
Wonder if we should add a hook_requirements status check for this to help people before Drupal 11 enables this for good?
Comment #200
berdirI just made those numbers up, felt like a nice, even numbers, not sure about them yet, especially the length one.
There is already a hook_requirements() warning, the problem is that there is no API to get that information, we would need to check the used key value implementation and then hardcode them or something like that, was not sure about the putting that complexity in there. What meant to do is link the change record there in that message so that people can read through that if they have questions. Maybe we can do a follow-up commit or issue for that?
Comment #201
jurgenhaasQuestion: we have a sub-class of State in ECA, how should we declare the constructor and the service, if we want to support both Drupal 10.2 and 10.3? This seems impossible with this change, or am I missing something?
Comment #202
catchI found https://git.drupalcode.org/project/eca/-/commit/3bb22ea5820fd81f72e46c44...
Since this is a subclass with its own service definition, why pass the extra services into your own constructor for both 10.2 and 10.3 - they won't be used in 10.2 but that seems OK?
Comment #203
jurgenhaasUnfortunately not. We have an extra service to inject and our service declaration looks like this:
So, for 10.3 the constructor needs to look like this:
but for 10.2 it needs to look like that:
Comment #204
catch@jurgenhaas why can't you inject cache and lock into 10.2?
Comment #205
jurgenhaas@catch ah, you mean I could update the service declaration there to mimic what 10.3 is going to do in the parent service? They should work, I'll give that a try in #3437321: Add support for cache collector for state in core
Comment #206
longwaveSpotted two issues here that deserve followups:
1. The new constructor calls the CacheCollector constructor with a fixed cache key; this will need to be changed in any subclasses, so we should move this to an overridable class constant.
2. We inject the bootstrap cache bin in services.yml but the BC fallback uses the discovery bin.
Comment #207
longwave#206.1 can actually just be solved by overwriting the property, but opened #3437347: State cache collector uses a fixed cache and collection ID anyway.
#206.2 is a bug so opened #3437344: State cache bin is inconsistently chosen
Comment #208
catchThis is leading to random test failures, I opened #3438424: [random test failures] Race condition in state when individual keys are set with an empty cache, no MR yet.
Comment #210
klemendev commentedWe have observed that on some of our websites
SELECT name, LENGTH(value) FROM key_value WHERE collection = 'state' ORDER BY LENGTH(value) DESC LIMIT 100;yields length numbers in orders of 200k, which is not ok according to https://www.drupal.org/node/3177901.
The vast majority of the length comes from drupal_js_cache_files.
https://www.drupal.org/node/3177901 recommends optimizing if
but how can we optimize something that originates from the core?
Comment #211
kristiaanvandeneyndeCould you file a new issue about this please? If core is setting that much data through drupal_js_cache_files, that definitely needs a dedicated issue.
Comment #212
berdirCore removed any usage of the string system.drupal_js_cache_files 12 years ago when it was renamed to system.js_cache_files. And that too was deprecated and removed in in 10.2 or 10.1.
Whatever you have there is probably not core or at least not actively used anymore. Try searching your code base for it, if you don't find anything, you can delete those records. FWIW, since this is a cache collector, if nobody *uses* those state keys, they will also not end up in the cache.
Comment #213
klemendev commentedNo occurrence of drupal_js_cache_files in the codebase, it seems this is stale stuff. The website we spotted this on used to run on all versions starting D7, so maybe it is an artifact from these times.
Comment #214
quietone commented