Problem/Motivation

After a full cache clear, Drupal has to build up various caches before it can serve a request. Element info, route preloading, menu links, library info, various plugin registries, theme registry, twig templates, CSS assets (#1014086: Stampedes and cold cache performance issues with css/js aggregation) etc.

This both takes a long time, and massively increases the memory usage of a request.

Steps to reproduce

drush cr, load a heavy-ish page, watch that spinny wheel / check mem_get_usage()

Proposed resolution

PHP 8.1 adds Fibers https://php.watch/versions/8.1/fibers / https://wiki.php.net/rfc/fibers

In DrupalKernel::handle(), execute HttpKernerl::handleRequest()in a fiber. This allows any code, anywhere in core, to suspend itself using Fiber::suspend(), the place we suspend is for now after entering lock wait or getting a cache miss in particular services - this should only be done for cache items that are used to build most pages - plugins and registries, but not render caching or anywhere like that. But it could also be after executing an async query once we have a database driver that supports it.

At the same time, services can tag themselves using the cache_prewarmable service tag and implement PreWarmableInterface. This is not mutually exclusive with calling suspend on a cache miss, some services can do both.

When a fiber has suspended, DrupalKernel is then able to call the cache_prewarmer service, which gets a list of tagged service IDs. It calls the ::prewarm() method of one of these services at random, then resumes the original fiber and the request continues as normal. The prewarm method will usually just call a different method on the service, this needs to be one that both checks and sets the cache - we always want to assume it might have been set before we got there.

If the fiber gets suspended again later in the request, we'll go back to the cache prewarmer service and pick a different one.

This is complementary to #3377570: Add PHP Fibers support to BigPipe and in general is designed to speed up the part of request handling that precedes having bigpipe placeholders to render, but since different pages hit different caches at different times, sometimes the 'same' Fiber::suspend() call will result in going back to DrupalKernel and warming a cache, sometimes it will result in moving onto a different BigPipe placheholder. This is fine and by design, the code inside the fiber doesn't care what happens in between suspending itself and being resumed, it's just anticipating that something might be able to happen and then bigpipe and the prewarmer service take care of what that 'something' is.

The big advantage of Fibers is that as long as the code executing a fiber handles them correctly, and as long as the code that might be executing inside a fiber checks its inside a fiber first before suspending, they don't need to know anything about each other at all. The 'parent' starts and resumes, and the 'child' suspends, and it doesn't need to be any more entangled than that.

Here's a very simplified visual indication of what stampedes look like in terms of service cache get misses, building, and setting:

Each letter represents a service.
Aget (cache get)
A-set (build and cache set).

Extra hyphens means a longer period of time building the cache item.

C and D are nested services with one calling the other, equivalent to the views examples above. When C is warm, D doesn't get requested.

When we have one request, it just builds each service sequentially.

AgetA-set[NO-PRE-WARMING-]BgetB--------setCgetDgetD-setC--setEgetE----setFget-Fset

In HEAD, when we get a stampede. All the requests build and set each cache sequentially. The requests with Z and F at the end represent different pages (node vs. front vs. user or whatever), which may have different caches to build.

[NO-PRE-WARMING-] is a space filler because suspend and resume take up space on the line..

[NO-PRE-WARMING-]
[SUSPEND][RESUME]
<code>
AgetA-set[NO-PRE-WARMING-]BgetB--------setCgetDgetD-setC--setEgetE----setFgetF-set
AgetA-set[NO-PRE-WARMING-]BgetB--------setCgetDgetD-setC--setEgetE----setFgetF-set
AgetA-set[NO-PRE-WARMING-]BgetB--------setCgetDgetD-setC--setEgetE----setFgetF-set
AgetA-set[NO-PRE-WARMING-]BgetB--------setCgetDgetD-setC--setEgetE----setZgetZ-set
AgetA-set[NO-PRE-WARMING-]BgetB--------setCgetDgetD-setC--setEgetE----setZgetZ-set

With the patch, if we get two requests at the same time, we start to see a benefit (without any async):

AgetA-set[SUSPEND]BgetB--------set[RESUME]BgetCgetDgetC--setEgetFgetF-set
AgetA-set[SUSPEND]DgetD-set[RESUME]BgetCgetDgetC--setEgetE---setFgetF-set

Now with five requests, it's making much more difference:

AgetA-set[SUSPEND]BgetB--------set[RESUME]BgetCgetEgetZget
AgetA-set[SUSPEND]FgetF-set[RESUME]BgetCgetEgetZgetZ-set
AgetA-set[SUSPEND]FgetF-set[RESUME]BgetCgetEgetFget
AgetA-set[SUSPEND]CgetDgetD-setC--Set[RESUME]CgetEgetFget
AgetA-set[SUSPEND]EgetE---set[RESUME]BgetCgetEgetFget

Note also that because B and D take different amounts of time, in this case even though Z isn't prewarmed, the two requests that need it get there at different times, so only one builds it and the other gets a cache hit - this is an effect of the offsetting/shuffling that the prewarmer does. Say one prewarmable cache takes 800ms to build and the other takes 8ms, the one that's done with prewarming after 8ms will get a lot more done before the 800ms prewarming is finished.

Remaining tasks

Once there are enough prewarmable services defined, we should be able to manually test this by clearing the cache then using ab -n1 -c10, probably with the standard profile, to see if there's a measurable performance improvement.

Once the API is in core, we can potentially add full async cache prewarming support to drush. I think drush could then build all the prewarmable caches at the same time immediately after a cache clear.
https://github.com/drush-ops/drush/issues/5724

User interface changes

API changes

Adds PreWarmableInterface

Data model changes

Release notes snippet

Issue fork drupal-3257725

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

catch created an issue. See original summary.

catch’s picture

catch’s picture

Issue summary: View changes
effulgentsia’s picture

If a fiber loads a PHP class into memory (e.g., via PHP's native reflection), does that memory get freed when the fiber is terminated? Or does PHP have a single memory pool for loaded classes?

catch’s picture

Issue summary: View changes

@effulgentsia according to the RFC:

Each fiber is allocated a separate C stack and VM stack on the heap. The C stack is allocated using mmap if available, meaning physical memory is used only on demand (if it needs to be allocated to a stack value) on most platforms. Each fiber stack is allocated a maximum of 8M of memory by default, settable with an ini setting fiber.stack_size. Note that this memory is used for the C stack and is not related to the memory available to PHP code. VM stacks for each fiber are allocated in a similar way to generators and use a similar amount of memory and CPU. VM stacks are able to grow dynamically, so only a single VM page (4K) is initially allocated.

Unless I've completely misunderstood, it'll have it's entire own PHP memory allocation separate from the main thread (i.e. it's genuine concurrency even though it doesn't allow for simultaneous execution).

andypost’s picture

FTR core could require 8.1.2 to make work on more CPU architectures as https://github.com/php/php-src/commit/069bbf3e8009795bfeec2d46cd4bb39d69...

longwave’s picture

https://stackoverflow.com/questions/70659739/php-fibers-and-memory implies that classes are loaded into a shared memory space when fibers are used.

catch’s picture

#3257725-7: Add a cache prewarm API. Yes, that's discouraging. It might be it introduces more options later, but it's not a quick fix.

I wonder if we can do something with i/o still, for example interleaving file scanning/loading and parsing, but even thinking through it, it gets complex quickly and the gains don't seem like much.

aaronmchale’s picture

Would be nice if PHP had an elegant way to spawn a child process, you can kind of do it using exec, but I doubt that would be acceptable nor reliable.

larowlan’s picture

catch’s picture

The problem with exec() is it's not available on most shared hosts, but maybe there are more options now:

pcntl_fork on php-fpm (but not mod_php)
exec() on mod_php (but not most shared hosts)
async http request (to something like /prewarm.php) (also not on some hosts that disallow self-http requests)

Symfony process, amphp and ReactPHP all try to abstract these away (and I think all three are doing so via Fibers too now) - although since (I assume) they're all designed for cli usage, I'm not sure they have a single API that handles falling back to any one of the above.

But maybe we could have a service in the container, configurable via settings.php, that tries to figure out what's available if nothing's configured and with a final fallback to just doing things inline.

catch’s picture

Title: Use Fibers to offload discovery builds » Use Fibers to offload discovery builds during stampedes
Issue summary: View changes

After writing up #3377570: Add PHP Fibers support to BigPipe I have a clearer idea of how this can be useful.

There are two situations where a full cache clear is annoying:

1. You're doing development, you change something, you clear cache, the page takes 25 seconds to load, especially if you're experiencing #3043330: Reduce the number of field blocks created for entities (possibly to zero).

2. You do a deployment on a busy site and there is a stampede where multiple requests hit empty caches. No page can be served until all the required caches have been built, if requests hit at the same time, you can end up rebuilding multiple expensive things multiple times each. Worst case your webserver is completely maxed out with requests, since they're all so individually slow none of them are completing before new ones come in.

For very, very expensive operations we have used the lock system - for example the theme registry and router builder both ensure that only one request can build them at a time, other requests have to wait for the lock to be released then they should be able to get the already-cached thing from cache. There might be other expensive-to-build caches where we could apply the same thing.

This is where it gets fun :)

First of all, these registries aren't built up-front, they're generally requested at the last minute - i.e. we don't try to build the theme registry until a theme hook is called, which is usually during rendering.

However something like the router is needed right at the beginning of the request, until we have the route, we can't execute the controller, and it's the controller that will determine what exactly the rest of the request needs.

But we know that on most sites if the router needs to be rebuilt, then the theme registry, field plugins, block plugins, library info and etc. are also going to need to be rebuilt very soon. Since all of these are both persistently and statically cached, if we just request them, it's either going to trigger a rebuild, or be nearly-free.

This means we want to execute routing in a Fiber - the obvious place to do this to subclass Symfony's RoutingSubscriber and do it there. (We could also do it in an event subscriber just before RoutingSubscriber, but it would probably mean running some aspects of routing twice if we did that.

We'd only want to run this logic if the router is likely to need rebuilding, and the way to determine that is to find out if the container has been rebuilt in this request. This means adding a protected property + getter to DrupalKernel so we can check if that's the case.

Once routing is inside a Fiber, we can then call Fiber::suspend() in RouteBuilder::rebuild():

   if (!$this->lock->acquire('router_rebuild')) {
      if (Fiber::getCurrent()) {
        Fiber::suspend();
      }
      // Wait for another request that is already doing this work.
      // We choose to block here since otherwise the routes might not be
      // available, resulting in a 404.
      $this->lock->wait('router_rebuild');
      return FALSE;
    }

Now, instead of calling sleep(), we can do other things, like build the theme registry, library info, plugin registries etc. we'd need a way to determine which of these to do, but maybe it can be something like a container parameter that's a list of callbacks you can add to. Or an event to subscribe to. If we apply the lock pattern to more of them, then whenever you reach something that another request is handling, you'd skip to the next one, and if it's already been done, then it's return from cache and be finished.

Before:

Request 1 starts to rebuild the router.
Requests 2, 3, 4, and 5 $lock->wait() until the router is rebuilt.

After:
Request 1 runs routing inside a fiber, routing acquires a lock.
Request 2 also runs routing in side a fiber, the router build suspends the fiber when it can't acquire a lock, starts to build the theme registry (would need to be the non-theme specific bit of the theme registry but we already build and cache that separately so could add a dedicated method for it, or build the registry for the default theme.)
Request 3 theme registry rebuild suspends when it can't acquire lock, starts to build the library info cache.
Request 4 suspends when it can't acquire a lock for the library info cache, starts to build the field formatter plugin registry.
etc. etc.

In between each possible thing to rebuild, we can restart the routing fiber again, and if routing completes, continue with the request - that way we're only running things when the router rebuild is actually running.

I haven't yet figured out how to use this to speed up the individual development environment case. For example we could run the database queries in MatcherDumper::dump() async and do other things in between, but we don't want to, because we need the entire operation to finish as quickly as possible inside a transaction so we can release the lock and other requests can successfully get routed.

One place that async might work for single requests though is YAML discovery and parsing - i.e. if we async discover the files, parse them as they come in, then we'd be interleaving file i/o intensive tasks (directory traversal and file_get_contents()) with CPU-intensive stasks (YAML parsing) rather than discovering all the files, then parsing all the files. But... that is probably going to be quite hard to implement.

effulgentsia’s picture

#12 is a great analysis, thank you! But if the use-case it addresses is multiple parallel requests, then what's the value that Fibers bring?

After:
Request 1 runs routing inside a fiber, routing acquires a lock.
Request 2 also runs routing in side a fiber, the router build suspends the fiber when it can't acquire a lock, starts to build the theme registry (would need to be the non-theme specific bit of the theme registry but we already build and cache that separately so could add a dedicated method for it, or build the registry for the default theme.)
Request 3 theme registry rebuild suspends when it can't acquire lock, starts to build the library info cache.
Request 4 suspends when it can't acquire a lock for the library info cache, starts to build the field formatter plugin registry.
etc. etc.

Do we need fibers for this? Can request 2 realize that it can't acquire a router rebuilding lock, then proceed to building the theme registry, and after building the theme registry check again if the router has been rebuilt (by request 1), etc.? Or is there something special about fibers that either makes this function better or make the code simpler?

catch’s picture

Do we need fibers for this?
Can request 2 realize that it can't acquire a router rebuilding lock, then proceed to building the theme registry, and after building the theme registry check again if the router has been rebuilt (by request 1), etc.?

There are two reasons, one is immediately useful, one is only potentially/speculatively useful.

If we zoom out slightly from the previous code example:

  public function rebuild() {
    if ($this->building) {
      throw new \RuntimeException('Recursive router rebuild detected.');
    }

    if (!$this->lock->acquire('router_rebuild')) {
      if (Fiber::getCurrent()) {
        Fiber::suspend();
      }
      // Wait for another request that is already doing this work.
      // We choose to block here since otherwise the routes might not be
      // available, resulting in a 404.
      $this->lock->wait('router_rebuild');
      return FALSE;
    }

    $this->building = TRUE;

    $collection = new RouteCollection();
    foreach ($this->getRouteDefinitions() as $routes) {

That Fiber::suspend() is right in the guts of the routing system, and it would take us back out to RoutingSubscriber, (or an event subscriber that runs just before RoutingSubscriber if we don't mind an extra cache get or adding a static cache or so). Once it's tried to prewarm something else, it would then $fiber->resume() which would take us to the code point immediately after Fiber::suspend() - we'd see if the lock has been released in the meantime and continue as if nothing had happened in between.

If we didn't use a fiber, we'd have to embed the prewarming logic right inside RouterBuilder::rebuild() itself. This would mean adding a prewarming API to RouterBuilder, vs. adding a prewarming API that incorporates the router builder. So it only helps from a code organisation point of view to keep the prewarming logic a bit more decoupled. This also means we could have done this years ago!

The potential/speculative reason is that if we find something that can be pre-warmed which can also use an async API, then this trick would work to speed things up within a single request. The problem with that, is that here we care about real performance as much as perceived performance. We want the actual router rebuild to finish as soon as possible so that other requests get a cache hit and not a lock wait, and that's also the case for the theme registry and most/all other things we'd add in here. So we wouldn't want to do something like fire off the query that writes back to the router table async, Fiber::suspend(), then check if it's finished and release the lock, because if what we do in Fiber::suspend() takes longer than the query does to come back, we've extended the overall Router::rebuild() time which could mean more requests hitting the lock.

This is different from #3377570: Add PHP Fibers support to BigPipe where it's OK if any individual placeholder takes a fractionally longer wall time to finish, because another one was rendering in the meantime - none of them are going to be blocking other HTTP requests from completing and the important thing for us is that the overall completion time is faster rather than any individual part.

However, still getting my head around this, so there might be an async use case after all, and soon as we have even one of those, it'll be able to make use of the logic. Like say a contrib module adds something to be prewarmed that requires a long database query, it could implement that with an async query and Fiber::suspend(), and it'll all work as long as we've executed it inside a fiber in the first place.

Another example might be if drush implemented an extreme version of this on cli - i.e. a drush cr-prewarm. This would clear the cache, then immediately start to rebuild the various registries, but because it'd be cli, it could use amphp (or drush's own child process logic) to do each one fully async, simultaneously build the router and the theme registry at the same time. But to allow that, it's better that the prewarming logic in core is encapsulated as much as possible, otherwise you wouldn't be able to independently router rebuild and theme registry rebuild without one triggering the other.

neclimdul’s picture

I feel like I need to dig in more and play with fibers to see how this is useful. I was asking the same question of "Why don't we just fire an event and let the inherit parallelization work any pre-warming without the fiber complexity" and I'm not sure I understand how Fibers helps still. Either way I like the idea and look forward to seeing where it goes.

catch’s picture

@neclimdul I think the tl;dr is if it's the only thing we do that's different, it's extremely marginal here, but it opens up possibilities later on.

I have a green patch on #3377570: Add PHP Fibers support to BigPipe now fwiw (although that is also not taking advantage of async yet).

catch’s picture

Status: Active » Needs review
StatusFileSize
new4.84 KB

Working code on #3377570: Add PHP Fibers support to BigPipe helped me think through this a bit too. Uploading a proof of concept.

This doesn't add a prewarm API, it hardcodes the theme registry directly into a decorated router_listener even listener, but it does implement the fibers stuff.

I was wrong about the theme registry, while it does use a lock, it only uses it to ensure atomic writes to the cache, not to block/wait when building the cache.

However, the principle is the same and there's another (better) reason to use fibers I hadn't thought of above.

The advantage is that the theme registry has a two-stage cache build, first it builds a cache item with only the module theme hooks, then it build the theme-specific registry on top of that.

So if we're inside ::build() and we get a cache hit on the module registry, then either we just haven't built the full registry for this specific theme yet, or another process is about to beat us to it.

This means we can add the Fiber::suspend() at that completely arbitrary point, and it'll jump back to the RouterListener, which can just go back to worrying about routing. If we need to build the rest of the theme registry alter, that'll happen (inside or outside of a fiber later on).

If we didn't use fibers and used some kind of prewarmer wrapper, then we could potentially do things like check for locks and cache items, but we'd have to do so outside the class in question with internal knowledge of it. So what we get is the ability to pause execution of the prewarming code paths at any particular point.

Also if we do have another service that lock waits like the router, then you could also suspend() in those services as soon as you're unable to acquire a lock - since we don't want to wait for those services to build, we're only have to wait for the router itself. This would also let you move on to another service each time one suspends.

donquixote’s picture

If we didn't use a fiber, we'd have to embed the prewarming logic right inside RouterBuilder::rebuild() itself. This would mean adding a prewarming API to RouterBuilder, vs. adding a prewarming API that incorporates the router builder.

Not necessarily.
We could instead add this to the locking mechanism itself.

So instead of $lock->wait(), we call $lock->waitAndBeUseful().
Or better, keep the $lock->wait(), but do something useful in that time.

I am not saying that this is better than fibers, perhaps there are still other reasons for using them.

catch’s picture

So instead of $lock->wait(), we call $lock->waitAndBeUseful().
Or better, keep the $lock->wait(), but do something useful in that time.

I am not saying that this is better than fibers, perhaps there are still other reasons for using them.

I like this idea. It seems a lot better to me than having to decorate RouterListener which is a bit of boilerplate, forgot it was final so we can't just subclass it. Feels like a new method to me (or a new parameter on the existing method), since we have some very low level usage of the lock API such as in #3375959: Add a way to delay executions in test runner until terminate event completed in the child site.

The other advantage is that we could intersperse work with polling - so poll the lock, do something, poll the lock, do something else, the proof of concept I've got above just does one single thing if it fails to aquire a lock, which is still better than nothing but it gives us more options.

One issue it might introduce is recursive calls - i.e. two systems use ::lockWaitAndBeUseful() and one ends up calling the other, in the second one, we wouldn't want to restart the 'being useful' game but instead back out immediately and try to do something else. But since it's all in the lock backend it could short-circuit that and immediately return from a nested call. So probably easy enough to deal with.

I'm also trying to think through how we could organise the 'do something'.

Let's say we have:
Router rebuild
Theme registry rebuild
Library info rebuild

Ideally we want it to look like:
Request A: actually rebuilding the router
Request B: theme registry rebuild while waiting
Request C: library info rebuild while waiting

And not:

Request A: actually rebuilding the router
Request B: theme registry rebuild while waiting
Request C: theme registry rebuild while waiting

This would be achievable by using the lock pattern more, i.e.

Request C: try to acquire router lock, try to acquire theme registry lock, actually rebuild the library info cache.

Either way I think we need a list of callbacks from the container, not something like the hook or event API where we expect to run all of them in an explicit order usually.

A new interface + tagged services might fit it better. The new interface would also allow the service to do something different while it's waiting in a ::preWarm() method - for example the theme registry could explicitly just rebuild the module theme hook cache entry and not any theme-specific bits, and it could wrap just that bit in a lock but not the whole operation.

catch’s picture

OK there is a real (not just preference) issue with trying do the prewarming from within the lock system, even if we added a dedicated method:

This is at the top of RouteBuilder::rebuild():

  /**
   * {@inheritdoc}
   */
  public function rebuild() {
    if ($this->building) {
      throw new \RuntimeException('Recursive router rebuild detected.');
    }

This means that if any module in any prewarming service triggers routing, we'd hit that recursion protection. An example could be someone adding a routed library URL in hook_library_info_alter().

But this has given me a couple of different ideas, going to try to get a proof of concept patch going along with the explanation though.

donquixote’s picture

It is a strange concept in Drupal, but we can actually catch exceptions :)
But, if the router rebuild is protected by a lock, we will hit the lock before we hit the recursion exception.

To make this idea work, we would have to introduce a lock id for every cache warming task.

There should be a registry of cache warming tasks with dependencies.
This also allows to register sub-tasks, e.g. to collect routing information of views module and store them in an intermediate cache.
Tagged services could work for this, yes.

During the "do something useful", if one cache warming task is blocked by a lock, we should skip it and move on.
The first idea would be that the lock id is registered with the cache warming task. But this could be weird if we call it from outside the cache warming, and it needs to acquire the same lock.

What we could do instead is start the cache warming task as a fiber from within lock wait.
Then within the fiber, when we hit a lock, we stop the fiber.

So, lock wait would be like this:
- If we are _not_ within a fiber, we look for cache warming tasks, and start them as fibers.
- If we are already within a fiber, we stop the fiber.

Feels like a new method to me (or a new parameter on the existing method), since we have some very low level usage of the lock API such as in #3375959: Add a way to delay executions in test runner until terminate event completed in the child site.

Why is this a concern?
For low level usage, I think we optimize for the case when the lock is not blocked.
Any overhead only occurs if the lock is blocked.
But maybe I am missing something.
Perhaps it is just safer to not always run expensive tasks when something is locked.

Another option would be different lock handlers. One that does cache warming tasks during sleep, another that does not.

catch’s picture

Version: 10.0.x-dev » 11.x-dev
StatusFileSize
new4.3 KB

OK this is basically the same approach as before, but with the code moved to completely different and IMO more appropriate places.

The Fiber::suspend() logic moves to LockBackendAbstract::wait() - see the code comments in there.

I've also added a non-locking Fiber:suspend() implementation to the theme registry

Instead of decorating RouterListener, I stuck the entirety of HttpKernel::handle() inside a Fiber, this means any Fiber::suspend() during handling of a request can potentially make use of the cache prewarming logic (if it bubbles up to this point).

Status: Needs review » Needs work

The last submitted patch, 22: 3257725-20.patch, failed testing. View results

catch’s picture

I've added the lock and theme registry changes over there too #3377570: Add PHP Fibers support to BigPipe because they're equally applicable, and that issue has a clearer path forwards so far I think. If they land over there, it'll just mean a smaller patch here.

One issue with testing this is we do our absolute best with the router unlike discovery caches, to rebuild it in the request that forces the rebuild - either in drupal_flush_all_caches() directly or at the end of a request that triggers rebuildNeeded. This means you can't just clear caches then hit a site with ab, because clearing caches will rebuild the router.

catch’s picture

StatusFileSize
new1.06 KB
new5.36 KB

Test failure is just because the call stack is longer.

catch’s picture

StatusFileSize
new1.54 KB
new15.57 KB

We want to know which services to apply this to - which ones take a long time to build and so should Fiber::suspend() when they get a cache miss, and which ones are good candidates for prewarming because no matter how long they take, they're likely to block the request being served.

I applied the following patch to the database cache backend - timers.txt, this produces the report cache.txt, which I ran on the standard front page, after a cache clear (and after deleting the file so it doesn't include caches already being built by drush cr), it gives us:

1. The order in which caches are requested.
2. When the caches are set, and a time in ms between get and set.

Truncated version with the ones we should look at:

routing.non_admin_routes and element_info are very early.

get: routing.non_admin_routes
set: routing.non_admin_routes time: 0.93
route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9
set: route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9 time: 1.92
get: theme.active_theme.olivero
set: theme.active_theme.olivero time: 0.64
get: element_info_build:olivero
get: element_info
set: element_info time: 2.64
set: element_info_build:olivero time: 19.18

views_data:en sets off a whole palindrome of cache gets and sets on an empty cache, the intermediate ones are good candidates for prewarming like entity_field_map and the bundle field definitions.

get: views:access
get: views_data:node_field_data:en
get: views_data:en
get: entity_type_definitions.installed
get: block_content.field_storage_definitions.installed
set: block_content.field_storage_definitions.installed time: 1.51
get: entity_base_field_definitions:block_content:en
get: field_types_plugins
get: typed_data_types_plugins
set: entity_base_field_definitions:block_content:en time: 4.48
get: entity_field_storage_definitions:block_content:en
set: entity_field_storage_definitions:block_content:en time: 4.44
get: entity_base_field_definitions:user:en
set: entity_base_field_definitions:user:en time: 0.65
get: comment.field_storage_definitions.installed
set: comment.field_storage_definitions.installed time: 1.03
get: entity_base_field_definitions:comment:en
set: entity_base_field_definitions:comment:en time: 0.62
get: entity_field_storage_definitions:comment:en
set: entity_field_storage_definitions:comment:en time: 2.33
get: entity_base_field_definitions:node:en
set: entity_base_field_definitions:node:en time: 0.78
get: entity_field_map
get: entity_bundle_info:en
get: entity_base_field_definitions:contact_message:en
set: entity_base_field_definitions:contact_message:en time: 0.63
get: entity_base_field_definitions:file:en
set: entity_base_field_definitions:file:en time: 0.72
get: entity_base_field_definitions:menu_link_content:en
set: entity_base_field_definitions:menu_link_content:en time: 0.89
get: entity_base_field_definitions:path_alias:en
get: entity_base_field_definitions:shortcut:en
set: entity_base_field_definitions:shortcut:en time: 0.76
get: entity_base_field_definitions:taxonomy_term:en
set: entity_base_field_definitions:taxonomy_term:en time: 0.81
set: entity_field_map time: 19.09
get: file.field_storage_definitions.installed
set: file.field_storage_definitions.installed time: 1.27
get: entity_field_storage_definitions:file:en
set: entity_field_storage_definitions:file:en time: 0.48
get: node.field_storage_definitions.installed
set: node.field_storage_definitions.installed time: 1.27
get: entity_field_storage_definitions:node:en
set: entity_field_storage_definitions:node:en time: 2.67
get: search_plugins
get: taxonomy_term.field_storage_definitions.installed
set: taxonomy_term.field_storage_definitions.installed time: 1.46
get: entity_field_storage_definitions:taxonomy_term:en
set: entity_field_storage_definitions:taxonomy_term:en time: 0.55
get: user.field_storage_definitions.installed
set: user.field_storage_definitions.installed time: 1.41
get: entity_field_storage_definitions:user:en
set: entity_field_storage_definitions:user:en time: 2.9
get: entity_bundle_field_definitions:block_content:basic:en
set: entity_bundle_field_definitions:block_content:basic:en time: 4.09
get: entity_bundle_field_definitions:comment:comment:en
set: entity_bundle_field_definitions:comment:comment:en time: 13.31
get: entity_bundle_field_definitions:node:article:en
set: entity_bundle_field_definitions:node:article:en time: 45.53
get: entity_bundle_field_definitions:node:page:en
set: entity_bundle_field_definitions:node:page:en time: 8.72
get: entity_bundle_field_definitions:user:user:en
set: entity_bundle_field_definitions:user:user:en time: 5.87
set: views_data:en time: 333.14


local action plugins and local task plugins, and block plugins:
<code>
get: local_action_plugins:en
set: local_action_plugins:en time: 1.37
get: local_task_plugins:en:view.frontpage.page_1
get: local_task_plugins:en
get: entity_form_mode_info:en
set: entity_form_mode_info:en time: 3.66
get: entity_view_mode_info:en
set: entity_view_mode_info:en time: 4.21
set: local_task_plugins:en time: 20.59
set: local_task_plugins:en:view.frontpage.page_1 time: 26.83
get: shortcut.field_storage_definitions.installed
set: shortcut.field_storage_definitions.installed time: 0.76
get: entity_bundle_field_definitions:shortcut:default:en
set: entity_bundle_field_definitions:shortcut:default:en time: 2.54
get: library_info:olivero
set: library_info:olivero time: 1.57
get: block_plugins
set: block_plugins time: 111
catch’s picture

Status: Needs work » Needs review
StatusFileSize
new10.99 KB
new16.04 KB

Added an API, and it seems to be working.

Here's the before from above:

get: routing.non_admin_routes
set: routing.non_admin_routes time: 0.93
route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9
set: route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9 time: 1.92

As you can see, it gets the non admin routes and sets that, then it gets route_provider.route_load.hash (which is the cache of the preloaded routes query) and sets it.

This is the after:

get: routing.non_admin_routes
set: routing.non_admin_routes time: 0.43
get: route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9

so far the same, but now it changes:

get: entity_field_map
get: entity_type
get: entity_bundle_info:en
get: entity_base_field_definitions:block_content:en
get: field_types_plugins
get: typed_data_types_plugins
get: module_implements
get: hook_info
set: hook_info time: 0.17
[snip]
get: entity_base_field_definitions:user:en
set: entity_base_field_definitions:user:en time: 0.74
set: entity_field_map time: 109.13
get: route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9
set: route_provider.route_load:0e574a6220b0c807e7d410c61f43c6f02773621676f881cbb160db210b856d32827b71a046ee16fa2dcc0352a44d773c067bdc3531bcb3960a66710d128298a9 time: 1.56

So what it's done is try to get the route preload cache, got a miss, built the entity_field_map instead, gone back to get the route preload cache, still got a miss, built it and cached it.

Then when you get to the views_data caching later in the request, entity_field_map isn't in the middle of it at all, because it's already statically cached from the pre warming.

And when it's set, the timing looks like this:

views_data:en time: 197.16

This is compared to more like 330ms above, which is consistent with entity_field_map taking 109ms. by itself.

catch’s picture

StatusFileSize
new12.44 KB

Try to make cspell happy.

andypost’s picture

StatusFileSize
new2.7 KB
new12.4 KB

Fix CS

andypost’s picture

StatusFileSize
new409 bytes
new12.4 KB

and one more

Status: Needs review » Needs work

The last submitted patch, 30: 3257725-30.patch, failed testing. View results

catch’s picture

Status: Needs work » Needs review
StatusFileSize
new503 bytes
new12.5 KB

Prewarmer service now needs an alias for autowiring.

andypost’s picture

StatusFileSize
new12.47 KB

@catch looks you're picked not #30 patch, here's interdiff from #32 added and tests should be green

catch’s picture

Issue summary: View changes

Thanks @andypost!

Updating the issue summary.

catch’s picture

StatusFileSize
new1.39 KB
new12.81 KB

Small improvement to CachePrewarmer - keep track of what we've already prewarmed so we don't run the same one twice. Should be built off #33 this time.

andypost’s picture

  1. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmer.php
    @@ -11,6 +11,16 @@
    +   * @var array
    +   */
    +  protected array $calledServices = [];
    

    could use @var string[]

  2. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmer.php
    @@ -20,9 +30,13 @@ public function prewarmCaches() {
    +    if ($candidates) {
    +      $key = array_rand($candidates);
    

    needs a comment why random service is used?
    This way it can use to run some tasks twice

catch’s picture

So #27 shows that we can take a cache rebuild that is usually nested instead the views data rebuild, prewarm it earlier in the request, and then save that time later in the request when we actually build the views data cache. That shows this working in principle.

I think there might be a way to show this working in practice, but it's going to be fairly hard because it relies on concurrent requests to do its work.

Something like this:

1. Add PreWarmableInterface to 4-5 other services, ideally we want more that are 50-100ms+. We could also find one more earlyish cache miss to add a Fiber::suspend() to.

2. Use the tool which I'd normally tell people not to use for core performance testing - ab. What we want is just to hit the site with about 10 requests like ab -c10 -n1 immediately after a drush cr. In the 'before' case they will all individually build all of the caches, in the after case, they should divide some of the cache building up between them. If we manage to find 500ms of cache building to divide up, we could see as much as 400ms improvement. This might be a big enough change to see consistently with ab. We'd need to run the before and after tests 3 or more times each to get a baseline. We can't increase -n because all the subsequent requests will be cache hits and see no changes, but running multiple times should be enough hopefully. If it's not visible, we can artificially slow down one or two of the rebuilds with sleep(1) to simulate a 200+ module site instead of a 20+ one.

#36 feedback is good - I'll add a longer comment. Need to add a lot of high level docs in a few places still too.

neclimdul’s picture

  1. +++ b/core/lib/Drupal/Core/DrupalKernel.php
    @@ -701,7 +701,18 @@ public function handle(Request $request, $type = self::MAIN_REQUEST, $catch = TR
    +        $this->container->get('cache_prewarmer')->preWarmCaches();
    
    @@ -717,6 +728,20 @@ public function handle(Request $request, $type = self::MAIN_REQUEST, $catch = TR
    +  /**
    +   * Prewarms caches for services that support it.
    +   *
    +   * When handling a request with cold caches, some services lock expensive
    +   * operations to prevent two processes handling them at the same time. When
    +   * this happens, further requests may go into a LockInterface::wait() pattern
    +   * which will usually sleep until the lock is released, in the hope that the
    +   * other request has cached in the meantime ready for it to use.  This method
    +   * allows us to do something else during that wait time instead of sleeping.
    +   */
    +  protected function prewarmCaches(): void {
    +    $this->container->get('cache_prewarmer')->prewarmCaches();
    +  }
    +
    

    Did you mean to call this new helper in the request changes?

  2. +++ b/core/lib/Drupal/Core/Lock/LockBackendAbstract.php
    @@ -37,18 +37,32 @@ public function wait($name, $delay = 30) {
    +    // If executing inside a fiber, then suspending the fiber implicitly waits
    +    // for whatever the parent process does before it is resumed again. Try
    +    // that so that other tasks can continue, before sleeping only if necessary.
    +    // @see Drupal\Core\DrupalKernel::prewarmCaches() for an example which takes
    +    // explicit advantage of this behaviour.
    +    if (\Fiber::getCurrent() !== NULL) {
    +      $sleep = 0;
    +      \Fiber::suspend();
    +    }
    +    else {
    +      // Begin sleeping at 25ms.
    +      $sleep = 25000;
    +    }
    +
    

    Would this be a little less complex if it was directly in the loop or looked more like this?

      $wait = function () use ($sleep) {
        if (\Fiber::getCurrent()) {
          \Fiber::suspend();
        }
        else {
          usleep($sleep);
        }
      };
    
      // In loop
      $wait();
    
  3. +++ b/core/lib/Drupal/Core/Lock/LockBackendAbstract.php
    @@ -37,18 +37,32 @@ public function wait($name, $delay = 30) {
    -    // Begin sleeping at 25ms.
    -    $sleep = 25000;
         while ($delay > 0) {
           // This function should only be called by a request that failed to get a
    -      // lock, so we sleep first to give the parallel request a chance to finish
    -      // and release the lock.
    -      usleep($sleep);
    +      // lock, so if  we haven't already suspended a fiber, sleep first to give
    +      // the parallel request a chance to finish and release the lock.
    +      if ($sleep > 0) {
    +        usleep($sleep);
    +      }
           // After each sleep, increase the value of $sleep until it reaches
           // 500ms, to reduce the potential for a lock stampede.
           $delay = $delay - $sleep;
    

    If sleep is 0, is there a possibility this becomes a hot loop waiting for the lock to release?

catch’s picture

Did you mean to call this new helper in the request changes?

I actually meant to get rid of the helper because it's a one-liner now there's a service.

Would this be a little less complex if it was directly in the loop or looked more like this?

Not sure the end point but probably yes. I think it should be in the loop, and we could probably do something like call the prewarmer when we start the loop, then check if the lock is available, then sleep, every time so that if we're waiting for a long time, more prewarming can happen. That would then get rid of all the $sleep = 0 logic altogether.

catch’s picture

Title: Use Fibers to offload discovery builds during stampedes » Add a cache prewarm API and use it to distribute cache rebuids after cache clears / during stampedes
StatusFileSize
new10.29 KB
new16.7 KB

Hopefully addressing #36 and #38 and also adding a lot of docs.

neclimdul’s picture

Great! Making a lot more sense.

+++ b/core/lib/Drupal/Core/Lock/LockBackendAbstract.php
@@ -45,16 +45,27 @@ public function wait($name, $delay = 30) {
+      if (\Fiber::getCurrent() !== NULL) {
+        \Fiber::suspend();
+      }

Because it keeps popping into my head I'm going to throw this out there. Are these methods being static(global) going to be a problem? We won't know where this is called so we don't know if we're triggering the pre-warm or some other fiber. Does it matter? Could this trigger un-intended behaviors in modules down the line if they use fibers?

catch’s picture

StatusFileSize
new16.58 KB

Actually one more change.

Moving this out of RouteProvider::preload() into RoutePreLoader::onRequest() for two reasons:

1. Because there are two cache gets and sets involved, we can Fiber::suspend() immediately after the first (extremely cheap to build because it just copies over from what's in state) cache is set, that way there is no checking cache again, just suspend and move on. This conflicts a bit with #2575105: Use cache collector for state but we could always move it back to RouteProvider::preload() once that's in, or somewhere else I haven't thought of.

2. The event listener is much more closely tied to where this happens during request execution, and because it's an event listener it's more swappable/modifiable.

catch’s picture

We won't know where this is called so we don't know if we're triggering the pre-warm or some other fiber. Does it matter? Could this trigger un-intended behaviors in modules down the line if they use fibers?

It both matters a lot and shouldn't matter at all!

Since we already have another implementation in the works, #3377570: Add PHP Fibers support to BigPipe, this is a real consideration, but they're complementary with each other.

The DrupalKernel::handleRequest() call is acting as a 'top level' Fiber, it's giving anything that might be running in the critical path, especially early request handling, the opportunity to call Fiber::suspend() when it thinks the request might be better off doing something else instead of continuing. And the role of the cache prewarmer is to provide that 'something else' in the absence of anything else.

DrupalKernel::handleRequest() doesn't execute the actual prewarming API in a Fiber though, so anything calling Fiber::getCurrent() will get NULL and just keep going instead of suspending (or Fibers will throw an exception if they don't check Fiber::getCurrent() properly but that'd be a straightforward bug).

This means DrupalKernel::handleRequest() is always going to get a suspension, try to prewarm, resume the fiber, and maybe it'll try to do the same again, with a different PreWarmableInterface implementation if it gets suspended later in the request, but it'll always resume once it's done it's thing (unless there's an exception thrown or fatal error) and allow the rest to continue as normal. This is why the while() loop.

The BigPipe implementation is much later in the request and will be happening a level underneath the DrupalKernel one, and it is looping over multiple different callbacks (bigpipe placeholders) and resuming them if they suspend, until they're all done. So for the BigPipe case there is an array of Fibers involved, all at the same level of execution, moving from one to the other, unlike DrupalKernel which is only dealing with one at a time.

It is very possible that some code that thinks it's likely to be called by the Drupal::Kernel::handleRequest() Fiber (which would include anything we add here until the other issue lands) might actually end up being called by the BigPipe Fibers once they're both in core.

However, this should still work nicely!

For example, let's say we have two placeholders,

Placeholder A is a views listing block, so it has to execute the listing query first (async, when that's added), and calls Fiber::suspend() before waiting for the query to be returned.

Placeholder B just gets the current user and renders their username.

Either the username placeholder or the views listing block could be called first depending on the page layout/block configuration.

Username first:

If the theme registry is empty when we go to render the username, then code added here could end up calling Fiber::suspend() with the expectation that the cache prewarming API will be called.

Now while the username placeholder is suspended, BigPipe would move onto the views listing block, which executes the async query, and suspend itself before waiting for the async query to come back.

Then, BigPipe moves back to the username placeholder. The theme registry checks the property and cache get again when it's resumed, just in case it got built while it was suspended. If it doesn't do this, there was no point in suspending because it would end up doing the work anyway.

There are two possibilities when we get back to the username placeholder:

1. While we were in Views, some other process built the theme registry cache for us. Now we didn't have to build the theme registry cache at all, even though it was empty when we tried to get it, although the dedicated cache prewarming API wasn't involved here (at least in this request), it ended up being a byproduct of the BigPipe implementation instead. But broadly the same thing we wanted to happen happened.

2. It's still a cache miss for the theme registry (because actually no other requests are happening, or they didn't finish while we were in Views), so we build it ourselves. But!!! we're now building the theme registry cache after the views async query was executed, and before views checks if it returned, so it's filling in that time that would otherwise be spent waiting for the query to execute.

Now whether #1 or #2 was true, when we go back to Views, it will now wait for the async query to return (minus the time we spent rendering the username), then when it does, or if it already has, it'll render (and the theme registry cache is warm by this point too).

When the views listing block is first, it's slightly different, but ends up in the same place:

1. Views executes the async query and suspends.
2. Username placeholder ends up in a theme registry cache miss and suspends.
3. Views is resumed and waits for the async query to come back and renders. Now because Views is rendering a template, it gets the theme registry cache miss itself. This is a separate call, so the theme registry suspends here too, in the same place. This is the second suspension from the views placeholder via different code paths.
4. BigPipe resumes the username placeholder, it already suspended from the theme registry before, so it resumes from there, and either gets a cache hit this time or builds it. Then the username gets rendered and it's done.
5. Bigpipe resumes the views placeholder for a second time, it gets a cache hit for the theme registry this time, and it renders.

In both cases, the views async query ends up getting fired before the theme registry is built, and the theme registry is only built once. We do get two or three cache misses for the theme registry before we actually try to build it, but cache gets are cheap (and cheaper than locking which is our other option) and they'll only happen on a miss.

Now let's say there's a custom module that knows nothing about any of the above, it has a list of RSS feeds, and it gets the most recent item from each feed, and renders it in a block. Because BigPipe is installed, this gets executed in a BigPipe placeholder, which is now inside a Fiber, which is itself inside the DrupalKernel Fiber.

The custom feeds block module also wants to use concurrency, so it uses Guzzle async to fire off requests to all of the RSS feeds at once, and then it keeps resuming (and suspending) Fibers. When any async RSS feed response is returned from a fiber, it parses the RSS and puts it into an array, then resumes the other Fibers to see if another RSS feed as come back, until they're all done. This is now three levels of Fibers - the RSS block Fiber, inside the BigPipe Fiber, inside the DrupalKernel::handleRequest() Fiber.

Because the RSS block is both executing the Fibers and also controls the code that is suspending those Fibers, all the calls should stay within that context this time - i.e. it's a closed loop of Fiber execution and suspension. It will request all the RSS feeds, eventually get all the RSS feeds, then finish. Then BigPipe will move onto the next placeholder and it has no idea that all of this has happened.

So it should all work, but there are various considerations:

1. Code that is executing Fibers and doesn't control all the the call chain below it, needs to make sure that no matter how many times the fiber is suspended, that it eventually gets resumed (if it cares about it finishing at all).

while ($fiber::isSuspended())
$fiber->resume()
}

2. Code that is suspending a Fiber needs to make sure it will eventually stop suspending, - i.e. we need to be very careful in lock wait that we don't end up in an infinite loop.

(these following two are specific to the implementation here

3. Code that is suspending a fibre after a cache miss in the hope that something else will build it, needs to check the cache again when it's resumed, so it actually benefits when it's been built, and doesn't just build it again regardless.

4. Implementations of PrewarmableInterface need to make sure they're safe whether routing has happened or not, whether there's a session or not, whether there's a request or not, because these especially have no idea where they'll get called from - but this is not particularly specific to Fibers at all but to the general problem of trying to find something to keep busy with at any arbitrary point in the request.

catch’s picture

StatusFileSize
new7.56 KB
new24 KB

Adding element info and all of the views plugin managers. They do not take long individually, 2-3ms each, although it would be longer with more modules.

: views:sort
set: views:sort time: 2.6

get: views:filter
set: views:filter time: 2.54

get: views_data:views:en
set: views_data:views:en time: 0.79
get: views:area
set: views:area time: 1.75

get: views_data:node:en
set: views_data:node:en time: 0.8
get: views:pager
set: views:pager time: 1.54
get: views:query
set: views:query time: 2.46

get: views:style
set: views:style time: 2.4

However cumulatively, as noted above, views_data is about 300ms with the standard profile and they are all nested inside there. So by taking them out, we're dividing up a 300ms-to-build cache item into over a dozen independently prewarmable items. We've already split out 100ms from entity_field_map, this might be another 50ms or so.

More advantages to this:

1. The more prewarmable things that the prewarmer can potentially call out to, the less chance of duplicates.
2. The more heterogeneity of prewarmable things, the more variation in how long they each take to build there is, and the more likelihood that requests stop trying to do the same thing at once. So adding quick-to-build caches alongside some slow-to-build ones will mix things up a bit.

catch’s picture

Another one: ViewsData itself (calling ::getALL()), however this is a tricky one because it stores $this->currentLanguage() in state. So we'd need to refactor that a bit to make it pre-routing safe. One option would be to not assign the language to a property, but get it from the language manager each time.

Also in case it comes up, it doesn't matter if we tag one thing as cache_prewarmable, then we also tag something it calls as cache_prewarmable - the main thing here is building these caches out of the regular execution order.

catch’s picture

StatusFileSize
new850 bytes
new24.83 KB

Fixing some of those test failures.

catch’s picture

Started looking at ViewsData and realised it shouldn't be caching by current language at all. Not a massive issue unless you have several languages installed, but still a waste, opened #3380145: ViewsData should not cache by language. Once that lands it'd be a very straightforward addition here.

catch’s picture

Issue summary: View changes
catch’s picture

Issue summary: View changes

Trying to add a visual (ASCII-diagram) representation of how this affects performance during stampedes.

catch’s picture

Issue summary: View changes
catch’s picture

Issue summary: View changes
catch’s picture

Issue summary: View changes
catch’s picture

Issue summary: View changes
catch’s picture

StatusFileSize
new614 bytes
new25.15 KB

Fixing the unused use statement.

Writing up the issue summary more, another thought occurred. So far I could see drush implementing cache prewarming as a post drush cr task - but drush cr is only one of the ways that caches get cleared, a very common one is running updates, installing a module via the UI etc.

So what if we did the following:

1. drupal_flush_all_caches() calls CachePrewarmer::setCachesCleared() - this just changes a boolean property on CachePreWarmer to TRUE
2. CachePreWarmer registers itself as a post-response task
3. If the boolean is set, instead of trying to randomly prewarm one thing, in this listener it tries to prewarm everything (still at random).
4. Because post response is properly non-blocking after #3295790: Post-response task running (destructable services) are actually blocking; add test coverage and warn for common misconfiguration, the next page will still get requested as quickly as before, but now with whatever caches warmed that the post response task has managed to warm up for it before it requests them.

This would speed up a lot of local development tasks, possibly also user experience in the installer, if it works.

It should a small amount of code to write this, then some manual testing to see if it works (and also see whether it causes DrupalCI to blow up).

Status: Needs review » Needs work

The last submitted patch, 54: 3257725.patch, failed testing. View results

borisson_’s picture

Status: Needs work » Reviewed & tested by the community

That is a known random, back to rtbc.

catch’s picture

Status: Reviewed & tested by the community » Needs review

RTBC would be nice but it was at needs review ;)

catch’s picture

Issue summary: View changes
Issue tags: +Needs manual testing, +Needs change record
catch’s picture

StatusFileSize
new27.6 KB

This implements DestructableInterface and adds a couple of extra methods.

neclimdul’s picture

+++ b/core/lib/Drupal/Core/Render/ElementInfoManager.php
@@ -82,6 +83,13 @@ public function __construct(\Traversable $namespaces, CacheBackendInterface $cac
+  public function preWarm(): void {
+    $this->getDefinitions();

+++ b/core/modules/views/src/Plugin/ViewsHandlerManager.php
@@ -133,4 +134,11 @@ public function getFallbackPluginId($plugin_id, array $configuration = []) {
+  public function preWarm(): void {
+    $this->getDefinitions();

+++ b/core/modules/views/src/Plugin/ViewsPluginManager.php
@@ -41,4 +42,11 @@ public function __construct($type, \Traversable $namespaces, CacheBackendInterfa
+  public function preWarm(): void {
+    $this->getDefinitions();

I see a trait forming :-D

Also, probably not RTBC with the text dump in the database backend...

g089h515r806’s picture

Great work.

I have a small question:

+      if (in_array($this->bin, ['cache_bootstrap', 'cache_data', 'cache_default', 'cache_discovery'])) {
+        file_put_contents('/tmp/cache.txt', 'get: ' . $cid . "\n", FILE_APPEND);
+      }
+      Timer::start($cid);

For '/tmp/cache.txt' file directory.
Does it support windows OS? Do we need support Windows OS at here?

neclimdul’s picture

The point is kinda moot because that is there to debug cache triggering. See the last line of my previous comment.

That said for reference both yes and no. The directory separators likely would work. The bigger problem would be the existence and writ-ability of /tmp which assumes a UNIX FHS that would probably fail. If this was intended for real usage we'd use \Drupal::service('file_system')->getTempDirectory();.

catch’s picture

StatusFileSize
new1.26 KB
new38.67 KB
new43.48 KB
new26.59 KB

Did some testing with the destructable changes to make sure it's doing what it's supposed to.

I applied 'debug.txt' - it's similar to the cache logging debug that keeps creeping into the patch above , but adds $_SERVER['REQUEST_METHOD'] and skips the timer.

Then with Umami installed, I submitted the 'clear all caches' button on admin/config/development/performance (after deleting cache.txt to make sure it starts empty).

Two resulting file attached - before.txt is with the debug and HEAD, after.txt is with the latest patch applied.

In before.txt, you can see that the POST request already builds some caches - this is from route rebuilding mainly I think, then there's a hard switch to the GET request which then handles all the cache misses from then on.

In after.txt, it is working! The POST requests is still setting cache items after the GET request starts.

Then if you find something like 'entity_field_map', in before.txt it only appears as a cache get/set in the GET request.

In after.txt', the get/set is in the POST request and the the GET request only has a cache get - so it's successfully prewarmed before the GET request needed it.

I also think this validates randomising the services - if we tried to order things by the earliest-needed-service or similar it might not be quick enough to beat other requests coming in that also need that service quickly, but if we build something that takes 100ms later in the request, then it still saves that time when everything else gets there. We also couldn't guarantee order anyway because it depends on the site.

I did some quick experimentation with ab, and at least on my machine, at anything more than ab -c2 -n2, I start getting page cache hits back. So the idea about hitting it with -c10 -n10 to get a proper stampede situation won't work with unmodified core. We could probably do something like artificially extend the time it takes to build entity_field_map or similar by adding usleep(500) on a cache miss, didn't try that yet.

Re-uploading the #59 patch just without the cruft in it too.

Here's a snapshot of after.txt, you can see the mix of POST and GET which each cache set and get demonstrating they're happening in parallel.

POST set: entity_bundle_field_definitions:node:article:en
POST set: last_write_timestamp_cache_discovery
POST get: entity_bundle_field_definitions:node:page:en
POST set: entity_bundle_field_definitions:node:page:en
POST set: last_write_timestamp_cache_discovery
POST get: entity_bundle_field_definitions:node:recipe:en
POST set: entity_bundle_field_definitions:node:recipe:en
POST set: last_write_timestamp_cache_discovery
POST set: views_data:en
GET get: entity_type
GET get: core.extension.list.theme
POST set: views_data:block_content:en
POST get: views_data:content_moderation_state:en
GET get: theme.active_theme.claro
POST set: views_data:content_moderation_state:en
POST get: views_data:file_managed:en
POST set: views_data:file_managed:en
GET get: element_info_build:claro
GET get: element_info
POST get: views_data:media:en
POST set: views_data:media:en
GET set: element_info
GET set: last_write_timestamp_cache_discovery
GET set: element_info_build:claro
GET set: last_write_timestamp_cache_discovery
GET get: route_provider.route_load:        

Unlike the early bootstrap stampede logic, this doesn't require fibers at all because there's no risk of recursion, so we could have done this years ago as soon as we added DestructableInterface.

The trait is a good idea, will look at adding that with next round of changes. It'd be usable in contrib too.

catch credited Fabianx.

catch’s picture

Status: Needs review » Needs work
Issue tags: -Needs manual testing

Marked #2496933: Centralize/optimize stampede protection/locking (aka work while we sleep) as duplicate and crediting Fabianx - much of this is the same as what we came up with eight years ago!

Untagging for needs manual testing since #63 demonstrates this working.

  1. +++ b/core/lib/Drupal/Core/Lock/LockBackendAbstract.php
    @@ -45,16 +45,27 @@ public function wait($name, $delay = 30) {
    +      // Check if we're executing inside a Fiber. If so, before sleeping,
    +      // suspend the fiber in case some other code can run in the meantime. By
    +      // the time that code has finished running, the lock may already be
    +      // available.
    +      // @see Drupal\Core\Prewarm\CachePrewarmer
    +      if (\Fiber::getCurrent() !== NULL) {
    +        \Fiber::suspend();
    +      }
    +      if ($this->lockMayBeAvailable($name)) {
    +        // No longer need to wait.
    +        return FALSE;
    +      }
    +      // If the lock is still not available, it's possible that the parent
    +      // process immediately resumed the Fiber we're running in, so sleep
    +      // to avoid a lock stampede.
    

    Self review - do we need to calculate how long we were suspended for and then do $delay - $time_suspended here?

  2. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmer.php
    @@ -0,0 +1,88 @@
    +    $candidates = $this->serviceIds;
    +    while ($candidates) {
    +      $key = array_rand($candidates);
    +      unset($candidates[$key]);
    

    We should move the randomization inline comment from preWarmOneCache() to the class overview so it covers both methods.

  3. +++ b/core/lib/Drupal/Core/PreWarm/PreWarmableInterface.php
    @@ -0,0 +1,32 @@
    +/**
    + * Interface for cache prewarmers.
    + *
    

    This is wrong, it's an interface for services with prewarmable caches.

  4. +++ b/core/modules/views/src/Plugin/ViewsHandlerManager.php
    @@ -133,4 +134,11 @@ public function getFallbackPluginId($plugin_id, array $configuration = []) {
     
    +  /**
    +   * {@inheritdoc}
    +   */
    +  public function preWarm(): void {
    +    $this->getDefinitions();
    +  }
    +
    

    Still need the trait.

catch’s picture

Status: Needs work » Needs review
StatusFileSize
new26.91 KB

This should address everything except #65.1

We might be able to address #65.1 by using the Timer class, but it adds a dependency to the trait.

catch’s picture

StatusFileSize
new26.95 KB

Making cspell happy.

Status: Needs review » Needs work

The last submitted patch, 67: 3257725-67.patch, failed testing. View results

catch’s picture

Status: Needs work » Needs review
Issue tags: +Needs architectural review

Random test failure.

#65.1 I am still thinking about and might have an idea, but that's not related to the new API here, so reviews of everything else would be good.

smustgrave’s picture

Status: Needs review » Needs work

Just RTBC #3377570: Add PHP Fibers support to BigPipe which led me here. Think it would be nice to get this in before 10.2 launches. Can the change record be written? Then could mark RTBC.

bradjones1’s picture

Is a CR the only thing that remains here? @catch marked this needs architectural review?

catch’s picture

Status: Needs work » Needs review
Issue tags: -Needs change record

Added a change record. Yeah I think given this adds an API it could use some review of whether the API is OK, @neclimdul started this but hasn't confirmed he's happy with it yet.

smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Have pinged @ neclimdul on slack, but to keep the issue from stalling going to mark. CR reads well, added 2 code blocks to just show some examples.

larowlan’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: -Needs architectural review

This looks great, thanks for all the detailed information in the issue summary and comments.
Took a while to read, but I learnt a lot, very appreciated.

Looking at the patch, its a lot simpler than I was expecting after reading the issue, so that is always a great sign.

I've really only got some style suggestions.

I'm going to remove the tag.

  1. +++ b/core/lib/Drupal/Core/DrupalKernel.php
    @@ -701,7 +701,20 @@ public function handle(Request $request, $type = self::MAIN_REQUEST, $catch = TR
    +      $fiber = new \Fiber(function () use ($request, $type, $catch) {
    +        return $this->getHttpKernel()->handle($request, $type, $catch);
    ...
    +      $fiber->start();
    

    we can use an arrow function here to avoid the need for use

    
    $fiber = new \Fiber(fn () => $this->getHttpKernel()->handle($request, $type, $catch));
    
    
  2. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmer.php
    @@ -0,0 +1,106 @@
    +    while ($candidates) {
    +      $key = array_rand($candidates);
    

    micro-optimisation, but if we used shuffle instead, we'd not need to call array_rand in each iteration

    $candidates = $this->serviceIds;
    shuffle($candidates);
    foreach ($candidates as $serviceId) {
      $service = $this->classResolver->getInstanceFromDefinitions($serviceId);
      $service->preWarm();
    }
    
    
  3. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmerInterface.php
    @@ -0,0 +1,72 @@
    + * to be served faster, we want to divide up different cache building between
    

    %s/we want to divide up different cache building between/we want to divide up cache building between different
    ?

  4. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmerInterface.php
    @@ -0,0 +1,72 @@
    + * where any service can define itself as prewarmable with a common method to
    

    nit: I think we can split this with a full stop after PreWarmableInterface then Any and remove the where

  5. +++ b/core/lib/Drupal/Core/PreWarm/CachePreWarmerInterface.php
    @@ -0,0 +1,72 @@
    +  public function preWarmOneCache();
    ...
    +  public function preWarmAllCaches();
    

    we should probably add the :void here too

  6. +++ b/core/lib/Drupal/Core/Routing/RouteProvider.php
    @@ -225,14 +225,38 @@ public function preLoadRoutes($names) {
    +        if (\Fiber::getCurrent() !== NULL) {
    +          \Fiber::suspend();
    +        }
    +        // Check for the cache item again in case it was set while we
    +        // were suspended.
    

    we're getting some pretty deep nesting here.

    The only thing that happens outside the if code is

    $this->serializedRoutes += $routes;
    So perhaps we can bring that into some of the if clauses, return early and avoid the nesting?

catch’s picture

Status: Needs work » Needs review

Moved to an MR - it also needed a re-roll for dictionary.txt changes.

Review points all look good so have incorporated into the MR.

smustgrave’s picture

Status: Needs review » Needs work

Have not reviewed but MR seems to have some test failures.

catch’s picture

Status: Needs work » Needs review
smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Appears points in #74 have been addressed.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

I've added some questions / points to the MR to address.

catch’s picture

Status: Needs work » Needs review

Moved the post-response stuff to a follow-up, addressed some points, answered others.

catch’s picture

Title: Add a cache prewarm API and use it to distribute cache rebuids after cache clears / during stampedes » Add a cache prewarm API and use it to distribute cache rebuilds after cache clears / during stampedes
smustgrave’s picture

Status: Needs review » Needs work

Seems to have 1 thread open.

catch’s picture

Status: Needs work » Needs review

Removed the rebase/merge cruft.

smustgrave’s picture

Status: Needs review » Needs work

Was going to check the threads but seems to have test failures. Reran too.

catch’s picture

Title: Add a cache prewarm API and use it to distribute cache rebuilds after cache clears / during stampedes » [PP-1] Add a cache prewarm API and use it to distribute cache rebuilds after cache clears / during stampedes

Just seen @donquixote's feedback four months later, wish d.o would update issues when there are comments on merge requests, but I guess we just need the gitlab issues migration to happen.

I've repled to a couple, also going to postpone this on #3394423: Adopt the Revolt event loop for async task orchestration - it is not really hard-blocked, but if we're going to do that, it makes sense to build this on top of that API and I know @KingDutch is already looking at doing that.

Kingdutch made their first commit to this issue’s fork.

catch’s picture

Title: [PP-1] Add a cache prewarm API and use it to distribute cache rebuilds after cache clears / during stampedes » Add a cache prewarm API and use it to distribute cache rebuilds after cache clears / during stampedes

I've revived the original MR here but without any Fibers or revolt support - just adding the prewarm API itself and the implementations.

I think we should then move the revolt implementation into a new issue - it would depend on this issue and the revolt library, but be a much smaller MR to review.

Applied a couple of suggestions on the MR but haven't addressed all the feedback yet.

catch’s picture

Title: Add a cache prewarm API and use it to distribute cache rebuilds after cache clears / during stampedes » Add a cache prewarm API
catch’s picture

Issue summary: View changes

I've opened #3483239: Use revolt to prewarm caches during lock waits for the revolt implementation. And rebased the original MR here to neither use raw Fibers nor revolt - it now just adds the basic API and prewarm implementations without trying to integrate with anything. Hoping that will make each issue a bit easier to review.

catch’s picture

Another reason things to split things here - we may want to add the new interface, and maybe the implementations, to 10.4, so that contrib can implement its own prewarming with the interface in place, but we might not want to backport all of the actual async work to 10.4 given the main consumer will be core for quite a while.

bradjones1’s picture

Issue summary: View changes

I'm a little confused about your note in #92 about fibers - the request handling is still wrapped in a Fiber, but this is only to provide some back-up functionality if there is a suspended fiber in the code path? You mentioned removing Fiber-related code so just trying to figure out where this landed.

catch’s picture

re #93, that was an oversight., removed that hunk too - similar logic will be handled by the revolt event loop issue.

catch’s picture

Status: Needs work » Needs review

Tried to address all the feedback on the MR and added a kernel test.

catch changed the visibility of the branch 3257725-revolt-cache-prewarm to hidden.

kristiaanvandeneynde’s picture

MR generally looks good to me, but after reading the interface docs, I'm still not sure if people will fully know what to do when they need prewarmable dependencies in their prewarmable service. Maybe a code example could help here?

E.g. Service A is required by service B, both can be prewarmed. Due to the randomization, service B is requested first. You mention Fibers in the docs, perhaps showing people how to properly call for A from B using Fibers could be a good code example?

catch’s picture

E.g. Service A is required by service B, both can be prewarmed. Due to the randomization, service B is requested first.

In this case, service A would call service B, and warm the caches of both - unless another request has already warmed the cache for service B before it gets there, in which case it will get a cache hit. There's nothing special to do.

Fibers won't help here unless something was happening async, but even then it's not strictly relevant to the double-cache-warm situation.

Not sure how best to reflect this in the docs.

edit: also removed the Fibers mention because that will all be handled in the Revolt issue, took all the calling of the API out of this MR to make it more reviewable but forgot to remove that paragraph.

needs-review-queue-bot’s picture

Status: Needs review » Needs work
StatusFileSize
new90 bytes

The Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".

This does not mean that the patch necessarily needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

catch’s picture

Status: Needs work » Needs review

Merged in 11.x (rebase was not happy at all).

kristiaanvandeneynde’s picture

To me this is good to go but let's give it 48 more hours for @johnwebdev to reply? I can see their point that randomizing the services in the constructor means we cannot know what the order is from within a test. So, for instance, we can't test whether the preWarmOneCache() method really grabs the last one.

The question is, do we have to?

gapple’s picture

I think the order the provided services are warmed is an internal detail that shouldn't be tested, just that one or all are warmed as expected, and no caches are warmed twice.

kristiaanvandeneynde’s picture

Running tests again as something seems to have gone wrong. If they go green I'm fine to RTBC. The final two threads on the MR are more discussion rather than requested changes.

kristiaanvandeneynde’s picture

Status: Needs review » Needs work

1) /builds/issue/drupal-3257725/core/tests/Drupal/Tests/Core/PreWarm/CachePreWarmerTest.php:59
Creation of dynamic property MockObject_PreWarmableInterface_db43a4f0::$warmed is deprecated

catch’s picture

Can't think of a way to workaround the dynamic property issue when using a mock of the interface, but we could define a real class and use that in the test instead?

kristiaanvandeneynde’s picture

Agreed, a test class with a public property would work here.

gapple’s picture

Status: Needs work » Needs review

I changed to use SplObjectStorage for tracking calls to each mock service.

needs-review-queue-bot’s picture

Status: Needs review » Needs work
StatusFileSize
new90 bytes

The Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".

This does not mean that the patch necessarily needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

gapple’s picture

Status: Needs work » Needs review

Merged 11.x

kristiaanvandeneynde’s picture

Seems to fail on random stuff, running tests again.

alexpott’s picture

Seems to fail on random stuff, running tests again.

We have to be careful on this one because doesn't this introduce some randomness into how caches are rebuilt?

catch’s picture

We have to be careful on this one because doesn't this introduce some randomness into how caches are rebuilt?

It will eventually, but the current MR only adds the API and not any actual cache prewarming anywhere.

#3483239: Use revolt to prewarm caches during lock waits is where all the indeterminacy will come in. #3392035: Add post-response cache prewarming after full cache clears would also add some.

daffie’s picture

Status: Needs review » Needs work

I have read the IS and most of the comments on this issue.
We are only adding the CachePreWarming service, but we are not going to use it.
The solution in the MR looks like the right one. Only as we are not going to use it we cannot be sure that it is the right solution.
When we are going to use the service we might find that this solution is not the right one and we need to change it.
Therefore it would be sensible to mark the new service as @internal. Changing the status to NW for this.

Everything else is to me RTBC.

catch’s picture

Status: Needs work » Needs review

Earlier iterations of this MR did have usages of the new system but I split them out to their own issues to try to keep this one reviewable.

I've added @internal to both the service class and CachePreWarmerInterface.

kristiaanvandeneynde’s picture

Status: Needs review » Reviewed & tested by the community

LGTM and I agree with @daffie that we want to preserve the right to make changes as we get more metrics. Test changes also make sense.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Committed 9525aa6 and pushed to 11.x. Thanks!

  • alexpott committed 9525aa6d on 11.x
    Issue #3257725 by catch, gapple, andypost, kingdutch, alexpott,...
chi’s picture

It worth noting that prewarming data cached in APCu backend does not make mush sense.

catch’s picture

@chi almost nothing is stored purely in the apcu backend, it is mostly in chained fast which combines apcu and a consistent backend (database, redis, memcache). The one exception is the file cache but that is a level below all the things that can be prewarmed as a result of this MR.

chi’s picture

Right. That note was mostly for custom projects that override cache backends like follows.

$settings['cache']['bins']['bootstrap'] = 'cache.backend.apcu';
$settings['cache']['bins']['config'] = 'cache.backend.apcu';
$settings['cache']['bins']['container'] = 'cache.backend.apcu';
$settings['cache']['bins']['discovery'] = 'cache.backend.apcu';
$settings['cache']['bins']['menu'] = 'cache.backend.apcu';
$settings['cache']['bins']['toolbar'] = 'cache.backend.apcu';
gapple’s picture

Would prewarming eventually be a replacement for hook_rebuild() when it's no longer @internal, or will they have different use cases?

catch’s picture

@gapple if hook_rebuild() is being used for pure cache pre-warming (e.g. something that will still be populated on a cache miss) then yes it could switch to this.

The differences will be the drush parallel cache warming, and if there's a stampede, the ability to warm different caches at random during lock wait or while waiting for async database queries to come back. Whereas hook_rebuild() has to go linerlarly through whatever is put in it.

None of these are implemented yet but see related issues.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.