Problem/Motivation
menu_rebuild() / Router::rebuild() has additional race conditions which aren't accounted for by the lock.
1. Processes that fail to acquire a lock, have the menu_rebuild_needed variable / state and no menu masks in memory. If they subsequently call menu_get_item() / any function on the router that checks for rebuild, that will trigger yet another menu_rebuild() / Router::rebuild().
2. Additionally, there's a window between variable_initialize() / state() retrieval and menu_get_item() / Router::[public] where another process may have fully completed a menu_rebuild() / Router::rebuild() and freed the lock. In this case we don't need to rebuild the menu, but we do anyway.
#1 is a serious race condition - it can cause menu_rebuild() / Router::rebuild() to occur over and over again - as one process acquires the lock, other processes lock_wait() / Lock::wait(), then rebuild, then acquire the lock, then more process lock_wait() / Lock::wait() etc. - proper lock stampede.
#2 is more of an optimization to make the current stampede protection more robust.
Both problems exist in 8.x too despite the code having changed a lot. They have been shown in above code with the / for D8 as the alternative implementation.
Its really just the naming that has changed.
Proposed resolution
Move the logic away from a lazy-rebuild model into a explicit rebuild model, which means, there is no global (across requests) flag, to trigger a rebuild.
Instead the router is marked as to be rebuild only in the current process. At the end of the request ($kernel->terminate()) the rebuild is triggered.
That way we have an important performance improvement (or at least avoiding regressions) with this patch:
- No more lock stampede where multiple requests try to rebuild routes (this reduces problems of things like #941970: Only set router rebuild needed when something related to routing actually changes
This leads to the following additional results:
- A Drupal standard installation will be 30% slower since multiple route rebuilds are triggered that are not necessary. See the follow-up #2451665: Don't rebuild the route on ModuleInstaller::install() (30% installer speedup) to address this.
Remaining tasks
(done #203) Agree that the performance improvements are worth it.
(todo) #211 @alexpott "Looks like the following CR will need to be revised https://www.drupal.org/node/2185947 - I think once this lands it should just point to the new CR..."
User interface changes
none.
API changes
RouteBuilderIndicator + Interface + service is removed.
Original report
Everytime menu_rebuild() is called, we get a crazy amount of waits on INSERT INTO menu_router and a bunch of DELETE FROM menu_router. I believe there is some sort of race condition for rebuilding menus. The only way I can kill it is to kill off a bunch of processes. I would imagine that setting a lock for the first user to rebuild that table would be a simple enough fix, but I wanted to open a ticket and see if anyone else had come up with a solution first.
Please help, this is crashing our site: http://www.divx.com/ every time we update a menu or clear the cache.
Thanks!
Comment | File | Size | Author |
---|---|---|---|
#228 | optimize_the_route-356399-228.patch | 67.97 KB | sidharrell |
#219 | interdiff.txt | 10.32 KB | klausi |
#218 | 356399.patch | 68.05 KB | klausi |
#204 | 356399-204.patch | 59.34 KB | dawehner |
#197 | 356399.patch | 64.75 KB | klausi |
Comments
Comment #1
slantview CreditAttribution: slantview commentedSo here is what I have been able to discover so far
menu_rebuild is called.
menu_rebuild clears the cache. Now our 400+ requests per second all decide they need to rebuild the menu_router table since they couldn't pull the menu from cache.
everybody gets in on the party at line 2199 and starts to rebuild.
at line 2288 they all DELETE FROM {menu_router} simultaneously and now menu_router is empty... sort of.
whoever is first starts to INSERT around line 2370 and now we have DELETE FROM and INSERT INTO happening at the same time.
the end result is 600-700 connections stuck in the db trying to figure out how to serve their request and a bunch of waiting processes on the web servers.
now we stack up the web requests until our load balancer starts dumping them into our sorry server and everybody gets our beautiful site maintenance page.
so my theory is that we need to lock before we rebuild and make sure we don't clear cache until the initial menu is built form whoever has the lock so that we can keep serving the stale menu cache until the menu is done rebuilding.
i'm working on code for this right now.
anyone else got an idea?
Comment #2
slantview CreditAttribution: slantview commentedAlso, there is another ticket at #238760: menu_router table truncated and site goes down that claims that this is fixed in Drupal 6.3, but we are on Drupal 6.8 and this is still broken.
Comment #3
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis have been identified a long time ago, and I believe this is one of the biggest scalability issue in Drupal 6 at this time. The solution is indeed to implement a locking framework, as I suggested long ago in #251792: Implement a locking framework for long operations.
Comment #4
pwolanin CreditAttribution: pwolanin commentedIs this really a dup? It sounds like a bug related to the fix we put in for the cases where the rebuild failed. This does indeed seem pretty critical. We also know that using drupal variable is broken (also subject to race conditions) in terms of a locking/semaphore.
Comment #5
David StraussSubscribe.
Comment #6
robertDouglass CreditAttribution: robertDouglass commentedSubscribe.
Comment #7
slantview CreditAttribution: slantview commentedThe fix for divx.com was to switch the menu_router table to MyISAM and to use a locking mechanism for this rebuild. I believe that this issue is dependent on #251792: Implement a locking framework for long operations. I don't see how you can really fix this issue without it.
Comment #8
donquixote CreditAttribution: donquixote commentedI really don't see the need for the ridiculously high number of queries happening during one menu_rebuild.
I would propose to rewrite the menu_rebuild function, see
Rewrite the function menu_rebuild
Basic idea:
- One big query to load the menu_links and menu_router tables into php arrays.
- Do some computations.
- Write the changes to the DB, if any.
EDIT:
That's too simple, because menu_links can be too big for memory. The trick is to load only a part of menu_links into memory, then it should work.
Comment #9
jmpalomar CreditAttribution: jmpalomar commentedSubscribe.
Comment #10
pwolanin CreditAttribution: pwolanin commentedThese are the 2 relevant issues I think would help with the problems:
#251792: Implement a locking framework for long operations
#193366: execute various rebuilds when modules are submitted
Please contribute to those.
Comment #11
catchDuplicate of #512962: Optimize menu_router_build() / _menu_router_save().
Comment #12
Josh Waihi CreditAttribution: Josh Waihi commentedI'm reopening this issue because it seems the "duplicate" has gone off track from what this issue was about and this is for Drupal 7 rather than Drupal 6.
I believe the core issue around menu_rebuild is its poor safeguards from rebuilding the menu when it doesn't need to. When menu_rebuild() is called it tries to acquire a lock. If it can't acquire the lock. It waits, assumes the menu has been rebuilt and returns. It misses completing two extra tasks. One is to set menu_rebuild_needed to FALSE in the local process (using $conf) to prevent menu_rebuild being called again without menu_rebuild_needed explicitly being set correctly. The second is to reload menu_masks which may have changed during the menu rebuild.
Failure to do either of these two things will force another menu_rebuild attempt on the same process if menu_get_item is called again. This can often be the case for taxonomy menus or rendering the output of l() among other things. It can lead to many attempts in a single process trying to rebuild the menu and each time, it may encounter a lock, wait, then continue. I've seen this utilize usleep for almost a minute in XHProf data. The more traffic, the more this race condition is experienced and the longer it takes for the lock to eventually free up.
This patch fixes this issue, though I can imagine there may be some contention around the workaround for menu_masks.
Comment #14
Josh Waihi CreditAttribution: Josh Waihi commentedLets hope this patch works.
Comment #15
mgiffordComment #17
mgiffordIs this also going to be a problem in Drupal 8?
Whats the best way to test this new patch?
Code looks simple enough and well documented.
Comment #18
newtoidJust had a previously patched and working site lock up and stop. Drupal 7.31.
Comment #19
mgifford@newtoid - Were you using the patch in #14?
In which case, this patch may not be sufficient....
Comment #20
newtoid@mgifford Thanks for the reply. Yes that was the patch... Is there more?
Comment #21
Fabianx CreditAttribution: Fabianx commented+1 Lovely find for a menu_rebuild_needed stampede.
This can take a site down ...
Comment #22
catchWe may have the same problem in 8.x yes.
If the state value has already been checked during the request, this won't reset it, so another call to rebuildIfNeeded() would attempt a rebuild again.
Caching was added to the state system in #1786490: Add caching to the state system.
Menu masks also got ported to 8.x, but that is now an internal implementation detail of the database router, so not sure the best way to reset it. Easiest would be to re-init the state system if we get to this condition.
As Josh anticipated I'm not sure about the workaround for menu_masks in the 7.x patch - what about forcing a variable_init() again instead?
Comment #23
tim.plunkettSee also #2230091: Route rebuilding is not guaranteed to finish in time for the next request
Comment #24
Fabianx CreditAttribution: Fabianx commentedThere is another problem here and here it gets tricky:
Under certain conditions e.g. using memcache or a MEMORY table for semaphore, the lock can be released _before_ the transaction is actually committed.
So what should happen is:
This pseudo-code should take care of all cases I can think of.
Comment #25
catchUpdated patch based on #24, hoping the comments are descriptive enough about the various cases.
Moving to 7.x temporarily for the bot.
Comment #26
catchComment #30
catchgrumble grumble... sorry for the noise.
Comment #32
catchComment #33
catchOK cleaned up now it's green, apologies again for the noise.
Comment #34
Fabianx CreditAttribution: Fabianx commentedThese two code snippets look almost the same:
Would it be possible to consolidate in a helper:
_menu_try_rebuild() or _menu_check_rebuild()
Even though comments would need to be a little more generic and a choice be taken regarding isset() vs. empty(), which would probably be good anyway.
Comment #35
catchUpdated the issue summary and a bit more cleanup.
Comment #36
catchWith the helper suggested in #34.
Comment #37
catchMissed an underscore.
Comment #38
Fabianx CreditAttribution: Fabianx commentedCan we reverse logic?
Feels strange ...
Comment is misleading ...
We still need to return FALSE here after the check.
Comment #41
catchComment #42
catchFixed the comment pointed out in #38 too.
Comment #43
Fabianx CreditAttribution: Fabianx commentednit - I don't think we need to remove this comments.
They are still valid.
nit - I don't think we need to remove whitespace here.
nit - This is left over cruft.
We are at nits, this is RTBC for 7.x in general.
Comment #44
catchFixed the nits.
Comment #45
Fabianx CreditAttribution: Fabianx commentedRTBC once tests pass!
Then we need to move to 8.x, get work there done and in a few weeks / months can finally get this into D7.
I really wished the backport policy would not apply to criticals ...
Comment #46
catchUp to 8.x.
Comment #47
catchComment #48
mikeytown2 CreditAttribution: mikeytown2 commentedLooks like in D8 this is in RouteBuilder::rebuild().
Comment #49
dawehnerIN D8 the logic is a bit different for the menu masks. We nowhere actually check that the state exists.
Comment #50
Fabianx CreditAttribution: Fabianx commentedThis now needs a reroll:
https://www.drupal.org/node/2352641
But should also be easier.
We need to always do two checks basically:
- 1. After the lock - DONE! :)
- 2. And then also in rebuildIfNeeded() we should check after the first fast state check if we really still need to rebuild ...
This can happen if a request comes in very late and the rebuild is almost finished, but not quite.
Comment #51
dawehner@Fabianx
Is there a reason you want to do that logic inside
rebuildIfNeeded()
and not inrebuild()
? Couldn't there be codewhich calls rebuild() itself?
Comment #53
Fabianx CreditAttribution: Fabianx commentedYes, there could be ...
should do the trick.
Comment #54
Fabianx CreditAttribution: Fabianx commentedShouldn't the state now use the new route.builder replacement state thingy, where you also do the setRebuildNeeded() part?
I mean: Instead of $this->state ...
Comment #55
nlisgo CreditAttribution: nlisgo commentedAttempting to restore a working patch.
Comment #56
nlisgo CreditAttribution: nlisgo commentedComment #57
nlisgo CreditAttribution: nlisgo commentedComment #58
nlisgo CreditAttribution: nlisgo commentedAdd method doc
Comment #59
Fabianx CreditAttribution: Fabianx commentedNow we only need #53 added and we are done :)
Thanks so much for your work!
Comment #60
nlisgo CreditAttribution: nlisgo commentedThe suggested fix in #53 causes all of the tests in RouteBuilderTest to fail.
Comment #61
nlisgo CreditAttribution: nlisgo commentedComment #63
Fabianx CreditAttribution: Fabianx commentedFascinating ...
That might be due to the state service resetting the flag before calling rebuild?
Will take a look at the patch in context!
Thanks for your work!
Comment #64
catchComment #65
Fabianx CreditAttribution: Fabianx commentedFrom all that I can see the patch should be correct. Weird ...
Comment #66
nlisgo CreditAttribution: nlisgo commentedMaybe the tests are broken :)
Comment #67
BerdirMy guess is this breaks manually called rebuild() calls. See the first failing test, CommentDefaultFormatterCacheTagsTest, calls rebuild() manually.
it is a kernel test, so I guess nobody triggers rebuildNeeded(), which means that it will not actually do anything.
This seems like a problem? If I manually, explicitly call rebuild(), then it should just do it and not do some checks if it's needed?
Comment #68
Fabianx CreditAttribution: Fabianx commented@Berdir: Doh, right!
Yeah, we need to move that logic to rebuildIfNeeded() instead.
Thanks!
Comment #69
BerdirMoved it, also did some renaming and documentation improvements. Still don't like the naming there, I'd rather have a specific method for this on the router builder indicator instead of a protected wrapper. And if posssible, avoid talking about cache, that seems like an implementation detail. Thoughts?
Comment #70
Fabianx CreditAttribution: Fabianx commentedMaybe move the isRebuildNeededUncached as isRebuildNeededNow or such to the Router builder indicator itself?
Comment #71
xjmSo the issue summary describes the D7 issue but not the issue in D8. @catch suggests in #22 that this is equally a problem in Drupal 8.
Comment #72
Fabianx CreditAttribution: Fabianx commentedComment #73
Fabianx CreditAttribution: Fabianx commentedI am putting to RTBC tentatively:
- The naming is not perfect
- The location might be better suited to have the isRebuildNeededCached() in the Route Builder indicator.
But besides that (which someone needs to take a decision on), this is ready.
So it would be great to get some core committer feedback.
Comment #74
catchI think it makes sense to move this to the route builder indicator, seems a better spot for it and looks like both Fabianx and Berdir feel like that as well.
Apart from isRouterRebuildNeededNow() I can only think of isRouterRebuildReallyNeeded() or isRouterRebuildStillNeeded() which seem OK for accuracy but the former is veering close to a tongue twister.
Comment #75
Fabianx CreditAttribution: Fabianx commentedCNW based on #74, lets move it to the route builder indicator and choose a good name.
Comment #76
Fabianx CreditAttribution: Fabianx commentedTagging novice - if someone wants to give that rather simple re-factor a try.
Comment #77
catchI'm leaning towards
isRouterRebuildStillNeeded()
a bit - what we're doing there is checking if the router rebuild is still needed since the last time it was checked.Comment #78
xjmComment #79
martin107 CreditAttribution: martin107 commentedLooking at the RouterBuilderIndicatorInterface
isRebuildNeed()
setRebuildDone()
it is clear the "rebuild" is acting on the Router
otherwise these should be isRouterRebuildNeeded and setRouterRebuildDone
So I have shortened the last suggestion to isRebuildStillNeeded()
and the function call will look like this
if ($this->routeBuilderIndicator->isRebuildStillNeeded()) {
Comment #80
martin107 CreditAttribution: martin107 commentedComment #81
dawehnerJust decided to move this critical a little bit further.
Ups, I worked on it in the meantime as well, but also fixed the unit test and made a small readability improvement:
This can be just
return $this->isRebuildNeeded()
I do agree with
, given that its more describing what is going on.
Another idea I had was
but I didn't like the non-boolean aspect of it.
Comment #82
martin107 CreditAttribution: martin107 commentedNovice tasks complete.
Comment #84
Fabianx CreditAttribution: Fabianx commentedI don't think we need to keep the helper function ...
Can use directly:
$this->state->resetCache();
Or do we still need this from the outside?
Isn't that twice the same test?
Comment #85
dawehnerThank you @Fabianx for the quick review!
1-3
You are absolutly right. This function is introduced by this patch, so its not called from anywhere in HEAD.
I can't think of a valid usecase, its a pure implementation detail that state has static caching, so in case someone
has that edge case, that someone could reset the static caching of
$state.
No its, not. Let's have a look at a bit more context:
So we call
rebuildIfNeeded()
3 times, with one time having the need for a real rebuild, and once not.Comment #86
BerdirThis looks a lot cleaner than previous patches, thanks!
Looks good to go to me, not 100% sure on the name and the documentation of the new method but have no better ideas, so let's get some feedback from @alexpott or @catch :)
Comment #87
catchI came up with isRebuildStillNeeded() so I'm fine with the name.
Patch looks good in general now.
Leaving RTBC - I shouldn't commit this one since I did a decent amount of work on the original 7.x patch here.
Comment #88
Dries CreditAttribution: Dries commentedThe code looks a bit funny:
That code path could effectively translate into:
Which calls into question why have and call a function whose result we won't believe. At a minimum, I think we need documentation as to why this is necessary. Ideally, we'd able to make the code more intuitive.
I also think the newly introduced recursion in
rebuild()
could be a bad implementation; if there are many processes, the recursion could really blow up the call stack with nested calls. If a process is never able to get the lock (possible at least in theory), it would keep recursing forever until the call stack runs out of memory. I'm used to seeing this kind of locking implemented using a while-loop to avoid the recursion.Comment #89
dawehner@berdir, @xjm, @dawehner discussed this issue and how do document the behaviour what is going on.
The problems mentioned in your last part of the review won't appear, given that its much easier to break with too many mysql
connections for example.
Assigning to @effulgentsia to review our new docs.
Comment #90
Anonymous (not verified) CreditAttribution: Anonymous commentedi don't like this patch.
- the state flag is written to outside the lock, so the process rebuilding the router info can unset the flag just after another process sets it, with undefined results for an undefined period
- the first thing we should try to do is make the rebuild less expensive. at a first approximation, caching the .yml info seems like a fairly obvious candidate
- we're still taking the likely most frequent case (readers) and mashing them against the lock system. seems like it would be better to make the writers fight for the lock, because they are (99% of the time) a) less frequent, and b) more expensive requests
Comment #91
catchDiscussed #90 a bit with beejeebus in irc.
Switching rebuilds to happen on write (and keeping the lock protection) seems like we should definitely try to do it for 8.x
I'm not sure it''ll be possible to backport that to 7.x - variable_set('menu_rebuild_needed') is a part of the API. However we have #44 running successfully in production on the site where we ran into this, and it's fine if we end up with two different solutions in two different versions.
Comment #92
Anonymous (not verified) CreditAttribution: Anonymous commentedhere's a rough patch that caches the yml stuff in getRouteDefinitions(), and makes the module and theme systems call ->rebuild() instead of setting a flag.
on my VM, i see RouteBuilder::getRouteDefinitions() go from ~70ms to ~15ms. would love to see someone else run this and insert some simple microtime measurements in getRouteDefinitions() and report back.
should i open a separate issue for caching the .yml stuff separate from the other changes?
Comment #93
effulgentsia CreditAttribution: effulgentsia commentedGiven #90-#92, unassigning myself for now. Please reassign if there's something you'd like my input on, though.
Comment #94
dawehner@beejeebus
Note: Its not only the parsing of the YML files, but we also have to dump the entries into the database,
and rebuild parts of the menu trees, as they become potentially invalid ...
We certainly need numbers to be able to judge something here.
Here is a an example run, which mostly had a router rebuild. Note: This is a crapy slow machine in general.
Out of those the event dispatcher dominates (... this is mostly spend in rebuild the menu links and rebuild the routes of views).
There is also quite some costs to dump the routes for itself (
Drupal\Core\Routing\MatcherDumper::dump
).Comment #95
dawehner.
Comment #96
Anonymous (not verified) CreditAttribution: Anonymous commentedok, lets see how much i broke.
this removes Drupal\Core\Routing\RouteBuilderIndicator and all usage of it, updates the RouteBuilder and ThemeManager unit tests.
Comment #98
Anonymous (not verified) CreditAttribution: Anonymous commentedok, this patch rips out all usage of rebuildIfNeeded, and replaces those calls with rebuild. mostly. there seem to be places where we do a dance around rebuildIfNeeded() where we can just remove it. i hope. see the changes in RouteProvider::getRouteCollectionForRequest().
looks like RouterRebuildSubscriber can just die, but i left it in for now.
Comment #100
Anonymous (not verified) CreditAttribution: Anonymous commentedthis patch fixes the install issues with #98, and replaces Drupal::service('router.builder_indicator')->setRebuildNeeded() with Drupal::service('router.builder')->rebuild().
lulz at menu_ui.install, that may be able to just die, left it as is for now.
Comment #105
Anonymous (not verified) CreditAttribution: Anonymous commentedoookay, removed a few ->rebuild() calls that happened on every request, and the RouterRebuildSubscriber().
also made sure to depend on system and create the router table in some of our franken-unit-tests, which were ostensibly doing route stuff without it.
Comment #107
Anonymous (not verified) CreditAttribution: Anonymous commentedremove references to RouterRebuildSubscriber(), and some leftover rubbish in MenuRouterRebuildSubscriber.
Comment #108
Anonymous (not verified) CreditAttribution: Anonymous commentedwhile i wait for that, i'm poking at the next bits that may be cacheable in RouteBuilder::rebuild().
the theory behind caching the yml is that (aside from dev mode) it shouldn't change outside of module install / uninstall / update, which has well known places to invalidate the cache.
once the yml parsing is replace by a cache->getMultiple(), the next thing that shows up as slow is this OO goop:
i have a perhaps-wishful-thinking hunch that this should have the same lifecycle. is that remotely right? can the return value of this change outside of module install / update / uninstall events?:
Comment #110
Anonymous (not verified) CreditAttribution: Anonymous commentedadded a bunch of boring test stuff to fix tests that faux-exercise the route rebuild code paths without creating a route table. lulz.
also fix missing use for RouteBuilder in ThemeController.
i think at least KernelTestBaseTest will still fail.
Comment #112
dawehnerIn general its important to keep in mind the original issue which basically introduced the indicator: #2164367: Rebuild router as few times as possible per request
Comment #113
catchI did a quick bit of archaeology on the menu_rebuild_needed variable, this race condition was rediscovered by Fabianx and me on a live Drupal 7 site recently so the Drupal 7 issue here is extremely real - it's not introduced by the 8.x listener. I've also seen the symptoms of it before in NewRelic without having actually tracked it down on other sites (since it always eventually self-repairs after a few minutes and various contrib modules rebuild menus sometimes).
The variable was originally added in #202955: Access denied after install - menu_rebuild() calls - this was to allow a menu rebuild to be triggered when in 'maintenance mode' (where you might not have a complete module list) so that it ran outside of maintenance mode. Not an optimization then, just a workaround.
Then field UI started to use it to optimize test runs since menu_rebuild() was getting triggered multiple times per request.
Then we've gradually added it elsewhere up until #2164367: Rebuild router as few times as possible per request which standardized that pattern.
So I think it'd be a good change to switch to setting a flag (in memory only, not state) that triggers a menu rebuild at the end of request, and that will satisfy not wanting to rebuild more than once per request. Where the rebuild has to happen mid-request, can just trigger that directly.
For Drupal 7 I think we'll have to leave the variable in, and probably use the 7.x version of this patch, but perhaps we could also set $conf['menu_rebuild_needed'] and add something in drupal_exit() looking for it in many cases that core uses this, to try to reduce the chances of a read stampede there as well.
Comment #114
Anonymous (not verified) CreditAttribution: Anonymous commentedi think it's worth adding back a flag to rebuild on shutdown.
we should be able to do that in a non-racy, per-process way.
Comment #115
Anonymous (not verified) CreditAttribution: Anonymous commentedthis should fix most of the fails due to the router table either not being there or already being there.
Comment #116
Anonymous (not verified) CreditAttribution: Anonymous commentedadded setRebuildNeeded() to route builder interface, made RouteBuilder implement destructable interface, call setRebuildNeeded() from non-module|theme places instead of rebuild.
should get us back to one rebuild per request most of the time.
Comment #119
Anonymous (not verified) CreditAttribution: Anonymous commentedthis one might be green.
Comment #120
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #124
Anonymous (not verified) CreditAttribution: Anonymous commentedmkay, so views created/altered programmatically within the test parent process, then accessed in a child via a http request, throw errors. because that sort of code is racy and evil.
Comment #126
Anonymous (not verified) CreditAttribution: Anonymous commentedaaaaaand more rebuilds in tests.
Comment #127
Anonymous (not verified) CreditAttribution: Anonymous commentedok, now this is green, and has the 'do one rebuild for all non-module event' code in it, passing back to effulgentsia for a review.
Comment #128
dawehnerNote: one reason why we had that extra service is documented in https://www.drupal.org/node/2352641#comment-9236537 but this is no longer the case.
Note: router.builder would be a perfect candidate for #1973618: DIC: Lazy instantiation of service dependencies (ProxyManager for "proxy services") ... we don't want to load the lock router.dumper etc. ... on every non-page-cache request
It is sad that we can't just use the extension cache tag here.
Couldn't this result in a race condition, that something wants to rebuild, and so the local tasks end up with old tasks?
What about name it "getStaticRouteDefinitions" so its a little clearer, what is going on?
... just curious here ... don't we just need the router in every kernel test now?
Comment #129
Anonymous (not verified) CreditAttribution: Anonymous commented128.1 - should we add a note to that issue? or open a separate follow up?
128.2 - discussed this with dawehner in IRC. i'd like to keep the per-module clearing, because it will lead to better hit rates. however, i agree with him that it's kinda ugly - better implementation ideas most welcome.
128.3 - yes, but this is "normal". this patch brings the router builder cache back in line with most others - there's a window where a reader can get the "old" router information, just like every system all throughout Drupal with a cache. the "next" request, however, will get the new data. i don't think there's anything we need to do here - i think that's a better approach than blocking all requests and hammering the lock system.
128.4 - i like that suggestion, will put that in the next reroll.
128.5 - maybe? i'm not sure, to be honest i was just banging away at the frankentest stuff to make it go green.
Comment #130
Anonymous (not verified) CreditAttribution: Anonymous commenteddiscussed the caching yaml part of this with berdir, i think we can remove it and rely on the more general yaml caching here being added in #2395143: YAML parsing is very slow, cache it with FileCache.
Comment #131
dawehner@beejeebus
What should we do about this issue?
The patch before you got involved solved a problem, and especially would allow us to get a fix for that problem into Drupal 7.
We can easily open up a new issue discussing your ideas.
Comment #132
catchThe earlier version of the patch is still right for 7.x - we still have to support menu_rebuild_needed there whatever happens.
I'm fine with the end-of-request write in Drupal 8 though, definitely simplifies things. YAML caching definitely should not happen here though.
If necessary we could fork the issue between 7 and 8 since they'll be completely different patches.
Comment #133
Anonymous (not verified) CreditAttribution: Anonymous commentedsorry for blocking progress.
related, i still seem to have assign-issue karma when i shouldn't. i'm not in maintainers.txt, so that should be removed, thanks.
Comment #134
dawehnerDoes someone agree that we should go with the previous solution, which at least opens up solving the D7 problem as well
and then try out the ideas of @beejeebus in another issue?
Comment #135
catchI'd be tempted to split the issue - move this to 7.x (and note it's applied and working on the 7.x client site where we rediscovered this) then open another issue for beejeebus' approach for 8.x (still critical).
If it turns out the end of request rebuild doesn't work we've still got the 8.x patch from here, but looking through the commit history for the current situation we got here almost entirely by accident so stepping back from that seems good.
Comment #136
dawehnerLet's branch of the D7 bit: #2425259: Router rebuild lock_wait() condition can result in rebuild later in the request (race condition)
Comment #137
dawehnerSo for now this is just a reroll, but I have a couple of ideas how to improve things.
Note: One thing which should certianly try is to mark the RouteBuilder service as lazy. You should almost never
Comment #138
BerdirWrong patch, this is the system_path stuff :)
Comment #139
dawehnerHa, I was indeed wondering about the filesize for a quick moment, but then simply stopped complaining about it.
Comment #141
dawehnerLet's first get the patch green again.
Comment #143
BerdirThis could be problematic.
I recently got rid of a forced router rebuild after every module in my install profile (because there was an old call to it on default_content, it resulted in a 30% performance improvement: https://github.com/larowlan/default_content/pull/32#issuecomment-73381037). And that was *with* my ApcFileCache, that caches parsed yaml files. The expensive part is dynamic stuff like field_ui and views.
Yes, this only rebuilds at the end of the request if I understand correctly, but if you use the UI installer, that's still a lot of rebuilds.
If we use the ApcFileCache for the parsing then we can drop it from here as suggested by @catch and don't need manual cache clears.
I'm not sure why some places call setRebuildNeeded() and most now call rebuild().
That we need this in so many kernel tests is annoying, what exactly is triggering this?
Shouldn't a views save do this automatically if needed?
Hm, the method still exists here, left-over or do we really still have both?
Comment #144
dawehnerThis particular example is trigged by the rebuild in
ThemeHandler
Well, in HEAD we call
setRebuildNeeded()
inView::postSave() => views_invalidate_cache()
...The problem for these tests are probably now the one which lead to the architecture we have now. #2164367: Rebuild router as few times as possible per request
So for example in that case we save the view, which does not trigger the rebuild. In a normal request we potentially end up in a race condition that the terminate is not done yet, but another query is already executed in the test process.
Note: This is just a bunch of obvious fixes ... nothing serious.
Comment #146
dawehnerFixed the remaining failures, and addressed #1 and #2
Comment #148
dawehnerReroll.
Comment #150
dawehnerLet's see whether things pass again when we rebuild the router again in the installer.
Comment #152
sidharrell CreditAttribution: sidharrell commentedrolled #150 against current HEAD.
Comment #154
sidharrell CreditAttribution: sidharrell commentedmove along, nothing to see here.
I did try to make an interdiff 150-152 but it broke.
Comment #156
sidharrell CreditAttribution: sidharrell commentedfound the source of the fatal error.
Comment #158
sidharrell CreditAttribution: sidharrell commentedfixed some fails
Comment #160
dawehner@sidharrell
I think we should avoid rebuilding the routes completly in submit handlers, are you sure that we need to add them for those cases?
Comment #161
sidharrell CreditAttribution: sidharrell commentedin field_ui_entity_form_mode_presave and field_ui_entity_view_mode_presave? I changed it there cause it kept the FieldUIRouteTest from blowing up. I have no idea whether it is or is not the correct thing to do. I'm anxiously awaiting any kind of feedback, cause I know I have tons to learn about the whole system.
I was surprised that the change the other way in ThemeHandler didn't blow anything up.
Comment #162
Fabianx CreditAttribution: Fabianx commentedSo just a quick understanding of this patch:
- We have 'eventual consistency' taking into account that requests coming while the rebuilding request is running see the old router.
- If there is no route stored at all, the route is build just once by whoever gets the lock - so no build stampede.
- We have almost direct consistency for when the ->rebuild() is called directly, requests see the new route once the rebuild is finished.
Works for me!
Comment #163
sidharrell CreditAttribution: sidharrell commentedtalking over this with @dawehner in irc, we think the rebuildIfNeeded function on RouteBuilder can go away.
After talking, got to thinking:
It's not used anywhere in all of core and it really doesn't make sense with this internal change, to have the API oriented in such a way as to have client code be concerned at all with whether or not the routing table needs to be rebuilt RIGHT NOW. The client code should just maybe notify that they made a change that requires a rebuild. When that rebuild occurs, client don't care.
The form submit rebuild for the field_ui test it's probably just best to eat right now. I'm much more concerned with the results of Drupal\Tests\Core\Routing\RouteBuilderTest and Drupal\system\Tests\Routing\RouteProviderTest. Since this whole change is for a routing issue, those really should (if TDD really is the bee's knees) go to the heart of the issue.
Comment #164
Fabianx CreditAttribution: Fabianx commented#163: Fair, what about:
->rebuild()
->rebuildNow() // for tests and special cases.
instead?
With the understanding that as soon as an url() is created the router is rebuild, else just as a thing in the request.
Comment #165
sidharrell CreditAttribution: sidharrell commentedI was actually looking at rebuild() earlier when banging my head against it, and it doesn't check for whether a rebuild is needed or not, it just does it, afaict. So a rebuildNow isn't necessary, client code can just call rebuild().
I may be (probably am) wrong, but I think I understand a little better now than when I started. The idea is two-fold.
Step 1 is to lower the number of times the routing table is rebuilt on each request, for a performance improvement. And that lowers the probablility of a race condition, but does not altogether eliminate it, which is why we need:
Step 2, fix the possibility of a race condition by some kind of locking mechanism.
Step 2 is where I get kinda fuzzy and have to consult wikipedia. Being a WP dev, we don't concern ourselves with such things.
"A race condition occurs when two or more threads can access shared data and they try to change it at the same time"
So each request-response lives in it's own thread, right? So are we talking about two apache(or nginx, whatevs) threads, each bootstrapping and executing, and what exactly is the shared data? The routing table stored in the DB, object cacheing? Or am I being to literal and detail oriented, and need to just let it go to the 'cache' or some other such abstraction?
Comment #166
Fabianx CreditAttribution: Fabianx commentedWe already have 2) the locking is in place and working fine. The race condition happened because we had a state() before that was set outside the race condition.
This eliminates state() as its only set for the current request, so should be fine.
My suggestion was to rename setRebuildNeeded() to rebuild() and rebuild() to rebuildNow(), but thats not important.
Comment #167
dawehnerIn case we do that, we should be aware, that existing code calling to
rebuild()
might do it on purpose, so its an API break which is IMHO more complex to debug than just a simple renamed function.Comment #168
sidharrell CreditAttribution: sidharrell commentedI'm officially stumped on the remaining fails. I'll keep poking at it, and following the issue, but I'm gonna try to find some lower hanging fruit.
Comment #169
dawehnerSome work on the failures.
Comment #170
dawehnerDoes that mean modules which run stuff which potential could need a new router, like
MenuLInkContent::preSave()
should callRouteBuilder::rebuildIfNeeded()
as well as default_content?With that we would move the lazyness from the Route Provider into the calling code, which is argueably bad, but at least inserts don't happen often.
Just to be clear, the current patch indeed gets rid of the cache again and instead rely on the cache of
YamlDiscovery
Well, places which previously called
->setRebuildNeeded
should be still able to callsetRebuildNeeded
if we apply the logic mentioned above. Its tricky, but I think its okayish to apply additional forced rebuilds in tests.The rebuild in
ModuleInstaller::install()
as far as I understand?MatcherDumper
could be silent if the table does not exist.Note: #2371709: Move the on-demand-table creation into the database API would have solved that nicely, even if hidden.
Well, code still wants to trigger a rebuild at the end of the request, doesn't it?
As maybe written already, it just sets the router to be rebuilt, which happens at the end of the request, which simply doesn't apply in a kernel test or a test in general.
Comment #171
Fabianx CreditAttribution: Fabianx commentedI don't think this is possible for the installer calling router->setRebuild() several times, but then the wrong ModuleDirectories are 'cached' in the yamlDiscovery object.
The helper is nice though.
If we think it helps a lot could compare before after moduleDirs and only create a new instance if that has changed ...
--
Overall the patch looks good.
Comment #172
dawehnerLet's ensure that we don't make the installer slower as before.
What do you mean with possible? It happens at least the for UI installer already, because for every next batch request you have in HEAD the router marked,
and so we do a rebuild. Note: This helper method exists already in HEAD and was only introduced for better testability.
In real life a module install triggers a container rebuild, which triggers a new instance of the router builder anyway, so you afaik can never have wrong data as part of the
YamlDiscovery
object.Comment #174
Fabianx CreditAttribution: Fabianx commentedOh, that is a good reasoning. As long as the module handler is not synthetic, which I am sure it is not, this will work fine.
Thanks for the explanation!
Comment #175
dawehnerLet's fix the test failures and try to get rid of as many of the
RouteBuilder->rebuild()
calls. On top,keep the existing
rebuildIfNeeded
calls and don't replace it withrebuild()
Comment #176
klausiIndentation should be 2 spaces.
is this meant to be the PHP object destructor __destruct()? Then the @inheritdoc does not really fit?
So we don't have any test coverage for this yet?
Unrelated change?
doc block missing, indentation should be 2 spaces.
same here, need @inheritdoc
And here.
Comment #177
dawehnerThank you for your quick review!!
I'm sorry I should have just seen it
We don't want to use the
__destruct()
as it has been causing issues in the past. Instead we use the\Drupal\Core\DestructableInterface
which provides thedestruct()
method.Comment #178
dawehnerThank you for klausi for helping on the documentation of the
destruct()
method.Comment #181
klausiWhy is this test method removed completely? It should be modified to remove the routeBuilderIndicator, but otherwise stay the same?
And did you forget to remove the RouteBuilderIndicator class and RouteBuilderIndicatorInterface ? I don't think it makes sense to keep them for API compatibility, since they are now dead classes?
Otherwise looks almost ready.
Comment #182
klausiComment #183
dawehnerYou are absolute right, this was probably removed temporarily because setRebuildIfNeeded was nota always there.
Good point. Let's remove it!
Comment #184
jibranI think the only thing remaining here is profiling. Everything else look good.
Comment #185
dawehnerPlease keep it on needs review, otherwise people don't review it.
Comment #186
dawehnerThese are tests running standard install.
Without patch
With patch
Tried to adapt the code a bit, see interdiff, but this did not helped at all:
Comment #187
klausiklausi opened a new pull request for this issue.
Comment #188
klausi@dawehner: Hm, so that looks bad. What exactly did you test? The whole install process? with drush si?
Experimenting a bit here with only setting the rebuild flag during install, let's see what goes wrong.
Comment #190
dawehnerYeah, those are basic
drush si
calls ... with standard ... but you have to keep in mind, by doing it explicit, the other requests hitting the site,won't have to deal with it any longer, which makes it easier for sites to recover.
@berdir suggested a different idea to test this.
Constantly request Drupal with for example ab.
While doing that enable a module or save a view, and keep looking at the load.
On HEAD, the lock should block those requests, with the patch, the rebuilding should be done explicitly, and the other requests just hit the previous router.
Ideally I think the logic could be:
But I think doing that right, is tricky.
Comment #191
klausiklausi pushed some commits to the pull request.
For an interdiff please see the list of recent commits.
Comment #193
dawehnerOh wow. @klausi, do you think its okay to require people to rebuild the router manually, in case they do something with the router in their
hook_install()
code?Reverted a hunk made by klausi, which make a) the patch size bigger and b) broke the unit tests. I don't think we need this change.
Another try to run the installer (
time drush si -y
):Head
Patch
Comment #195
dawehnerThis feels like a random test failures, let's reupload the patch.
Comment #197
klausiklausi pushed some commits to the pull request.
For an interdiff please see the list of recent commits.
Comment #198
klausiOK, so as dawehner said I implemented a behavior change with this patch: routes are not automatically rebuild during the module install process.
That way we have two important performance improvements (or at least avoiding regressions) with this patch:
* No more lock stampede where multiple requests try to rebuild routes.
* No multiple route rebuilding during module installation
Disadvantages of this patch:
* Edge case: modules that rely on updated route information in hook_modules_installed() will have to call \Drupal::service('router.builder')->rebuildIfNeeded(); themselves.
* Some test cases need to call \Drupal::service('router.builder')->rebuild() themselves to make sure that the route information is up to date before they make requests.
* "drush en mymodule" is broken now, because drush will also have to do a \Drupal::service('router.builder')->rebuild(); ==> but this can easily be fixed in drush.
Comment #199
dawehnerUpdated the issue summary.
I think solving the big problem of the lock is IMHO worth the disadvantages described in the issue summary.
Comment #200
dawehnerFor benchmarking I tried the following approach.
In one terminal run:
ab -n 1000 -c 10 http://d8.dev/
In another terminal run:
drush ev '$view = \Drupal\views\Entity\View::load("frontpage"); $view->save();'
in HEADor
drush ev '$view = \Drupal\views\Entity\View::load("frontpage"); $view->save(); \Drupal::service("router.builder")->rebuild(); '
with the branch here.Sadly both lead to a apache segfault.
Comment #201
dawehnerAlright, this time on another machine.
Went with
ab -n 100 -c 4 http://d8.dev
(no xdebug)HEAD, without a view save: ~6 seconds taken for the tests
HEAD, with view saving: ~9 seconds taken for the tests
PATCH, without a view save: ~6 seconds taken for the tests
PATCH, with view save: ~7 seconds taken for the tests.
Interesting is probably the standard derivation for HEAD with view save and PATCH with view save, as it shows, how slow the individual requests are:
Note: The
$view->save()
triggers not only a router rebuild, but also some other cache rebuilds.Comment #202
Wim LeersWow, an 80% reduction in standard deviation!
Seems like some awesome improvements are happening here :)
Comment #203
catchI'd prefer to make the opt-in route rebuilding a follow-up even if this makes things slower initially.
It seems worth looking at, but prior to that change this issue is exchanging an issue that brings production sites down in Drupal 7 at least for a slower installer, which is an OK trade-off.
If we want to make the API change to speed up the installer and multiple-module installs generally, that could also be a good change but seems separate.
Comment #204
dawehnerFair. Took my old patch and applied some doc changes on top of it.
Let's discuss that on #2451665: Don't rebuild the route on ModuleInstaller::install() (30% installer speedup)
Comment #205
klausiOk, let's continue with the 30% installer performance regression here and work on that in #2451665: Don't rebuild the route on ModuleInstaller::install() (30% installer speedup).
RTBC, assuming the bot comes back green.
Comment #206
dawehnerI guess we no longer needs profiling, we know its slower :P
Comment #207
Fabianx CreditAttribution: Fabianx commentedComment #208
JulienThomas CreditAttribution: JulienThomas commentedSorry for the "out of topic" question as I go back on the Drupal7 port. Is the speed issue only related to Drupal 8 or does it apply to 7 too?
Comment #209
catchIt's only relevant to the Drupal 8 patch.
Comment #210
JulienThomas CreditAttribution: JulienThomas commentedThanks. when reviewing code I was thinking that but wanted confirmation. Great!
Comment #211
alexpottLooks like the following CR will need to be revised https://www.drupal.org/node/2185947 - I think once this lands it should just point to the new CR...
Also I think we need a CR for this change.
Comment #212
alexpottAlso I'm not certain that we should be removing the lightweight indicator service see #2352641: Break router.builder dependency - are we sure that this is the right thing to do?
Comment #213
alexpottMisspelling of indicates
Comment #218
klausiklausi pushed some commits to the pull request.
For an interdiff please see the list of recent commits.
Comment #219
klausiRegarding #2352641: Break router.builder dependency: that issue was about a service dependency circle between route builder and route provider - this patch changes route provider to not depend on route builder at all. We just forgot to remove that dependency, I did that now with the latest patch, yay!
The change notice at https://www.drupal.org/node/2185947 will only have to be revised slightly, it basically stays the same.
The CR for this issue should just mention that the RouteBuilderIndicator service is removed and that router rebuild now are never triggered at the beginning of requests when routes are looked up.
Attached is also an interdiff to dawehner's patch before (for the Github haters :-P )
Comment #220
Anonymous (not verified) CreditAttribution: Anonymous commentedre. #212 - yes, i do think removing that code is the right thing to do. i'm happy to work on a CR for this patch, given i pushed for the change in direction at #90.
i think this is ready, but i've been away from it for ages, so i'll leave the RTBC for someone else.
Comment #221
dawehnerJust wrote one, feel free to improve it.
https://www.drupal.org/node/2452735
Comment #222
klausiCR looks good, made a small improvement.
@dawehner: can you sanity-check my patch and then set this to RTBC again?
Comment #223
klausiComment #224
Anonymous (not verified) CreditAttribution: Anonymous commenteddawehner++
/me dares to be brave
Comment #225
dawehnerGood improvements!
Went through the patch with storm and it looked still fine.
Comment #226
YesCT CreditAttribution: YesCT commented.
Comment #228
sidharrell CreditAttribution: sidharrell commentedstraight reroll of 219 against head.
Comment #229
sidharrell CreditAttribution: sidharrell commentedsorry, meant #218.
Comment #230
alexpottThis issue addresses a critical bug and is allowed per https://www.drupal.org/core/beta-changes. Committed 40d57b3 and pushed to 8.0.x. Thanks!
Comment #231
alexpottPublished the CR. Can someone update https://www.drupal.org/node/2185947.
Comment #233
klausiYay! Updated the old change record at https://www.drupal.org/node/2185947
Unpostponed #2451665: Don't rebuild the route on ModuleInstaller::install() (30% installer speedup).
Comment #234
Fabianx CreditAttribution: Fabianx commentedYes! Thanks for getting this in!
Comment #235
dawehnerAdding the related issue
Comment #237
apaderno