TL;DR
Rules 7.x-2.x has a bug whereby if there are no rules configured rules_init will continually try to rebuild its cache for the "init" event.
A very simple change to the logic prevents this from happening (the change is to specifically check for a cache miss for the whitelist):
$whitelist = rules_get_cache('rules_event_whitelist');
- if ((empty($whitelist) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {
+ if ((($whitelist === FALSE) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {
klausi reviewed this in #14 and set it to RTBC.
Tests pass, and performance is significantly better with the patch applied to a vanilla D7 site with the rules module enabled but no rules configured.
Remaining tasks
Review and commit the simple patch from #8 (it still applies to 7.x-2.x as at 2017-10-10).
Original issue summary
Problem/Motivation
It appears that cache rebuilding is broken in the latest version of Rules, causing significant database strain and performance degradation on every (or nearly every) page request when the module is enabled, regardless of whether any rules are actually enabled, and apart from cron activity (cache flushing, etc.).
(Sample data below shows an increase of over 200ms to baseline page request times after enabling the Rules module on one production site--an increase that was sustained over several hours).
The sample data suggests that the excess time is being spent attempting to update the cache_rules table, and that the problem starts in Rules' hook_init() implementation--which is why every page request has the potential to be affected.
Related Issues
Many users have reported a large number of "cache rebuild lock hit" errors when using the latest version of Rules. These errors are likely symptomatic of the underlying problem that something in Rules' cache (re)building process is broken. Please see the list of linked issues at right.
My data shows that the core issue is "every page request" in scope, rather than limited to lock handling or stampeded protection following a cache flush. While every page request may not generate "lock hit" errors, it appears that Rules is hammering the rules_cache table on every (or nearly every) page request.
Sample Data and Case in Point
After enabling the latest stable version of Rules on one site, my watchdog was instantly inundated with hundreds of "Cache rebuild lock hit" messages. I also discovered a large performance degradation via New Relic site monitoring (average server-side transaction time went from under 500ms to over 800ms per page request). I then upgraded to the latest Rules dev build, which suppresses the watchdog messages... but site performance degradation remained. This was without actually enabling any rules--just turning on the module.
After a bit of investigation I discovered that the problem seems to be rules_init(), and rules cache handling.
Other issue threads have attempted to address the watchdog errors and improve Rules' stampede protection, but I think we are on the wrong track there: it's fine to try to improve Rules stampede protection, deadlock handling, and memcached support, but the more significant question (as some users have pointed out) is: why are we running into stampedes and deadlocks on every single page request, when no rules are even enabled on a site?? The core issue, from my investigation, is that something in rules_init() related to cache building is fundamentally broken.
It seems that somewhere during rules_init() Rules attempts to rebuild cache_rules, possibly multiple times, on every single page request.
Here are some screenshots from New Relic, based on severaal hours of a large production site running (right-lick and open image in new tab for full-resolution screenshots):
Here we have Rules taking more average time per page request than any other module--by a landslide. This is with no actual rules enabled, and on a site with hundreds of other modules enabled:
It looks like cache_rules update query is the issue:
Query trace showing that the issue goes back to rules_init() and RulesEventSet::RebuildEventCache
And here's my grand and expensive cache_rules table, containing one single entry and no rules enabled!
Additional Notes
* This issue occurred on my site after upgrading Rules from version 7.x-2.3 to 7.x-2.9. I had had the module disabled for some time before that, so cannot be sure whether this issue was present in the previous version (7.x-2.3) or not.
Because I upgraded the module rather than enabling it from scratch, there is a slight chance that the upgrade process could be to blame here (i.e. rules.install, and the various update functions present there).
* I normally use memcached and the memcache_storage module to handle most of my site's cache tables, but disabled memcached handling of cache_rules in order to troubleshoot this issue.
Remaining tasks
- Confirm root issue / performance degradation.
- Determine when this degradation was introduced by inspecting/testing previous versions of the Rules module.
- Inspect Rules' hook_init() implementation to see where things are going wrong / why Rules is attempting to rebuild the cache table so frequently.
- Fix the problem and close related issues.
Comment | File | Size | Author |
---|---|---|---|
#15 | chrome_2017-02-15_13-34-18.png | 252.65 KB | JordanMagnuson |
#15 | chrome_2017-02-15_13-31-09.png | 236.57 KB | JordanMagnuson |
#15 | chrome_2017-02-15_13-27-19.png | 152.5 KB | JordanMagnuson |
#8 | rules_init_and_cache-2851567-8.patch | 836 bytes | mcdruid |
| |||
cache_rules_table.png | 119.75 KB | JordanMagnuson |
Comments
Comment #2
JordanMagnuson CreditAttribution: JordanMagnuson commentedComment #3
JordanMagnuson CreditAttribution: JordanMagnuson commentedComment #4
botrisThanks for taking the time to generate such an extensive issue.
My two cents, some observations, no conclusions :)
- not sure hook_init is broken just because the cache rebuild originates there.
- according to #2 in #2406863: stampede protection should be flexible ("Cache rebuild lock hit" watchdog message) this bug introduced in 2.8
- in 2.8 this was committed: #2190553: Add locking to cache rebuild to avoid stampeding amongst others.
- #2190553 mentions
- That Rules is triggered (too) early in bootstrap is also here: [https://www.drupal.org/node/2324587] (which went into 2.8 as well.
So current bug (cache rebuild lock hit) is here to to prevent a possible bigger bug, "site completely unresponsive"?
On volatile cache:
As per #2190553
This would explain why in some use cases the use of volatile cache triggers this error.
This is confirmed by:
#2406863 #42
#2406863 #102
#2406863 #104
So the use of volatile cache is not the cause of this problem but could trigger it because of the early call.
Comment #5
botrisComment #6
mcdruidThanks for creating a new issue. I think I've found at least one of the causes of the seemingly excessive cache rebuilds.
The
hook_init
in rules:...calls
rules_invoke_event
for the "init" event. That function does this:So that's (most likely) two calls to
rules_get_cache
which means with a cold cache there may be a stampede as both of these will getFALSE
back from thecache_get
, and proceed to try to rebuild caches.Even without a cold cache, this can still cause problems. The reason is that rules will only create a cache entry for
event_init
if the site is configured with at least one rule which will fire on the init event.If you don't have any rules associated with that event, every time
hook_init
is called (effectively every bootstrap), rules tries to get the cache entry forevent_init
, gets a cache miss and ends up trying to rebuild its event cache; this is inrules_get_cache
:So with a cold cache,
rebuildEventCache
will be called twice as part ofhook_init
(once for each of the calls torules_get_cache
inrules_invoke_event
).Looking at what that method does:
If the site doesn't have any rules for the
init
event though, thatrules_set_cache
will not be called for "event_init
", as no$sets['init']
element will have been created by the previous loop.That means that next time
rules_init
is called, rules will try to load the "init
" event from cache, find there is no cache entry (none was ever created because there are no rules for that event), andrebuildEventCache
will be called again.I think this is why simply enabling the rules module introduces a performance degradation even when no rules are configured - in fact, it may be that having no rules (especially for the init event) is worse for performance than if you do have at least one rule; that way a cache item will be created.
I believe it's also why with rules-7.x-2.9 it's apparently so common to hit the stampede problem with lots of "
Cache rebuild lock hit: event_init
" messages being logged; any site with no rules for the init event will be trying to rebuild that cache on every bootstrap.I'm having a think about how this could best be resolved, but wanted to get these findings into this new issue.
Comment #7
mcdruidActually it looks like a significant portion of the problem I described in my previous comment applies mainly to the situation when the rules module is enabled, but no rules are configured. Otherwise the
rules_event_whitelist
should minimise unnecessary rebuilding of the rule event cache(s) AFAICS.So it might be worth trying to address that situation in the module's code, but as things stand it probably makes sense not to enable the rules module if you don't want to use it (seems silly, but not sure if it's enabled by certain distributions etc...)
Might help to explain e.g. #2630636: Rules-enabled site requests/sec hit
Comment #8
mcdruidI initially looked at avoiding the endless calls to
rebuildEventCache
by doing something like this:That means the next time rules invokes the init event, it doesn't try to get the event from cache and in so doing trigger the rebuild.
However, a much simpler way to achieve the same thing is to specifically check for a cache miss for the whitelist, rather than checking whether the whitelist is empty:
That means that if there are no event rules configured and therefore the whitelist is (legitimately) empty, rules doesn't try to rebuild the event cache over and over again.
If there's a cache miss for the whitelist though, the event cache will be rebuilt (which is correct).
This seems to make quite a difference to performance on a site with the rules module enabled but no rules configured.
Testing on a vanilla D7 site using ab with 10000 requests, 5 concurrent, clearing cache before each run. (Actual numbers not important, this is a local docker set up).
Before the patch:
After the patch:
The patch seems to apply to both 7.x-2.x and the 7.x-2.9 release.
Comment #9
mcdruidComment #10
botrisImpressive stats. How does that relate to when it's not empty though?
I've found that the the problem get's worse on pages where more rules are triggered.
Comment #11
mcdruidMy findings, and the patch, are really for a very specific scenario - the problem that arises when the rules module is enabled but no rules are configured.
I think that's worth fixing by itself, and I'd argue that we should keep this issue specific to that.
Comment #12
JordanMagnuson CreditAttribution: JordanMagnuson commented@mcdruid: thanks for your analysis, and the patch. Will take a look when I have a moment, and report back.
(Regarding the silliness of enabling the Rules module on a site that has no rules active: I agree that it's "silly," but the fact is that site admins may enable/disable various rules over time, and potentially forget to disable the module during times when no rules are active. And more significantly, there are a number of modules out there that have an (unfortunate) hard dependency on Rules. Personally, I don't think that any module should incorporate Rules as a hard dependency (its should always be about optional integration paths, IMO), but the fact remains. At the end of the day, I think these hard dependencies are themselves silly, and that yes, it is silly to have the module enabled without any rules active, but I also think that it is equally silly that the Rules module would implode in the case where it is enabled but not being used...)
Comment #13
mcdruidThanks @JordanMagnuson.
I certainly didn't mean to be rude about the "silliness" of having rules enabled with no (reaction / event) rules configured. It's easy to imagine lots of people would ask "why would you do that?", but I think your example of other modules (or perhaps features) declaring a dependency on rules (so the module cannot be disabled / uninstalled) is spot on - I'm sure there are hundreds of production sites in that situation.
If such a simple change can have a dramatic effect on the performance cost of the rules module when it's enabled but not (yet) set up to react to any events, I definitely think it's worth doing.
Comment #14
klausiReviewed the patch and that totally makes sense!
Comment #15
JordanMagnuson CreditAttribution: JordanMagnuson commented@mcdruid: your analysis makes sense to me. I had a chance to test your patch on my production site yesterday, and the patch successfully stopped the cache_rules update queries from hammering the database like they had been (with no rules active). Thanks for your efforts!
Rules still managed to take up a decent chunk of time thanks to all the cache get queries in hook_init (Rules is now #6 on my list of most time-consuming modules), but I suspect that that impact can be brought to a reasonable level by handing cache_rules back over to memcached (some users have claimed that cache_rules should not be handled by memcached, but I have not seen much evidence to support that--thoughts?).
Screens of current impact on my largest production site (with cache_rules still handled by db):
Comment #16
MustangGB CreditAttribution: MustangGB commentedSo should #2406863: stampede protection should be flexible ("Cache rebuild lock hit" watchdog message) be closed as a duplicate now, or is there further investigation required?
Comment #17
mcdruidI think they are two separate issues; this one turned out to be specific to rules_init when there are no event rules configured.
The other is still an issue all of its own.
Comment #18
hgoto CreditAttribution: hgoto at Studio Umi commentedI also met this problem and the patch #8 works well. Thank you.
Comment #19
mcdruidComment #21
fagothx, the patch from #8 looks great. If I committed it.