TL;DR

Rules 7.x-2.x has a bug whereby if there are no rules configured rules_init will continually try to rebuild its cache for the "init" event.

A very simple change to the logic prevents this from happening (the change is to specifically check for a cache miss for the whitelist):

     $whitelist = rules_get_cache('rules_event_whitelist');
-    if ((empty($whitelist) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {
+    if ((($whitelist === FALSE) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {

klausi reviewed this in #14 and set it to RTBC.

Tests pass, and performance is significantly better with the patch applied to a vanilla D7 site with the rules module enabled but no rules configured.

Remaining tasks

Review and commit the simple patch from #8 (it still applies to 7.x-2.x as at 2017-10-10).


Original issue summary

Problem/Motivation

It appears that cache rebuilding is broken in the latest version of Rules, causing significant database strain and performance degradation on every (or nearly every) page request when the module is enabled, regardless of whether any rules are actually enabled, and apart from cron activity (cache flushing, etc.).

(Sample data below shows an increase of over 200ms to baseline page request times after enabling the Rules module on one production site--an increase that was sustained over several hours).

The sample data suggests that the excess time is being spent attempting to update the cache_rules table, and that the problem starts in Rules' hook_init() implementation--which is why every page request has the potential to be affected.

Many users have reported a large number of "cache rebuild lock hit" errors when using the latest version of Rules. These errors are likely symptomatic of the underlying problem that something in Rules' cache (re)building process is broken. Please see the list of linked issues at right.

My data shows that the core issue is "every page request" in scope, rather than limited to lock handling or stampeded protection following a cache flush. While every page request may not generate "lock hit" errors, it appears that Rules is hammering the rules_cache table on every (or nearly every) page request.

Sample Data and Case in Point

After enabling the latest stable version of Rules on one site, my watchdog was instantly inundated with hundreds of "Cache rebuild lock hit" messages. I also discovered a large performance degradation via New Relic site monitoring (average server-side transaction time went from under 500ms to over 800ms per page request). I then upgraded to the latest Rules dev build, which suppresses the watchdog messages... but site performance degradation remained. This was without actually enabling any rules--just turning on the module.

After a bit of investigation I discovered that the problem seems to be rules_init(), and rules cache handling.

Other issue threads have attempted to address the watchdog errors and improve Rules' stampede protection, but I think we are on the wrong track there: it's fine to try to improve Rules stampede protection, deadlock handling, and memcached support, but the more significant question (as some users have pointed out) is: why are we running into stampedes and deadlocks on every single page request, when no rules are even enabled on a site?? The core issue, from my investigation, is that something in rules_init() related to cache building is fundamentally broken.

It seems that somewhere during rules_init() Rules attempts to rebuild cache_rules, possibly multiple times, on every single page request.

Here are some screenshots from New Relic, based on severaal hours of a large production site running (right-lick and open image in new tab for full-resolution screenshots):

Here we have Rules taking more average time per page request than any other module--by a landslide. This is with no actual rules enabled, and on a site with hundreds of other modules enabled:

It looks like cache_rules update query is the issue:

Query trace showing that the issue goes back to rules_init() and RulesEventSet::RebuildEventCache

And here's my grand and expensive cache_rules table, containing one single entry and no rules enabled!

Additional Notes

* This issue occurred on my site after upgrading Rules from version 7.x-2.3 to 7.x-2.9. I had had the module disabled for some time before that, so cannot be sure whether this issue was present in the previous version (7.x-2.3) or not.

Because I upgraded the module rather than enabling it from scratch, there is a slight chance that the upgrade process could be to blame here (i.e. rules.install, and the various update functions present there).

* I normally use memcached and the memcache_storage module to handle most of my site's cache tables, but disabled memcached handling of cache_rules in order to troubleshoot this issue.

Remaining tasks

  1. Confirm root issue / performance degradation.
  2. Determine when this degradation was introduced by inspecting/testing previous versions of the Rules module.
  3. Inspect Rules' hook_init() implementation to see where things are going wrong / why Rules is attempting to rebuild the cache table so frequently.
  4. Fix the problem and close related issues.
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

JordanMagnuson created an issue. See original summary.

JordanMagnuson’s picture

Issue summary: View changes
JordanMagnuson’s picture

Issue summary: View changes
botris’s picture

Thanks for taking the time to generate such an extensive issue.

My two cents, some observations, no conclusions :)
- not sure hook_init is broken just because the cache rebuild originates there.
- according to #2 in #2406863: stampede protection should be flexible ("Cache rebuild lock hit" watchdog message) this bug introduced in 2.8
- in 2.8 this was committed: #2190553: Add locking to cache rebuild to avoid stampeding amongst others.
- #2190553 mentions

Currently it's possible that extensive use of rules leads to stampeding because of the time it requires to build the full rules cache. Since the rules cache is required very early (hook_init()) even the usage of enhanced caching mechanisms can't cover all cases.
This can lead to serious trouble that can make a site completely unresponsive because all resources are used by the incoming requests to build the rules cache.

- That Rules is triggered (too) early in bootstrap is also here: [https://www.drupal.org/node/2324587] (which went into 2.8 as well.

So current bug (cache rebuild lock hit) is here to to prevent a possible bigger bug, "site completely unresponsive"?

On volatile cache:
As per #2190553

Since the rules cache is required very early (hook_init()) even the usage of enhanced caching mechanisms can't cover all cases.

This would explain why in some use cases the use of volatile cache triggers this error.
This is confirmed by:
#2406863 #42
#2406863 #102
#2406863 #104
So the use of volatile cache is not the cause of this problem but could trigger it because of the early call.

botris’s picture

mcdruid’s picture

Thanks for creating a new issue. I think I've found at least one of the causes of the seemingly excessive cache rebuilds.

The hook_init in rules:

  // Enable event invocation once hook_init() was invoked for Rules.
  rules_event_invocation_enabled(TRUE);
  rules_invoke_event('init');

...calls rules_invoke_event for the "init" event. That function does this:

  // We maintain a whitelist of configured events to reduces the number of cache
  // reads. If the whitelist is empty we proceed and it is rebuilt.
  if (rules_event_invocation_enabled()) {
    $whitelist = rules_get_cache('rules_event_whitelist');
    if ((empty($whitelist) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {
      $event->executeByArgs($args);
    }
  }

So that's (most likely) two calls to rules_get_cache which means with a cold cache there may be a stampede as both of these will get FALSE back from the cache_get, and proceed to try to rebuild caches.

Even without a cold cache, this can still cause problems. The reason is that rules will only create a cache entry for event_init if the site is configured with at least one rule which will fire on the init event.

If you don't have any rules associated with that event, every time hook_init is called (effectively every bootstrap), rules tries to get the cache entry for event_init, gets a cache miss and ends up trying to rebuild its event cache; this is in rules_get_cache:

    if ($get = cache_get($cid . $cid_suffix, 'cache_rules')) {
      $cache[$cid] = $get->data;
    }
    else {

      // ...snip... stampede protection, rebuilding of the "data" $cid and component cache

      elseif (strpos($cid, 'event_') === 0 || $cid == 'rules_event_whitelist') {
        $cache[$cid] = FALSE;
        RulesEventSet::rebuildEventCache();
      }

So with a cold cache, rebuildEventCache will be called twice as part of hook_init (once for each of the calls to rules_get_cache in rules_invoke_event).

Looking at what that method does:

    // Add all rules associated with this event to an EventSet for caching.
    $rules = rules_config_load_multiple(FALSE, array('plugin' => 'reaction rule', 'active' => TRUE));

    foreach ($rules as $name => $rule) {

      // ...snip... creates a $sets array if there are any $rules to iterate through

    }

    // Create cache items for all created sets.
    foreach ($sets as $event_name => $set) {
      $set->sortChildren();
      $set->optimize();
      // Allow modules to alter the cached event set.
      drupal_alter('rules_event_set', $event_name, $set);
      rules_set_cache('event_' . $event_name, $set);
    }
    // Cache a whitelist of configured events so we can use it to speed up later
    // calls. See rules_invoke_event().
    rules_set_cache('rules_event_whitelist', array_flip(array_keys($sets)));
  }

If the site doesn't have any rules for the init event though, that rules_set_cache will not be called for "event_init", as no $sets['init'] element will have been created by the previous loop.

That means that next time rules_init is called, rules will try to load the "init" event from cache, find there is no cache entry (none was ever created because there are no rules for that event), and rebuildEventCache will be called again.

I think this is why simply enabling the rules module introduces a performance degradation even when no rules are configured - in fact, it may be that having no rules (especially for the init event) is worse for performance than if you do have at least one rule; that way a cache item will be created.

I believe it's also why with rules-7.x-2.9 it's apparently so common to hit the stampede problem with lots of "Cache rebuild lock hit: event_init" messages being logged; any site with no rules for the init event will be trying to rebuild that cache on every bootstrap.

I'm having a think about how this could best be resolved, but wanted to get these findings into this new issue.

mcdruid’s picture

Actually it looks like a significant portion of the problem I described in my previous comment applies mainly to the situation when the rules module is enabled, but no rules are configured. Otherwise the rules_event_whitelist should minimise unnecessary rebuilding of the rule event cache(s) AFAICS.

So it might be worth trying to address that situation in the module's code, but as things stand it probably makes sense not to enable the rules module if you don't want to use it (seems silly, but not sure if it's enabled by certain distributions etc...)

Might help to explain e.g. #2630636: Rules-enabled site requests/sec hit

mcdruid’s picture

I initially looked at avoiding the endless calls to rebuildEventCache by doing something like this:

@@ -807,6 +807,11 @@ class RulesEventSet extends RulesRuleSet {
       drupal_alter('rules_event_set', $event_name, $set);
       rules_set_cache('event_' . $event_name, $set);
     }
+
+    if (empty($sets)) {
+      // Put something into the whitelist so rebuild is not called because the list is empty.
+      $sets['no-event-rules-configured'] = 'no-event-rules-configured';
+    }
     // Cache a whitelist of configured events so we can use it to speed up later
     // calls. See rules_invoke_event().
     rules_set_cache('rules_event_whitelist', array_flip(array_keys($sets)));

That means the next time rules invokes the init event, it doesn't try to get the event from cache and in so doing trigger the rebuild.

However, a much simpler way to achieve the same thing is to specifically check for a cache miss for the whitelist, rather than checking whether the whitelist is empty:

@@ -1020,10 +1020,10 @@ function rules_invoke_event() {
   $event_name = $args[0];
   unset($args[0]);
   // We maintain a whitelist of configured events to reduces the number of cache
-  // reads. If the whitelist is empty we proceed and it is rebuilt.
+  // reads. If the whitelist is not in the cache we proceed and it is rebuilt.
   if (rules_event_invocation_enabled()) {
     $whitelist = rules_get_cache('rules_event_whitelist');
-    if ((empty($whitelist) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {
+    if ((($whitelist === FALSE) || isset($whitelist[$event_name])) && $event = rules_get_cache('event_' . $event_name)) {
       $event->executeByArgs($args);
     }
   }

That means that if there are no event rules configured and therefore the whitelist is (legitimately) empty, rules doesn't try to rebuild the event cache over and over again.

If there's a cache miss for the whitelist though, the event cache will be rebuilt (which is correct).

This seems to make quite a difference to performance on a site with the rules module enabled but no rules configured.

Testing on a vanilla D7 site using ab with 10000 requests, 5 concurrent, clearing cache before each run. (Actual numbers not important, this is a local docker set up).

Before the patch:

Requests per second:    87.71 [#/sec] (mean)
Time per request:       57.004 [ms] (mean)
Time per request:       11.401 [ms] (mean, across all concurrent requests)

After the patch:

Requests per second:    210.14 [#/sec] (mean)
Time per request:       23.794 [ms] (mean)
Time per request:       4.759 [ms] (mean, across all concurrent requests)

The patch seems to apply to both 7.x-2.x and the 7.x-2.9 release.

mcdruid’s picture

Status: Active » Needs review
botris’s picture

Impressive stats. How does that relate to when it's not empty though?
I've found that the the problem get's worse on pages where more rules are triggered.

mcdruid’s picture

My findings, and the patch, are really for a very specific scenario - the problem that arises when the rules module is enabled but no rules are configured.

I think that's worth fixing by itself, and I'd argue that we should keep this issue specific to that.

JordanMagnuson’s picture

@mcdruid: thanks for your analysis, and the patch. Will take a look when I have a moment, and report back.

(Regarding the silliness of enabling the Rules module on a site that has no rules active: I agree that it's "silly," but the fact is that site admins may enable/disable various rules over time, and potentially forget to disable the module during times when no rules are active. And more significantly, there are a number of modules out there that have an (unfortunate) hard dependency on Rules. Personally, I don't think that any module should incorporate Rules as a hard dependency (its should always be about optional integration paths, IMO), but the fact remains. At the end of the day, I think these hard dependencies are themselves silly, and that yes, it is silly to have the module enabled without any rules active, but I also think that it is equally silly that the Rules module would implode in the case where it is enabled but not being used...)

mcdruid’s picture

Thanks @JordanMagnuson.

I certainly didn't mean to be rude about the "silliness" of having rules enabled with no (reaction / event) rules configured. It's easy to imagine lots of people would ask "why would you do that?", but I think your example of other modules (or perhaps features) declaring a dependency on rules (so the module cannot be disabled / uninstalled) is spot on - I'm sure there are hundreds of production sites in that situation.

If such a simple change can have a dramatic effect on the performance cost of the rules module when it's enabled but not (yet) set up to react to any events, I definitely think it's worth doing.

klausi’s picture

Title: rules_init() and cache rebuilding are broken ("cache rebuild lock hit" root issue) » rules_init() and cache rebuilding are broken when no events are configured
Status: Needs review » Reviewed & tested by the community

Reviewed the patch and that totally makes sense!

JordanMagnuson’s picture

@mcdruid: your analysis makes sense to me. I had a chance to test your patch on my production site yesterday, and the patch successfully stopped the cache_rules update queries from hammering the database like they had been (with no rules active). Thanks for your efforts!

Rules still managed to take up a decent chunk of time thanks to all the cache get queries in hook_init (Rules is now #6 on my list of most time-consuming modules), but I suspect that that impact can be brought to a reasonable level by handing cache_rules back over to memcached (some users have claimed that cache_rules should not be handled by memcached, but I have not seen much evidence to support that--thoughts?).

Screens of current impact on my largest production site (with cache_rules still handled by db):

MustangGB’s picture

So should #2406863: stampede protection should be flexible ("Cache rebuild lock hit" watchdog message) be closed as a duplicate now, or is there further investigation required?

mcdruid’s picture

I think they are two separate issues; this one turned out to be specific to rules_init when there are no event rules configured.

The other is still an issue all of its own.

hgoto’s picture

I also met this problem and the patch #8 works well. Thank you.

mcdruid’s picture

Issue summary: View changes

  • fago committed 83d72b2 on 7.x-2.x authored by mcdruid
    Issue #2851567 by mcdruid, JordanMagnuson, botris: rules_init() and...
fago’s picture

Status: Reviewed & tested by the community » Fixed

thx, the patch from #8 looks great. If I committed it.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.