Problem description

This module uses the Redis KEYS command to perform cache wildcard deletions. As the docs on KEYS command state, it should not be used in production environments. In a 1.5M Redis instance, we are experiencing website freezes on cache_menu bin flushes with key counts ranging from 500K to 800K. Happens also when flushing other cache bins, such as doing database updates. Redis freezes while doing KEYS to fetch matching keys for deletion and the entire site slows down to a crawl or even halts completely if the freeze lasts long enough (we've seen freezes lasting from 20 to 40sec).

It's also the first item under the @todo section in lib/Redis/Cache/Base.php :)

Solution

The alternative is to use the SCAN command which uses a cursor that you will keep calling repeatedly to iterate on the matching keys. Doing it this way, Redis can keep accepting commands and does not blocks. However, it has some differences:
Operation is not atomic, SCAN provides a cursor and the eval'ed script will loop until no more keys left. It can take slightly more time under heavy load, but no more freezing of Redis until it returns from enumerating keys.
To be replication-safe, it requires switching from script replication to script *effects* replication. This is available in Redis >=3.2. More info on Antirez's post.
Warning: This method might also cause trouble with script runs lasting more than lua-time-limit (which defaults to 5s), because of Redis answering 'BUSY' to clients past that limit. It's still being worked out.

Implementation

It has been implemented alongside current behaviour to preserve compatibility. Module will work by default as before unless you explicitly turn on this feature.
Also, if you accidentaly turn on the feature on a server not supporting effects replication (<3.2), it will simply ignore it, since the script always checks server version first and fallbacks to normal behaviour.
To add the feature, these are the main changes.

  • Add the feature toggle option in the configuration page.
  • Add a new parameter to the Lua script which uses the new SCAN code for deletion.
  • Add the scan on/off parameter on the eval() call in the cache backend implementation, according the configuration.

The Lua script has grown to accommodate the following new capabilities:

  • Checks second parameter (ARGV[1]) to enable SCAN.
  • Checks for server support for switching to effects replication with the redis.replicate_commands() call.
  • Loops performing SCAN/DEL until no more keys match.
  • Script supports specifying SCAN's COUNT parameter as ARGV[2], which defaults to 100. You can change this value from the configuration page.
  • It outputs information and events to Redis' logs. You can silence them raising server's minimum log level. Doing all logging with LOG_DEBUG level would render logging useless on high volume setups, since it will be lost among lots of items per sec. Information sent to log:
    • Script invocation started, with requested keys argument (notice).
    • If SCAN mode requested, say so, along server's returned version string and COUNT parameter (verbose).
    • If SCAN mode requested and not supported, emit a warning message (warning).
    • If using KEYS command, say so before and after calling KEYS to allow timing (notice).
    • Finished message with mode (SCAN/KEYS) and deleted keys count (notice).
  • Minor: Use the variable name _ convention instead of i for unused variables.
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

amontero created an issue. See original summary.

amontero’s picture

Issue summary: View changes
amontero’s picture

amontero’s picture

Status: Active » Needs review
memtkmcc’s picture

Status: Needs review » Reviewed & tested by the community

This works absolutely amazing, thank you!

We have already implemented this patch in the BOA stack:

https://github.com/omega8cc/boa/commit/5fdc45f28c21360a63b0036e01426c5dd...
https://github.com/omega8cc/redis/commit/fceaef574a629fb716ef6c91db5eba6...

amontero’s picture

FileSize
9.52 KB
1 KB

@memtkmcc: Glad to hear that! I assume you mean production, already. Could you please give some figures on your setup? (instances, keys, etc)

Also, patch reroll with minor fixes:

  • Add missing "Using KEYS command to delete." log notice message for volatile keys deletes (thanks Ferran)
  • Fix wrong indentation

Attached also interdiff.

memtkmcc’s picture

@amontero We have added this to BOA head, and it will be included in the upcoming stable release this week, so we don't have data from production servers yet, we stress-tested this on typically slow mass-cache-clear tasks on sites managed in Aegir, and it worked beautifully. We are happy to share the results from production machines, once it is deployed.

cafuego’s picture

Thank you @amontero!

We installed this patch on our production systems yesterday and cron stopped failing! Yay!

Previously, a cache clear in cron would error out after a minute and the drush cron task (and connection to redis) would then hang forever.

With the patch, cron takes a few seconds and no longer errors out.

Additionally, because of the errors previously, we ended up with a redis filled with 700k mostly stale entries, whereas now we have 250k of current-ish ones. I'm sure that will help performance too (no stats on that yet, though).

attiks’s picture

#6 Nice job! Works like a charm

pounard’s picture

Why didn't I saw that in my issue list ? Seems good, I will review and test this when I'll have some spare time, thank you very much!

memtkmcc’s picture

After some testing in production we have mixed results. On at least four different servers with SCAN instead of KEYS it managed to almost freeze Redis server at totally random moments (typically every few hours), so it delayed connections to the point where it basically stopped working, hence triggering our fallback to use SQL cache, which of course causes high load spikes, exhausting allowed DB connections limits, etc. Note that we use fairly moderate configuration, where all cache tables are delegated to Redis, and the allowed memory limit per Redis instance is only 2 GB. Really weird, but using SCAN on some systems actually made things much much worse. Still investigating why exactly this happens.

pounard’s picture

One very good reason for NOT applying the patch right now, please if you have any more details, share it!

kirkby’s picture

After applying patch in production, we have had the same experience as @memtkmcc

The slowlog would reveal lots of

 4) 1) "EVAL"
       2) "redis.log(redis.LOG_NOTICE, \"Delete by prefix requested for '\" .. ARGV[1] .. \"'.\")\nlocal deleted_keys_count = 0\n\nlocal scan_dele... (1787 more bytes)"

with response time around 0.9 seconds pr. delete, yielding system completely unresponsive.

The bigger the database, the longer the response time. A flushdb would temporarily solve problem.

After disabling, response times were reduced by factor 8 for these deletes (still pretty slow ....) Of course, having a lot of "Delete by prefix" is probably not a good idea under any circumstances, however, this patch made things worse for us.

amontero’s picture

In order to address pounard's legitimate concerns, I'll clearly label the feature as experimental to warn people and put a link here in the description field to get as much feedback as possible. UX suggestions welcome.

Now into @memtkmcc and @kirkby issues:
Although longer deletion times are expected since the Lua script yields CPU to Redis' main event loop to allow it to continue serving requests, freezing Redis at all is not expected. The problem I try to address here is cache wildcard deletes with huge key counts (800K). A freezing Redis means that all webserver threads stall at bootstrap phase and an avalanche of web requests starts to pile up. For those scenarios, it's better for the deleter request to take longer, but allow others to proceed, even at a slower speed to avoid web requests to being killed due to PHP's max_execution_time.

It is normal with the SCAN method to trigger SLOWLOG entries, since Redis has a low (configurable) threshold for it as opposed to KEYS, which is an atomic, stop-the-world operation which does not trigger slowlogs, IIRW. Can you please provide machine specs, version, configurations (off from defaults at least) and logs of delete operations? With Redis' logs we'll be able to diagnose how much time per key count it's taking, both with KEYS and SCAN commands. Increase Redis' server logging level to LOG_DEBUG, if possible. I also recommend monitoring Redis' performance metrics to check if there are hidden metrics that can give us any clue to find the cause (I recommend Redsmin).

Possible causes of varying results which come to mind (suggestions welcome):

  • Perhaps multiple deletions are concurring or there is some higher than expected deletion times with other root cause. Since Redis is basically single threaded, if several deletes are being performed at the same time, CPU can easily max out. Having frequent cache wildcard deletions is not common, but I've experienced it in the past due to buggy custom/contrib code (actually, the new Lua script logging provided us the evidence, an it's available even for the KEYS method).
  • Persistence settings. We're using Redis with no persistence at all and they can interact notably.
  • Server kernel's THP settings, if disk writes trigger due to persistence settings while deletions are in progress. Lots of memory pages will get dirtied and that's bad for Redis' memory CoW forking and known to cause latency spikes.

I'll be rolling out a new patch shortly. Please, give me as much info as you can and I'll do my best to find where the problem is for those of you for whom is not working, and if new information appears which I can use to improve the code, I'll happily include it.

@pounard: Perhaps not all scenarios can take advantage of it (especially with short deletion times to complete and shorter Redis freezes, which would not make as much difference, other than taking longer to delete the keys). How about the approach of releasing the feature incrementally in the 7.x-dev branch *clearly* labeling it as experimental? Some people can already benefit from it and we can put it on the hands of more people brave enough to "try devel" to get more feedback.

amontero’s picture

Issue summary: View changes
Status: Reviewed & tested by the community » Needs review
FileSize
11.86 KB
8.45 KB

Due to lack of feedback, I've added (well, actually exposed) a new configuration setting which might help solving the found issues.

The SCAN command accepts a COUNT argument for specifying the number of elements returned at every SCAN call. Being able to fine-tune this setting may change behaviour, be it by increasing the value (thus needing less calls to SCAN and hopefully finishing sooner) or lowering it (perhaps yielding more time to Redis' main loop, at the cost of more calls and more time to complete). Although the Lua script already had this parameter, there was no way outside the code to change it. This patch exposes the configuration setting and passes it when calling the Lua script.
Since the Redis service might be configured differently on each setup, your mileage may vary and there is no rule of thumb AFAIK which might apply to everyone, thus making the setting tunable is the best solution I can come with, by now. I have no other info to figure out what might be the culprit for the problems experienced here. Feedback is much appreciated.

Patch reroll. Changes:

  • Add parameter to Lua deletion script invocation which will be used as SCAN's COUNT argument (the functionality is already implemented in the Lua code).
  • Add a new constant with previous default value for COUNT arg.
  • UI: Improve feature description text: mark it clearly as experimental and link to current issue.
  • UI: Expose above COUNT argument to make it tunable.
  • Increase log verbosity level from LOG_DEBUG to LOG_VERBOSE for some messages. No longer using LOG_DEBUG, since you would get an absurd amount of info on production systems.
kirkby’s picture

Hi again, didn't notice your request for feedback until now.

It will take me some time to collect some systematic feedback but I will try to get around to it.
However, one thing that I should mention right away is that our redis instance was running
on an absurdly overpowered physical host, with the recommended THP setting and no
persistence which suggest that this is not related to insufficient specs on the host.

Another quick observation: we actually re-patched the patch to remove the version checking
in the lua script. This part:

  local server_info = redis.call("INFO", "SERVER")
  local version_string = string.match(server_info, "redis_version:([%d.]*)")
  local version_major, version_minor = 0, 0

  version_major, version_minor = string.match(version_string, "(%d+).(%d+)")
  version_major = tonumber(version_major)
  version_minor = tonumber(version_minor)
  redis.log(redis.LOG_DEBUG, "Non-blocking delete requested. " ..
    "Server version string: " .. version_string)

  if (version_major >= 3 and version_minor >= 2) then

Already running the required redis version this code seems rather inefficient.
No reason to check redis version for every single call - it will probably not change
between calls. Unfortunately, I have no data on the effect of this change.

I will try to collect some more data for you, maybe publish our own patch for inspiration.
Right now, we are not using patch in production. We eliminated most of the delete by prefix
calls from Drupal so things are running pretty stable right now.

Thanks for your effort.

pounard’s picture

@pounard: Perhaps not all scenarios can take advantage of it (especially with short deletion times to complete and shorter Redis freezes, which would not make as much difference, other than taking longer to delete the keys). How about the approach of releasing the feature incrementally in the 7.x-dev branch *clearly* labeling it as experimental? Some people can already benefit from it and we can put it on the hands of more people brave enough to "try devel" to get more feedback.

That's fine for me, I'll have to do some reviews of the patch, and I'll try it too. You should also add the corresponding unit tests (no need to write new tests, but declare a new class with the correct parameters to have tests running with it).

amontero’s picture

Issue summary: View changes

News on the possible causes of varying results:
As the inline comments on the included Redis 3.2 configuration file state, if the deletion script runs for longer than 'lua-time-limit' config setting, Redis will start answering 'BUSY' error to other calling processes. Although it might accept connections and commands, this would cause them to fail and most surely throw an exception. Try issuing 'CONFIG SET lua-time-limit 10000' to raise it to 10s or whatever suits you.
As I see now, the possible solutions are:

  • Increasing the limit ourselves from the script: Not very friendly/considerate with the site oner/admin, unless me make it opt-in. We can also reset it to previous value after we're done. However, if the script dies, we have changed permanently the configuration.
  • Return before reaching limit and handle it client-side. Perhaps sending another deletion call until we're done deleting all keys, but we can hit PHP's max_execution_time as well.
  • Accept that some limit is OK and document/warn/recommend appropriately the site admin to configure the limit before enabling.

Any other? Which one sounds better to you?

@kirkby: Checking server version on each call is necessary because the script would break when doing a SCAN/DEL in versions prior to 3.2. Since Redis can be upgraded at anytime, it's the most cautious way to proceed, IMO. Having to configure your Redis version or relying on site admin to not toggling the feature for servers not supporting it adds to complexity and opens the door to troubles. I haven't measured it, but the code should run very fast, anyway. I don't think that the execution time improvement from removing the check justifies the added hassle of patching the module.

@pounard: Happy to hear that. Note taken for the tests, but postponing until @kirkby (or anyone else) sends a bit more feedback.

Fabianx’s picture

---

While I 100% agree that this needs to be fixed and that KEYS is problematic, the SCAN implementation in a script has many drawbacks as well.

I am just outlining an idea that I had in hopes that others may find it useful:

If the list of to be wildcard expired cache entries is known (cache_rules:*, cache:entity_info:*, etc.) then it is possible to create a LOOKUP table within redis.

The trick is to use redis ZSETs.

Once the cache gets cleared for a known prefix (one KEYS command usually is not locking things down, multiple do) the following is happening in pseudo-code - here on an example of cache_rules:

- clear("cache_rules:*")
- ZSET cache_cids_cache_rules => [ 'REDIS_INTERNAL_INIT' => 1 ]

Now a cache_set is happening:

cache_set("cache_rules:foo", "bar");

internally after the cache_set (or before) we do:

ZSET cache_cids_cache_rules => [ 'foo' => 1 ]

Our set now contains:

cache_cids_cache_rules => [ 'REDIS_INTERNAL_INIT' => 1, 'foo' => 1 ]

For a longer key, we would store the whole cid-suffix:

cache_set("cache_rules:bar:baz:foo", "bar");

ZSET cache_cids_cache_rules => [ 'bar:baz:foo' => 1 ]

Now when the cache_clear_all wildcard clear comes again:

- clear("cache_rules:*")

Within the script we check for an entry in cache_cids_cache_rules and if the 'REDIS_INTERNAL_INIT' is set to 1, then we get the remaining keys out of the entry and clear only those and reset the entry back to just the INIT.

If the 'REDIS_INTERNAL_INIT' is not set to 1 (e.g. it expired or was never called), then KEYS is used as fallback like now.

The advantage of that approach is that it is "learning" by observing frequent cache_clear_all's and it could also be threshold based (e.g. just start recording after 5 cache_clear_all's) - though a white list would also work.

This would not solve the occasional cache_clear_all for e.g. files after a file deletion, but would solve the very frequent ones and for the others like file_deletion a static configuration map could work - as we know core's cache_clear_all in general.

This has more effort on SET for the high frequency cache_clear_all cache IDs, but way less effort on expiration while still having the same atomicity level as reached now.

The reason why this works is, because the ZSET is either fully expired or not at all and as such we are able to know (compared to memcache) if the system is in a coherent state.

--

One of the best properties of that system is that e.g. if cache_clear_all() is called in really quick succession, e.g. on entity_defaults_rebuild(), which rebuilds all rules, which clears the cache e.g. 200 times in a row, the effort remains minimal.

And such the cases that lock down redis are efficiently solved, while the effort for non-locking occasional clears is the same as before.

--

We can certainly open another issue for that, but I wanted to present this solution as an alternative to the above here for all to see.

showrx’s picture

Version: 7.x-3.x-dev » 7.x-3.17
Assigned: amontero » showrx
FileSize
5.4 KB

We're running this patch against 7.x-3.17. SCAN is used when redis_delete_prefix_scan_enabled is true and "script effects" replication is available (Redis 3.2.0+). There's also an optional redis_delete_prefix_scan_count to adjust scan COUNT.

showrx’s picture

#20 doesn't do it since EVAL itself is blocking, so I rewrote using Redis client scan() function...

- Added deleteByPrefixUsingScan(). To enable, set redis_delete_scan_enabled to TRUE.
- Two optional settings: redis_delete_scan_count (default is 100) and redis_delete_batch_size (default is 20). redis_delete_batch_size is pipeline delete batch size.
- Also eliminate duplicate deletions in a single execution using static variable caching.

djdevin’s picture

Thanks for the patch, this helped out a lot.

With 500k+ keys we had to increase the scan count to 1000 to prevent cache check/clears from taking much longer than they should. Increasing it to 10000 caused slowlog SCAN entries to appear. But at least now it wasn't blocking anything except the current request.

So not a perfect solution but much better. As said in #19 without some sort of lookup cache, Drupal has to scan over all those keys to delete them. In our case we have a lot of sites using a big Redis instance. And of course from what I've read, this isn't best practice. But there's no way I'm spinning up that many Redis instances on AWS because it would be insanely cost prohibitive. So memcache is unfortunately looking like a better option for purely cache purposes, even though I much prefer Redis clustering.

pounard’s picture

We use the Redis module configured for proxy assisted sharding, even when there's one single instance, and we configure Redis server itself to evicts keys using its own LRU mechanism. Basically, if you don't know, the Redis module set all his own keys as being volatile with a few month of lifetime for permanent keys, allowing Redis to evict them when it needs to reclaim memory. This way we don't have to clear caches at all. If you're working with many sites, which share the same Redis instances, use the module this way, explicit cache clear has never been recommended with high data volume.

cafuego’s picture

Version: 7.x-3.17 » 7.x-3.x-dev

Putting the version back to 7.x-3.x, as you always patch against HEAD and not a release. Patch still applies to 7.x-3.18, but I think it may need a logic check so it keeps working as advertised with the changes from https://www.drupal.org/node/3074189

t14’s picture

The patch with comment #21 worked very well in production on a high traffic site with a larger than normal amount of keys

The key (no pun intended) to getting this right is adjusting the configurable options that come with the patch 'redis_delete_batch_size' and 'redis_delete_scan_count'

'redis_delete_scan_count' is from what I have experienced the most important . The default value of 10 will probably fix your problem adjusting it to a bigger count will help improve the speed of cache clears and also improve performance especially if your php max_execution time is low, if you set the value to high you will end up with the same issues when using the key command.

I used the LUA script below to fill my redis db with wildcard keys (matching the value I have in prod) and then ran the patch #21 tweaking the config values "redis_delete_scan_count" and "redis_delete_batch_size" till I found a good value

for i = 1, <AMOUNT_OF_WILDCARD_KEYS_TO_ADD_REDIS>, 1 do
    redis.call("SET", "<CAHCE_PREFIX>:cache:"..i.."*", i)
end

Assuming you save the LUA script above with file name redis_import_keys.lua you can run the following command to fill your redis db

redis-cli --eval redis_import_keys.lua

I think this should be merged. Surprised that this module relies on the KEY command when it is clearly stated in the Redis docs that you shouldn't use KEY in production. (Other than that thanks for all the hard work, I know that most maintainers are volunteering their own time :) )

ron_s’s picture

Patch #21 still applies cleanly to the latest 7.x-3.19 version.

powrsurg’s picture

So I want to put out there that patch in #21 dramatically helped us for the most part.

I've done review of how many keys we've had open and I've seen us in the 2.5 mil key range. Our system was always locking up with Redis (which as an FYI was an AWS ElastiCache managed service, not something running on localhost) warnings until it completed when we flushed Drupal cache. This solution helped us for the most part.

We found that even by upping the number of deletions and scanned it was taking a long time and consuming a ton of memory -- more than our typical server needs. We had to add in some lines where deleteByPrefixUsingScan could extend the memory allocated to run, and to set drupal_set_time_limit to 0 so that it had unlimited time to delete everything.

With this the site never became unusable even when we were flushing cache. Overall this was an improvement for us.

powrsurg’s picture

UPDATE: Even after applying this update, in Views updating something and saving the View seemed to cause a total Redis meltdown. I'm not sure if Views does not trigger this or what.