Just dropping a note that the performance of 1.7 is still a problem. We went live this morning on Economist.com with this version but had to rollback to 1.5 immediately.

I would love to provide more information, but all I can give you right now is that the number of connections to the master DB reaches it's limit and the site goes down, this is presumably caused by cache misses triggering a lot of extra load.

Going back to 6.x-1.5 fixes the issue.

It could of course be due to something in the configuration. We need to reproduce the issue in a stage environment

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

catch’s picture

The performance issues in 1.6 compared to 1.5 were down to a high number of multigets being run for wildcard clear for validation, which I'm very confident was massively improved in 1.7. This wouldn't have any effect on genuine cache misses for cache_get(), only the performance of memcache module itself, so I'm wondering if either as you say you ended up in a cold start situation for some reason, or alternatively the upgrade triggered a notice and flooded watchdog.

Mark Theunissen’s picture

We restart the memcached daemon for all deployments, so we're pretty used to coming up cold, if that's what you mean. I totally understand what you're saying - the performance hit should not have been on the master DB. I can't really explain it.

What do you mean triggered a notice? Our watchdog goes to syslog, not the DB, so it's unlikely that would have been the problem.

At some point we'll dedicate some time to load testing and see if we can reproduce the issue.

rwohleb’s picture

I possibly ran into the same issue on our production servers this morning. The connection counts just soared, taking the site offline. We were able to capture a process dump at the time and it appears there was a stampede of updates on 'memcache_wildcard_flushes' in the variable table. Considering the flush handling was a big change between 1.5 and 1.7 this might be the same issue you were having.

Luckily our DB recovered, but now I'm looking for a long term fix. Storing a semaphore in memcache could be the answer, or even APC or core lock.inc, but I wanted to touch base with the rest of the community around this issue.

premanup’s picture

I'm using
memcache.inc 1.7
pressflow 6.19
php-fpm 5.2.14
memcached 1.0.2
libmemcached 0.44
libevent 1.4

Devel module provides this info:
Page execution time was 1988.62 ms. Executed 222 queries in 169.01 milliseconds.
So PHP execution time is very high. It takes ten times more time then mysql queries does.

XHProf module shows 636,862 function calls for single page view.
With memcache turned off it takes only 75,700 calls.

So after few days very high CPU usage I decide downgrade memcache from 1.7 to 1.5 version. Version 1.5 still provide overall ~ 75,000 function calls for one single page rendering and nicely reduces DB queries execution.

catch’s picture

variable_set() stampede sounds like a likely culprit to me, if that's the case we should do #926062: Avoid variable_set() on bin and wildcard flushes, which is also what rwohleb suggested. Will try to get a patch up there.

Jeremy’s picture

Catch has provided a patch that could fix this, is anyone able to apply #926062: Avoid variable_set() on bin and wildcard flushes to 1.7 and see if it resolves the above reported performance regression? (The patch in update #4 in that ticket is for 6.x-1.7)

premanup’s picture

I've just quick tested this patch against 6.x-1.x-dev version memcache. But as I can see thanks to XHProf module the overall function calls for a single page view has the same level as original 1.7 version does. It is lead to very high CPU usage by PHP proccesses.

premanup’s picture

FileSize
28.7 KB
23.98 KB
24.08 KB

Please take a look attached munin statistics graphs of CPU usage for a day after changing the version 1.7 -> 1.5.
On cpubyuser-day.png image: www-data is PHP-FPM processes.
But the number of mysql queries is little higher for this day.
I hope this data will usefull to resolve this problem. If not I'm ready to provide more info.

catch’s picture

premanup - please post more details as to where the number of function calls are in xhprof. Just a screenshot or two from the ui (with particularly bad cases highlighted) would be helpful.

rwohleb’s picture

We had another variable_set stampede this morning. I now have http://drupal.org/node/926062#comment-3730384 in production so we'll see what happens. Considering the gap between occurrences it'll be a few days.

premanup’s picture

premanup’s picture

@Catch - I made 2 screenshots from XHProf statistic PHP function calls:

1. original 1.5 version
2. last dev version with patch applied from http://drupal.org/node/926062#comment-3730384

premanup’s picture

@Catch - one more screenshot sorted by count of calls.

catch’s picture

@premanup - so the dev screenshot has _theme_process_registry(), but the first screenshot doesn't. Is there any chance one of this is on an install where the theme registry is set to rebuild on every request? Or this may have just been a cache miss. But nothing here points to an issue with memcache itself at all.

premanup’s picture

@Catch - Yes, I agreed. But no chance... This is exacty the same site (files and DB). The only difference is content of folder /sites/all/modules/memcache. I repeated tests a few times. So I have no idea how it possible.

rwohleb’s picture

Just a quick update: I've had no further issue so far while running http://drupal.org/node/926062#comment-3730384 in production.

catch’s picture

Priority: Normal » Major
Status: Active » Needs review
FileSize
897 bytes

Found one possible cause for this, at least if you have frequent cron runs.

Between 1.5 and 1.7, memcache started respecting cache_clear_all(NULL, $bin); whereas previously it had ignored it. system_cron() does this for every bin returned by hook_flush_caches() - so a cron run every five minutes, means all caches cleared every five minutes.

Here's a patch, hoping that people having issues will test it, I'll also be doing so on a high traffic site shortly.

Bumping this to major, since even if this isn't the cause of the high cpu usage others have been seeing, it's still a regression.

premanup’s picture

Status: Needs review » Needs work
FileSize
19.44 KB

@catch - I've just tested patch. Unfortunately, it have not improve performance in my case. See screenshot. So cron is not the reason. Cron was not called during the tests at all.

geerlingguy’s picture

Sub.

mikeytown2’s picture

subscribe; variable in the cache table/bin is getting written to a lot. I went as far as removing variable caching in core and strongarm in order to fix an issue we where having (yes problem is gone now). Looking at downgrading now and undo-ing the hacks to see if that fixes the issue.

catch’s picture

So I'm pretty sure this is some kind of interaction with other modules, since I have 1.7 on a client site and that's doing fine once #17 is applied.

One thing that's new here is that the module does a few more variable_set() than it did with 1.4 for keeping track of wildcard clears, it might be worth adding some logging in there to see if that's gone up, this would explain why the variable cache is being rebuilt so often if so.

Jeremy’s picture

Status: Needs work » Active

Thanks Nat, patch applied in 6.x-dev:
http://drupal.org/cvs?commit=465602

And 7.x-dev:
http://drupal.org/cvs?commit=465608

Leaving bug report open to gather further data points as people are still reporting regressions even with this patch applied.

catch’s picture

Thanks Jeremy.

@premanup - could you repeat your test and debug http://api.drupal.org/api/drupal/includes--theme.inc/function/_theme_loa... and _theme_save_registry()

Whether it's always a cache miss, whether the cache gets set at all etc. etc.

Nick Lewis’s picture

I'm found that disabling the wildcard logic brings my cache hit rate up from 30% to 95%! Using the 2.2.2 memcache library. Either i have something seriously misconfigured, or the wildcard logic is 100% broken for certain versions.

premanup’s picture

@catch - I've just repeated tests using last memcache-6.x-1.x-dev version. No change. The function _theme_process_registry is called 72(!) times on every page view. This problem disappears as I setup memcache-6.x-1.5.

I've tried to investigate this by switching off some modules one by one. And I was surprised. The key module is Lightbox2 6.x-1.11. I repeated a few times for sure. No difference which version of memcache module is used. If Lightbox2 is turned off then _theme_process_registry is not called on every page view. And if memcache module is turned off then Lightbox2 module also works without problem.

So this bug appears as you setup memcache module version > 1.5 and Lightbox2 module simultaneously.

Jeremy’s picture

@catch: this drupal.org infra issue may be related (and is testing out your cron changes now):
#978310: drupal.org forms expire way too early

bleen’s picture

subscribing

catch’s picture

@Nick Lewis - did you try the patch at #22 and/or latest dev?

@premanup, that's odd, couldn't see anything in the lightbox code that would affect this.

Nick Lewis’s picture

@catch no sorry - the issue was listed as 6.x-1.7 i'll try that tomorrow.
@premanup, @catch I ran into the same problem with lightbox2. Basically when you have a lot of image presets, lightbox2 pollutes the theme registry with as much as 700k data!!! (yeah, i was surprised too).

Can't cache anything larger than meg...

I worked around the issue by removing lightbox2's evil auto theme function stuff.

longwave’s picture

@Nick Lewis: cache hit ratio is no longer an indicator of performance when the wildcard logic is enabled; it is expected that there will be a large number of misses due to the need to check prefixes for previous wildcard clears, most of which are unlikely to exist.

So does this performance issue only occur when the theme registry (or some other "mandatory" cache item) grows to over 1Mb and hence cache set always fails? Should we split larger items into chunks and recombine them on cache get?

Jeremy’s picture

I don't see any reason 6.x-1.5 would be handling large objects differently than 6.x-1.7. We should tackle >1MB objects in a different issue, imo, as this issue is regarding a reported performance regression.

slantview’s picture

The 1MB limit is imposed by memcached daemon. Please see the FAQ for memcached: What_is_the_maximum_data_size_you_can_store?_(1_megabyte)

However, you can recompile memcached to support larger sizes by changing the POWER_BLOCK in slab.c as also stated in the FAQ:

If you really do want to store items larger than 1MB, you can recompile memcached with an edited slabs.c:POWER_BLOCK value, or use the inefficient malloc/free backend. Other suggestions include a database, MogileFS, etc.

Jeremy’s picture

@slantview, please see #31 -- let's move any discussion of the 1MB limit to another issue.

longwave’s picture

@Jeremy: There was a lightbox2 security release the same week that memcache 6.x-1.6 was released. According to #409354-22: Performance issues with large number imagecache presets and content types the lightbox2 performance problems appeared in that release, and the previous stable release of lightbox2 was 20 months old - so I'm wondering if this is the root cause of the issue, not the wildcard caching. I guess the only way to figure this out is to test both versions of memcache against both versions of lightbox2, and see which combinations have problems.

slantview’s picture

@Jeremy - np. I was just pointing out it's an issue of the server and has nothing to do with memcache module.

Nick Lewis’s picture

@longview - here's where i'm coming from: in a production environment involving 8 webservers a downgrade to 1.5 to 1.7 magically solved our biggest performance problems (cache hit went from 33 to 78 percent). I can actually see the moment we downgraded in the nagios graphs: the load numbers go instantly lower, and the website death spikes disappear.

I did a diff of versions, and the multiget/wildcard system is the glaring difference. In my own time, I wanted to know why and found that limiting wildcard functionality to environments that contained a class called "memcached" magically increased the hit rate from 33% to 95%. My working theory is that $memcache->get(array()) does not work the same as $memcached->getMulti(array()).

I'm still looking into this; I don't *know* this is the problem. I just know that the multiget and wildcard system needs a close look so we can rule it out.

ball.in.th’s picture

subscribing

ball.in.th’s picture

I believe the patch in #22 makes memcache's cache_clear_all different from core. In core, cache_clear_all(NULL, $bin) would delete all CACHE_TEMPORARY cache entries, e.g., cached pages for cache_page bin. With #22, it doesn't do anything.

mikeytown2’s picture

Not clearing the page cache would be bad; one would get missing css/js errors.

catch’s picture

memcache has always treated CACHE_TEMPORARY entries as having an explicit expiry of one month. Core is at fault for not setting a TTL here, see #891600: Page and block caches are cleared the same way on cron as on content change. The deletion of everything was only introduced with the wildcard logic, and was mistaken as far as I can tell. The page cache shouldn't end up with missing css/js in most cases since on rebuilds only files that are older than one month are deleted (by default).

ball.in.th’s picture

I see -- memcache defines CACHE_TEMPORARY to be 1 month.

With the new wildcard logic in memcache, perhaps CACHE_TEMPORARY could now be treated as a special wildcard. memcache's cache_clear_all(NULL, $bin) could then delete all CACHE_TEMPORARY as in core.

Nick Lewis’s picture

@catch - discovered something interesting, that's going to be unfortunately hard to track down. Running both memcache 1.7 and 1.x-dev I get 90% hit rate on a new install with devel generated content. The results are the same for core drupal and pressflow I'd add. Doing the same on a big complicated panels based site I still get no change, still 30%. So your idea that it was related to a module might be correct. The only other explanation I can think of is the number of cache_gets on page may have something to do with it.

As I said before, removing wildcard functionality resolves the problem in the second case, but i'm at a loss as to why at this point because apparently the wildcard stuff works under certain circumstances and configurations...

mikeytown2’s picture

@catch
in 6 it still wipes out the js & css dirs
drupal_clear_css_cache, drupal_clear_js_cache
#721400: Order JS files according to weight, don't change filenames for aggregated JS/CSS hasn't been back ported to 6.x - drupal_delete_file_if_stale

Edit: looks like drupal_flush_all_caches includes cache_page as part of the tables to kill. So I hope this is a mute point.

bleen’s picture

#42: .... according to #30 (above) the cache hit ratio is not an indicator of performance with the new wildcard logic

catch’s picture

@Nick Lewis - if you're running into something similar to premanup's issue, then you should check if cache_set() is being called on any of those pages - if it's not, then you are not getting a real cache miss in cache_get(). If it is, then you should be able to track it down from there via the bin and cid.

torgosPizza’s picture

Subscribing. Thanks for everyone's effort on this; I think this is what contributed to a major downtime event for us a few weeks ago.

catch’s picture

@Nick Lewis - that's helpful.

I didn't look yet, but when was the support for memcache vs. memcached introduced?

Seems like there's three things at least going on:

1. cache_clear_all() on cron (committed to dev, not in 1.7)
2. People upgrading lightbox and memcache at the same time, and hitting the lightbox bug.
3. memcache vs. memcached

nmonterroso’s picture

subscribing

abx’s picture

Just tested version 1.8. It's better than 1.6 and 1.7 but still... it's slow compare to 1.5. I noticed that Page Execution Time is quite high in every pages compare to version 1.5. So, I have to go back to 1.5 again :(

catch’s picture

Status: Active » Postponed (maintainer needs more info)

If you're still having trouble with this, please update this issue with the following information:

1. Try upgrading to the latest 6.x-1.x dev release and see if that improves memcache hit rates - a couple of optimizations just went in.
2. Please post whether you are using the MemCache or MemCached extension, and which version of the library (and of libmemcache or libmemcached too).

I'm marking this as 'needs more info' until there are some more actionable bug reports here.

catch’s picture

Status: Postponed (maintainer needs more info) » Fixed

I'm marking this fixed, both 1.8 and the new 1.9 release candidate have several improvements over 1.7.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.