Performance of 6.x-1.7 still degraded compared to 6.x-1.5 [#958930]

Comment	File	Size	Author
#18	958930-wildcard_patch.png	19.44 KB	premanup
#17	958930-wildcard.patch	897 bytes	catch
#13	xhprof_-memcache-6.x-1.x-dev-patch2.png	47.38 KB	premanup
#12	xhprof_-memcache-6.x-1.5.png	55.76 KB	premanup
#12	xhprof_-memcache-6.x-1.x-dev-patch.png	54.77 KB	premanup
#11	xhprof_-memcache-6.x-1.5.png	55.76 KB	premanup
#11	xhprof_-memcache-6.x-1.x-dev-patch.png	54.77 KB	premanup
#8	cpu-day.png	24.08 KB	premanup
#8	cpubyuser-day.png	23.98 KB	premanup
#8	mysql_queries-day.png	28.7 KB	premanup

Comment #1

he/him

English

CreditAttribution: catch commented 15 November 2010 at 03:35

The performance issues in 1.6 compared to 1.5 were down to a high number of multigets being run for wildcard clear for validation, which I'm very confident was massively improved in 1.7. This wouldn't have any effect on genuine cache misses for cache_get(), only the performance of memcache module itself, so I'm wondering if either as you say you ended up in a cold start situation for some reason, or alternatively the upgrade triggered a notice and flooded watchdog.

Log in or register to post comments

Comment #2

Mark Theunissen CreditAttribution: Mark Theunissen commented 15 November 2010 at 21:14

We restart the memcached daemon for all deployments, so we're pretty used to coming up cold, if that's what you mean. I totally understand what you're saying - the performance hit should not have been on the master DB. I can't really explain it.

What do you mean triggered a notice? Our watchdog goes to syslog, not the DB, so it's unlikely that would have been the problem.

At some point we'll dedicate some time to load testing and see if we can reproduce the issue.

Log in or register to post comments

Comment #3

rwohleb

he/him

English

CreditAttribution: rwohleb commented 17 November 2010 at 17:09

I possibly ran into the same issue on our production servers this morning. The connection counts just soared, taking the site offline. We were able to capture a process dump at the time and it appears there was a stampede of updates on 'memcache_wildcard_flushes' in the variable table. Considering the flush handling was a big change between 1.5 and 1.7 this might be the same issue you were having.

Luckily our DB recovered, but now I'm looking for a long term fix. Storing a semaphore in memcache could be the answer, or even APC or core lock.inc, but I wanted to touch base with the rest of the community around this issue.

Log in or register to post comments

Comment #4

premanup CreditAttribution: premanup commented 18 November 2010 at 23:00

I'm using
memcache.inc 1.7
pressflow 6.19
php-fpm 5.2.14
memcached 1.0.2
libmemcached 0.44
libevent 1.4

Devel module provides this info:
Page execution time was 1988.62 ms. Executed 222 queries in 169.01 milliseconds.
So PHP execution time is very high. It takes ten times more time then mysql queries does.

XHProf module shows 636,862 function calls for single page view.
With memcache turned off it takes only 75,700 calls.

So after few days very high CPU usage I decide downgrade memcache from 1.7 to 1.5 version. Version 1.5 still provide overall ~ 75,000 function calls for one single page rendering and nicely reduces DB queries execution.

Log in or register to post comments

Comment #5

catch

he/him

English

CreditAttribution: catch commented 19 November 2010 at 02:17

variable_set() stampede sounds like a likely culprit to me, if that's the case we should do #926062: Avoid variable_set() on bin and wildcard flushes, which is also what rwohleb suggested. Will try to get a patch up there.

Log in or register to post comments

Comment #6

Jeremy CreditAttribution: Jeremy commented 19 November 2010 at 12:15

Catch has provided a patch that could fix this, is anyone able to apply #926062: Avoid variable_set() on bin and wildcard flushes to 1.7 and see if it resolves the above reported performance regression? (The patch in update #4 in that ticket is for 6.x-1.7)

Log in or register to post comments

Comment #7

premanup CreditAttribution: premanup commented 19 November 2010 at 22:10

I've just quick tested this patch against 6.x-1.x-dev version memcache. But as I can see thanks to XHProf module the overall function calls for a single page view has the same level as original 1.7 version does. It is lead to very high CPU usage by PHP proccesses.

Log in or register to post comments

Comment #8

premanup CreditAttribution: premanup commented 19 November 2010 at 22:28

File	Size
mysql_queries-day.png	28.7 KB
cpubyuser-day.png	23.98 KB
cpu-day.png	24.08 KB

Please take a look attached munin statistics graphs of CPU usage for a day after changing the version 1.7 -> 1.5.
On cpubyuser-day.png image: www-data is PHP-FPM processes.
But the number of mysql queries is little higher for this day.
I hope this data will usefull to resolve this problem. If not I'm ready to provide more info.

Log in or register to post comments

Comment #9

catch

he/him

English

CreditAttribution: catch commented 24 November 2010 at 04:58

premanup - please post more details as to where the number of function calls are in xhprof. Just a screenshot or two from the ui (with particularly bad cases highlighted) would be helpful.

Log in or register to post comments

Comment #10

rwohleb

he/him

English

CreditAttribution: rwohleb commented 24 November 2010 at 15:59

We had another variable_set stampede this morning. I now have http://drupal.org/node/926062#comment-3730384 in production so we'll see what happens. Considering the gap between occurrences it'll be a few days.

Log in or register to post comments

Comment #11

premanup CreditAttribution: premanup commented 26 November 2010 at 08:59

File	Size
xhprof_-memcache-6.x-1.x-dev-patch.png	54.77 KB
xhprof_-memcache-6.x-1.5.png	55.76 KB

Log in or register to post comments

Comment #12

premanup CreditAttribution: premanup commented 26 November 2010 at 09:03

File	Size
xhprof_-memcache-6.x-1.x-dev-patch.png	54.77 KB
xhprof_-memcache-6.x-1.5.png	55.76 KB

@Catch - I made 2 screenshots from XHProf statistic PHP function calls:

1. original 1.5 version
2. last dev version with patch applied from http://drupal.org/node/926062#comment-3730384

Log in or register to post comments

Comment #13

premanup CreditAttribution: premanup commented 26 November 2010 at 10:03

File	Size
xhprof_-memcache-6.x-1.x-dev-patch2.png	47.38 KB

@Catch - one more screenshot sorted by count of calls.

Log in or register to post comments

Comment #14

catch

he/him

English

CreditAttribution: catch commented 26 November 2010 at 17:05

@premanup - so the dev screenshot has _theme_process_registry(), but the first screenshot doesn't. Is there any chance one of this is on an install where the theme registry is set to rebuild on every request? Or this may have just been a cache miss. But nothing here points to an issue with memcache itself at all.

Log in or register to post comments

Comment #15

premanup CreditAttribution: premanup commented 26 November 2010 at 17:34

@Catch - Yes, I agreed. But no chance... This is exacty the same site (files and DB). The only difference is content of folder /sites/all/modules/memcache. I repeated tests a few times. So I have no idea how it possible.

Log in or register to post comments

Comment #16

rwohleb

he/him

English

CreditAttribution: rwohleb commented 30 November 2010 at 19:34

Just a quick update: I've had no further issue so far while running http://drupal.org/node/926062#comment-3730384 in production.

Log in or register to post comments

Comment #17

catch

he/him

English

CreditAttribution: catch commented 8 December 2010 at 03:46

Priority:	Normal	» Major
Status:	Active	» Needs review

File	Size
958930-wildcard.patch	897 bytes

Found one possible cause for this, at least if you have frequent cron runs.

Between 1.5 and 1.7, memcache started respecting cache_clear_all(NULL, $bin); whereas previously it had ignored it. system_cron() does this for every bin returned by hook_flush_caches() - so a cron run every five minutes, means all caches cleared every five minutes.

Here's a patch, hoping that people having issues will test it, I'll also be doing so on a high traffic site shortly.

Bumping this to major, since even if this isn't the cause of the high cpu usage others have been seeing, it's still a regression.

Log in or register to post comments

Comment #18

premanup CreditAttribution: premanup commented 8 December 2010 at 09:26

Status:

Needs review

» Needs work

File	Size
958930-wildcard_patch.png	19.44 KB

@catch - I've just tested patch. Unfortunately, it have not improve performance in my case. See screenshot. So cron is not the reason. Cron was not called during the tests at all.

Log in or register to post comments

Comment #19

geerlingguy CreditAttribution: geerlingguy commented 8 December 2010 at 15:53

Sub.

Log in or register to post comments

Comment #20

mikeytown2 CreditAttribution: mikeytown2 commented 13 December 2010 at 21:19

subscribe; variable in the cache table/bin is getting written to a lot. I went as far as removing variable caching in core and strongarm in order to fix an issue we where having (yes problem is gone now). Looking at downgrading now and undo-ing the hacks to see if that fixes the issue.

Log in or register to post comments

Comment #21

catch

he/him

English

CreditAttribution: catch commented 14 December 2010 at 08:51

So I'm pretty sure this is some kind of interaction with other modules, since I have 1.7 on a client site and that's doing fine once #17 is applied.

One thing that's new here is that the module does a few more variable_set() than it did with 1.4 for keeping track of wildcard clears, it might be worth adding some logging in there to see if that's gone up, this would explain why the variable cache is being rebuilt so often if so.

Log in or register to post comments

Comment #22

Jeremy CreditAttribution: Jeremy commented 16 December 2010 at 19:22

Status:

Needs work

» Active

Thanks Nat, patch applied in 6.x-dev:
http://drupal.org/cvs?commit=465602

And 7.x-dev:
http://drupal.org/cvs?commit=465608

Leaving bug report open to gather further data points as people are still reporting regressions even with this patch applied.

Log in or register to post comments

Comment #23

catch

he/him

English

CreditAttribution: catch commented 20 December 2010 at 15:21

Thanks Jeremy.

@premanup - could you repeat your test and debug http://api.drupal.org/api/drupal/includes--theme.inc/function/_theme_loa... and _theme_save_registry()

Whether it's always a cache miss, whether the cache gets set at all etc. etc.

Log in or register to post comments

Comment #24

Nick Lewis CreditAttribution: Nick Lewis commented 21 December 2010 at 22:13

I'm found that disabling the wildcard logic brings my cache hit rate up from 30% to 95%! Using the 2.2.2 memcache library. Either i have something seriously misconfigured, or the wildcard logic is 100% broken for certain versions.

Log in or register to post comments

Comment #25

premanup CreditAttribution: premanup commented 21 December 2010 at 23:27

@catch - I've just repeated tests using last memcache-6.x-1.x-dev version. No change. The function _theme_process_registry is called 72(!) times on every page view. This problem disappears as I setup memcache-6.x-1.5.

I've tried to investigate this by switching off some modules one by one. And I was surprised. The key module is Lightbox2 6.x-1.11. I repeated a few times for sure. No difference which version of memcache module is used. If Lightbox2 is turned off then _theme_process_registry is not called on every page view. And if memcache module is turned off then Lightbox2 module also works without problem.

So this bug appears as you setup memcache module version > 1.5 and Lightbox2 module simultaneously.

Log in or register to post comments

Comment #26

Jeremy CreditAttribution: Jeremy commented 22 December 2010 at 01:26

@catch: this drupal.org infra issue may be related (and is testing out your cron changes now):
#978310: drupal.org forms expire way too early

Log in or register to post comments

Comment #27

bleen CreditAttribution: bleen commented 22 December 2010 at 01:39

subscribing

Log in or register to post comments

Comment #28

catch

he/him

English

CreditAttribution: catch commented 22 December 2010 at 02:22

@Nick Lewis - did you try the patch at #22 and/or latest dev?

@premanup, that's odd, couldn't see anything in the lightbox code that would affect this.

Log in or register to post comments

Comment #29

Nick Lewis CreditAttribution: Nick Lewis commented 22 December 2010 at 02:36

@catch no sorry - the issue was listed as 6.x-1.7 i'll try that tomorrow.
@premanup, @catch I ran into the same problem with lightbox2. Basically when you have a lot of image presets, lightbox2 pollutes the theme registry with as much as 700k data!!! (yeah, i was surprised too).

Can't cache anything larger than meg...

I worked around the issue by removing lightbox2's evil auto theme function stuff.

Log in or register to post comments

Comment #30

longwave

he/him

English

UK

CreditAttribution: longwave commented 22 December 2010 at 02:55

@Nick Lewis: cache hit ratio is no longer an indicator of performance when the wildcard logic is enabled; it is expected that there will be a large number of misses due to the need to check prefixes for previous wildcard clears, most of which are unlikely to exist.

So does this performance issue only occur when the theme registry (or some other "mandatory" cache item) grows to over 1Mb and hence cache set always fails? Should we split larger items into chunks and recombine them on cache get?

Log in or register to post comments

Comment #31

Jeremy CreditAttribution: Jeremy commented 22 December 2010 at 03:12

I don't see any reason 6.x-1.5 would be handling large objects differently than 6.x-1.7. We should tackle >1MB objects in a different issue, imo, as this issue is regarding a reported performance regression.

Log in or register to post comments

Comment #32

slantview CreditAttribution: slantview commented 22 December 2010 at 03:22

The 1MB limit is imposed by memcached daemon. Please see the FAQ for memcached: What_is_the_maximum_data_size_you_can_store?_(1_megabyte)

However, you can recompile memcached to support larger sizes by changing the POWER_BLOCK in slab.c as also stated in the FAQ:

If you really do want to store items larger than 1MB, you can recompile memcached with an edited slabs.c:POWER_BLOCK value, or use the inefficient malloc/free backend. Other suggestions include a database, MogileFS, etc.

Log in or register to post comments

Comment #33

Jeremy CreditAttribution: Jeremy commented 22 December 2010 at 03:27

@slantview, please see #31 -- let's move any discussion of the 1MB limit to another issue.

Log in or register to post comments

Comment #34

longwave

he/him

English

UK

CreditAttribution: longwave commented 22 December 2010 at 03:35

@Jeremy: There was a lightbox2 security release the same week that memcache 6.x-1.6 was released. According to #409354-22: Performance issues with large number imagecache presets and content types the lightbox2 performance problems appeared in that release, and the previous stable release of lightbox2 was 20 months old - so I'm wondering if this is the root cause of the issue, not the wildcard caching. I guess the only way to figure this out is to test both versions of memcache against both versions of lightbox2, and see which combinations have problems.

Log in or register to post comments

Comment #35

slantview CreditAttribution: slantview commented 22 December 2010 at 03:57

@Jeremy - np. I was just pointing out it's an issue of the server and has nothing to do with memcache module.

Log in or register to post comments

Comment #36

Nick Lewis CreditAttribution: Nick Lewis commented 22 December 2010 at 08:56

@longview - here's where i'm coming from: in a production environment involving 8 webservers a downgrade to 1.5 to 1.7 magically solved our biggest performance problems (cache hit went from 33 to 78 percent). I can actually see the moment we downgraded in the nagios graphs: the load numbers go instantly lower, and the website death spikes disappear.

I did a diff of versions, and the multiget/wildcard system is the glaring difference. In my own time, I wanted to know why and found that limiting wildcard functionality to environments that contained a class called "memcached" magically increased the hit rate from 33% to 95%. My working theory is that $memcache->get(array()) does not work the same as $memcached->getMulti(array()).

I'm still looking into this; I don't *know* this is the problem. I just know that the multiget and wildcard system needs a close look so we can rule it out.

Log in or register to post comments

Comment #37

ball.in.th CreditAttribution: ball.in.th commented 22 December 2010 at 18:26

subscribing

Log in or register to post comments

Comment #38

ball.in.th CreditAttribution: ball.in.th commented 22 December 2010 at 19:04

I believe the patch in #22 makes memcache's cache_clear_all different from core. In core, cache_clear_all(NULL, $bin) would delete all CACHE_TEMPORARY cache entries, e.g., cached pages for cache_page bin. With #22, it doesn't do anything.

Log in or register to post comments

Comment #39

mikeytown2 CreditAttribution: mikeytown2 commented 22 December 2010 at 19:28

Not clearing the page cache would be bad; one would get missing css/js errors.

Log in or register to post comments

Comment #40

catch

he/him

English

CreditAttribution: catch commented 23 December 2010 at 02:07

memcache has always treated CACHE_TEMPORARY entries as having an explicit expiry of one month. Core is at fault for not setting a TTL here, see #891600: Page and block caches are cleared the same way on cron as on content change. The deletion of everything was only introduced with the wildcard logic, and was mistaken as far as I can tell. The page cache shouldn't end up with missing css/js in most cases since on rebuilds only files that are older than one month are deleted (by default).

Log in or register to post comments

Comment #41

ball.in.th CreditAttribution: ball.in.th commented 23 December 2010 at 03:22

I see -- memcache defines CACHE_TEMPORARY to be 1 month.

With the new wildcard logic in memcache, perhaps CACHE_TEMPORARY could now be treated as a special wildcard. memcache's cache_clear_all(NULL, $bin) could then delete all CACHE_TEMPORARY as in core.

Log in or register to post comments

Comment #42

Nick Lewis CreditAttribution: Nick Lewis commented 23 December 2010 at 03:31

@catch - discovered something interesting, that's going to be unfortunately hard to track down. Running both memcache 1.7 and 1.x-dev I get 90% hit rate on a new install with devel generated content. The results are the same for core drupal and pressflow I'd add. Doing the same on a big complicated panels based site I still get no change, still 30%. So your idea that it was related to a module might be correct. The only other explanation I can think of is the number of cache_gets on page may have something to do with it.

As I said before, removing wildcard functionality resolves the problem in the second case, but i'm at a loss as to why at this point because apparently the wildcard stuff works under certain circumstances and configurations...

Log in or register to post comments

Comment #43

mikeytown2 CreditAttribution: mikeytown2 commented 23 December 2010 at 03:40

@catch
in 6 it still wipes out the js & css dirs
drupal_clear_css_cache, drupal_clear_js_cache
#721400: Order JS files according to weight, don't change filenames for aggregated JS/CSS hasn't been back ported to 6.x - drupal_delete_file_if_stale

Edit: looks like drupal_flush_all_caches includes cache_page as part of the tables to kill. So I hope this is a mute point.

Log in or register to post comments

Comment #44

bleen CreditAttribution: bleen commented 23 December 2010 at 14:15

#42: .... according to #30 (above) the cache hit ratio is not an indicator of performance with the new wildcard logic

Log in or register to post comments

Comment #45

catch

he/him

English

CreditAttribution: catch commented 23 December 2010 at 14:19

@Nick Lewis - if you're running into something similar to premanup's issue, then you should check if cache_set() is being called on any of those pages - if it's not, then you are not getting a real cache miss in cache_get(). If it is, then you should be able to track it down from there via the bin and cid.

Log in or register to post comments

Comment #46

torgosPizza

he/him

English

Portland, OR

CreditAttribution: torgosPizza commented 6 January 2011 at 19:44

Subscribing. Thanks for everyone's effort on this; I think this is what contributed to a major downtime event for us a few weeks ago.

Log in or register to post comments

Comment #47

catch

he/him

English

CreditAttribution: catch commented 7 January 2011 at 06:48

@Nick Lewis - that's helpful.

I didn't look yet, but when was the support for memcache vs. memcached introduced?

Seems like there's three things at least going on:

1. cache_clear_all() on cron (committed to dev, not in 1.7)
2. People upgrading lightbox and memcache at the same time, and hitting the lightbox bug.
3. memcache vs. memcached

Log in or register to post comments

Comment #48

nmonterroso CreditAttribution: nmonterroso commented 24 January 2011 at 18:45

subscribing

Log in or register to post comments

Comment #49

abx CreditAttribution: abx commented 2 February 2011 at 06:06

Just tested version 1.8. It's better than 1.6 and 1.7 but still... it's slow compare to 1.5. I noticed that Page Execution Time is quite high in every pages compare to version 1.5. So, I have to go back to 1.5 again :(

Log in or register to post comments

Comment #50

catch

he/him

English

CreditAttribution: catch commented 19 April 2011 at 04:35

Status:

Active

» Postponed (maintainer needs more info)

If you're still having trouble with this, please update this issue with the following information:

1. Try upgrading to the latest 6.x-1.x dev release and see if that improves memcache hit rates - a couple of optimizations just went in.
2. Please post whether you are using the MemCache or MemCached extension, and which version of the library (and of libmemcache or libmemcached too).

I'm marking this as 'needs more info' until there are some more actionable bug reports here.