Problem/Motivation
Adding lots of objects in APC user cache causes issues for concurrent test bot runs.
Potential but discarded resolutions
Restart Apache as part of the test setup or teardown. This will clear APC/APCu. We already clear APC/APCu before every test suite run see #2257359: Testbots should clear apc user cache on initialization. See #45 for an explanation of why this isn't an option.
Call This can't work, as this script isn't run in the Apache-space, so it does not have access to the same cache.apc_clear_cache('user') in run-tests.sh.
Create a standalone script w/ an acccess check to call Since there isn't a parent site, there is no hash salt to adequately protect a GET w/ a token+timestamp.apc_clear_cache('user').
Use While APC will self heal, this has the potential to cause cache thrashing, as APC doesn't implement a LRU strategy, and will dump the entire cache when it is full with this setting. This is the current settingapc.user_ttl = 0 on the TestBot instances.
Proposed resolution
Refactor APC usages so that a common prefix is used for cache object where that makes sense. For example, the class map and caches of Yaml files. The APC user cache is already cleared prior to each test suite run.
Remaining tasks
Commit.
User interface changes
None
API changes
Add \Drupal\Core\Site\Settings::getApcuPrefix()
Previous issue summary
I have been having random test errors today with HEAD on TestBot, in addition to some local testing I had been doing (which I didn't take notes on):
In https://www.drupal.org/node/2371491#comment-9849759, the same comment only patch was uploaded four times (different filenames each time). The patch was tested against the same codebase, but one failed in \Drupal\Tests\simpletest\Functional\BrowserTestBaseTest
Drupal\Tests\simpletest\Functional\BrowserTestBaseTest::testGoTo GuzzleHttp\Exception\ConnectException: cURL error 28
bot 2818
In https://www.drupal.org/node/2371491#comment-9849555, the same same comment only patch was uploaded twice. One passed, one failed in \Drupal\locale\Tests\LocaleUpdateTest with an exception in FileStorage.
mkdir(): File exists Warning FileStorage.php 171 Drupal\Component\PhpStorage\FileStorage->createDirectory()
bot 2843
In https://www.drupal.org/node/2371491#comment-9849129, there was a failure in \Drupal\file\Tests\FileManagedFileElementTest
bot 2843
In https://www.drupal.org/node/2371491#comment-9848437, there was a failure in \Drupal\simpletest\Tests\SimpleTestBrowserTest. The other fail was fixed in #2474835: Random test fail in PageCacheTest::testPageCacheAnonymous403404.
bot 2783
In https://qa.drupal.org/pifr/test/1030863, failures in \Drupal\file\Tests\RemoteFileSaveUploadTest at lines lines 70, 74, 309, and 309, and in \Drupal\simpletest\Tests\SimpleTestBrowserTest, at line 105. Bot #2813.
| Comment | File | Size | Author |
|---|---|---|---|
| #66 | 2474909-lucky.66.patch | 4.04 KB | alexpott |
| #66 | 52-66-interdiff.txt | 4.45 KB | alexpott |
| #64 | 2474909-test-only.64.patch | 33.43 KB | alexpott |
| #64 | 63-64-interdiff.txt | 1000 bytes | alexpott |
| #63 | 2474909-test-only.patch | 33.33 KB | mpdonadio |
Comments
Comment #1
mpdonadioComment #2
znerol commentedhttps://qa.drupal.org/pifr/test/1030288 is interesting: cURL error 28: See http://curl.haxx.se/libcurl/c/libcurl-errors.html .
Comment #3
znerol commentedyields
So there seems to be a standard timeout of 15 seconds on my machine. I cannot find a place (ini option) where this could be changed. Funny enough the error code/message is different if the timeout option was set (28 with the option, 7 without).
Comment #4
znerol commentedComment #5
mpdonadioAntother data point, Another data point: https://qa.drupal.org/pifr/test/1030863. Failure in \Drupal\file\Tests\RemoteFileSaveUploadTest and \Drupal\simpletest\Tests\SimpleTestBrowserTest in one run, but passed in another.
Comment #6
xjmComment #7
mpdonadioComment #8
znerol commentedThis one is extremely weird: https://qa.drupal.org/pifr/test/1031223
Cannot redeclare class drupal\core\render\element\operations PHP Fatal error Operations.phpExcept that the patch does not touch that area at all.
(10:07:06 AM) berdir: znerol: Dom__: I'm worried that we have a problem with the apc classloader.. that we're not restarting apache and end up with inconsistencies in the cache when patches add/change classesComment #9
znerol commentedPossibly related: #2474077: FileCacheFactory::setPrefix() is called even when no hash_salt is present on the site
Comment #10
xjmAlso WRT the APC classloader, I saw several on 2833 fail with:
Maximum execution time of 300 seconds exceeded PHP Fatal error ApcClassLoader.phpComment #11
xjmBumping to critical; this has gotten pretty disruptive.
Comment #12
berdirAPC cache size might be too small or extremely fragmented. Every test has a different prefix for the classloader, so we have millions of records in there.
Workaround might be to restart apache as part of each test run. Or only use the apc classloader by default when the apcu extension is enabled.
Comment #13
berdirWe need to get the "testbot guys" in here, they should check the output of the apc.php statistics file on the server...
Comment #14
xjmIt also appears we've severely regressed the test suite run times; possibly related?
Comment #15
mpdonadioOr clear apc_cache_clear('user') as part of testbot initialization. The timing of this does seem to correspond with the APC classloader changes. Were people at DDD having random fails over the weekend?
Comment #16
alexpottI reverted #2395143-203: YAML parsing is very slow, cache it with FileCache - I think we only started to see this after that patch went it in. The APCu classloader went in really early in the devdays week.
Comment #17
mpdonadio#14, if the problems are causing timeouts and cache churning, then that could be the cause of the long run times. I have noticed that some runs take the normal 20ish minutes, but others can take about about hour now.
I edited the title as the cause could be #2395143: YAML parsing is very slow, cache it with FileCache or #2296009: Use APC Classloader by default (when available), both of which are using APC.
The general problem, I think, is that APC isn't the best cache. The last time I checked, APC (not 100% sure about APCu) doesn't employ a LRU strategy, so you either need to set the TTL to 0 (which drops the whole cache) or you need to add some logic to handle a failed cache set. I have seen full/fragmented APC cause some seriously weird problems, when it can't GC enough data.
A short term solution could be to set apc.user_ttl=0 on the TestBots. If APC fills up, then it will just erase the whole cache. Or, bump up the APC apc.shm_size now that user cache is being used more.
A long term solution is trickier. It would likely involve setting a site-wide APC prefix, storing it in settings.php, and then using an APCIterator() to delete the entries during a TestBot teardown.
Comment #18
alexpottIncreasing the
apc.shm_sizewon't be that effective because because we key everything using the hash salt. Combined with the fact that we run concurrent tests this means any attempt to nuke the user cache with something like:won't work.
Comment #19
alexpottWell we can do something like this in run-tests.sh in the part that is only run in the process that loops to run all the tests in separate and concurrent processes.
Comment #20
berdirExcept that doesn't run in the apache process, so it can't clear the cache there.
Comment #21
alexpott@Berdir I keep on forgetting that... thanks!
Comment #22
mpdonadioComment #23
alexpottThe current APC settings on a testbot:
Comment #24
alexpottNo resolutions left :(
Comment #25
wim leersQuoting myself at #2395143-208: YAML parsing is very slow, cache it with FileCache:
Comment #26
wim leersD'oh, that's what #24 just said: that is already happening. But if that is already happening, how is it possible that we fragment the data in APCu so very badly? How can there even be different prefixes in there within a single test run? Or is the problem that for a single test suite run, we reinstall Drupal so many times, each with different prefixes, that despite the restarts, it is is still a problem?
Comment #27
alexpottre #26 Yep within a test suite run - we key by hash salt which will be different for each install.
Comment #28
alexpottThe only way to solve this is for web tests to set a predictable prefix and do an additional request in tearDown to clear the apc user cache using this key.
Comment #29
alexpottHere is a patch that empties the user cache for a specific simpletest webtestbase run. We don;t need to worry about KernelTestBase because these are cli tests and therefore APC cache is only persistent for the duration of that particular test.
Ran into some oddities about using apc_delete with the APCiterator - it looks like only 939 entries can be cleared at once for some reason. This will also affect our APCuBackend deletion.
Comment #30
alexpottImproved docs and security... it relies on drupal_valid_ua to permit access to the route to clear APC user cache.
Comment #31
wim leersSo we basically make the prefix longer, so that we now have a controllable per-site prefix.
(One could argue we used to have "drupal." as a prefix, but deleting by that prefix would also delete the caches of other Drupal instances on the same shared host. Correct?)
Sorry: s/standardised/standardized/
APC or APCu?
Nit: double space.
Nit: not enough cursing.
s/rebuild/rebuilt/
This is dangerous: it means anybody can invalidate the APCu cache. Too easy to attack.
Comment #32
mpdonadio#23, those look like the settings for the CLI version of PHP. Is the Apache version the same? I also don't see how this can be a problem if
apc.user_ttl = 0.I don't think we have gotten true confirmation on #25. If Apache really is restarting after every run then the APC cache should be clearing, so the clear in the teardown should be unnecessary.
It almost seems like that in concurrent testbot runs, two instances are getting the same the hash/prefix and are stepping on each other.
Just a random note, that
apc.gc_ttl = 3600is likely too high, if we are restarting Apache and test runs average about 20-25 min, then expirable entries many never get GCed.Comment #33
dawehnerI'm curious why this should belong into the DrupalKernel and not into Settings? There could be Settings::getApcuPrefix?
I think we don't want to allow everyone to clear those?
Comment #34
alexpott@mpdonadio apc cache is being cleared before every suite run... that includes hundreds on web tests. And yes the user_ttl is just the default - see https://php.net/manual/en/apc.configuration.php
The problem we are having is being caused by the cache thrashing of thousands of concurrent inserts and evictions.
Comment #35
alexpott@dawehner and @Wim Leers re security.
In order to clear the apc caches we have to have valid test ua and it only clears cache entries with that in the key name. So I think the danger is eliminated. I will add a comment to the routing file.
Comment #36
alexpott@Wim Leers / #31
drupal.because concurrent tests and security@dawehner - nice idea putting the static method in settings. Done.
I could only prove this patch worked through manual testing so I guess other should to.
Comment #37
alexpottSome improvements as to when we actually need to do stuff and added a test - which also shows that the route is quite secure.
Comment #38
wim leers#36.1: yep, I got that, I was just explicitly stating why that wasn't possible. Sorry for not being more clear.
#36.4: @alexpott+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comment #40
mpdonadioHere are the two reverted patches from #2395143: YAML parsing is very slow, cache it with FileCache reapplied to HEAD along with the patch in #37. If TestBot looks slow tonight, I will queue this up a bunch of times on my IGNORE issue and see what happens.
Comment #41
Aki Tendo commentedBeware newer APCu...
Comment #42
mpdonadioPer @Mixologic in IRC, Apache does not restart between test runs.
Comment #43
mpdonadioComment #45
alexpottRe the Apache restart thing - it is still not a solution because the APC caches are cleared by the test runner in a web request. And clearing the caches is why you are suggesting restarting Apache -which is actually impossible anyway because the test running done by pifr appears to take place within a web request - so a restart would kill the test run.
Comment #46
mpdonadioComment #48
alexpottSo this seems to have the affect on making the testbots crawl :(. i'm guessing this is because doing an APCIterator on an APC cache that is getting 5000 inserts per second slows everything right down.
From https://php.net/manual/en/class.apciterator.php.
So the latest patch makes the chunk size 1 to see if it's the locks that make cause this.
Comment #50
alexpottLocal testing has shown that #48 is still going to be super slow.
So yet another idea... we can use a consistent prefix for the class loader and file cache (when the patch is re-committed) on the testbots because code is not moving.
Running the config group of tests...
Without the patch attached
Cached Variables 27807 ( 26.5 MBytes)
Hits: 254278 (90.1%)
With the patch
Cached Variables 4968 ( 12.0 MBytes)
Hits: 274107 (98.2%)
So this patch will lead to considerably less items in the APC user cache for a full run of the testbot suite.
Comment #52
alexpottRemoving some cruft
Comment #57
catchWith #52 should we still be trying the APCIterator for items in the cache backend?
The number of items to delete should be less, and the number of inserts while it's happening should also be less.
Comment #58
alexpottSo here's a patch that kind of combines #52 and #48 to remove APC backends. With this patch after running the config group of tests:
Cached Variables: 1926 ( 1.2 MBytes)
Hits: 291333 (98.3%)
So way less leftover stuff in APC... hopefully the bots like it.
Comment #62
alexpottThe patch in #58 doesn't appear to work. Both test suite runs with it did not complete. Using APCIterator we the APC cache is being assaulted by the the concurrent test runs seems to cause a serious slowdown.
I think the patch in #52 is the way to go. Re-uploading so it is the most recent patch on the issue.
Comment #63
mpdonadioHere are the two reverted patches from #2395143: YAML parsing is very slow, cache it with FileCache (2395143-185.patch + 2395143-190-followup.patch) + the patch from #62 (2474909-alt-approach.52_0.patch) for testing purposes. If TestBot is slow tonight (and I remember), I will run this a bunch of times on my IGNORE issue as a test.
Comment #64
alexpott@mpdonadio the previous file cache patch needs to be slightly adjusted to work as we expect.
With the patch in #63 after running the config group of tests:
Cached Variables: 8202 ( 17.3 MBytes)
Hits: 279313 (97.1%)
With the patch attached:
Cached Variables: 5410 ( 12.8 MBytes)
Hits: 285497 (98.1%)
Comment #65
alexpottI think we have a problem though because with the current approach if you had two drupal installs on the same machine say beta9 and HEAD you couldn't run simpletest on both of them at the same time without cache collisions :(
Comment #66
alexpottThe patch attached solves the problem outlined in #65.
Cached Variables: 4968 ( 12.1 MBytes)
Hits: 274427 (98.2%)
So no difference to #52
Comment #67
dawehnerI'm curious whether we should store the result, given that we at least will use it probably twice.
Comment #68
catchSo I'm still a bit concerned that if a machine does dozens/hundreds of test runs without a restart, we'll end up with a full APC(u) cache again from the cache backends. It'll take longer, but could still happen over time.
What about the following:
From ApcuBackend:
Add a configuration option here to treat set a defined ttl other than 0 for CACHE_PERMANENT in Settings.
So on the test bots, we could set that to ten minutes.
That could also be useful on real sites - you might want to set it to a day or a week to ensure stale items don't sit around there too.
Comment #69
mpdonadioIf
apc.user_ttl = 0, then I am not sure if this will matter. If there is an attempt to write to the cache, and it is full or too fragmented, APC will clear its contents (potentially leading to cache thrashing). From http://php.net/manual/en/apc.configuration.php#ini.apc.user-ttlMy read of this is that the second and third sentences contradict each other, but experience show the later happens.
If we start setting TTLs then APC will honor them , and since it isn't LRU, it will just not store new entries, which will lead to cache misses until the GC can remove expired entried.
However, if there is enough APC memory and we have short TTLs on the apc_store() calls, then it could help performance on test runs (no thrashing). This is probably a profile-and-see situation.
Comment #70
alexpottThe problem is that APC does not get expunged until you read that entry...
Comment #71
catchI've seen fragmented-but-not-full APC caches reject writes with the 'cannot allocate memory to pool' error. In that case it doesn't expunge the whole cache. But then I'd assume it doesn't expunge individual items past their ttl in that case either, so yeah doesn't help much.
Comment #72
alexpottSo shall we go forward with this patch and then fix up #2395143: YAML parsing is very slow, cache it with FileCache to use it?
This patch massively reduces that number of entries in a testbots APC user cache and increases the hit rate. All other ideas we've come up with have flaws at this point.
Comment #73
catchIf the bots are definitely clearing the cache after each test run, let's go ahead and do this. Can always roll it back again.
I still think a configurable ttl could be useful for some situations but that can happen elsewhere.
Comment #74
alexpottRe-titling to fit the patch and fixing the issue summary. We no longer have random fails. That was fixed by reverting #2395143: YAML parsing is very slow, cache it with FileCache.
Comment #75
alexpottComment #76
catchWe should have a change notice for this since it could also be useful to set this on multisite installs.
I asked alexpott about #67 and he mentioned we'd need to add a static property to settings and also key by arguments. That doesn't seem worth it for < 10 calls so I think we could leave it for now and add later if it turns out to be a real issue.
Tentatively marking RTBC - no commits today since we're in beta freeze so time to knock back if it's not. This would be good to get in early after the beta in case it breaks things again.
Comment #77
webchickI know we're not yet at the end of our Wednesday commit freeze but there's some action happening in #drupal-testing atm and this would really help. I looked through the recent bug reports for D8 https://www.drupal.org/project/issues/drupal?text=&status=Open&prioritie... and nothing in there seemed like it would require a paper bag release.
Committed and pushed to 8.0.x. Thanks!
Comment #79
fabianx commentedRTBC + 1, really nice patch!