Problem/Motivation

Adding lots of objects in APC user cache causes issues for concurrent test bot runs.

Potential but discarded resolutions

Restart Apache as part of the test setup or teardown. This will clear APC/APCu. We already clear APC/APCu before every test suite run see #2257359: Testbots should clear apc user cache on initialization. See #45 for an explanation of why this isn't an option.

Call apc_clear_cache('user') in run-tests.sh. This can't work, as this script isn't run in the Apache-space, so it does not have access to the same cache.

Create a standalone script w/ an acccess check to call apc_clear_cache('user'). Since there isn't a parent site, there is no hash salt to adequately protect a GET w/ a token+timestamp.

Use apc.user_ttl = 0 on the TestBot instances. While APC will self heal, this has the potential to cause cache thrashing, as APC doesn't implement a LRU strategy, and will dump the entire cache when it is full with this setting. This is the current setting

Proposed resolution

Refactor APC usages so that a common prefix is used for cache object where that makes sense. For example, the class map and caches of Yaml files. The APC user cache is already cleared prior to each test suite run.

Remaining tasks

Commit.

User interface changes

None

API changes

Add \Drupal\Core\Site\Settings::getApcuPrefix()

Previous issue summary

I have been having random test errors today with HEAD on TestBot, in addition to some local testing I had been doing (which I didn't take notes on):

In https://www.drupal.org/node/2371491#comment-9849759, the same comment only patch was uploaded four times (different filenames each time). The patch was tested against the same codebase, but one failed in \Drupal\Tests\simpletest\Functional\BrowserTestBaseTest
Drupal\Tests\simpletest\Functional\BrowserTestBaseTest::testGoTo GuzzleHttp\Exception\ConnectException: cURL error 28
bot 2818

In https://www.drupal.org/node/2371491#comment-9849555, the same same comment only patch was uploaded twice. One passed, one failed in \Drupal\locale\Tests\LocaleUpdateTest with an exception in FileStorage.
mkdir(): File exists Warning FileStorage.php 171 Drupal\Component\PhpStorage\FileStorage->createDirectory()
bot 2843

In https://www.drupal.org/node/2371491#comment-9849129, there was a failure in \Drupal\file\Tests\FileManagedFileElementTest
bot 2843

In https://www.drupal.org/node/2371491#comment-9848437, there was a failure in \Drupal\simpletest\Tests\SimpleTestBrowserTest. The other fail was fixed in #2474835: Random test fail in PageCacheTest::testPageCacheAnonymous403404.
bot 2783

In https://qa.drupal.org/pifr/test/1030863, failures in \Drupal\file\Tests\RemoteFileSaveUploadTest at lines lines 70, 74, 309, and 309, and in \Drupal\simpletest\Tests\SimpleTestBrowserTest, at line 105. Bot #2813.

Comments

mpdonadio’s picture

Issue summary: View changes
znerol’s picture

https://qa.drupal.org/pifr/test/1030288 is interesting: cURL error 28: See http://curl.haxx.se/libcurl/c/libcurl-errors.html .

CURLE_OPERATION_TIMEDOUT (28)

Operation timeout. The specified time-out period was reached according to the conditions.

znerol’s picture

$t0 = time();

$ch1 = curl_init();
curl_setopt($ch1, CURLOPT_URL, "http://1.1.1.1/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
#curl_setopt($ch1, CURLOPT_CONNECTTIMEOUT, 1);
#curl_setopt($ch1, CURLOPT_CONNECTTIMEOUT_MS, 1000);
curl_exec($ch1);
var_dump(curl_errno($ch1));
var_dump(curl_error($ch1));
curl_close($ch1);

$t1 = time();

print ("t1-t0: " . ($t1 - $t0) . "\n");

$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, "http://1.1.1.1/");
curl_setopt($ch2, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_CONNECTTIMEOUT, 1);
curl_setopt($ch2, CURLOPT_CONNECTTIMEOUT_MS, 1000);
curl_exec($ch2);
var_dump(curl_errno($ch2));
var_dump(curl_error($ch2));
curl_close($ch2);

$t2 = time();
print ("t2-t1: " . ($t2 - $t1) . "\n");

yields

php curl-test.php 
int(7)
string(58) "Failed to connect to 1.1.1.1 port 80: Connection timed out"
t1-t0: 15
int(28)
string(44) "Connection timed out after 1001 milliseconds"
t2-t1: 1

So there seems to be a standard timeout of 15 seconds on my machine. I cannot find a place (ini option) where this could be changed. Funny enough the error code/message is different if the timeout option was set (28 with the option, 7 without).

znerol’s picture

(21:31:21) mpdonadio: znerol: i have had about 25 local SimpleBrowserTest passes locally, so this could suggest a concurrency or testbot problem ... not seeing anything in the git logs to suggest a change, though the apc classloader gave me pause ... apc does crazy things when it runs out of memory
(21:32:24) znerol: mpdonadio: I've been experimenting with curl
(21:33:26) znerol: mpdonadio: there was http://drupal.org/node/2461985 which added CURLOPT_TIMEOUT in addition to CURLOPT_TIMEOUT_MS for backward compatibility
(21:33:29) Druplicon: http://drupal.org/node/2461985 => Update Guzzle to latest release #2461985: Update Guzzle to latest release => 11 comments, 1 IRC mention
(21:34:05) znerol: mpdonadio: also I tried to figure out whether that value maybe could bleed over into simpletest
(21:34:12) znerol: mpdonadio: but I think we can exclude that
(21:34:22) znerol: mpdonadio: once curl_close is called, the options are gone
(21:35:03) znerol: mpdonadio: do you happen to know whether they need to restart the webserver for a new testrun?
(21:35:47) znerol: mpdonadio: if it does not restart fast enough when other tests are running, then maybe this could become a problem?
(21:36:27) znerol: mpdonadio: or is there only one run-tests.sh per machine?
(21:36:37) mpdonadio: znerol: not sure ... having trouble raising anyone to answer testbot questions
mpdonadio’s picture

Antother data point, Another data point: https://qa.drupal.org/pifr/test/1030863. Failure in \Drupal\file\Tests\RemoteFileSaveUploadTest and \Drupal\simpletest\Tests\SimpleTestBrowserTest in one run, but passed in another.

xjm’s picture

Issue summary: View changes
mpdonadio’s picture

Issue summary: View changes
znerol’s picture

This one is extremely weird: https://qa.drupal.org/pifr/test/1031223

Cannot redeclare class drupal\core\render\element\operations PHP Fatal error Operations.php

Except that the patch does not touch that area at all.

(10:07:06 AM) berdir: znerol: Dom__: I'm worried that we have a problem with the apc classloader.. that we're not restarting apache and end up with inconsistencies in the cache when patches add/change classes

xjm’s picture

Also WRT the APC classloader, I saw several on 2833 fail with:
Maximum execution time of 300 seconds exceeded PHP Fatal error ApcClassLoader.php

xjm’s picture

Title: Random Test Failures with Unknown Cause » Random Test Failures (possibly related to APC classloader)
Priority: Major » Critical

Bumping to critical; this has gotten pretty disruptive.

berdir’s picture

APC cache size might be too small or extremely fragmented. Every test has a different prefix for the classloader, so we have millions of records in there.

Workaround might be to restart apache as part of each test run. Or only use the apc classloader by default when the apcu extension is enabled.

berdir’s picture

We need to get the "testbot guys" in here, they should check the output of the apc.php statistics file on the server...

xjm’s picture

It also appears we've severely regressed the test suite run times; possibly related?

mpdonadio’s picture

Or clear apc_cache_clear('user') as part of testbot initialization. The timing of this does seem to correspond with the APC classloader changes. Were people at DDD having random fails over the weekend?

alexpott’s picture

I reverted #2395143-203: YAML parsing is very slow, cache it with FileCache - I think we only started to see this after that patch went it in. The APCu classloader went in really early in the devdays week.

mpdonadio’s picture

Title: Random Test Failures (possibly related to APC classloader) » Random Test Failures (possibly related to APC)

#14, if the problems are causing timeouts and cache churning, then that could be the cause of the long run times. I have noticed that some runs take the normal 20ish minutes, but others can take about about hour now.

I edited the title as the cause could be #2395143: YAML parsing is very slow, cache it with FileCache or #2296009: Use APC Classloader by default (when available), both of which are using APC.

The general problem, I think, is that APC isn't the best cache. The last time I checked, APC (not 100% sure about APCu) doesn't employ a LRU strategy, so you either need to set the TTL to 0 (which drops the whole cache) or you need to add some logic to handle a failed cache set. I have seen full/fragmented APC cause some seriously weird problems, when it can't GC enough data.

A short term solution could be to set apc.user_ttl=0 on the TestBots. If APC fills up, then it will just erase the whole cache. Or, bump up the APC apc.shm_size now that user cache is being used more.

A long term solution is trickier. It would likely involve setting a site-wide APC prefix, storing it in settings.php, and then using an APCIterator() to delete the entries during a TestBot teardown.

alexpott’s picture

Increasing the apc.shm_size won't be that effective because because we key everything using the hash salt. Combined with the fact that we run concurrent tests this means any attempt to nuke the user cache with something like:

diff --git a/core/modules/simpletest/src/TestBase.php b/core/modules/simpletest/src/TestBase.php
index 9a38774..1c24100 100644
--- a/core/modules/simpletest/src/TestBase.php
+++ b/core/modules/simpletest/src/TestBase.php
@@ -1235,6 +1235,15 @@ private function prepareEnvironment() {
     ));
 
     drupal_set_time_limit($this->timeLimit);
+
+    // Clear APC user cache to ensure mulitple test runs on the same testbot do
+    // not fragment the cache.
+    if (function_exists('apc_clear_cache')) {
+      apc_clear_cache('user');
+    }
+    if (function_exists('apcu_clear_cache')) {
+      apcu_clear_cache();
+    }
   }
 
   /**

won't work.

alexpott’s picture

Status: Active » Needs review
StatusFileSize
new549 bytes

Well we can do something like this in run-tests.sh in the part that is only run in the process that loops to run all the tests in separate and concurrent processes.

berdir’s picture

Except that doesn't run in the apache process, so it can't clear the cache there.

alexpott’s picture

@Berdir I keep on forgetting that... thanks!

mpdonadio’s picture

Issue summary: View changes
alexpott’s picture

Issue summary: View changes

The current APC settings on a testbot:

 sudo -u www-data php -i | grep apc
/etc/php5/cli/conf.d/20-apc.ini,
apc
MMAP File Mask => /tmp/apc.UOaRHi
apc.cache_by_default => On => On
apc.canonicalize => On => On
apc.coredump_unmap => Off => Off
apc.enable_cli => On => On
apc.enabled => On => On
apc.file_md5 => Off => Off
apc.file_update_protection => 2 => 2
apc.filters => no value => no value
apc.gc_ttl => 3600 => 3600
apc.include_once_override => Off => Off
apc.lazy_classes => Off => Off
apc.lazy_functions => Off => Off
apc.max_file_size => 1M => 1M
apc.mmap_file_mask => /tmp/apc.UOaRHi => /tmp/apc.UOaRHi
apc.num_files_hint => 1000 => 1000
apc.preload_path => no value => no value
apc.report_autofilter => Off => Off
apc.rfc1867 => On => On
apc.rfc1867_freq => 0 => 0
apc.rfc1867_name => APC_UPLOAD_PROGRESS => APC_UPLOAD_PROGRESS
apc.rfc1867_prefix => upload_ => upload_
apc.rfc1867_ttl => 3600 => 3600
apc.serializer => default => default
apc.shm_segments => 1 => 1
apc.shm_size => 300M => 300M
apc.shm_strings_buffer => 4M => 4M
apc.slam_defense => On => On
apc.stat => On => On
apc.stat_ctime => Off => Off
apc.ttl => 0 => 0
apc.use_request_time => On => On
apc.user_entries_hint => 4096 => 4096
apc.user_ttl => 0 => 0
apc.write_lock => On => On
alexpott’s picture

Issue summary: View changes

No resolutions left :(

wim leers’s picture

Quoting myself at #2395143-208: YAML parsing is very slow, cache it with FileCache:

* Consider restarting apache after/before a completed test run, to drop out any cached information between test runs.

Almost exactly a year ago, @rfay said this over at #1809248-27: Enable apc.enable_cli php.ini setting on testbots:

I think we've had a restart after every test in there for a long time.

… so … did this change?

wim leers’s picture

D'oh, that's what #24 just said: that is already happening. But if that is already happening, how is it possible that we fragment the data in APCu so very badly? How can there even be different prefixes in there within a single test run? Or is the problem that for a single test suite run, we reinstall Drupal so many times, each with different prefixes, that despite the restarts, it is is still a problem?

alexpott’s picture

re #26 Yep within a test suite run - we key by hash salt which will be different for each install.

alexpott’s picture

The only way to solve this is for web tests to set a predictable prefix and do an additional request in tearDown to clear the apc user cache using this key.

alexpott’s picture

StatusFileSize
new5.21 KB

Here is a patch that empties the user cache for a specific simpletest webtestbase run. We don;t need to worry about KernelTestBase because these are cli tests and therefore APC cache is only persistent for the duration of that particular test.

Ran into some oddities about using apc_delete with the APCiterator - it looks like only 939 entries can be cleared at once for some reason. This will also affect our APCuBackend deletion.

alexpott’s picture

StatusFileSize
new4.55 KB
new5.96 KB

Improved docs and security... it relies on drupal_valid_ua to permit access to the route to clear APC user cache.

wim leers’s picture

Status: Needs review » Needs work
  1. +++ b/core/lib/Drupal/Core/DrupalKernel.php
    @@ -251,7 +251,7 @@ public static function createFromRequest(Request $request, $class_loader, $envir
    -      $prefix = 'drupal.' . hash('sha256', 'drupal.' . Settings::getHashSalt());
    
    @@ -1338,4 +1338,28 @@ protected function addServiceFiles($service_yamls) {
    +    $prefix = 'drupal.';
    +    if ($apc_prefix = Settings::get('apc_prefix')) {
    +       $prefix .= $apc_prefix . '.';
    +    }
    +    return $prefix . $identifier . hash('sha256', $prefix . $identifier . Settings::getHashSalt());
    

    So we basically make the prefix longer, so that we now have a controllable per-site prefix.

    (One could argue we used to have "drupal." as a prefix, but deleting by that prefix would also delete the caches of other Drupal instances on the same shared host. Correct?)

  2. +++ b/core/lib/Drupal/Core/DrupalKernel.php
    @@ -1338,4 +1338,28 @@ protected function addServiceFiles($service_yamls) {
    +   * A standardised prefix is useful to allow visual inspection of an APC cache.
    

    Sorry: s/standardised/standardized/

    APC or APCu?

  3. +++ b/core/modules/system/src/Controller/SystemController.php
    @@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
    +      // No APC class  or non alphanumeric characters in the prefix.
    

    Nit: double space.

  4. +++ b/core/modules/system/src/Controller/SystemController.php
    @@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
    +    // APCIterator only deletes 939 items in one go. Not sure why but it is
    +    // repeatable.
    

    Nit: not enough cursing.

  5. +++ b/core/modules/system/src/Controller/SystemController.php
    @@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
    +    // Do nothing else so the APC classloader is not rebuild this request.
    

    s/rebuild/rebuilt/

  6. +++ b/core/modules/system/system.routing.yml
    @@ -474,3 +474,12 @@ system.entity_autocomplete:
    +system.apc_clear:
    +  path: '/admin/config/development/testing/apc_clear'
    +  defaults:
    +    _controller: '\Drupal\system\Controller\SystemController::apcClear'
    +  requirements:
    +    _access: 'TRUE'
    

    This is dangerous: it means anybody can invalidate the APCu cache. Too easy to attack.

mpdonadio’s picture

#23, those look like the settings for the CLI version of PHP. Is the Apache version the same? I also don't see how this can be a problem if apc.user_ttl = 0.

I don't think we have gotten true confirmation on #25. If Apache really is restarting after every run then the APC cache should be clearing, so the clear in the teardown should be unnecessary.

It almost seems like that in concurrent testbot runs, two instances are getting the same the hash/prefix and are stepping on each other.

Just a random note, that apc.gc_ttl = 3600 is likely too high, if we are restarting Apache and test runs average about 20-25 min, then expirable entries many never get GCed.

dawehner’s picture

  1. +++ b/core/lib/Drupal/Core/Cache/ApcuBackendFactory.php
    @@ -34,7 +34,7 @@ class ApcuBackendFactory implements CacheFactoryInterface {
       public function __construct($root, CacheTagsChecksumInterface $checksum_provider) {
    -    $this->sitePrefix = Crypt::hashBase64($root . '/' . conf_path());
    +    $this->sitePrefix = DrupalKernel::getApcuPrefix('cache_backend');
         $this->checksumProvider = $checksum_provider;
    

    I'm curious why this should belong into the DrupalKernel and not into Settings? There could be Settings::getApcuPrefix?

  2. +++ b/core/modules/system/system.routing.yml
    @@ -474,3 +474,12 @@ system.entity_autocomplete:
    +  requirements:
    +    _access: 'TRUE'
    

    I think we don't want to allow everyone to clear those?

alexpott’s picture

@mpdonadio apc cache is being cleared before every suite run... that includes hundreds on web tests. And yes the user_ttl is just the default - see https://php.net/manual/en/apc.configuration.php

The problem we are having is being caused by the cache thrashing of thousands of concurrent inserts and evictions.

alexpott’s picture

@dawehner and @Wim Leers re security.

+++ b/core/modules/system/src/Controller/SystemController.php
@@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
+  public function apcClear() {
+    $prefix = drupal_valid_test_ua();
+    if (!class_exists('\APCIterator') || !$prefix || !ctype_alnum($prefix)) {
+      // No APC class  or non alphanumeric characters in the prefix.
+      throw new AccessDeniedHttpException();
+    }

In order to clear the apc caches we have to have valid test ua and it only clears cache entries with that in the key name. So I think the danger is eliminated. I will add a comment to the routing file.

alexpott’s picture

Status: Needs work » Needs review
Issue tags: +Needs manual testing
StatusFileSize
new5.97 KB
new6.63 KB

@Wim Leers / #31

  1. We can't clear by drupal. because concurrent tests and security
  2. Fixed and specified APC user cache so it is nice for both 5.4 and 5.5+
  3. Fixed
  4. Agreed this is llama poop
  5. Fixed
  6. Added docs to the routing file why I think this is acceptable

@dawehner - nice idea putting the static method in settings. Done.

I could only prove this patch worked through manual testing so I guess other should to.

alexpott’s picture

Issue tags: -Needs manual testing
StatusFileSize
new3.8 KB
new9.07 KB

Some improvements as to when we actually need to do stuff and added a test - which also shows that the route is quite secure.

wim leers’s picture

#36.1: yep, I got that, I was just explicitly stating why that wasn't possible. Sorry for not being more clear.

#36.4: @alexpott+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The last submitted patch, 29: 2474909.29.patch, failed testing.

mpdonadio’s picture

StatusFileSize
new39.87 KB

Here are the two reverted patches from #2395143: YAML parsing is very slow, cache it with FileCache reapplied to HEAD along with the patch in #37. If TestBot looks slow tonight, I will queue this up a bunch of times on my IGNORE issue and see what happens.

Aki Tendo’s picture

mpdonadio’s picture

Issue summary: View changes

Per @Mixologic in IRC, Apache does not restart between test runs.

mpdonadio’s picture

Issue summary: View changes

The last submitted patch, 36: 2474909.36.patch, failed testing.

alexpott’s picture

Re the Apache restart thing - it is still not a solution because the APC caches are cleared by the test runner in a web request. And clearing the caches is why you are suggesting restarting Apache -which is actually impossible anyway because the test running done by pifr appears to take place within a web request - so a restart would kill the test run.

mpdonadio’s picture

Issue summary: View changes

The last submitted patch, 37: 2474909.37.patch, failed testing.

alexpott’s picture

Issue summary: View changes
StatusFileSize
new2.41 KB
new9.17 KB

So this seems to have the affect on making the testbots crawl :(. i'm guessing this is because doing an APCIterator on an APC cache that is getting 5000 inserts per second slows everything right down.

The APCIterator class makes it easier to iterate over large APC caches. This is helpful as it allows iterating over large caches in steps, while grabbing a defined number of entries per lock instance, so it frees the cache locks for other activities rather than hold up the entire cache to grab 100 (the default) entries. Also, using regular expression matching is more efficient as it's been moved to the C level.

From https://php.net/manual/en/class.apciterator.php.

So the latest patch makes the chunk size 1 to see if it's the locks that make cause this.

Status: Needs review » Needs work

The last submitted patch, 48: 2474909.48.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new3.61 KB

Local testing has shown that #48 is still going to be super slow.

So yet another idea... we can use a consistent prefix for the class loader and file cache (when the patch is re-committed) on the testbots because code is not moving.

Running the config group of tests...

Without the patch attached

Cached Variables 27807 ( 26.5 MBytes)
Hits: 254278 (90.1%)

With the patch

Cached Variables 4968 ( 12.0 MBytes)
Hits: 274107 (98.2%)

So this patch will lead to considerably less items in the APC user cache for a full run of the testbot suite.

The last submitted patch, 40: 2474909-test-only.patch, failed testing.

alexpott’s picture

StatusFileSize
new1.28 KB
new2.53 KB

Removing some cruft

The last submitted patch, 50: 2474909-alt-approach.49.patch, failed testing.

The last submitted patch, 19: 2474909.19.patch, failed testing.

isntall queued 19: 2474909.19.patch for re-testing.

The last submitted patch, 30: 2474909.30.patch, failed testing.

catch’s picture

With #52 should we still be trying the APCIterator for items in the cache backend?

The number of items to delete should be less, and the number of inserts while it's happening should also be less.

alexpott’s picture

StatusFileSize
new5.92 KB
new8.23 KB

So here's a patch that kind of combines #52 and #48 to remove APC backends. With this patch after running the config group of tests:
Cached Variables: 1926 ( 1.2 MBytes)
Hits: 291333 (98.3%)

So way less leftover stuff in APC... hopefully the bots like it.

Status: Needs review » Needs work

The last submitted patch, 58: 2474909-alt-approach.58.patch, failed testing.

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 58: 2474909-alt-approach.58.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new2.53 KB

The patch in #58 doesn't appear to work. Both test suite runs with it did not complete. Using APCIterator we the APC cache is being assaulted by the the concurrent test runs seems to cause a serious slowdown.

I think the patch in #52 is the way to go. Re-uploading so it is the most recent patch on the issue.

mpdonadio’s picture

StatusFileSize
new33.33 KB

Here are the two reverted patches from #2395143: YAML parsing is very slow, cache it with FileCache (2395143-185.patch + 2395143-190-followup.patch) + the patch from #62 (2474909-alt-approach.52_0.patch) for testing purposes. If TestBot is slow tonight (and I remember), I will run this a bunch of times on my IGNORE issue as a test.

alexpott’s picture

StatusFileSize
new1000 bytes
new33.43 KB

@mpdonadio the previous file cache patch needs to be slightly adjusted to work as we expect.

With the patch in #63 after running the config group of tests:
Cached Variables: 8202 ( 17.3 MBytes)
Hits: 279313 (97.1%)

With the patch attached:
Cached Variables: 5410 ( 12.8 MBytes)
Hits: 285497 (98.1%)

alexpott’s picture

I think we have a problem though because with the current approach if you had two drupal installs on the same machine say beta9 and HEAD you couldn't run simpletest on both of them at the same time without cache collisions :(

alexpott’s picture

StatusFileSize
new4.45 KB
new4.04 KB

The patch attached solves the problem outlined in #65.

Cached Variables: 4968 ( 12.1 MBytes)
Hits: 274427 (98.2%)

So no difference to #52

dawehner’s picture

+++ b/core/lib/Drupal/Core/Site/Settings.php
@@ -148,4 +149,31 @@ public static function getHashSalt() {
+    if (static::get('apcu_ensure_unique_prefix', TRUE)) {
+      return 'drupal.' . $identifier . '.' . hash_hmac('sha256', $identifier, static::get('hash_salt', $root . '/' . $site_path));
+    }
+    return 'drupal.' . $identifier . '.' . Crypt::hashBase64($root . '/' . $site_path);

I'm curious whether we should store the result, given that we at least will use it probably twice.

catch’s picture

The patch in #58 doesn't appear to work. Both test suite runs with it did not complete. Using APCIterator we the APC cache is being assaulted by the the concurrent test runs seems to cause a serious slowdown.

So I'm still a bit concerned that if a machine does dozens/hundreds of test runs without a restart, we'll end up with a full APC(u) cache again from the cache backends. It'll take longer, but could still happen over time.

What about the following:

From ApcuBackend:

/ apc_store()'s $ttl argument can be omitted but also set to 0 (zero),
    // in which case the value will persist until it's removed from the cache or
    // until the next cache clear, restart, etc. This is what we want to do
    // when $expire equals CacheBackendInterface::CACHE_PERMANENT.
    if ($expire === CacheBackendInterface::CACHE_PERMANENT) {
      $expire = 0;
    }
    apc_store($this->getApcuKey($cid), $cache, $expire);

Add a configuration option here to treat set a defined ttl other than 0 for CACHE_PERMANENT in Settings.

So on the test bots, we could set that to ten minutes.

That could also be useful on real sites - you might want to set it to a day or a week to ensure stale items don't sit around there too.

mpdonadio’s picture

If apc.user_ttl = 0, then I am not sure if this will matter. If there is an attempt to write to the cache, and it is full or too fragmented, APC will clear its contents (potentially leading to cache thrashing). From http://php.net/manual/en/apc.configuration.php#ini.apc.user-ttl

apc.user_ttl integer

The number of seconds a cache entry is allowed to idle in a slot in case this cache entry slot is needed by another entry. Leaving this at zero means that APC's cache could potentially fill up with stale entries while newer entries won't be cached. In the event of a cache running out of available memory, the cache will be completely expunged if ttl is equal to 0. Otherwise, if the ttl is greater than 0, APC will attempt to remove expired entries.

My read of this is that the second and third sentences contradict each other, but experience show the later happens.

If we start setting TTLs then APC will honor them , and since it isn't LRU, it will just not store new entries, which will lead to cache misses until the GC can remove expired entried.

However, if there is enough APC memory and we have short TTLs on the apc_store() calls, then it could help performance on test runs (no thrashing). This is probably a profile-and-see situation.

alexpott’s picture

The problem is that APC does not get expunged until you read that entry...

When specifying a ttl (Time-To-Live), you are allowed to use negative values. This causes a stored entry to be invalidated immediately, but note that it will not physically be removed until you read (eg. apc_fetch or apc_exists) it

catch’s picture

I've seen fragmented-but-not-full APC caches reject writes with the 'cannot allocate memory to pool' error. In that case it doesn't expunge the whole cache. But then I'd assume it doesn't expunge individual items past their ttl in that case either, so yeah doesn't help much.

alexpott’s picture

So shall we go forward with this patch and then fix up #2395143: YAML parsing is very slow, cache it with FileCache to use it?

This patch massively reduces that number of entries in a testbots APC user cache and increases the hit rate. All other ideas we've come up with have flaws at this point.

catch’s picture

If the bots are definitely clearing the cache after each test run, let's go ahead and do this. Can always roll it back again.

I still think a configurable ttl could be useful for some situations but that can happen elsewhere.

alexpott’s picture

Title: Random Test Failures (possibly related to APC) » Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects
Issue summary: View changes
Issue tags: +blocker

Re-titling to fit the patch and fixing the issue summary. We no longer have random fails. That was fixed by reverting #2395143: YAML parsing is very slow, cache it with FileCache.

alexpott’s picture

Issue summary: View changes
catch’s picture

Status: Needs review » Reviewed & tested by the community

We should have a change notice for this since it could also be useful to set this on multisite installs.

I asked alexpott about #67 and he mentioned we'd need to add a static property to settings and also key by arguments. That doesn't seem worth it for < 10 calls so I think we could leave it for now and add later if it turns out to be a real issue.

Tentatively marking RTBC - no commits today since we're in beta freeze so time to knock back if it's not. This would be good to get in early after the beta in case it breaks things again.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

I know we're not yet at the end of our Wednesday commit freeze but there's some action happening in #drupal-testing atm and this would really help. I looked through the recent bug reports for D8 https://www.drupal.org/project/issues/drupal?text=&status=Open&prioritie... and nothing in there seemed like it would require a paper bag release.

Committed and pushed to 8.0.x. Thanks!

  • webchick committed 19f7e41 on 8.0.x
    Issue #2474909 by alexpott, mpdonadio, znerol, catch, Wim Leers, Berdir...
fabianx’s picture

RTBC + 1, really nice patch!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.