Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects [#2474909]

Problem/Motivation

Adding lots of objects in APC user cache causes issues for concurrent test bot runs.

Potential but discarded resolutions

~~Restart Apache as part of the test setup or teardown. This will clear APC/APCu.~~ We already clear APC/APCu before every test suite run see #2257359: Testbots should clear apc user cache on initialization. See #45 for an explanation of why this isn't an option.

~~Call apc_clear_cache('user') in run-tests.sh.~~ This can't work, as this script isn't run in the Apache-space, so it does not have access to the same cache.

~~Create a standalone script w/ an acccess check to call apc_clear_cache('user').~~ Since there isn't a parent site, there is no hash salt to adequately protect a GET w/ a token+timestamp.

~~Use apc.user_ttl = 0 on the TestBot instances.~~ While APC will self heal, this has the potential to cause cache thrashing, as APC doesn't implement a LRU strategy, and will dump the entire cache when it is full with this setting. This is the current setting

Proposed resolution

Refactor APC usages so that a common prefix is used for cache object where that makes sense. For example, the class map and caches of Yaml files. The APC user cache is already cleared prior to each test suite run.

Remaining tasks

Commit.

User interface changes

None

API changes

Add \Drupal\Core\Site\Settings::getApcuPrefix()

Previous issue summary

I have been having random test errors today with HEAD on TestBot, in addition to some local testing I had been doing (which I didn't take notes on):

In https://www.drupal.org/node/2371491#comment-9849759, the same comment only patch was uploaded four times (different filenames each time). The patch was tested against the same codebase, but one failed in \Drupal\Tests\simpletest\Functional\BrowserTestBaseTest
Drupal\Tests\simpletest\Functional\BrowserTestBaseTest::testGoTo GuzzleHttp\Exception\ConnectException: cURL error 28
bot 2818

In https://www.drupal.org/node/2371491#comment-9849555, the same same comment only patch was uploaded twice. One passed, one failed in \Drupal\locale\Tests\LocaleUpdateTest with an exception in FileStorage.
mkdir(): File exists Warning FileStorage.php 171 Drupal\Component\PhpStorage\FileStorage->createDirectory()
bot 2843

In https://www.drupal.org/node/2371491#comment-9849129, there was a failure in \Drupal\file\Tests\FileManagedFileElementTest
bot 2843

In https://www.drupal.org/node/2371491#comment-9848437, there was a failure in \Drupal\simpletest\Tests\SimpleTestBrowserTest. The other fail was fixed in #2474835: Random test fail in PageCacheTest::testPageCacheAnonymous403404.
bot 2783

In https://qa.drupal.org/pifr/test/1030863, failures in \Drupal\file\Tests\RemoteFileSaveUploadTest at lines lines 70, 74, 309, and 309, and in \Drupal\simpletest\Tests\SimpleTestBrowserTest, at line 105. Bot #2813.

Comment	File	Size	Author
#66	2474909-lucky.66.patch	4.04 KB	alexpott
#66	52-66-interdiff.txt	4.45 KB	alexpott
#64	2474909-test-only.64.patch	33.43 KB	alexpott
#64	63-64-interdiff.txt	1000 bytes	alexpott
#63	2474909-test-only.patch	33.33 KB	mpdonadio
#62	2474909-alt-approach.52.patch	2.53 KB	alexpott
#58	2474909-alt-approach.58.patch	8.23 KB	alexpott
#58	52-58-interdiff.txt	5.92 KB	alexpott
#52	2474909-alt-approach.52.patch	2.53 KB	alexpott
#52	50-52-interdiff.txt	1.28 KB	alexpott
#50	2474909-alt-approach.49.patch	3.61 KB	alexpott
#48	2474909.48.patch	9.17 KB	alexpott
#48	37-48-interdiff.txt	2.41 KB	alexpott
#40	2474909-test-only.patch	39.87 KB	mpdonadio
#37	2474909.37.patch	9.07 KB	alexpott
#37	36-37-interdiff.txt	3.8 KB	alexpott
#36	2474909.36.patch	6.63 KB	alexpott
#36	30-36-interdiff.txt	5.97 KB	alexpott
#30	2474909.30.patch	5.96 KB	alexpott
#30	29-30-interdiff.txt	4.55 KB	alexpott
#29	2474909.29.patch	5.21 KB	alexpott
#19	2474909.19.patch	549 bytes	alexpott

Comments

Comment #1

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 21 April 2015 at 01:56

Issue summary:

View changes

Comment #2

znerol commented 21 April 2015 at 18:49

https://qa.drupal.org/pifr/test/1030288 is interesting: cURL error 28: See http://curl.haxx.se/libcurl/c/libcurl-errors.html .

CURLE_OPERATION_TIMEDOUT (28)

Operation timeout. The specified time-out period was reached according to the conditions.

Comment #3

znerol commented 21 April 2015 at 19:40

$t0 = time();

$ch1 = curl_init();
curl_setopt($ch1, CURLOPT_URL, "http://1.1.1.1/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
#curl_setopt($ch1, CURLOPT_CONNECTTIMEOUT, 1);
#curl_setopt($ch1, CURLOPT_CONNECTTIMEOUT_MS, 1000);
curl_exec($ch1);
var_dump(curl_errno($ch1));
var_dump(curl_error($ch1));
curl_close($ch1);

$t1 = time();

print ("t1-t0: " . ($t1 - $t0) . "\n");

$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, "http://1.1.1.1/");
curl_setopt($ch2, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_CONNECTTIMEOUT, 1);
curl_setopt($ch2, CURLOPT_CONNECTTIMEOUT_MS, 1000);
curl_exec($ch2);
var_dump(curl_errno($ch2));
var_dump(curl_error($ch2));
curl_close($ch2);

$t2 = time();
print ("t2-t1: " . ($t2 - $t1) . "\n");

yields

php curl-test.php 
int(7)
string(58) "Failed to connect to 1.1.1.1 port 80: Connection timed out"
t1-t0: 15
int(28)
string(44) "Connection timed out after 1001 milliseconds"
t2-t1: 1

So there seems to be a standard timeout of 15 seconds on my machine. I cannot find a place (ini option) where this could be changed. Funny enough the error code/message is different if the timeout option was set (28 with the option, 7 without).

Comment #4

znerol commented 21 April 2015 at 19:41

(21:31:21) mpdonadio: znerol: i have had about 25 local SimpleBrowserTest passes locally, so this could suggest a concurrency or testbot problem ... not seeing anything in the git logs to suggest a change, though the apc classloader gave me pause ... apc does crazy things when it runs out of memory
(21:32:24) znerol: mpdonadio: I've been experimenting with curl
(21:33:26) znerol: mpdonadio: there was http://drupal.org/node/2461985 which added CURLOPT_TIMEOUT in addition to CURLOPT_TIMEOUT_MS for backward compatibility
(21:33:29) Druplicon: http://drupal.org/node/2461985 => Update Guzzle to latest release #2461985: Update Guzzle to latest release => 11 comments, 1 IRC mention
(21:34:05) znerol: mpdonadio: also I tried to figure out whether that value maybe could bleed over into simpletest
(21:34:12) znerol: mpdonadio: but I think we can exclude that
(21:34:22) znerol: mpdonadio: once curl_close is called, the options are gone
(21:35:03) znerol: mpdonadio: do you happen to know whether they need to restart the webserver for a new testrun?
(21:35:47) znerol: mpdonadio: if it does not restart fast enough when other tests are running, then maybe this could become a problem?
(21:36:27) znerol: mpdonadio: or is there only one run-tests.sh per machine?
(21:36:37) mpdonadio: znerol: not sure ... having trouble raising anyone to answer testbot questions

Comment #5

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 21 April 2015 at 19:49

Antother data point, Another data point: https://qa.drupal.org/pifr/test/1030863. Failure in \Drupal\file\Tests\RemoteFileSaveUploadTest and \Drupal\simpletest\Tests\SimpleTestBrowserTest in one run, but passed in another.

Comment #6

xjm

she/her

English

commented 21 April 2015 at 19:58

Issue summary:

View changes

Comment #7

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 21 April 2015 at 20:20

Issue summary:

View changes

Comment #8

znerol commented 22 April 2015 at 08:13

This one is extremely weird: https://qa.drupal.org/pifr/test/1031223

Cannot redeclare class drupal\core\render\element\operations PHP Fatal error Operations.php

Except that the patch does not touch that area at all.

(10:07:06 AM) berdir: znerol: Dom__: I'm worried that we have a problem with the apc classloader.. that we're not restarting apache and end up with inconsistencies in the cache when patches add/change classes

Comment #9

znerol commented 22 April 2015 at 08:13

Comment #10

xjm

she/her

English

commented 22 April 2015 at 08:32

Also WRT the APC classloader, I saw several on 2833 fail with:
Maximum execution time of 300 seconds exceeded PHP Fatal error ApcClassLoader.php

Comment #11

xjm

she/her

English

commented 22 April 2015 at 08:33

Title:	Random Test Failures with Unknown Cause	» Random Test Failures (possibly related to APC classloader)
Priority:	Major	» Critical

Bumping to critical; this has gotten pretty disruptive.

Comment #12

berdir

German

Switzerland

commented 22 April 2015 at 08:52

APC cache size might be too small or extremely fragmented. Every test has a different prefix for the classloader, so we have millions of records in there.

Workaround might be to restart apache as part of each test run. Or only use the apc classloader by default when the apcu extension is enabled.

Comment #13

berdir

German

Switzerland

commented 22 April 2015 at 08:53

We need to get the "testbot guys" in here, they should check the output of the apc.php statistics file on the server...

Comment #14

xjm

she/her

English

commented 22 April 2015 at 09:21

It also appears we've severely regressed the test suite run times; possibly related?

Comment #15

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 22 April 2015 at 11:36

Or clear apc_cache_clear('user') as part of testbot initialization. The timing of this does seem to correspond with the APC classloader changes. Were people at DDD having random fails over the weekend?

Comment #16

alexpott

he/they

English

🇪🇺🌍

commented 22 April 2015 at 12:36

I reverted #2395143-203: YAML parsing is very slow, cache it with FileCache - I think we only started to see this after that patch went it in. The APCu classloader went in really early in the devdays week.

Comment #17

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 22 April 2015 at 15:35

Title:

Random Test Failures (possibly related to APC classloader)

» Random Test Failures (possibly related to APC)

#14, if the problems are causing timeouts and cache churning, then that could be the cause of the long run times. I have noticed that some runs take the normal 20ish minutes, but others can take about about hour now.

I edited the title as the cause could be #2395143: YAML parsing is very slow, cache it with FileCache or #2296009: Use APC Classloader by default (when available), both of which are using APC.

The general problem, I think, is that APC isn't the best cache. The last time I checked, APC (not 100% sure about APCu) doesn't employ a LRU strategy, so you either need to set the TTL to 0 (which drops the whole cache) or you need to add some logic to handle a failed cache set. I have seen full/fragmented APC cause some seriously weird problems, when it can't GC enough data.

A short term solution could be to set apc.user_ttl=0 on the TestBots. If APC fills up, then it will just erase the whole cache. Or, bump up the APC apc.shm_size now that user cache is being used more.

A long term solution is trickier. It would likely involve setting a site-wide APC prefix, storing it in settings.php, and then using an APCIterator() to delete the entries during a TestBot teardown.

Comment #18

alexpott

he/they

English

🇪🇺🌍

commented 22 April 2015 at 20:54

Increasing the apc.shm_size won't be that effective because because we key everything using the hash salt. Combined with the fact that we run concurrent tests this means any attempt to nuke the user cache with something like:

diff --git a/core/modules/simpletest/src/TestBase.php b/core/modules/simpletest/src/TestBase.php
index 9a38774..1c24100 100644
--- a/core/modules/simpletest/src/TestBase.php
+++ b/core/modules/simpletest/src/TestBase.php
@@ -1235,6 +1235,15 @@ private function prepareEnvironment() {
     ));
 
     drupal_set_time_limit($this->timeLimit);
+
+    // Clear APC user cache to ensure mulitple test runs on the same testbot do
+    // not fragment the cache.
+    if (function_exists('apc_clear_cache')) {
+      apc_clear_cache('user');
+    }
+    if (function_exists('apcu_clear_cache')) {
+      apcu_clear_cache();
+    }
   }
 
   /**

won't work.

Comment #19

alexpott

he/they

English

🇪🇺🌍

commented 22 April 2015 at 22:27

Status:

Active

» Needs review

Status	File	Size
new	2474909.19.patch	549 bytes

Well we can do something like this in run-tests.sh in the part that is only run in the process that loops to run all the tests in separate and concurrent processes.

Comment #20

berdir

German

Switzerland

commented 22 April 2015 at 22:39

Except that doesn't run in the apache process, so it can't clear the cache there.

Comment #21

alexpott

he/they

English

🇪🇺🌍

commented 22 April 2015 at 22:58

@Berdir I keep on forgetting that... thanks!

Comment #22

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 00:11

Issue summary:

View changes

Comment #23

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 07:58

Issue summary:

View changes

The current APC settings on a testbot:

 sudo -u www-data php -i | grep apc
/etc/php5/cli/conf.d/20-apc.ini,
apc
MMAP File Mask => /tmp/apc.UOaRHi
apc.cache_by_default => On => On
apc.canonicalize => On => On
apc.coredump_unmap => Off => Off
apc.enable_cli => On => On
apc.enabled => On => On
apc.file_md5 => Off => Off
apc.file_update_protection => 2 => 2
apc.filters => no value => no value
apc.gc_ttl => 3600 => 3600
apc.include_once_override => Off => Off
apc.lazy_classes => Off => Off
apc.lazy_functions => Off => Off
apc.max_file_size => 1M => 1M
apc.mmap_file_mask => /tmp/apc.UOaRHi => /tmp/apc.UOaRHi
apc.num_files_hint => 1000 => 1000
apc.preload_path => no value => no value
apc.report_autofilter => Off => Off
apc.rfc1867 => On => On
apc.rfc1867_freq => 0 => 0
apc.rfc1867_name => APC_UPLOAD_PROGRESS => APC_UPLOAD_PROGRESS
apc.rfc1867_prefix => upload_ => upload_
apc.rfc1867_ttl => 3600 => 3600
apc.serializer => default => default
apc.shm_segments => 1 => 1
apc.shm_size => 300M => 300M
apc.shm_strings_buffer => 4M => 4M
apc.slam_defense => On => On
apc.stat => On => On
apc.stat_ctime => Off => Off
apc.ttl => 0 => 0
apc.use_request_time => On => On
apc.user_entries_hint => 4096 => 4096
apc.user_ttl => 0 => 0
apc.write_lock => On => On

Comment #24

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 08:21

Issue summary:

View changes

No resolutions left :(

Comment #25

wim leers

Ghent 🇧🇪🇪🇺

commented 23 April 2015 at 08:48

Quoting myself at #2395143-208: YAML parsing is very slow, cache it with FileCache:

* Consider restarting apache after/before a completed test run, to drop out any cached information between test runs.

Almost exactly a year ago, @rfay said this over at #1809248-27: Enable apc.enable_cli php.ini setting on testbots:

I think we've had a restart after every test in there for a long time.

… so … did this change?

Comment #26

wim leers

Ghent 🇧🇪🇪🇺

commented 23 April 2015 at 08:50

D'oh, that's what #24 just said: that is already happening. But if that is already happening, how is it possible that we fragment the data in APCu so very badly? How can there even be different prefixes in there within a single test run? Or is the problem that for a single test suite run, we reinstall Drupal so many times, each with different prefixes, that despite the restarts, it is is still a problem?

Comment #27

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 08:53

re #26 Yep within a test suite run - we key by hash salt which will be different for each install.

Comment #28

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 08:56

The only way to solve this is for web tests to set a predictable prefix and do an additional request in tearDown to clear the apc user cache using this key.

Comment #29

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 11:15

Status	File	Size
new	2474909.29.patch	5.21 KB

Here is a patch that empties the user cache for a specific simpletest webtestbase run. We don;t need to worry about KernelTestBase because these are cli tests and therefore APC cache is only persistent for the duration of that particular test.

Ran into some oddities about using apc_delete with the APCiterator - it looks like only 939 entries can be cleared at once for some reason. This will also affect our APCuBackend deletion.

Comment #30

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 11:43

Status	File	Size
new	29-30-interdiff.txt	4.55 KB
new	2474909.30.patch	5.96 KB

Improved docs and security... it relies on drupal_valid_ua to permit access to the route to clear APC user cache.

Comment #31

wim leers

Ghent 🇧🇪🇪🇺

commented 23 April 2015 at 12:01

Status:

Needs review

» Needs work

+++ b/core/lib/Drupal/Core/DrupalKernel.php
@@ -251,7 +251,7 @@ public static function createFromRequest(Request $request, $class_loader, $envir
-      $prefix = 'drupal.' . hash('sha256', 'drupal.' . Settings::getHashSalt());

@@ -1338,4 +1338,28 @@ protected function addServiceFiles($service_yamls) {
+    $prefix = 'drupal.';
+    if ($apc_prefix = Settings::get('apc_prefix')) {
+       $prefix .= $apc_prefix . '.';
+    }
+    return $prefix . $identifier . hash('sha256', $prefix . $identifier . Settings::getHashSalt());

So we basically make the prefix longer, so that we now have a controllable per-site prefix.

(One could argue we used to have "drupal." as a prefix, but deleting by that prefix would also delete the caches of other Drupal instances on the same shared host. Correct?)

+++ b/core/lib/Drupal/Core/DrupalKernel.php
@@ -1338,4 +1338,28 @@ protected function addServiceFiles($service_yamls) {
+   * A standardised prefix is useful to allow visual inspection of an APC cache.

Sorry: s/standardised/standardized/

APC or APCu?

+++ b/core/modules/system/src/Controller/SystemController.php
@@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
+      // No APC class  or non alphanumeric characters in the prefix.

Nit: double space.

+++ b/core/modules/system/src/Controller/SystemController.php
@@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
+    // APCIterator only deletes 939 items in one go. Not sure why but it is
+    // repeatable.

Nit: not enough cursing.

+++ b/core/modules/system/src/Controller/SystemController.php
@@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
+    // Do nothing else so the APC classloader is not rebuild this request.

s/rebuild/rebuilt/

+++ b/core/modules/system/system.routing.yml
@@ -474,3 +474,12 @@ system.entity_autocomplete:
+system.apc_clear:
+  path: '/admin/config/development/testing/apc_clear'
+  defaults:
+    _controller: '\Drupal\system\Controller\SystemController::apcClear'
+  requirements:
+    _access: 'TRUE'

This is dangerous: it means anybody can invalidate the APCu cache. Too easy to attack.

Comment #32

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 12:43

#23, those look like the settings for the CLI version of PHP. Is the Apache version the same? I also don't see how this can be a problem if apc.user_ttl = 0.

I don't think we have gotten true confirmation on #25. If Apache really is restarting after every run then the APC cache should be clearing, so the clear in the teardown should be unnecessary.

It almost seems like that in concurrent testbot runs, two instances are getting the same the hash/prefix and are stepping on each other.

Just a random note, that apc.gc_ttl = 3600 is likely too high, if we are restarting Apache and test runs average about 20-25 min, then expirable entries many never get GCed.

Comment #33

dawehner

German

commented 23 April 2015 at 12:58

+++ b/core/lib/Drupal/Core/Cache/ApcuBackendFactory.php
@@ -34,7 +34,7 @@ class ApcuBackendFactory implements CacheFactoryInterface {
   public function __construct($root, CacheTagsChecksumInterface $checksum_provider) {
-    $this->sitePrefix = Crypt::hashBase64($root . '/' . conf_path());
+    $this->sitePrefix = DrupalKernel::getApcuPrefix('cache_backend');
     $this->checksumProvider = $checksum_provider;

I'm curious why this should belong into the DrupalKernel and not into Settings? There could be Settings::getApcuPrefix?

+++ b/core/modules/system/system.routing.yml
@@ -474,3 +474,12 @@ system.entity_autocomplete:
+  requirements:
+    _access: 'TRUE'

I think we don't want to allow everyone to clear those?

Comment #34

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 12:59

@mpdonadio apc cache is being cleared before every suite run... that includes hundreds on web tests. And yes the user_ttl is just the default - see https://php.net/manual/en/apc.configuration.php

The problem we are having is being caused by the cache thrashing of thousands of concurrent inserts and evictions.

Comment #35

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 13:02

@dawehner and @Wim Leers re security.

+++ b/core/modules/system/src/Controller/SystemController.php
@@ -456,4 +457,37 @@ public static function setLinkActiveClass(array $element, array $context) {
+  public function apcClear() {
+    $prefix = drupal_valid_test_ua();
+    if (!class_exists('\APCIterator') || !$prefix || !ctype_alnum($prefix)) {
+      // No APC class  or non alphanumeric characters in the prefix.
+      throw new AccessDeniedHttpException();
+    }

In order to clear the apc caches we have to have valid test ua and it only clears cache entries with that in the key name. So I think the danger is eliminated. I will add a comment to the routing file.

Comment #36

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 13:19

Status:	Needs work	» Needs review
Issue tags:		+Needs manual testing

Status	File	Size
new	30-36-interdiff.txt	5.97 KB
new	2474909.36.patch	6.63 KB

@Wim Leers / #31

We can't clear by drupal. because concurrent tests and security
Fixed and specified APC user cache so it is nice for both 5.4 and 5.5+
Fixed
Agreed this is llama poop
Fixed
Added docs to the routing file why I think this is acceptable

@dawehner - nice idea putting the static method in settings. Done.

I could only prove this patch worked through manual testing so I guess other should to.

Comment #37

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 13:41

Issue tags:

-Needs manual testing

Status	File	Size
new	36-37-interdiff.txt	3.8 KB
new	2474909.37.patch	9.07 KB

Some improvements as to when we actually need to do stuff and added a test - which also shows that the route is quite secure.

Comment #38

wim leers

Ghent 🇧🇪🇪🇺

commented 23 April 2015 at 13:50

#36.1: yep, I got that, I was just explicitly stating why that wasn't possible. Sorry for not being more clear.

#36.4: @alexpott+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Comment #39

23 April 2015 at 14:26

The last submitted patch, 29: 2474909.29.patch, failed testing.

Comment #40

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 14:38

Status	File	Size
new	2474909-test-only.patch	39.87 KB

Here are the two reverted patches from #2395143: YAML parsing is very slow, cache it with FileCache reapplied to HEAD along with the patch in #37. If TestBot looks slow tonight, I will queue this up a bunch of times on my IGNORE issue and see what happens.

Comment #41

Aki Tendo commented 23 April 2015 at 15:13

Beware newer APCu...

Comment #42

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 17:51

Issue summary:

View changes

Per @Mixologic in IRC, Apache does not restart between test runs.

Comment #43

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 17:54

Issue summary:

View changes

Comment #44

23 April 2015 at 22:18

The last submitted patch, 36: 2474909.36.patch, failed testing.

Comment #45

alexpott

he/they

English

🇪🇺🌍

commented 23 April 2015 at 22:55

Re the Apache restart thing - it is still not a solution because the APC caches are cleared by the test runner in a web request. And clearing the caches is why you are suggesting restarting Apache -which is actually impossible anyway because the test running done by pifr appears to take place within a web request - so a restart would kill the test run.

Comment #46

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 23 April 2015 at 23:24

Issue summary:

View changes

Comment #47

24 April 2015 at 01:43

The last submitted patch, 37: 2474909.37.patch, failed testing.

Comment #48

alexpott

he/they

English

🇪🇺🌍

commented 24 April 2015 at 11:11

Issue summary:

View changes

Status	File	Size
new	37-48-interdiff.txt	2.41 KB
new	2474909.48.patch	9.17 KB

So this seems to have the affect on making the testbots crawl :(. i'm guessing this is because doing an APCIterator on an APC cache that is getting 5000 inserts per second slows everything right down.

The APCIterator class makes it easier to iterate over large APC caches. This is helpful as it allows iterating over large caches in steps, while grabbing a defined number of entries per lock instance, so it frees the cache locks for other activities rather than hold up the entire cache to grab 100 (the default) entries. Also, using regular expression matching is more efficient as it's been moved to the C level.

From https://php.net/manual/en/class.apciterator.php.

So the latest patch makes the chunk size 1 to see if it's the locks that make cause this.

Comment #49

24 April 2015 at 14:13

Status:

Needs review

» Needs work

The last submitted patch, 48: 2474909.48.patch, failed testing.

Comment #50

alexpott

he/they

English

🇪🇺🌍

commented 24 April 2015 at 14:24

Status:

Needs work

» Needs review

Status	File	Size
new	2474909-alt-approach.49.patch	3.61 KB

Local testing has shown that #48 is still going to be super slow.

So yet another idea... we can use a consistent prefix for the class loader and file cache (when the patch is re-committed) on the testbots because code is not moving.

Running the config group of tests...

Without the patch attached

Cached Variables 27807 ( 26.5 MBytes)
Hits: 254278 (90.1%)

With the patch

Cached Variables 4968 ( 12.0 MBytes)
Hits: 274107 (98.2%)

So this patch will lead to considerably less items in the APC user cache for a full run of the testbot suite.

Comment #51

24 April 2015 at 14:29

The last submitted patch, 40: 2474909-test-only.patch, failed testing.

Comment #52

alexpott

he/they

English

🇪🇺🌍

commented 24 April 2015 at 14:52

Status	File	Size
new	50-52-interdiff.txt	1.28 KB
new	2474909-alt-approach.52.patch	2.53 KB

Removing some cruft

Comment #53

24 April 2015 at 14:53

The last submitted patch, 50: 2474909-alt-approach.49.patch, failed testing.

Comment #54

24 April 2015 at 17:56

The last submitted patch, 19: 2474909.19.patch, failed testing.

Comment #55

24 April 2015 at 18:00

isntall queued 19: 2474909.19.patch for re-testing.

Comment #56

24 April 2015 at 18:05

The last submitted patch, 30: 2474909.30.patch, failed testing.

Comment #57

catch

he/him

English

commented 24 April 2015 at 19:39

With #52 should we still be trying the APCIterator for items in the cache backend?

The number of items to delete should be less, and the number of inserts while it's happening should also be less.

Comment #58

alexpott

he/they

English

🇪🇺🌍

commented 24 April 2015 at 21:06

Status	File	Size
new	52-58-interdiff.txt	5.92 KB
new	2474909-alt-approach.58.patch	8.23 KB

So here's a patch that kind of combines #52 and #48 to remove APC backends. With this patch after running the config group of tests:
Cached Variables: 1926 ( 1.2 MBytes)
Hits: 291333 (98.3%)

So way less leftover stuff in APC... hopefully the bots like it.

Comment #59

25 April 2015 at 00:08

Status:

Needs review

» Needs work

The last submitted patch, 58: 2474909-alt-approach.58.patch, failed testing.

Comment #60

25 April 2015 at 11:07

Status:

Needs work

» Needs review

alexpott queued 58: 2474909-alt-approach.58.patch for re-testing.

Comment #61

25 April 2015 at 12:00

Status:

Needs review

» Needs work

The last submitted patch, 58: 2474909-alt-approach.58.patch, failed testing.

Comment #62

alexpott

he/they

English

🇪🇺🌍

commented 25 April 2015 at 13:06

Status:

Needs work

» Needs review

Status	File	Size
new	2474909-alt-approach.52.patch	2.53 KB

The patch in #58 doesn't appear to work. Both test suite runs with it did not complete. Using APCIterator we the APC cache is being assaulted by the the concurrent test runs seems to cause a serious slowdown.

I think the patch in #52 is the way to go. Re-uploading so it is the most recent patch on the issue.

Comment #63

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 25 April 2015 at 17:11

Status	File	Size
new	2474909-test-only.patch	33.33 KB

Here are the two reverted patches from #2395143: YAML parsing is very slow, cache it with FileCache (2395143-185.patch + 2395143-190-followup.patch) + the patch from #62 (2474909-alt-approach.52_0.patch) for testing purposes. If TestBot is slow tonight (and I remember), I will run this a bunch of times on my IGNORE issue as a test.

Comment #64

alexpott

he/they

English

🇪🇺🌍

commented 25 April 2015 at 19:14

Status	File	Size
new	63-64-interdiff.txt	1000 bytes
new	2474909-test-only.64.patch	33.43 KB

@mpdonadio the previous file cache patch needs to be slightly adjusted to work as we expect.

With the patch in #63 after running the config group of tests:
Cached Variables: 8202 ( 17.3 MBytes)
Hits: 279313 (97.1%)

With the patch attached:
Cached Variables: 5410 ( 12.8 MBytes)
Hits: 285497 (98.1%)

Comment #65

alexpott

he/they

English

🇪🇺🌍

commented 25 April 2015 at 19:17

I think we have a problem though because with the current approach if you had two drupal installs on the same machine say beta9 and HEAD you couldn't run simpletest on both of them at the same time without cache collisions :(

Comment #66

alexpott

he/they

English

🇪🇺🌍

commented 25 April 2015 at 20:45

Status	File	Size
new	52-66-interdiff.txt	4.45 KB
new	2474909-lucky.66.patch	4.04 KB

The patch attached solves the problem outlined in #65.

Cached Variables: 4968 ( 12.1 MBytes)
Hits: 274427 (98.2%)

So no difference to #52

Comment #67

dawehner

German

commented 27 April 2015 at 05:45

+++ b/core/lib/Drupal/Core/Site/Settings.php
@@ -148,4 +149,31 @@ public static function getHashSalt() {
+    if (static::get('apcu_ensure_unique_prefix', TRUE)) {
+      return 'drupal.' . $identifier . '.' . hash_hmac('sha256', $identifier, static::get('hash_salt', $root . '/' . $site_path));
+    }
+    return 'drupal.' . $identifier . '.' . Crypt::hashBase64($root . '/' . $site_path);

I'm curious whether we should store the result, given that we at least will use it probably twice.

Comment #68

catch

he/him

English

commented 27 April 2015 at 13:35

The patch in #58 doesn't appear to work. Both test suite runs with it did not complete. Using APCIterator we the APC cache is being assaulted by the the concurrent test runs seems to cause a serious slowdown.

So I'm still a bit concerned that if a machine does dozens/hundreds of test runs without a restart, we'll end up with a full APC(u) cache again from the cache backends. It'll take longer, but could still happen over time.

What about the following:

From ApcuBackend:

/ apc_store()'s $ttl argument can be omitted but also set to 0 (zero),
    // in which case the value will persist until it's removed from the cache or
    // until the next cache clear, restart, etc. This is what we want to do
    // when $expire equals CacheBackendInterface::CACHE_PERMANENT.
    if ($expire === CacheBackendInterface::CACHE_PERMANENT) {
      $expire = 0;
    }
    apc_store($this->getApcuKey($cid), $cache, $expire);

Add a configuration option here to treat set a defined ttl other than 0 for CACHE_PERMANENT in Settings.

So on the test bots, we could set that to ten minutes.

That could also be useful on real sites - you might want to set it to a day or a week to ensure stale items don't sit around there too.

Comment #69

mpdonadio

he/him

English

Philadelphia/PA/USA (UTC-5)

commented 27 April 2015 at 14:14

If apc.user_ttl = 0, then I am not sure if this will matter. If there is an attempt to write to the cache, and it is full or too fragmented, APC will clear its contents (potentially leading to cache thrashing). From http://php.net/manual/en/apc.configuration.php#ini.apc.user-ttl

apc.user_ttl integer

The number of seconds a cache entry is allowed to idle in a slot in case this cache entry slot is needed by another entry. Leaving this at zero means that APC's cache could potentially fill up with stale entries while newer entries won't be cached. In the event of a cache running out of available memory, the cache will be completely expunged if ttl is equal to 0. Otherwise, if the ttl is greater than 0, APC will attempt to remove expired entries.

My read of this is that the second and third sentences contradict each other, but experience show the later happens.

If we start setting TTLs then APC will honor them , and since it isn't LRU, it will just not store new entries, which will lead to cache misses until the GC can remove expired entried.

However, if there is enough APC memory and we have short TTLs on the apc_store() calls, then it could help performance on test runs (no thrashing). This is probably a profile-and-see situation.

Comment #70

alexpott

he/they

English

🇪🇺🌍

commented 27 April 2015 at 14:36

The problem is that APC does not get expunged until you read that entry...

When specifying a ttl (Time-To-Live), you are allowed to use negative values. This causes a stored entry to be invalidated immediately, but note that it will not physically be removed until you read (eg. apc_fetch or apc_exists) it

Comment #71

catch

he/him

English

commented 27 April 2015 at 15:34

I've seen fragmented-but-not-full APC caches reject writes with the 'cannot allocate memory to pool' error. In that case it doesn't expunge the whole cache. But then I'd assume it doesn't expunge individual items past their ttl in that case either, so yeah doesn't help much.

Comment #72

alexpott

he/they

English

🇪🇺🌍

commented 28 April 2015 at 19:10

So shall we go forward with this patch and then fix up #2395143: YAML parsing is very slow, cache it with FileCache to use it?

This patch massively reduces that number of entries in a testbots APC user cache and increases the hit rate. All other ideas we've come up with have flaws at this point.

Comment #73

catch

he/him

English

commented 28 April 2015 at 19:50

If the bots are definitely clearing the cache after each test run, let's go ahead and do this. Can always roll it back again.

I still think a configurable ttl could be useful for some situations but that can happen elsewhere.

Comment #74

alexpott

he/they

English

🇪🇺🌍

commented 29 April 2015 at 14:39

Title:	Random Test Failures (possibly related to APC)	» Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects
Issue summary:	View changes
Issue tags:		+blocker

Re-titling to fit the patch and fixing the issue summary. We no longer have random fails. That was fixed by reverting #2395143: YAML parsing is very slow, cache it with FileCache.

Comment #75

alexpott

he/they

English

🇪🇺🌍

commented 29 April 2015 at 14:43

Issue summary:

View changes

Comment #76

catch

he/him

English

commented 29 April 2015 at 15:11

Status:

Needs review

» Reviewed & tested by the community

We should have a change notice for this since it could also be useful to set this on multisite installs.

I asked alexpott about #67 and he mentioned we'd need to add a static property to settings and also key by arguments. That doesn't seem worth it for < 10 calls so I think we could leave it for now and add later if it turns out to be a real issue.

Tentatively marking RTBC - no commits today since we're in beta freeze so time to knock back if it's not. This would be good to get in early after the beta in case it breaks things again.

Comment #77

webchick

she/they

English

Vancouver 🇨🇦

commented 29 April 2015 at 22:09

Status:

Reviewed & tested by the community

» Fixed

I know we're not yet at the end of our Wednesday commit freeze but there's some action happening in #drupal-testing atm and this would really help. I looked through the recent bug reports for D8 https://www.drupal.org/project/issues/drupal?text=&status=Open&prioritie... and nothing in there seemed like it would require a paper bag release.

Committed and pushed to 8.0.x. Thanks!

Comment #78

29 April 2015 at 22:10

webchick committed 19f7e41 on 8.0.x

Issue #2474909 by alexpott, mpdonadio, znerol, catch, Wim Leers, Berdir...

Comment #79

fabianx commented 29 April 2015 at 22:43

RTBC + 1, really nice patch!

Comment #80

13 May 2015 at 22:44

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.