See the continuation of the discussion in the #2934002: APCu cache backend can have unreasonable number of entries during testing or multi-site

Problem/Motivation

Wide use of a unique prefix can lead to problem with memory, if tests are run with a concurrency higher than 1.

Exception: Warning: apcu_store(): Unable to allocate memory for pool.
Symfony\Component\ClassLoader\ApcClassLoader->findFile()() (Line: 128)

Examples:

Another kind of error that you can get because of this 'CI aborted' due to 'Build timed out (after 110 minutes)'. This happens when the tests simply hang because of a lack of resources, collisions, or apcu fragmentaions.

Proposed resolution

1. Use significantly more an more resources (see #29, #34- #36, #2930022-106)

2. Not use unique prefix.
We have already done this in #2474909: Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects, but then canceled because of suspicion of random fails (#2749955: Random fails in UpdatePathTestBase tests). Later the culprit of random fails was detected in gc.

See also #2930022: Testing fails 'CI aborted' and 'apcu memory' with with a lot of research and tonal pictures statistical charts.

Remaining tasks

User interface changes

API changes

Data model changes

Original post

Exception: Warning: apcu_store(): Unable to allocate memory for pool.
Symfony\Component\ClassLoader\ApcClassLoader->findFile()() (Line: 128)

Examples:
https://www.drupal.org/pift-ci-job/819166
https://www.drupal.org/pift-ci-job/820576
https://www.drupal.org/pift-ci-job/821247
https://www.drupal.org/pift-ci-job/822016
https://www.drupal.org/pift-ci-job/822256
https://www.drupal.org/pift-ci-job/822949
https://www.drupal.org/pift-ci-job/823749
https://www.drupal.org/pift-ci-job/823996
https://www.drupal.org/pift-ci-job/824108
https://www.drupal.org/pift-ci-job/824147
https://www.drupal.org/pift-ci-job/824864
https://www.drupal.org/pift-ci-job/826294
https://www.drupal.org/pift-ci-job/826932
https://www.drupal.org/pift-ci-job/827140
https://www.drupal.org/pift-ci-job/830325
https://www.drupal.org/pift-ci-job/831905
https://www.drupal.org/pift-ci-job/831952
https://www.drupal.org/pift-ci-job/831958
https://www.drupal.org/pift-ci-job/832566
https://www.drupal.org/pift-ci-job/833314
https://www.drupal.org/pift-ci-job/833405
https://www.drupal.org/pift-ci-job/833456
https://www.drupal.org/pift-ci-job/834821

A couple of issues which might help to determine problem:

A couple php bugs with apcu:

Issue for testing any ideas: #2930022: Testing fails 'CI aborted' and 'apcu memory'

CommentFileSizeAuthor
#77 2926309-51.patch3.98 KBmpdonadio
#66 Screen Shot 2018-01-03 at 11.41.03 AM.png188.96 KBMixologic
#63 2926309-51.patch3.98 KBalexpott
#61 2926309-different-update-tests-1000-with-patch.patch3.08 KBAnonymous (not verified)
#61 2926309-different-update-tests-1000-as-is.patch666 bytesAnonymous (not verified)
#61 2926309-61-update-test-as-is.patch627 bytesAnonymous (not verified)
#60 2926309-57.patch3.98 KBalexpott
#60 2926309-60-update-test.patch4.59 KBalexpott
#57 2926309-57.patch3.98 KBalexpott
#57 2926309-57-update-test.patch4.57 KBalexpott
#57 2762549-apcu-117.patch599 bytesalexpott
#52 interdiff-43-51.txt629 bytesAnonymous (not verified)
#52 2926309-51.patch3.98 KBAnonymous (not verified)
#43 interdiff-41-43.txt2.09 KBAnonymous (not verified)
#43 2926309-43.patch3.98 KBAnonymous (not verified)
#42 2926309-42-test-only.patch2.82 KBmpdonadio
#41 2930022-126.patch2.43 KBAnonymous (not verified)
#36 APC entrieshint 54 min 40 sec.png205.02 KBMixologic
#36 APC Defaults 31 min 20 sec.png208.82 KBMixologic
#29 Screen Shot 2017-12-21 at 1.19.18 PM.png208.04 KBMixologic
#25 concurrent-requests-over-testrun-40852.png53.29 KBtacituseu
#25 tests-completed-over-time-40852.png18.83 KBtacituseu
#25 apcu-free-over-testrun-40852.png28.14 KBtacituseu
#23 834888-40783-0h47m-time-track-sorted.csv_.txt272.25 KBtacituseu
#23 834262-40472-1h24m-time-track-sorted.csv_.txt272.18 KBtacituseu
#21 shuffle-vs-usual.png33.05 KBAnonymous (not verified)
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Anonymous’s picture

vaplas created an issue. See original summary.

Anonymous’s picture

I'm not sure that this is Drupal's problem and maybe this happens rarely. So Normal priority. While issue just for notification and collection of additional info about this problem.

Anonymous’s picture

Issue summary: View changes
cilefen’s picture

Title: Random fail due to problem with memory » Random fail due to APCu not being able to allocate memory
tacituseu’s picture

Issue summary: View changes
Anonymous’s picture

Issue summary: View changes
xjm’s picture

Priority: Normal » Critical

This is happening heaps (pun) today, so promoting to critical. I don't think it's a core issue so we should probably move it to the appropriate queue, but tracking here for now.

https://www.drupal.org/pift-ci-job/824165

Anonymous’s picture

Thank you, @dawehner! Perhaps one of these is really helpful in eliminating the random fails through the more strong control of limit. Added to IS.

+ to links on php bugs with apcu.

+ issue for testing ideas - #2930022: Testing fails 'CI aborted' and 'apcu memory'

I tried to test one of the REST tests (UserJsonAnonTest), where the problem often appears. But 1500 runs showed nothing.

However, the testdrive for Drupal\system\Tests\Module\InstallUninstallTest, showed a strange result. In all launches it went into space, and both times after 124 passes:
https://dispatcher.drupalci.org/job/drupal_patches/40289/console
https://dispatcher.drupalci.org/job/drupal_patches/40307/console

Anonymous’s picture

Hmm, interesting moment: phpunit version of InstallUninstallTest has the same problem: 124 passes and 'Build timed out'. But unlike the simplest, it seems this is really the time limit.
https://dispatcher.drupalci.org/job/drupal_patches/40321/console
The difference between the last passes and 'Time out' only 14 minutes (in comparison with 1 hour for simpletest, see #9 console logs).


This also means that the simpletest performs the InstallUninstallTest in two quicker, than phpunit? o_O
gambry’s picture

Anonymous’s picture

Issue summary: View changes

Added #11 to examples in IS + few more.

Also added #2704571: Add an APCu classloader with a single entry to list issues which might help to determine problem.

Additional notices:

While the earliest case, when it was seen, is #2843139-100: EntityResource: Provide comprehensive test coverage for File entity, and tighten access control handler (2017-11-26, Sunday). But this is not the reason, because before commited this issue took a couple of days, during which the error appeared in other issues, like #2346893-242: Duplicate AJAX wrapper around a file field (2017-11-27).


Nearest commits near this time:
2017-11-24: #2800873: Add XML GET REST test coverage, work around XML encoder quirks - increases the number of http requests during the testing.
2017-11-22: #2868035: Test that all core content+config entity types have functional REST test coverage - installing many modules.
Can these two changes increase the likelihood of a mini-DoS attack during parallel testing on CI? See also #2704571: Add an APCu classloader with a single entry and other issues about apcu.


Most often going beyond the memory effect on next groups:
  • Module.Module
  • Rest.Rest
  • Routing.Routing
  • Search.Search

But maybe, after shuffle tests, the statistics will change. I tried to run a set of failed tests together, but all green (#2930022-7: Testing fails 'CI aborted' and 'apcu memory').

Anonymous’s picture

Issue summary: View changes

News:

@tacituseu scanned the memory (2930022-8), the results do not show anything illegal.

He also pointed to changes in the environment (23 Nov, 30 Nov, 12 Dec, 13 Dec), which could also affect the failures(/fixes?). See more in 2930022-11

An earlier fall was detected 2017-11-24: https://www.drupal.org/pift-ci-job/819166

Anonymous’s picture

Anonymous’s picture

Last news:

D34dMan’s picture

Issue summary: View changes

Adding more example to the report from this issue #2920963: Remove the migrate.migration prefix from core test migrations

Anonymous’s picture

@D34dMan, thank you!

Also, it seems, that @tacituseu found a weak place:
const DEFAULT_MAX_ROWS = 5000
if it is increased, testing is accelerated (#2930022-32)

We also noticed a strange acceleration of tests in https://dispatcher.drupalci.org/job/drupal_patches/40783/console (28 min 59 sec for non-js part of tests). But the re-testing was not so fast (see #2930022-25 and #2930022-33)

Anonymous’s picture

Few new statistics from #2930022: Testing fails 'CI aborted' and 'apcu memory':



#2930022-35: Testing fails 'CI aborted' and 'apcu memory'
+++ b/core/scripts/run-tests.sh
@@ -142,6 +142,7 @@
+shuffle($test_list);
  • Run/1: (No-JS) Test run duration: 29 min 27 sec
  • Run/2: (No-JS) Test run duration: 27 min 38 sec
  • Run/3: (No-JS) Test run duration: 29 min 7 sec

Shuffle tests gives consistently fast performance test. Perhaps this is due to a more uniform distribution of requests from the REST-tests, that drain some resources CI. In this case, we can wait #2910883: Move all entity type REST tests to the providing modules.


#1596472: Replace hard coded static cache of entities with cache backends one more useful issue, where last patch from @catch:
https://dispatcher.drupalci.org/job/drupal_patches/40846/console
(No-JS) Test run duration: 31 min 50 sec
But I can not argue, that the timing of the No-JS tests depends on the patches. Because at the current moment this time often fluctuates. Because:

So, it would be great to know the opinion from other developers, especially from the CI team. Perhaps they have long since discovered the cause of the fails, and our research of black box in #2930022: Testing fails 'CI aborted' and 'apcu memory' does not make any sense)

tacituseu’s picture

That DEFAULT_MAX_ROWS was just a stab at figuring out what is slowing the PHP 7 testing down, but it went in 12 September 2017 so I don't think it is likely.
Randomizing order in #2930022-35: Testing fails 'CI aborted' and 'apcu memory' on the other hand seems to be consistent, and gives around 30 minute reduction in test time for PHP 7.
Will try to narrow down which tests take so long by processing the logs.
Also #2930022-25: Testing fails 'CI aborted' and 'apcu memory' shows that both good and bad runs have the same memory usage (~100MB max to serve a page and 35MB left in APCU cache).

Anonymous’s picture

@tacituseu, thank you as always! Yeah. These tests can not be relied on, how many interesting theories they have already broken!

#2930022-38: Testing fails 'CI aborted' and 'apcu memory' now one rest-test (selected randomly) with 31 concurency. An it freezed after 558 runs (12 min after start). Those, perhaps the problem is not in heavy or long tests. Just some kind of resource leak, that accumulates on the CI during an avalanche of queries?

Anonymous’s picture

FileSize
33.05 KB

Until the problem is resolved, we can just use shuffle tests (#2930022-35: Testing fails 'CI aborted' and 'apcu memory') like quick workaround.

Here is a comparison shuffle vs usual (3 latest from https://www.drupal.org/pift-ci-job/837913).

  • x - time
  • y - number of tests completed

shuffle-vs-usual
An average of 30 minutes of shuffle tests, against 40-60 minutes of usual tests. Therefore, not only more stable, but also faster.

gambry’s picture

Is the graph saying there are around 175 tests heavily slowing down the testing, from 12-15 minutes after the testing has started?
Can they be the source of the issue (it's out of memory after all...)?

tacituseu’s picture

Processed test durations attached.
In fast run (#25) Drupal\system\Tests\Module\InstallUninstallTest takes 10 minutes, in the slow one (#8) it lingers for 47 minutes and there's plenty of slow Drupal\Tests\rest\Functional\* tests (they are at least 10x slower vs fast run).

tacituseu’s picture

There also seems to be an issue with ApcuBackend assuming APCU storage works like Memcached (LRU eviction) and filling up (see #2930022-73: Testing fails 'CI aborted' and 'apcu memory').
Edit: nevermind that, the fail here was in my data processing skills, graph actually shows that when the APCU cache fills up it is flushed completely.

tacituseu’s picture

APCU free over testrun:
apcu-free-over-testrun-40852.png
@vaplas completion script for the testrun:
tests-completed-over-time-40852.png
# of requests per second over testrun:
concurrent-requests-over-testrun-40852.png

moshe weitzman’s picture

graph actually shows that when the APCU cache fills up it is flushed completely.

That sounds like a really risky property for a cache backend. I dont think I would use it.

Anonymous’s picture

#22: @gambry, you read the diagram absolutely right. As already said @tacituseu, we have a large drawdown in the rest-tests period. We also have a simple test to send requests on node create (#2930022-53: Testing fails 'CI aborted' and 'apcu memory'), which leads to the same effect. So, the bottleneck not in rest-tests, but somewhere nearby.

#24: I like the analogy with the limited size of the hash table, in which there are always delays due to the large number of collisions during the generation of keys. But perhaps your assumption is more true.

#25: The first graph shows that the memory is constantly striving to end but resets before the critical zero? And in the central case it is not happen. In this case, we may need some more stable reset-trigger after the each of tests?

tacituseu’s picture

Re #26: trying to figure out the exact mechanism, those apcu_store errors do seem to pop out when the cache is nearly exhausted and there are many requests in flight at the same time (e.g. for test run in #25 at the second dip when it has 35MB free - request #33909, timestamp 20171216001325).

Re #27: could be something to it, digging into apcu-5.1.8.tgz, check apc.entries_hint, also some info in apcu-5.1.8\TECHNOTES.txt, will keep trying to characterize its content.

Mixologic’s picture

So, something is kinda terrible with how we're using apcu, at least on the testbots.

In the php7/sqlite environment, if I bump the cache to 8gb of memory so that we can fit everything without ever having to evict, the test time goes up to Test run duration: 54 min 34 sec

With it set to 2GB, we see anything from 27-32 minutes typcially

If I *shut it off entirely*, it goes down to Test run duration: 22 min 10 sec

So, these 'random errors' are race conditions in apcu that manifest when you try and have 32 processes cramming 4.5 million objects into it and never do any cleaning of all the cached data in there.

Is there any way that we can empty all the cached data for a test when that test is complete?

mpdonadio’s picture

APCu not being a real LRU cache make it horrible to deal with.

Is there any way that we can empty all the cached data for a test when that test is complete?

The main challenge with this is that the backend doesn't keep track of what $bin:$cid got set throughout the lifetime of a request. So the potential solution would be to

- add a property which would track $bin:$cid
- set/delete would have to do the housekeeping
- we would have to add a __destruct() to iterate and apcu_delete($bin:$cid)

We would probably also want to subclass ApcuBackend to ApcuLifetimeBackend for BC, and because we really only want to do this for tests (and someone may find a use for this outside of testing, similar to the in-memory one).

catch’s picture

Re #26: trying to figure out the exact mechanism, those apcu_store errors do seem to pop out when the cache is nearly exhausted and there are many requests in flight at the same time (e.g. for test run in #25 at the second dip when it has 35MB free - request #33909, timestamp 20171216001325).

IME you hit APCu errors not due to it emptying, but due to it not emptying when it's fragmented.

Let's say you have a 1mb item to cache, and two slots with 500kb each. You now cannot write the item (hence the unable to allocate error), but also APC isn't 'full' so it doesn't clear out.

@Mixologic is right though. The way we use APCu in core is fine for single site situations, since the cache size is finite. It should never be used for multiple sites on one server and that's what we're doing with the test bot.

We could possibly do something in functional tests to removeBin() from child sites on test completion. Might help a bit. Or we might need to not test with APCu except for actual cache tests.

mpdonadio’s picture

IME you hit APCu errors not due to it emptying, but due to it not emptying when it's fragmented.

IIRC, that is only the case when you are using TTLs. With "permanent" entries, it should dump. I have an old, old issue in the PHP issue queue trying to get clarification on the exact strategy that it uses, but I don't think it ever got answered. My understanding is my memory (ha!) of reading the code.

I think disabling APCu for tests and seeing what the impact on run times is the best option right now; my guts says that removeBin() on teardown may cause different problems.

tacituseu’s picture

@vaplas was right in #27.2 and much earlier in #23, the slowdowns are all from apc.entries_hint defaulting to 4K and the tests cramming there 5M of variables, confirmed by @Mixologic #106.
Also see graphs he made in #111, they correlate test completion with free cache.

There are also some logs to give you idea what's in the cache over time (#105:1, #105:2 - script by @msonnabaum from here).
There are a lot of 1K bootstrap:user_permissions_hash:authenticated,ayl5fozd style entries, which leads to Drupal\Core\Session\PermissionsHashGenerator::generate().

Still not sure what does the actual flushes, tried to catch it in #93 (log), for all I know it might even by done outside of Drupal's cache system (there's apcu_clear_cache() call in Doctrine\Common\Cache\ApcuCache (doFlush()).
Then there are also
Drupal\Component\FileCache\ApcuFileCacheBackend, Symfony\Component\ClassLoader\ApcClassLoader that seem to have no concept of flushing/gc.

Drupal\Core\Cache\ApcuBackend used to set $ttl at some time but it was removed in #2581395: Incorrect expiration in APCUBackend.

I'm sure the apc.entries_hint needs adjusting in the test environments, maybe even apc.ttl (see #109), or at least some requirement check (does ApcuBackend make an assumption that it will be set some specific way ? 0 or fixed time).
But it feels like ApcuBackend needs to do something too.

Also for some tries of cleaning after tests: #89, #107, #111, #113, still WIP.

tacituseu’s picture

Re #32: @Mixologic did that in #98, they run much faster without APCu ;).

From apcu-5.1.8\INSTALL:

apc.ttl                 The number of seconds a cache entry is allowed to
                            idle in a slot in case this cache entry slot is 
                            needed by another entry.  Leaving this at zero
                            means that your cache could potentially fill up
                            with stale entries while newer entries won't be
                            cached.  
                            (Default: 0) 

says noting of 'emergency dumping', still checking code.

apc.smart  If you begin to get low on resources, an expunge of the cache
           is performed if it is less than half full. This is not always
           a suitable way of determining if an expunge of the cache
           should be per
           apc.smart allows you to set a runtime configuration value which
           is used to determine if an expunge should be run
           if (available_size < apc.smart * requested_size)
             apc_cache_expunge()
           (Default: 0) 

apcu-5.1.8\TECHNOTES.txt:

While walking the linked list we also check to see if the cache has a TTL defined.
If while walking the linked list we see a slot that has expired, we remove it
since we are right there looking at it.  This is the only place we remove stale
entries unless the shared memory segment fills up and we force a full expunge via
apc_cache_expunge().  apc_cache_expunge() walks all slots attempting deletion, how
deletion occurs depends on runtime parameters, see INSTALL for runtime parameter
configuration details. 
tacituseu’s picture

Now for the code part, TLDR: it does nuke it.

apcu-5.1.8\apc_cache_api.h:

/* {{{ apc_cache_default_expunge
* Where smart is not set:
*  Where no ttl is set on cache:
*   1) Perform cleanup of stale entries
*   2) Expunge if available memory is less than sma->size/2
*  Where ttl is set on cache:
*   1) Perform cleanup of stale entries
*   2) If available memory if less than the size requested, run full expunge
*
* Where smart is set:
*  Where no ttl is set on cache:
*   1) Perform cleanup of stale entries
*   2) Expunge is available memory is less than size * smart
*  Where ttl is set on cache:
*   1) Perform cleanup of stale entries
*   2) If available memory if less than the size requested, run full expunge
*
* The TTL of an entry takes precedence over the TTL of a cache
*/
PHP_APCU_API void apc_cache_default_expunge(apc_cache_t* cache, size_t size); 
/*
* apc_cache_real_expunge: trashes the whole cache
*
* Note: it is assumed you have a write lock on the header when you enter real expunge
*/
PHP_APCU_API void apc_cache_real_expunge(apc_cache_t* cache); 

apcu-5.1.8\apc_cache.c:

PHP_APCU_API void apc_cache_default_expunge(apc_cache_t* cache, size_t size) 
{
...
	/* make suitable selection */
	suitable = (cache->smart > 0L) ? (size_t) (cache->smart * size) : (size_t) (cache->sma->size/2); 
...
	/* perform expunge processing */
	if(!cache->ttl) {

		/* check it is necessary to expunge */
		if (available < suitable) {
			apc_cache_real_expunge(cache);
		}
	} else {
	...
	}
}

The triggers are inside apcu-5.1.8\apc_sma.c (notes after ###):

PHP_APCU_API void* apc_sma_api_malloc_ex(apc_sma_t* sma, zend_ulong n, zend_ulong fragment, zend_ulong* allocated) { 
...
    off = sma_allocate(SMA_HDR(sma, sma->last), n, fragment, allocated);
    
    ### first tries to expunge last segment
    if(off == -1) {
        /* retry failed allocation after we expunge */
        WUNLOCK(&SMA_LCK(sma, sma->last));
		sma->expunge(
			*(sma->data), (n+fragment));  
        WLOCK(&SMA_LCK(sma, sma->last));
        off = sma_allocate(SMA_HDR(sma, sma->last), n, fragment, allocated);
    }
... ### skipped one going over all other segments
    ### if that didn't help it just nukes it anyway
    /* I've tried being nice, but now you're just asking for it */
    if(!nuked) {
        sma->expunge(*(sma->data), (n+fragment));
        nuked = 1;
        goto restart;
    }
...
}

Can also be verified experimentally as it keeps the count under 'Cache full count'

apcu-5.1.8\apc.php:
'<tr class=tr-1><td class=td-0>Cache full count</td><td>{$cache['expunges']}</td></tr>',

apcu-5.1.8\apc_cache.c:
add_assoc_double(&info, "expunges", (double)cache->header->nexpunges);
so @Mixologic could confirm that it is the case by catching the same apc.php stats screenshot as in #29 but with default settings.

Edit: confirmed (last column is apcu_cache_info(TRUE)['expunges']).

Mixologic’s picture

I would love to shut of APCu on the testbots.

A not insignificant amount of time is spent waiting for memory locks.

I've uploaded the status of the APC cache directly after running the default, which took 31 min, 20 to run just the phptests on php7/sqlite.

I then changed just the apc.entries_hint so that it had more than 4096 buckets to use, but that made things worse: 54 min 40 sec.

Finally, I just flat out shut it off.

14 min 14 sec.

Now that we're paying by the minute, I can pretty much quantify this to saving 0.12 per test, which sure would be nice to get that back. (not sure the impact it would have on the php5.5/5.6 envs, but probably similar)

Anonymous’s picture

14 min 14 sec.

It sounds very tempting! I do not use apcu, so for me this is an excellent suggestion. But it seems in drupal there is a lot of code and issues about apcu. Perhaps, keeping it also makes sense. Can we implement proposals from @catch and @mpdonadio?

  • flag for enable APCu for special tests
  • marks apcu cache by test
  • clear apcu cache after each of these tests

Also, looks like @tacituseu found a way to reduce tests to 21 min 51 sec (from 51 min) via
$settings['apcu_ensure_unique_prefix'] = FALSE; (#124).
Or like build custom version of prefix for CI #120.

tacituseu’s picture

The PHP 7 & SQLite 3.8 of #120 does 14 min 1 sec.

dawehner’s picture

$settings['apcu_ensure_unique_prefix'] = FALSE;

This is actually a nice find, given that #2474909: Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects was introduced only for this reason in the first place :)

tacituseu’s picture

And then removed by #2749955: Random fails in UpdatePathTestBase tests.
Edit:
Making ApcClassLoader shared across the tests lowers the amount of APCu cache entries over the test-run from 5M to 1M (hits/misses = 39, right before the expunge), with ApcuFileCacheBackend added it gets reduced to 300K (hits/misses = 117, charts courtesy of @vaplas).

Anonymous’s picture

Here is the implementation of the #124 from @tacituseu.

The patch reverts http://cgit.drupalcode.org/drupal/commit/?id=44aa63b #2749955: Random fails in UpdatePathTestBase tests. So, we tested this:

  • with a variety of update tests in #127 - ok.
  • with memory scan in #133 - ok (around 400K cache entries and slightly lower hits/misses ratio (93), it's little more than 300K, but not bad too)
  • Then I went into the jungle of tests only one UpdatePathTestBaseTest with many runs, but perhaps this is another story.
mpdonadio’s picture

So, if we want to go with #41

  1. +++ b/core/lib/Drupal/Core/Test/FunctionalTestSetupTrait.php
    @@ -41,6 +41,13 @@
    +   * Setting unique apcu prefix. See Settings::getApcuPrefix().
    +   *
    +   * @var bool
    +   */
    

    Nits, but first line should be one sentence, and should use a @see

  2. +++ b/core/tests/Drupal/Tests/Core/Site/SettingsTest.php
    @@ -117,7 +117,7 @@ public function testSerialize() {
    -    $settings = new Settings(['hash_salt' => 123]);
    +    $settings = new Settings(['hash_salt' => 123, 'apcu_ensure_unique_prefix' => TRUE]);
    

    Nice, esp when you see this in context.

Since this is the third(?) time we have had to tweak this, I think we should expand the comments on Settings::getApcuPrefix() about what is going on, and reference #2474909: Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects, #2749955: Random fails in UpdatePathTestBase tests, and this issue.

And kudos to @valpas and @tacituseu on this. I have run into some other instances where caching slowed things down. The profiling / charts are awesome, and are really needed to draw real conclusions.

Just adding the patch w/ shuffles to see if there are any side effects.

Anonymous’s picture

@mpdonadio, thanks for the review, advice and kind words! I tried to do #42 points. But with a little difference.

#42.1: Done + explain why FALSE by default
#42.3: Bit changed description of Settings::getApcuPrefix(): WebTestBase -> tests. + only one @see on current issue. Two other issues i added to IS and Related issues list. Therefore, i hope, they will be found in the study, so we do not need to list them all in @see.

Anonymous’s picture

Issue summary: View changes
dawehner’s picture

One important result of all this amazing research is that real sites also potentially has issues with that. Unlike the initial benchmarks we did (clear all cache, just access the frontpage), real sites tend to have actually a lot of more code involved over time.

I'm curious whether @berdir made any research on the same topic as well. They have D8 production sites since a long time.

Anonymous’s picture

@dawehner, thanks for good point, as always. It will be great to get the best apcu settings by default for real sites. But maybe we can still hotfix the problem only for CI, so as not to interfere with the other issues.

#2749955-122: Random fails in UpdatePathTestBase tests @catch:

-    $settings['settings']['apcu_ensure_unique_prefix'] = (object) [
-      'value' => FALSE,
-      'required' => TRUE,
-    ];
Summarizing discussion in irc - I've committed #118 to 8.1.x and 8.2.x as an interim step.

As has already been noted by @tacituseu, the random fail occurred due to gc. So, now we just revert this iterim step.

Perhaps after a more thorough analysis we will find another culprit for the problem, but until this happens, it would be great to make more stable and quick testing.

catch’s picture

Status: Needs review » Reviewed & tested by the community

Marking #43 RTBC, it's a shame we already thought about this, removed it as a hotfix, then forgot about it until now but all the work here has been really useful regardless. I think we should open a (major) follow-up to discuss this more so we don't lose all the information here.

Anonymous’s picture

catch’s picture

Whoops I just nearly cross-posted because I'd opened #2934002: APCu cache backend can have unreasonable number of entries during testing or multi-site. Let's mark one as duplicate (I haven't done so yet though).

Anonymous’s picture

O, no problem!) I closed my version #2933998: Discover the best APCu settings. Perhaps later we will convert it into an additional research by plan from @tacituseu.

larowlan’s picture

+++ b/core/lib/Drupal/Core/Test/FunctionalTestSetupTrait.php
@@ -41,6 +41,18 @@
+   * Wide use of a unique prefix can lead to problem with memory, if tests are

nit: => problems, can be fixed on commit

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 52: 2926309-51.patch, failed testing. View results

Anonymous’s picture

#53: unrelated global 'possibly out of free disk space' on DrupalCI. Except one https://www.drupal.org/pift-ci-job/849151 (it is still run, https://dispatcher.drupalci.org/job/drupal_patches/42455/console).

Anonymous’s picture

Now DrupalCI. But 1 random fail in JS test. Move it to #2934064: Random fail in JS-tests due to unstable pressButton(). We already have same case in #2924201: Resolve random failure in LayoutBuilderTest so that it can be added to HEAD with unstable pressButton() method. So, perhaps it is unrelated fail too)

alexpott’s picture

Just a note that we stopped WTB sharing the same APC key in #2749955: Random fails in UpdatePathTestBase tests. So we have to be wary that doing this will introduce other random fails.

alexpott’s picture

Let's see if we still have random fails in the update tests...

The last submitted patch, 57: 2762549-apcu-117.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

The last submitted patch, 57: 2926309-57-update-test.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

alexpott’s picture

Anonymous’s picture

@alexpott, fair point. We also tested *Update* tests. See #126-#141. Unfortunately, this patch does not bring for them an improvement. APCu memory ends up as quickly as without a patch.

It seems, the hang occurs in checkForMetaRefresh(), when 30 tests endlessly waiting in Batch queue, because 1 test is frozen due to lack of resources.


Also an interesting point, this happens precisely with the repeated execution of one *Update* test. If you run 1000 different *Update* tests, then everything is ok.
Let's see if the *Update* tests are run better without the patch (as-is)

The last submitted patch, 60: 2926309-60-update-test.patch, failed testing. View results

Anonymous’s picture

After double check #60/#61: with same apcu prefix *1 Update test* works really worse, than with unique prefix, when 31 concurency and 100 runs.

Perhaps this is due to the increased number of collisions during the multiple execution of one test. Because:

  1. *Different Update tests* (#61) work approximately identically (30-60 seconds difference on 1000 runs).
  2. All tests with unique prefix for *update* and for *update and installer* tests (#146) work absolutely identically (1-5 seconds different on 3000 tests).

Therefore, it may be worth continuing apcu research with multiple run single test, but commit #63, since testing various tests is clearly a priority.

Anonymous’s picture

Also about #52 fail in JS-Test. Many checks confirmed, that it is random fail due to pressButton():

I also reproduced this fail locally with phantomjs 2.1.1 (But I have not yet picked up a successful way to quick reproduce it)

Mixologic’s picture

I ran the patch in #52 and APC still overflowed once.

I bumped the apc.entries_hint to 500,000 to give it a better subdivision, and bumped the memory to 3GB to see if I could avoid an overflow. Attached is the final apc dashboard, so no overflows, but it still needs about 2.4 GB of memory.

Im excited to get this patch in to reduce random fails, as well as a significant performance boost for the testbots, which translates into cost savings for the DA.

It would still be nice to see if its possible to clear out a prefix at the end of a test, but that could mean more overhead than it's worth.

Mixologic’s picture

To clarify some things, I'll bump APC's memory to 3GB as well as the entries hint to help reduce testbot errors. On one hand, that might hide issues that real world users have if their APC memory is lower than required, but given that we're not really mimicing real world scenarios with the way we run the tests, I think its better to err on the side of not having tests fail randomly.

  • larowlan committed 9d552ca on 8.5.x
    Issue #2926309 by vaplas, alexpott, mpdonadio, tacituseu, Mixologic:...
larowlan’s picture

Version: 8.5.x-dev » 8.4.x-dev
Issue tags: +8.4.4 Commit freeze

Committed 9d552ca and pushed to 8.5.x.

Leaving at RTBC and changing to 8.4.x.

Will cherry-pick to 8.4.x tomorrow after commit freeze for 8.4.4 is over.

Tagging with my made-up tag until such time.

tacituseu’s picture

@Mixologic: at the same time it can also be reduced in php-cli.ini's, as the run-tests.sh side is hardly even using it (tens of MB max).

Anonymous’s picture

#69: Many thanks! If possible, it would be nice to move @tacituseu on the first place + add credit to @dawehner and @catch, they did reviews and contributions in other issues (like #2704571: Add an APCu classloader with a single entry and #1596472: Replace hard coded static cache of entities with cache backends).

#66: yep, by logs apcu overflowed near 18 minutes (charts). Perhaps we will come to the mechanism for cleaning the cache after each test in follow-up #49 from @catch. While just watching the result of #69.

  • PHP 5.5 & PostgreSQL - seems neutral to #69. Perhaps it is so slow that apcu manages to self-regulate :)
  • PHP 7 & SQLite - in two faster after #69
tacituseu’s picture

@vaplas: first place is right the way it is :) awesome work.

larowlan’s picture

Added catch and dawehner to issue credits

andypost’s picture

What is follow-up issue to research cleaning cache by prefix? Should not be so hard to make it in tearDown()

catch’s picture

We can look at it in #2934002: APCu cache backend can have unreasonable number of entries during testing or multi-site or possibly make a child issue of that one?

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 63: 2926309-51.patch, failed testing. View results

mpdonadio’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
3.98 KB

Just a reupload so we have runs against 8.4.x.

  • larowlan committed 527e762 on 8.4.x
    Issue #2926309 by vaplas, alexpott, mpdonadio, tacituseu, Mixologic,...
larowlan’s picture

Status: Reviewed & tested by the community » Fixed

Cherry-picked as 527e762 and pushed to 8.4.x

Made the commit message for the cherry-pick reference @catch and @dawehner as per #71

larowlan’s picture

Issue tags: -8.4.4 Commit freeze

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

xjm’s picture