See the continuation of the discussion in the #2934002: APCu cache backend can have unreasonable number of entries during testing or multi-site
Problem/Motivation
Wide use of a unique prefix can lead to problem with memory, if tests are run with a concurrency higher than 1.
Exception: Warning: apcu_store(): Unable to allocate memory for pool.
Symfony\Component\ClassLoader\ApcClassLoader->findFile()() (Line: 128)
Examples:
- https://www.drupal.org/pift-ci-job/819166
- https://www.drupal.org/pift-ci-job/820576
- https://www.drupal.org/pift-ci-job/821247
- and many many others recently :(
Another kind of error that you can get because of this 'CI aborted'
due to 'Build timed out (after 110 minutes)'
. This happens when the tests simply hang because of a lack of resources, collisions, or apcu fragmentaions.
Proposed resolution
1. Use significantly more an more resources (see #29, #34- #36, #2930022-106)
2. Not use unique prefix.
We have already done this in #2474909: Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects, but then canceled because of suspicion of random fails (#2749955: Random fails in UpdatePathTestBase tests). Later the culprit of random fails was detected in gc.
See also #2930022: Testing fails 'CI aborted' and 'apcu memory' with with a lot of research and tonal pictures statistical charts.
Remaining tasks
User interface changes
API changes
Data model changes
Original post
Exception: Warning: apcu_store(): Unable to allocate memory for pool.
Symfony\Component\ClassLoader\ApcClassLoader->findFile()() (Line: 128)
Examples:
https://www.drupal.org/pift-ci-job/819166
https://www.drupal.org/pift-ci-job/820576
https://www.drupal.org/pift-ci-job/821247
https://www.drupal.org/pift-ci-job/822016
https://www.drupal.org/pift-ci-job/822256
https://www.drupal.org/pift-ci-job/822949
https://www.drupal.org/pift-ci-job/823749
https://www.drupal.org/pift-ci-job/823996
https://www.drupal.org/pift-ci-job/824108
https://www.drupal.org/pift-ci-job/824147
https://www.drupal.org/pift-ci-job/824864
https://www.drupal.org/pift-ci-job/826294
https://www.drupal.org/pift-ci-job/826932
https://www.drupal.org/pift-ci-job/827140
https://www.drupal.org/pift-ci-job/830325
https://www.drupal.org/pift-ci-job/831905
https://www.drupal.org/pift-ci-job/831952
https://www.drupal.org/pift-ci-job/831958
https://www.drupal.org/pift-ci-job/832566
https://www.drupal.org/pift-ci-job/833314
https://www.drupal.org/pift-ci-job/833405
https://www.drupal.org/pift-ci-job/833456
https://www.drupal.org/pift-ci-job/834821
A couple of issues which might help to determine problem:
- #2765271: Rationalize use of the 'discovery' cache bin, since it's stored in the limited size APCu by default
- #2800605: Warn/inform users when the hosting environment has a too low limit of APCU cache
- #2832450: Multilingual config cached in "config" cache bin; quickly reaches APCu memory limits
- #2704571: Add an APCu classloader with a single entry
A couple php bugs with apcu:
- https://bugs.php.net/bug.php?id=75176 (see also #2930022-4: Testing fails 'CI aborted' and 'apcu memory')
- https://bugs.php.net/bug.php?id=58982
Issue for testing any ideas: #2930022: Testing fails 'CI aborted' and 'apcu memory'
Comments
Comment #1
Anonymous (not verified) CreditAttribution: Anonymous commentedvaplas created an issue. See original summary.
Comment #2
Anonymous (not verified) CreditAttribution: Anonymous commentedI'm not sure that this is Drupal's problem and maybe this happens rarely. So Normal priority. While issue just for notification and collection of additional info about this problem.
Comment #3
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #4
cilefen CreditAttribution: cilefen at Institute for Advanced Study commentedComment #5
tacituseu CreditAttribution: tacituseu commentedComment #6
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #7
xjmThis is happening heaps (pun) today, so promoting to critical. I don't think it's a core issue so we should probably move it to the appropriate queue, but tracking here for now.
https://www.drupal.org/pift-ci-job/824165
Comment #8
dawehnerHere are a couple of issues which might help to determine problem:
#2765271: Rationalize use of the 'discovery' cache bin, since it's stored in the limited size APCu by default
#2800605: Warn/inform users when the hosting environment has a too low limit of APCU cache
#2832450: Multilingual config cached in "config" cache bin; quickly reaches APCu memory limits
Comment #9
Anonymous (not verified) CreditAttribution: Anonymous commentedThank you, @dawehner! Perhaps one of these is really helpful in eliminating the random fails through the more strong control of limit. Added to IS.
+ to links on php bugs with apcu.
+ issue for testing ideas - #2930022: Testing fails 'CI aborted' and 'apcu memory'
I tried to test one of the REST tests (UserJsonAnonTest), where the problem often appears. But 1500 runs showed nothing.
However, the testdrive for Drupal\system\Tests\Module\InstallUninstallTest, showed a strange result. In all launches it went into space, and both times after 124 passes:
https://dispatcher.drupalci.org/job/drupal_patches/40289/console
https://dispatcher.drupalci.org/job/drupal_patches/40307/console
Comment #10
Anonymous (not verified) CreditAttribution: Anonymous commentedHmm, interesting moment: phpunit version of InstallUninstallTest has the same problem: 124 passes and
'Build timed out'
. But unlike the simplest, it seems this is really the time limit.https://dispatcher.drupalci.org/job/drupal_patches/40321/console
The difference between the last passes and 'Time out' only 14 minutes (in comparison with 1 hour for simpletest, see #9 console logs).
This also means that the simpletest performs the InstallUninstallTest in two quicker, than phpunit? o_O
Comment #11
gambryHere another fail from https://www.drupal.org/project/drupal/issues/2627512
https://www.drupal.org/pift-ci-job/833314
Comment #12
Anonymous (not verified) CreditAttribution: Anonymous commentedAdded #11 to examples in IS + few more.
Also added #2704571: Add an APCu classloader with a single entry to list issues which might help to determine problem.
Additional notices:
While the earliest case, when it was seen, is #2843139-100: EntityResource: Provide comprehensive test coverage for File entity, and tighten access control handler (2017-11-26, Sunday). But this is not the reason, because before commited this issue took a couple of days, during which the error appeared in other issues, like #2346893-242: Duplicate AJAX wrapper around a file field (2017-11-27).
Nearest commits near this time:
2017-11-24: #2800873: Add XML GET REST test coverage, work around XML encoder quirks - increases the number of http requests during the testing.
2017-11-22: #2868035: Test that all core content+config entity types have functional REST test coverage - installing many modules.
Can these two changes increase the likelihood of a mini-DoS attack during parallel testing on CI? See also #2704571: Add an APCu classloader with a single entry and other issues about apcu.
Most often going beyond the memory effect on next groups:
But maybe, after shuffle tests, the statistics will change. I tried to run a set of failed tests together, but all green (#2930022-7: Testing fails 'CI aborted' and 'apcu memory').
Comment #13
Anonymous (not verified) CreditAttribution: Anonymous commentedNews:
@tacituseu scanned the memory (2930022-8), the results do not show anything illegal.
He also pointed to changes in the environment (23 Nov, 30 Nov, 12 Dec, 13 Dec), which could also affect the failures(/fixes?). See more in 2930022-11
An earlier fall was detected 2017-11-24: https://www.drupal.org/pift-ci-job/819166
Comment #14
Anonymous (not verified) CreditAttribution: Anonymous commentedTherefore, if the fail is no longer reproduced, then maybe it was a bug with a memory leak in one of the libraries like:
And it was fixed via 'apt-get update':
http://cgit.drupalcode.org/drupalci_environments/commit/?id=a69d7fb
http://cgit.drupalcode.org/drupalci_environments/commit/?id=9a94149
http://cgit.drupalcode.org/drupalci_environments/commit/?id=b5e7b34
or via restores previous packer config for drupalci:
http://cgit.drupalcode.org/drupalci_environments/commit/?id=cc04980
or i don't know what it was)
Comment #15
Anonymous (not verified) CreditAttribution: Anonymous commentedLast news:
'apcu memory'
fail is intertwined with'CI aborted'
fail.Comment #16
D34dMan CreditAttribution: D34dMan as a volunteer and commentedAdding more example to the report from this issue #2920963: Remove the migrate.migration prefix from core test migrations
Comment #17
Anonymous (not verified) CreditAttribution: Anonymous commented@D34dMan, thank you!
Also, it seems, that @tacituseu found a weak place:
const DEFAULT_MAX_ROWS = 5000
if it is increased, testing is accelerated (#2930022-32)
We also noticed a strange acceleration of tests in https://dispatcher.drupalci.org/job/drupal_patches/40783/console (28 min 59 sec for non-js part of tests). But the re-testing was not so fast (see #2930022-25 and #2930022-33)
Comment #18
Anonymous (not verified) CreditAttribution: Anonymous commentedFew new statistics from #2930022: Testing fails 'CI aborted' and 'apcu memory':
https://dispatcher.drupalci.org/job/drupal_patches/40790/console
(No-JS) Test run duration: 46 min 17 sec
https://dispatcher.drupalci.org/job/drupal_patches/40815/console
(No-JS) Test run duration: 1 hour 1 min
Therefore, the possibility it has the effect on CI testing. See also 2526150-271 / 2526150-275.
#2930022-35: Testing fails 'CI aborted' and 'apcu memory'
Shuffle tests gives consistently fast performance test. Perhaps this is due to a more uniform distribution of requests from the REST-tests, that drain some resources CI. In this case, we can wait #2910883: Move all entity type REST tests to the providing modules.
#1596472: Replace hard coded static cache of entities with cache backends one more useful issue, where last patch from @catch:
https://dispatcher.drupalci.org/job/drupal_patches/40846/console
(No-JS) Test run duration: 31 min 50 sec
But I can not argue, that the timing of the No-JS tests depends on the patches. Because at the current moment this time often fluctuates. Because:
So, it would be great to know the opinion from other developers, especially from the CI team. Perhaps they have long since discovered the cause of the fails, and our research of black box in #2930022: Testing fails 'CI aborted' and 'apcu memory' does not make any sense)
Comment #19
tacituseu CreditAttribution: tacituseu commentedThat
DEFAULT_MAX_ROWS
was just a stab at figuring out what is slowing the PHP 7 testing down, but it went in 12 September 2017 so I don't think it is likely.Randomizing order in #2930022-35: Testing fails 'CI aborted' and 'apcu memory' on the other hand seems to be consistent, and gives around 30 minute reduction in test time for PHP 7.
Will try to narrow down which tests take so long by processing the logs.
Also #2930022-25: Testing fails 'CI aborted' and 'apcu memory' shows that both good and bad runs have the same memory usage (~100MB max to serve a page and 35MB left in APCU cache).
Comment #20
Anonymous (not verified) CreditAttribution: Anonymous commented@tacituseu, thank you as always! Yeah. These tests can not be relied on, how many interesting theories they have already broken!
#2930022-38: Testing fails 'CI aborted' and 'apcu memory' now one rest-test (selected randomly) with 31 concurency. An it freezed after 558 runs (12 min after start). Those, perhaps the problem is not in heavy or long tests. Just some kind of resource leak, that accumulates on the CI during an avalanche of queries?
Comment #21
Anonymous (not verified) CreditAttribution: Anonymous commentedUntil the problem is resolved, we can just use shuffle tests (#2930022-35: Testing fails 'CI aborted' and 'apcu memory') like quick workaround.
Here is a comparison shuffle vs usual (3 latest from https://www.drupal.org/pift-ci-job/837913).
An average of 30 minutes of shuffle tests, against 40-60 minutes of usual tests. Therefore, not only more stable, but also faster.
Comment #22
gambryIs the graph saying there are around 175 tests heavily slowing down the testing, from 12-15 minutes after the testing has started?
Can they be the source of the issue (it's out of memory after all...)?
Comment #23
tacituseu CreditAttribution: tacituseu commentedProcessed test durations attached.
In fast run (#25)
Drupal\system\Tests\Module\InstallUninstallTest
takes 10 minutes, in the slow one (#8) it lingers for 47 minutes and there's plenty of slowDrupal\Tests\rest\Functional\*
tests (they are at least 10x slower vs fast run).Comment #24
tacituseu CreditAttribution: tacituseu commentedThere also seems to be an issue with
ApcuBackend
assuming APCU storage works like Memcached (LRU eviction) and filling up (see #2930022-73: Testing fails 'CI aborted' and 'apcu memory').Edit: nevermind that, the fail here was in my data processing skills, graph actually shows that when the APCU cache fills up it is flushed completely.
Comment #25
tacituseu CreditAttribution: tacituseu commentedAPCU free over testrun:
@vaplas completion script for the testrun:
# of requests per second over testrun:
Comment #26
moshe weitzman CreditAttribution: moshe weitzman commentedThat sounds like a really risky property for a cache backend. I dont think I would use it.
Comment #27
Anonymous (not verified) CreditAttribution: Anonymous commented#22: @gambry, you read the diagram absolutely right. As already said @tacituseu, we have a large drawdown in the rest-tests period. We also have a simple test to send requests on node create (#2930022-53: Testing fails 'CI aborted' and 'apcu memory'), which leads to the same effect. So, the bottleneck not in rest-tests, but somewhere nearby.
#24: I like the analogy with the limited size of the hash table, in which there are always delays due to the large number of collisions during the generation of keys. But perhaps your assumption is more true.
#25: The first graph shows that the memory is constantly striving to end but resets before the critical zero? And in the central case it is not happen. In this case, we may need some more stable reset-trigger after the each of tests?
Comment #28
tacituseu CreditAttribution: tacituseu commentedRe #26: trying to figure out the exact mechanism, those
apcu_store
errors do seem to pop out when the cache is nearly exhausted and there are many requests in flight at the same time (e.g. for test run in #25 at the second dip when it has 35MB free - request #33909, timestamp 20171216001325).Re #27: could be something to it, digging into
apcu-5.1.8.tgz
, checkapc.entries_hint
, also some info inapcu-5.1.8\TECHNOTES.txt
, will keep trying to characterize its content.Comment #29
MixologicSo, something is kinda terrible with how we're using apcu, at least on the testbots.
In the php7/sqlite environment, if I bump the cache to 8gb of memory so that we can fit everything without ever having to evict, the test time goes up to
Test run duration: 54 min 34 sec
With it set to 2GB, we see anything from 27-32 minutes typcially
If I *shut it off entirely*, it goes down to
Test run duration: 22 min 10 sec
So, these 'random errors' are race conditions in apcu that manifest when you try and have 32 processes cramming 4.5 million objects into it and never do any cleaning of all the cached data in there.
Is there any way that we can empty all the cached data for a test when that test is complete?
Comment #30
mpdonadioAPCu not being a real LRU cache make it horrible to deal with.
The main challenge with this is that the backend doesn't keep track of what $bin:$cid got set throughout the lifetime of a request. So the potential solution would be to
- add a property which would track $bin:$cid
- set/delete would have to do the housekeeping
- we would have to add a __destruct() to iterate and apcu_delete($bin:$cid)
We would probably also want to subclass ApcuBackend to ApcuLifetimeBackend for BC, and because we really only want to do this for tests (and someone may find a use for this outside of testing, similar to the in-memory one).
Comment #31
catchIME you hit APCu errors not due to it emptying, but due to it not emptying when it's fragmented.
Let's say you have a 1mb item to cache, and two slots with 500kb each. You now cannot write the item (hence the unable to allocate error), but also APC isn't 'full' so it doesn't clear out.
@Mixologic is right though. The way we use APCu in core is fine for single site situations, since the cache size is finite. It should never be used for multiple sites on one server and that's what we're doing with the test bot.
We could possibly do something in functional tests to removeBin() from child sites on test completion. Might help a bit. Or we might need to not test with APCu except for actual cache tests.
Comment #32
mpdonadioIIRC, that is only the case when you are using TTLs. With "permanent" entries, it should dump. I have an old, old issue in the PHP issue queue trying to get clarification on the exact strategy that it uses, but I don't think it ever got answered. My understanding is my memory (ha!) of reading the code.
I think disabling APCu for tests and seeing what the impact on run times is the best option right now; my guts says that removeBin() on teardown may cause different problems.
Comment #33
tacituseu CreditAttribution: tacituseu commented@vaplas was right in #27.2 and much earlier in #23, the slowdowns are all from
apc.entries_hint
defaulting to 4K and the tests cramming there 5M of variables, confirmed by @Mixologic #106.Also see graphs he made in #111, they correlate test completion with free cache.
There are also some logs to give you idea what's in the cache over time (#105:1, #105:2 - script by @msonnabaum from here).
There are a lot of 1Kbootstrap:user_permissions_hash:authenticated,ayl5fozd
style entries, which leads toDrupal\Core\Session\PermissionsHashGenerator::generate().
Still not sure what does the actual flushes, tried to catch it in #93 (log), for all I know it might even by done outside of Drupal's cache system (there's
apcu_clear_cache()
call inDoctrine\Common\Cache\ApcuCache
(doFlush()
).Then there are also
Drupal\Component\FileCache\ApcuFileCacheBackend
,Symfony\Component\ClassLoader\ApcClassLoader
that seem to have no concept of flushing/gc.Drupal\Core\Cache\ApcuBackend
used to set$ttl
at some time but it was removed in #2581395: Incorrect expiration in APCUBackend.I'm sure the
apc.entries_hint
needs adjusting in the test environments, maybe evenapc.ttl
(see #109), or at least some requirement check (doesApcuBackend
make an assumption that it will be set some specific way ? 0 or fixed time).But it feels like
ApcuBackend
needs to do something too.Also for some tries of cleaning after tests: #89, #107, #111, #113, still WIP.
Comment #34
tacituseu CreditAttribution: tacituseu commentedRe #32: @Mixologic did that in #98, they run much faster without APCu ;).
From
apcu-5.1.8\INSTALL
:says noting of 'emergency dumping', still checking code.
apcu-5.1.8\TECHNOTES.txt
:Comment #35
tacituseu CreditAttribution: tacituseu commentedNow for the code part, TLDR: it does nuke it.
apcu-5.1.8\apc_cache_api.h
:apcu-5.1.8\apc_cache.c
:The triggers are inside
apcu-5.1.8\apc_sma.c
(notes after ###):Can also be verified experimentally as it keeps the count under 'Cache full count'
apcu-5.1.8\apc.php
:'<tr class=tr-1><td class=td-0>Cache full count</td><td>{$cache['expunges']}</td></tr>',
apcu-5.1.8\apc_cache.c
:add_assoc_double(&info, "expunges", (double)cache->header->nexpunges);
so @Mixologic could confirm that it is the case by catching the same apc.php stats screenshot as in #29 but with default settings.
Edit: confirmed (last column is
apcu_cache_info(TRUE)['expunges']
).Comment #36
MixologicI would love to shut of APCu on the testbots.
A not insignificant amount of time is spent waiting for memory locks.
I've uploaded the status of the APC cache directly after running the default, which took 31 min, 20 to run just the phptests on php7/sqlite.
I then changed just the apc.entries_hint so that it had more than 4096 buckets to use, but that made things worse: 54 min 40 sec.
Finally, I just flat out shut it off.
Now that we're paying by the minute, I can pretty much quantify this to saving 0.12 per test, which sure would be nice to get that back. (not sure the impact it would have on the php5.5/5.6 envs, but probably similar)
Comment #37
Anonymous (not verified) CreditAttribution: Anonymous commentedIt sounds very tempting! I do not use apcu, so for me this is an excellent suggestion. But it seems in drupal there is a lot of code and issues about apcu. Perhaps, keeping it also makes sense. Can we implement proposals from @catch and @mpdonadio?
Also, looks like @tacituseu found a way to reduce tests to 21 min 51 sec (from 51 min) via
$settings['apcu_ensure_unique_prefix'] = FALSE;
(#124).Or like build custom version of prefix for CI #120.
Comment #38
tacituseu CreditAttribution: tacituseu commentedThe
PHP 7 & SQLite 3.8
of #120 does 14 min 1 sec.Comment #39
dawehnerThis is actually a nice find, given that #2474909: Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects was introduced only for this reason in the first place :)
Comment #40
tacituseu CreditAttribution: tacituseu commentedAnd then removed by #2749955: Random fails in UpdatePathTestBase tests.
Edit:
Making
ApcClassLoader
shared across the tests lowers the amount of APCu cache entries over the test-run from 5M to 1M (hits/misses = 39, right before the expunge), withApcuFileCacheBackend
added it gets reduced to 300K (hits/misses = 117, charts courtesy of @vaplas).Comment #41
Anonymous (not verified) CreditAttribution: Anonymous commentedHere is the implementation of the #124 from @tacituseu.
The patch reverts http://cgit.drupalcode.org/drupal/commit/?id=44aa63b #2749955: Random fails in UpdatePathTestBase tests. So, we tested this:
Comment #42
mpdonadioSo, if we want to go with #41
Nits, but first line should be one sentence, and should use a @see
Nice, esp when you see this in context.
Since this is the third(?) time we have had to tweak this, I think we should expand the comments on Settings::getApcuPrefix() about what is going on, and reference #2474909: Allow Simpletest to use the same APC user cache prefix so that tests can share the classmap and other cache objects, #2749955: Random fails in UpdatePathTestBase tests, and this issue.
And kudos to @valpas and @tacituseu on this. I have run into some other instances where caching slowed things down. The profiling / charts are awesome, and are really needed to draw real conclusions.
Just adding the patch w/ shuffles to see if there are any side effects.
Comment #43
Anonymous (not verified) CreditAttribution: Anonymous commented@mpdonadio, thanks for the review, advice and kind words! I tried to do #42 points. But with a little difference.
#42.1: Done + explain why FALSE by default
#42.3: Bit changed description of
Settings::getApcuPrefix()
: WebTestBase -> tests. + only one @see on current issue. Two other issues i added to IS and Related issues list. Therefore, i hope, they will be found in the study, so we do not need to list them all in @see.Comment #44
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #45
dawehnerOne important result of all this amazing research is that real sites also potentially has issues with that. Unlike the initial benchmarks we did (clear all cache, just access the frontpage), real sites tend to have actually a lot of more code involved over time.
I'm curious whether @berdir made any research on the same topic as well. They have D8 production sites since a long time.
Comment #46
Anonymous (not verified) CreditAttribution: Anonymous commented@dawehner, thanks for good point, as always. It will be great to get the best apcu settings by default for real sites. But maybe we can still hotfix the problem only for CI, so as not to interfere with the other issues.
#2749955-122: Random fails in UpdatePathTestBase tests @catch:
As has already been noted by @tacituseu, the random fail occurred due to gc. So, now we just revert this iterim step.
Perhaps after a more thorough analysis we will find another culprit for the problem, but until this happens, it would be great to make more stable and quick testing.
Comment #47
catchMarking #43 RTBC, it's a shame we already thought about this, removed it as a hotfix, then forgot about it until now but all the work here has been really useful regardless. I think we should open a (major) follow-up to discuss this more so we don't lose all the information here.
Comment #48
Anonymous (not verified) CreditAttribution: Anonymous commentedFollow-up done: #2933998: Discover the best APCu settings
Comment #49
catchWhoops I just nearly cross-posted because I'd opened #2934002: APCu cache backend can have unreasonable number of entries during testing or multi-site. Let's mark one as duplicate (I haven't done so yet though).
Comment #50
Anonymous (not verified) CreditAttribution: Anonymous commentedO, no problem!) I closed my version #2933998: Discover the best APCu settings. Perhaps later we will convert it into an additional research by plan from @tacituseu.
Comment #51
larowlannit: => problems, can be fixed on commit
Comment #52
Anonymous (not verified) CreditAttribution: Anonymous commented@larowlan, thanks for review! Done.
+ Add follow-up from #49 to IS head.
Comment #54
Anonymous (not verified) CreditAttribution: Anonymous commented#53: unrelated global
'possibly out of free disk space'
on DrupalCI. Except one https://www.drupal.org/pift-ci-job/849151 (it is still run, https://dispatcher.drupalci.org/job/drupal_patches/42455/console).Comment #55
Anonymous (not verified) CreditAttribution: Anonymous commentedNow DrupalCI. But 1 random fail in JS test. Move it to #2934064: Random fail in JS-tests due to unstable pressButton(). We already have same case in #2924201: Resolve random failure in LayoutBuilderTest so that it can be added to HEAD with unstable pressButton() method. So, perhaps it is unrelated fail too)
Comment #56
alexpottJust a note that we stopped WTB sharing the same APC key in #2749955: Random fails in UpdatePathTestBase tests. So we have to be wary that doing this will introduce other random fails.
Comment #57
alexpottLet's see if we still have random fails in the update tests...
Comment #60
alexpottlol the test has moved...
Comment #61
Anonymous (not verified) CreditAttribution: Anonymous commented@alexpott, fair point. We also tested *Update* tests. See #126-#141. Unfortunately, this patch does not bring for them an improvement. APCu memory ends up as quickly as without a patch.
It seems, the hang occurs in
checkForMetaRefresh()
, when 30 tests endlessly waiting in Batch queue, because 1 test is frozen due to lack of resources.Also an interesting point, this happens precisely with the repeated execution of one *Update* test. If you run 1000 different *Update* tests, then everything is ok.
Let's see if the *Update* tests are run better without the patch (as-is)
Comment #63
alexpottSo we're not seeing any super often random fails because of reverting #2749955: Random fails in UpdatePathTestBase tests - so +1 to going forward with #52. Re-uploading so that that is the last patch on issue.
Comment #64
Anonymous (not verified) CreditAttribution: Anonymous commentedAfter double check #60/#61: with same apcu prefix *1 Update test* works really worse, than with unique prefix, when 31 concurency and 100 runs.
Perhaps this is due to the increased number of collisions during the multiple execution of one test. Because:
Therefore, it may be worth continuing apcu research with multiple run single test, but commit #63, since testing various tests is clearly a priority.
Comment #65
Anonymous (not verified) CreditAttribution: Anonymous commentedAlso about #52 fail in JS-Test. Many checks confirmed, that it is random fail due to
pressButton()
:I also reproduced this fail locally with phantomjs 2.1.1 (But I have not yet picked up a successful way to quick reproduce it)
Comment #66
MixologicI ran the patch in #52 and APC still overflowed once.
I bumped the apc.entries_hint to 500,000 to give it a better subdivision, and bumped the memory to 3GB to see if I could avoid an overflow. Attached is the final apc dashboard, so no overflows, but it still needs about 2.4 GB of memory.
Im excited to get this patch in to reduce random fails, as well as a significant performance boost for the testbots, which translates into cost savings for the DA.
It would still be nice to see if its possible to clear out a prefix at the end of a test, but that could mean more overhead than it's worth.
Comment #67
MixologicTo clarify some things, I'll bump APC's memory to 3GB as well as the entries hint to help reduce testbot errors. On one hand, that might hide issues that real world users have if their APC memory is lower than required, but given that we're not really mimicing real world scenarios with the way we run the tests, I think its better to err on the side of not having tests fail randomly.
Comment #69
larowlanCommitted 9d552ca and pushed to 8.5.x.
Leaving at RTBC and changing to 8.4.x.
Will cherry-pick to 8.4.x tomorrow after commit freeze for 8.4.4 is over.
Tagging with my made-up tag until such time.
Comment #70
tacituseu CreditAttribution: tacituseu commented@Mixologic: at the same time it can also be reduced in
php-cli.ini
's, as therun-tests.sh
side is hardly even using it (tens of MB max).Comment #71
Anonymous (not verified) CreditAttribution: Anonymous commented#69: Many thanks! If possible, it would be nice to move @tacituseu on the first place + add credit to @dawehner and @catch, they did reviews and contributions in other issues (like #2704571: Add an APCu classloader with a single entry and #1596472: Replace hard coded static cache of entities with cache backends).
#66: yep, by logs apcu overflowed near 18 minutes (charts). Perhaps we will come to the mechanism for cleaning the cache after each test in follow-up #49 from @catch. While just watching the result of #69.
Comment #72
tacituseu CreditAttribution: tacituseu commented@vaplas: first place is right the way it is :) awesome work.
Comment #73
larowlanAdded catch and dawehner to issue credits
Comment #74
andypostWhat is follow-up issue to research cleaning cache by prefix? Should not be so hard to make it in tearDown()
Comment #75
catchWe can look at it in #2934002: APCu cache backend can have unreasonable number of entries during testing or multi-site or possibly make a child issue of that one?
Comment #77
mpdonadioJust a reupload so we have runs against 8.4.x.
Comment #79
larowlanCherry-picked as 527e762 and pushed to 8.4.x
Made the commit message for the cherry-pick reference @catch and @dawehner as per #71
Comment #80
larowlanComment #82
xjm