Problem/Motivation
Random test fails due to segmentation faults have been observed in numerous core tests.
The test runner reports the fail in a class that changes from time to time as new code is committed. Previous classes include:
\Drupal\node\Tests\Views\FrontPageTest::testCacheTagsWithCachePluginTime()
\Drupal\editor\Tests\EditorSecurityTest::testEditorXssFilterOverride()
\Drupal\datetime_range\Tests\DateRangeFieldTest::testDatelistWidget()
\Drupal\locale\Tests\LocaleUpdateTest::testEnableUninstallModule()
\Drupal\views_ui\Tests\HandlerTest::testErrorMissingHelp()
Drupal\Tests\toolbar\Functional\ToolbarCacheContextsTest::testToolbarCacheContextsCaller()
However, since the test that reports the failure from the segfault changes unpredictably, do not rely on this list. To check whether a fail is a segfault: If the test says "Fatal error" with no other information, click "View results on dispatcher", then "Console Output", then "Full log", and search on the page for "segmentation fault".
Proposed resolution
For the PHP5.6 we have clear evidence that there are problems garbage collecting whilst building the container. Compare https://dispatcher.drupalci.org/job/drupal_patches/14991/consoleText and https://www.drupal.org/pift-ci-job/667697.
Therefore we are disabling garbage collection during container rebuilds for PHP5. As far was we can tell this is working around a known PHP5 that will not be fixed https://bugs.php.net/bug.php?id=72286. The potential impact is an increase in memory usage. For a single container rebuild this has been calculated to be 1.5mb.
There are other segfaults - which will be tracked in: #2223459: Segmentation fault, Out of memory in LocaleUpdateTest and others and #2876895: Intermittent segfaults on PHP 7.1
Remaining tasks
User interface changes
None
API changes
None
Data model changes
None
Original issue summary
testEditorXssFilterOverride
fail: [Completion check] Line 418 of core/modules/editor/src/Tests/EditorSecurityTest.php:
The test did not complete due to a fatal error.Unknown
fail: [run-tests.sh check] Line 0 of :
FATAL Drupal\editor\Tests\EditorSecurityTest: test runner returned a non-zero error code (139).
- https://www.drupal.org/pift-ci-job/618661
- https://www.drupal.org/pift-ci-job/615117
- https://www.drupal.org/pift-ci-job/619448
- https://www.drupal.org/pift-ci-job/615775
- https://www.drupal.org/pift-ci-job/612834
Looks like we may have a new one on php5.6:
https://dispatcher.drupalci.org/job/php5.6_mysql5.5/2134/artifact/jenkin...
Both:
https://dispatcher.drupalci.org/job/php5.6_mysql5.5/2134/And
https://dispatcher.drupalci.org/job/php5.6_mysql5.5/2140/Are having segfaults in what look to be the same location in the code:
https://dispatcher.drupalci.org/job/php5.6_mysql5.5/2134/artifact/jenkin...https://dispatcher.drupalci.org/job/php5.6_mysql5.5/2140/artifact/jenkin...
NULL pointer dereference in garbage collector triggered in Symfony\Component\DependencyInjection\Compiler\ServiceReferenceGraph, possibly related to #2603152: Fix PHP 7 testbot failures.
Comment | File | Size | Author |
---|---|---|---|
#56 | 2859704-56.patch | 1.43 KB | alexpott |
#56 | 55-56-interdiff.txt | 457 bytes | alexpott |
#55 | 2859704-55.patch | 1.88 KB | alexpott |
| |||
#55 | 49-55-interdiff.txt | 1012 bytes | alexpott |
#49 | 2859704-39.patch | 1.35 KB | alexpott |
Comments
Comment #1
Anonymous (not verified) CreditAttribution: Anonymous commentedvaplas created an issue. See original summary.
Comment #2
Anonymous (not verified) CreditAttribution: Anonymous commentednew case https://www.drupal.org/pift-ci-job/619448.
Comment #3
catchDuplicate of #2857843: Random fail in Drupal\KernelTests\Core\Entity\ContentEntityChangedTest::testChanged.
Comment #4
Anonymous (not verified) CreditAttribution: Anonymous commentedYes, it's looks like duplicate, because I took mentioned issue as pattern, and forgot to change the name of the method (testChanged to testEditorXssFilterOverride). Sorry for the misunderstanding.
Comment #5
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #6
xjmIt's a segfault, so probably has nothing to do with the particular test and may appear to be in a different test tomorrow.
Comment #7
xjmComment #8
xjmComment #11
xjmComment #12
xjmComment #13
tacituseu CreditAttribution: tacituseu commentedhttps://www.drupal.org/pift-ci-job/615775
https://www.drupal.org/pift-ci-job/618661
https://www.drupal.org/pift-ci-job/619448
all have to do with
ServiceReferenceGraph
same as those reported by @Mixologichttps://www.drupal.org/pift-ci-job/618711 looks clean
Comment #14
Anonymous (not verified) CreditAttribution: Anonymous commentedExactly, in its place should be https://www.drupal.org/pift-ci-job/618661. IS updated.
Comment #15
xjm#2853905: Figure out why there were segfaults in every test run on PHP 7.1.x-dev is no longer occurring, so that issue has been closed, but adding a reference here in case.
In general, there seem to be a lot of segfaults and stability problems related to garbage collection in our tests (or at least on CI). Adding a few other related issues as well.
Comment #16
xjmPHP 7.1 had a segfault here: https://www.drupal.org/pift-ci-job/626032
So it's not 5.6 only.
Comment #17
tacituseu CreditAttribution: tacituseu commentedThat PHP 7.1 trace from #16 looks like a different bug/trigger.
Comment #18
cilefen CreditAttribution: cilefen commented@alexpott, @catch, @cottser, @xjm and I considered this issue and agree this is suitably a critical priority bug because intermittent (seemingly random) test failures interrupt the work of project contributors.
Comment #19
Wim LeersI'm the
editor.module
maintainer. Let me know how I can help.Since this is only occasionally happening, and when it does, it's a segfault… I don't feel like there's much I can do right now.
Comment #20
tacituseu CreditAttribution: tacituseu commentedThere seem to be 2 triggers:
1. inside
Symfony\Component\DependencyInjection\Compiler\ServiceReferenceGraph::connect()
:https://www.drupal.org/pift-ci-job/615775
https://www.drupal.org/pift-ci-job/618661
https://www.drupal.org/pift-ci-job/619448
2. inside
Drupal\Component\Utility\NestedArray::mergeDeepArray()
inFrontPageTest
:https://www.drupal.org/pift-ci-job/662744
https://www.drupal.org/pift-ci-job/662998
https://www.drupal.org/pift-ci-job/663460
https://www.drupal.org/pift-ci-job/663495
They are getting more frequent.
Crashes are inside garbage collector around
gc_zval_check_possible_root()
/gc_collect_cycles()
, both sometimes result inzend_mm_heap corrupted
.Edit:
The third trigger might be inside
system_update_8004()
inUpdatePathTestBase
from old issue #2828559: UpdatePathTestBase tests randomly failing that was just worked around withgc_disabled()
, see dumps in #114 and #115.Comment #21
Anonymous (not verified) CreditAttribution: Anonymous commentedLooks like twin prime segfaults :)
#19: fair remark, since #6 it is no more associated with the
editor.module
.Comment #22
tacituseu CreditAttribution: tacituseu commentedhttps://www.drupal.org/pift-ci-job/663713 is something else still, seen it too in https://www.drupal.org/pift-ci-job/664432, triggered in
Symfony\Component\DependencyInjection\Compiler\ResolveDefinitionTemplatesPass::process()
Comment #23
mpdonadioNot sure if this adds more info, but saw this again:
https://dispatcher.drupalci.org/job/drupal_patches/14492/console
Comment #24
mpdonadioAnd again in https://dispatcher.drupalci.org/job/drupal_patches/14505/consoleText
Comment #25
Mixologic@mpdonadio: thats actually something strangely broken with the testbots. They are *supposed* to remove the core files once they do a gdb backtrace (because they are huge). But something, recently, has made them stay in place, and so when a new container tries to backtrace those core files, none of the memory exists anymore.
Im going to have to add some more output to see whats going on. 'sudo rm -rf $core_file' seems to have some kind of issue.
Comment #26
gambryI've been redirected here from #2876281: \Drupal\node\Tests\Views\FrontPageTest::testCacheTagsWithCachePluginTime fails randomly, closed as duplicates of #2848133: Random segfaults in Drupal\node\Tests\Views\FrontPageTest.testCacheTagsWithCachePluginTime, closed as well as duplicate of this issue.
However I don't actually see the
Drupal\node\Tests\Views\FrontPageTest.testCacheTagsWithCachePluginTime
test mentioned anywhere, so I agree with @alexpott (on #2876281) RE this issue "should have a list of classes so we don't create dupes."This just happened on #2786577: The Views integration Datetime Range fields should extend the views integration for regular Datetime fields:
\Drupal\node\Tests\Views\FrontPageTest::testCacheTagsWithCachePluginTime
https://www.drupal.org/pift-ci-job/666090
From all reports in here looks the classes/tests affected are:
- \Drupal\node\Tests\Views\FrontPageTest::testCacheTagsWithCachePluginTime
- Editor.Drupal\editor\Tests\EditorSecurityTest::testEditorXssFilterOverride
- Views_ui.Drupal\views_ui\Tests\HandlerTest::testErrorMissingHelp
- Drupal\Tests\toolbar\Functional\ToolbarCacheContextsTest::testToolbarCacheContextsCaller
(However last one has got a different output error from the rest, so I don't know if it can be considered a symptom of the same issue)
Comment #27
alexpottEditing the summary so that we get an idea of the tests where this is occurring and so people can search and find this issue.
Comment #28
alexpott#2488350-35: Switch to a memory cache backend during installation seems to up the rate of segfaults in
Drupal\node\Tests\Views\FrontPageTest::testCacheTagsWithCachePluginTime
. Some of the core dumps show the segfaults occurring whilst doing garbage collection. One of the times we create thousands of objects that would need to be garbage collected is container building. It looks like disabling garbage collection makes things more reliable - see #2488350-37: Switch to a memory cache backend during installation.Patch attached is the one that seems to be making the difference.
Comment #29
mpdonadioThis aligns with the related GC issue about an underlying problem with PHP memory management that we are just bringing to the surface, since update does a rebuild. I wonder if an explicit GC after the enable would help with memory swell during rebuild. Since requests aren't persistent, not sure if this matters too much?
Comment #30
xjmA "list of classes" is not actually going to help so much because it's just one class at a time, that changes kind of arbitrarily as we commit stuff. I think the list in the summary is misleading, because it's probably not going to be any of those classes again.
Rather than listing classes, the correct thing to do is to simply check the console log for the segfault. See step #5 in the summary of #2829040: [meta] Known intermittent, random, and environment-specific test failures for brief instructions. The results will always look something like:
When you see a fatal with no additional details, go to the console full log and look for "segmentation fault". I'll update the summary with this information.
All that said, I've been thinking about creating this Twitter account for like four months: https://twitter.com/drupalsegfault :P
Hopefully it doesn't get to make that many tweets now that we have some debugging.
Comment #31
xjmIf it's helpful, I can provide a list of and links to every segfault in HEAD over the past several months, but it'd be a bit tedious and time-consuming trawling through my email. If that would be useful it's probably more efficient for Mixologic to grep logs or whatnot for it.
The pattern of clusters of one test failing at a time, then changing to something else, is pretty consistent in the HEAD fails. I bet it's different for patches because (of course) the patches are changing different things each time and so the conditions that lead to the segfault change. I have also suspected it may not even actually happening in the test that reports the fail, but somewhere else due to concurrency or whatnot.
Comment #32
alexpott@xjm I'm seeing \Drupal\node\Tests\Views\FrontPageTest::testCacheTagsWithCachePluginTime() all over the place at the moment and I was the second person to create a duplicate issue.
The patch attached increases memory usage during minimal install but comparing #2488350-35: Switch to a memory cache backend during installation and #2488350-37: Switch to a memory cache backend during installation it is appears to be fixing the PHP5.6 problem we have. The evidence is clear. The patch in comment 35 has failed 8 out of 13 times (on PHP5.6) and the patch in comment 37 has passed 9 out of 9 times. I think that the installer patch is causing the problem because it is causing additional passes over the container because it adds a service modifier.
Looking at the installs (which do plenty of container rebuilds) to see how this impacts memory usage.
Before
Minimal:
Standard:
After
Minimal:
Standard:
Comment #33
tacituseu CreditAttribution: tacituseu commentedLooked over quite a few for a pattern, the only thing in common I can see is that they fail in code paths / methods that manipulate large arrays or produce large amounts of objects.
The segfault is triggered by NULL pointer dereference inside garbage collector in branches of the code responsible for reclaiming memory, because when GC runs out of root buffer space (
#define GC_ROOT_BUFFER_MAX_ENTRIES 10000
) it tries to reclaim some by callinggc_collect_cycles()
.That doesn't necessarily mean the bug is inside one of those methods, as @xjm points out it could be anywhere in the code, something leaves buggy object/array dormant in the buffer, then when it goes over the limit it tries to reclaim memory, which makes it go over the buggy thing and segfault.
As the code base grows and tests are getting more complicated, it will start getting triggered at more and more code paths.
Listing specific tests might be of use if there's some obvious thing they have in common, like some service they all use that has the buggy code.
Comment #34
alexpottFor me, #33 is exactly why container building is the most likely culprit. It creates 1000s of objects. It also explains why update.php was sensitive - plenty of container rebuilds occur during it.
Comment #35
tacituseu CreditAttribution: tacituseu commentedSurprised #32 helps, from http://php.net/manual/en/features.gc.collecting-cycles.php:
Looked over old logs,
Drupal\node\Tests\Views\FrontPageTest
seems to take under a minute per run, maybe x40 patch would be useful as it is the most common failure at the moment.Edit:
What it could mean is that it isn't the first
gc_collect_cycles()
that fails but the n-th one, and the preceding one, is the one that corrupts. Since with the patch none will occur when rebuilding the container, it could imply that the corrupting one used to happen there.Alternatively it might mean the buffer is full of "safe" objects from container rebuilds, and the "troubled" one that happens later isn't recorded due to lack of space.
Comment #36
alexpottSo for me container rebuilds are totally suspect for causing garbage collection segfaults.
Using the script below to target just a container rebuild.
gc_collect_cycles()
returns the number of zvals freed.With patch
Without patch
So the patch is affectively isolating a place where we have a prodigious number of objects created and isolating its affects.
Comment #37
xjm@alexpott: Here's the list of segfaults in HEAD over the past four months (may have missed a few). Draw your own conclusions. To me it's pretty clear that it moves around and clusters in different tests as different things are committed.
Comment #38
alexpott@xjm thanks for the list. It's really useful to help avoid duplicates.
Here's a patch that only disables gc when container building. Running the script in #36 results in the same improvements for garbage collection doing way less. With the patch post container rebuild we have 4018 objects to garbage collection, without the patch we have 21399.
I'm also pretty sure that we can remove the gc_disable() from update.php with this patch too but I think we should investigate that in #2842393: Discover why gc_disable() improves update.php stability
Comment #39
alexpott@catch suggested some improvements to the comment so that it is not PHP5.6 centric. The improvements to the amount of objects to garbage collect happens regardless of PHP version.
Also remove test I was playing around with trying to make the segfaults occur.
Comment #40
catchPatch looks great. Since we have a little bit of time before the next patch release I'd be happy committing this to both 8.4.x and 8.3.x - the only way we can tell conclusively if this helps test stability is if the rate of segfaults goes down. Also would be nice if we can remove the one-off fix in update.php but agreed on doing that in its own issue, less variables at once the better.
Bumping to RTBC.
Comment #41
Gábor HojtsyI agree this looks like it makes a ton of sense based on the above analysis.
Comment #42
tacituseu CreditAttribution: tacituseu commentedThe difference in number of zvals freed by
gc_collect_cycles()
is due to http://php.net/manual/en/features.gc.collecting-cycles.php:Comment #43
Gábor Hojtsy@tacituseu: hm, if this causes memory leaks then that does not sounds RTBC to me?!
Comment #44
tacituseu CreditAttribution: tacituseu commentedMore of an increased memory usage in this case, it should be freed at the end of the request, but the way this patch differs from the workaround in
update.php
is that it isn't limited to test environment (code inupdate.php
is behinddrupal_valid_test_ua()
)Comment #45
alexpottSo re the additional memory used as #32 shows - during a standard install we actually don't use more memory. We do during a drush minimal install but not more the 64mb target.
Profiling just a rebuild shows us this:
HEAD
int(0)
int(21399)
default: 23.50 MiB - 625 ms
Patch
int(0)
int(4018)
default: 25.00 MiB - 618 ms
This is for a single rebuild which is general the case when installing a module via the UI. Personally I keep this is an acceptable trade-off for something that only occurs during installation, extension changes and full cache clears. Also regarding the targeting of only DrupalCI in update.php - the problem with this is that it is an assumption that the segfaults only affect DrupalCI. Yes it's doing things - especially the number of installations and extension changes that no normal site is doing but it is entirely possible that other sites are triggering segmentation faults here. We know that PHP garbage collection runs out of space at 10,000 objects for a normal PHP build. We know that rebuilding the standard container - regardless of DrupalCI or not - will create 20000+ objects.
Comment #46
tacituseu CreditAttribution: tacituseu commented@alexpott: speaking of segfaults, gave testbot a bit of a headache (see #2876705: No space left on device / broken test runners), sorry.
Comment #47
alexpottSo let's try that here :) It might break another bot but it is worth it.
Comment #48
tacituseu CreditAttribution: tacituseu commentedMust be an issue @Mixologic mentioned in #25.
Comment #49
alexpottSo the test-only part of #47 completely crated test bot - see https://dispatcher.drupalci.org/job/drupal_patches/14991/
The patch on #47 is way way more green that that.
I discussed whether or not we should add a drupal_valid_test_ua() check to the code - both of us are -1 on it because whilst we are seeing this only testbot environment it is highly likely that this affects real sites too. For me the additional memory cost is worth the stability.
Re-uploading #39.
Comment #50
tacituseu CreditAttribution: tacituseu commented#47 has no segfaults, unlikely to finish though, only because it has second pass under concurrency=1 as
PHPUnit-FunctionalJavascript
.Comment #51
mpdonadioI've been staring at this for a bit. Why do we need to save off whether GC is enabled here or not?
I trawled through the PHP issue queue. I think this is the bug we are hitting: https://bugs.php.net/bug.php?id=72286 May be worthwhile to add to the comment. Triggered a PHP7 test.
Comment #52
tacituseu CreditAttribution: tacituseu commented@mpdonadio: so it won't be forced by
gc_enable()
later on environments that mean to have it disabled.From https://bugs.php.net/bug.php?id=72286:
Comment #54
tacituseu CreditAttribution: tacituseu commented#47 timed out, results:
274 passes of
\Drupal\node\Tests\Views\FrontPageTest
for 8.3.x, 278 for 8.4.x, without even a single segfault.Control, test-only patch had ~20% (out of 50 tests) failure rate before test bot gave up.
Comment #55
alexpottAdding to the comment as per #51.
So we could add a version compare and only do this for PHP5. Therefore we'd have no memory impact in PHP7 and there's no reports of segfaults here on PHP7. Seems like a good idea given that https://bugs.php.net/bug.php?id=72286 is a known PHP5.6 issue and there will not be a fix.
Comment #56
alexpottRemoving the test.
Comment #57
mpdonadioI like where this patch ended up, and am glad an upstream problem was indentified (even though it won't be fixed) so there is less guessing at the cause.
Comment #59
xjm"Cratering the bot" with a timeout but no segfault doesn't seem to demonstrate the fix to me, just that we exceeded the timeout. I can see RTBCing the patch based on other reasons if we can't reliably prove a fix with CI, but I want to make sure we're not RTBCing on incorrect info.
See #16 where I retitled the issue because I saw a segfault on PHP7. If @tacituseu is correct and a different bug caused that segfault, then at least we need to open a followup for the 7.1 segfault if it still exists. InstallUninstallTest has segfaulted on at least 5.5 and 7.1.
Also, in general, remember that the relative abundance of reports on 5.6 will always be higher because it is used for the primary test environment and so runs probably hundreds of times more tests than the other PHP versions.
Setting NR for (at a minimum) confirmation that neither #47 nor the statement about PHP7 change the RTBC, and to get a followup for the "other" segfault if there are two as @tacituseu has suggested a couple times.
Comment #60
tacituseu CreditAttribution: tacituseu commented@xjm: here's the "cratering" log https://dispatcher.drupalci.org/job/drupal_patches/14991/consoleText, it is the same patch as in #47 just without the workaround, it made so many core dumps that the test bot ran out of space, #47 times out because of number of repetitions is too high (100 would pass).
Re #16 : the 7.1 failure is in unserializer, didn't see a single one like that in 5.5-5.6 traces.
Edit: I wouldn't run #47 on 5.6 without the patch until #25 is resolved, as it disrupts other issues since the instance is alive for some time after and tries to schedule other patches.
Comment #61
alexpottThank @tacituseu. So this clearly rtbc because timing out on DrupalCI with no segfaults clearly proves the fix because without it we break a testbot and cause subsequent test runs to fail. With the fix the testbot continues to work.
Comment #62
alexpottWrt to the PHP7.1 segfault observed in #16 perhaps we should just leave this issue open after the commit and monitor things for a while? And if we've seen no new segfault after 2 weeks then we can close it?
The alternative is to commit the patch without the PHP version check because whilst we've not got a test case for PHP7 failing like PHP5, there's not much harm other then increased memory usage when doing a container rebuild which is not exactly the most performant part of the code base anyway.
Comment #63
alexpottI'm in favour of committing #56 and leaving the issue open. The nature of these faults is that they are intermittent there's a lot of work and information contain on this issue and we only have good evidence that it fixes one particularly common case. There's a good chance it will fix the other PHP5 segfaults but we can not be sure. Also there's nothing to stop us removing the version check at a later date if we can prove this makes PHP7 more stable too.
Comment #64
catchFor me I'd commit this, mark it fixed, have a major task to remove the version check.
Garbage collection between PHP 5 and 7 is completely different, so I don't think we'll get the same segfault as such, but it still might be worth removing for other reasons.
Comment #65
xjmThanks @tacitseu, that's really helpful!
Comment #66
alexpottComment #67
alexpottComment #70
catchOpened #2876895: Intermittent segfaults on PHP 7.1 to keep going on the PHP 7.1 segfaults.
Committed/pushed to 8.4.x and cherry-picked to 8.3.x, thanks!
Comment #71
Wim LeersWow, impressive research here!
Comment #73
xjm