Problem/Motivation
After #2749955: Random fails in UpdatePathTestBase tests, we are continuing to see random fails in UpdatePathTestBase tests. For example, from #2627512-139: Datetime Views plugins don't support timezones:
https://www.drupal.org/pift-ci-job/532197
https://www.drupal.org/pift-ci-job/532126
https://www.drupal.org/pift-ci-job/532196
https://www.drupal.org/pift-ci-job/532198
When you look at these reports, there isn't a clear symptom other than
fail: [Browser] Line 248 of core/modules/system/src/Tests/Update/UpdatePathTestBase.php: GET http://localhost/checkout/update.php/start?id=2&op=do_nojs returned 0 (0 bytes). fail: [Other] Line 263 of core/modules/system/src/Tests/Update/UpdatePathTestBase.php: Schema key block.block.bartik_login:settings.cache failed with: missing schema
at some point in the output.
Given the recent work that has gone into test isolation (need links to the related issues) for APCu, database, and filesystem collisions during parallel tests, these new fails more baffling.
Proposed resolution
Until this can be isolated, I think we need to consider two radical options
- Disable parallel tests
- or Force a fresh testbot for each full run
that should(?) point us in the right direction.
Comments
Comment #2
mpdonadioI guess another possibility is to run all tests (with parallelism) except for UpdatePathTestBase tests, and then run those at the end in serial.
?
Comment #3
xjm@alexpott, @effulgentsia, @cilefen, @catch, and I agreed that this is a critical issue since it's a random test failure for core.
Comment #4
xjmComment #5
Anonymous (not verified) CreditAttribution: Anonymous commentedSorry if this information would be useless. But I had a similar example https://www.drupal.org/pift-ci-job/480599.
Comment #6
xjmComment #7
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedAt @alexpott's suggestion, here's a patch which will show us exactly what updates didn't run.
Comment #8
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedBetter one.
Comment #10
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedMore better :D
Comment #12
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedWe need to reset the static cache for
drupal_get_installed_schema_version()
now.Comment #17
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedMissed a spot.
Comment #18
hchonovWe've noticed in #2248983: Define the revision metadata base fields in the entity annotation in order for the storage to create them only in the revision table that the problem is reproducible with PHP 5.5 but not with PHP 7 when we put our update logic in View::preSave and through a hook_post_update trigger re-save of all views. However the problem does not occur if we instead directly update the configs in hook_update_n.
So currently there we have a patch which is constantly failing and not randomly.
Comment #19
hchonovThe random fails in our case have been caused highly probable by PHP 5.5 bug by using "foreach by reference" as shown in the interdiff from #2248983-176: Define the revision metadata base fields in the entity annotation in order for the storage to create them only in the revision table where the patch finally passed as instead using "foreach by reference" I've modified the array values directly.
Comment #20
Anonymous (not verified) CreditAttribution: Anonymous commentedRegexp
foreach.+=>\s*&
found about 50 cases in core. We can try replace it, right?Edit: and 1 case in
vendor /vendor/guzzlehttp/guzzle/src/UriTemplate.php
Comment #21
alexpottMight have worked this out :) We have protection in WebTestBase to prevent following to many redirects... the problem is that this is set to 5 and sometimes (depending on load) the update batch can't be completed in 5 refreshes.
Comment #22
alexpottI think the issue summary has it wrong btw... it is quoting the second fail - the first fail from all of the tests is:
Comment #23
alexpottSo #21 is not right - the batch system meta refreshes which are handled separately :( but still I think we should be focussing on the 0 bytes not the schema stuff.
Looking at #17.
This change is a shame. Any ideas how we can get around doing that?
Comment #24
alexpottI cannot see any reason why drupal_get_installed_schema_version() is not using drupal_static() - let's use that so we can reset it.
I think we should commit this so we have additional information about what is going on here. At the moment we are flying blind.
Comment #25
alexpottSo the static was added in #35924: Make update.php work with modules before drupal_static() existed and it just never got updated.
Comment #26
alexpottComment #27
amateescu CreditAttribution: amateescu for Pfizer, Inc. commentedThe interdiff from #24 makes perfect sense, why didn't I think of that? :P
+1 from me!
Comment #28
catchThis looks good, I don't think the drupal_static() is a problem for 8.2.x.
Comment #29
alexpottI've committed #24 to 8.2.x and 8.3.x because without this extra info we're never going to find out what is happening.
Committed and pushed 5e5ef24 to 8.3.x and 607ae50 to 8.2.x. Thanks!
Comment #32
alexpottOkay we've had the first fail with the new code in place https://www.drupal.org/pift-ci-job/548483. So more information...
So all the hook_update_N's have run but some of the post updates have not... actually it is not some! It is ALL of the post updates!
Comment #33
BerdirAre we sure it is only post updates? Post update functions don't run if any of the normal update functions fail.
The entity field definitions differences are also strange. Those should be fixed by actual update functions, not by post update functions, so that seems another indication that not all update functions where executed?
Comment #34
alexpottIn-between running hook_update_N() and post-updates we call drupal_flush_all_caches() to ensure that the environment is clean before allowing the full API use in post updates. I wonder if immediately after drupal_flush_all_caches() if we do a refresh for the batch there is a problem. So things work if a post update is run but not if one is not.
Comment #35
mpdonadio#34, what happens in this scenario if two updates are running at the same time b/c test concurrency?
Comment #36
alexpott@mpdonadio I'm not sure concurrency has much to do with this - there are separate URLs, separate DBs, separate everythings... however it is worth testing this...
Comment #37
mpdonadioIn wonder if this is related to #2607572: Simpletest UI throwing 500 errors during tests where the batch tries to run another set, and the queue is actually empty, no it never actually finishes properly.
Comment #38
alexpottOne thought is that automated_cron is installed by system_update_8013() and there is nothing to prevent cron from running whilst the updates are still running. This seems a bit unexpected to me. Also given that cron on request termination it could be running at the same time drupal_flush_all_caches() is running. However if this was the cause I'd expect the missing updates to be more random :(
Still I'm pretty sure that cron shouldn't run if there are outstanding database updates.
Comment #39
mpdonadioTwo quick hacks to check the theory in #38.
Comment #40
alexpottWell we're not putting the site into maintenance mode :) and since #565154: cron.php will run regardless of whether the site is in maintenance mode or not. actually we prevent cron running in maintenance when run via the cron url so... let's put do updates properly :)
I propose that if this patch is green we commit it. Because we have no reliable way of causing the error yet so we can't prove if anything fixes it - only that it does not.
Comment #41
alexpottOops $this->assert(200) is just silly :) - also let's preserve the value just in case people are testing the updating does not change this value.
Comment #42
mpdonadio#40, confused here, DbUpdateController::triggerBatch() puts the side in maintenance mode if it isn't already:
and then AutomatedCron::onTerminate() doesn't have any checks. TBH, I was just checking maintenance_mode because I was too lazy to set/clear a state variable before/after updates in DbUpdateController to test the theory in #38 about cron running during updates and triggering a cache clear in the middle.
Comment #44
alexpott@mpdonadio - ah you are right the checks are built into the cron route rather than the cron service. I think the checks should be in the cron service and not the cron controller.
Comment #45
mpdonadioSo, the cron route in system.routing.yml is wired up to CronController::run(), which has no check on it at all for maintenance_mode (not sure when that diverted from #565154: cron.php will run regardless of whether the site is in maintenance mode or not.; can't fully trace out the history of cron.php getting converted to the current route implementation]).
Here is the access check in Cron::run() along with #41. This should prevent cron from running in maintenance_mode from anywhere (the route, the form, the event, the tests).
We may want to explicitly set a lock around the update and check that instead (or both?).
Side note in CronController:
Since Cron::run() can return FALSE, we may want a followup here to check for the return code, and return a 503 if FALSE. There are similar checks when run manually (eg, CronForm::submitForm()).
Comment #47
mpdonadioWell that was embarrassing.
Comment #48
mpdonadioFails, https://www.drupal.org/pift-ci-job/550289
Oddly, it is a zero byte symptom, but something I don't recall seeing before
Going to trigger a few more runs.
Comment #49
hchonovWe've noticed some strange test fails in our setups and they were caused when running tests through run-tests.sh, but they were not reproducible when running tests through the Simpletest UI. The same happens for example with #2837707-7: EntityChangedConstraintValidator should be retrieving the latest changed time through an entity query instead by loading the unchanged entity as well. The test fails with PHP 5.6 on the test bot and at my machine locally through run-tests.sh, but passes when I execute it through the Simpletest UI.
Comment #50
xjmThere's a new variety of this fail, possibly just the same bug reported in a new way with the new CI deployment:
https://www.drupal.org/pift-ci-job/556488
Comment #51
xjmSo a number of people have said they saw this fail a lot more frequently today on patches, to the point that they thought HEAD was broken. 8.3.x HEAD failed with different variants of it twice in a row before passing on a third retest.
Comment #52
alexpotthttps://dispatcher.drupalci.org/job/php5.5_mysql5.5/4637/artifact/jenkin... shows that a segfault occurred on the recent fails. A segfault explains two things… the 0 byte response and the complete lack of error info - which has always been happening with this one
Comment #53
mpdonadio#52, is there a core dump to go along with it so we can see if there is a stack trace, or is it long gone?
Comment #54
Wim LeersIs this issue also the one that explains the near-constant failures in
Drupal\views\Tests\Update\ArgumentPlaceholderUpdatePathTest
since yesterday, for example https://www.drupal.org/pift-ci-job/557113?Comment #55
dawehnerNote: This failure should be fixed by #2784159: Remove CURL timeout in BTB
Comment #56
Wim Leerscatch reverted a patch at #2834291-39: Add a SQL index for entity types that are using EntityPublishedInterface, as a best guess effort, but it didn't help, unfortunately.
Comment #57
mpdonadio#55, currently that isn't applied to 8.2.x. But yeah, I think the error in #48 is a pretty clear indication that this is the 30 second problem in Guzzle.
Comment #58
dawehnerI am a bit confused about those changes, as ideally update.php toggles the maintenance mode for us already, doesn't it?
Comment #59
catchThe fails in #47 look to me like they might be different random test failures. If that's the case, we should consider committing this at least as an interim improvement. Posting the same patch again, and setting of some tests. And it definitely makes sense in itself regardless.
Comment #61
mpdonadioIn #59 it looks like none of the fails have the 0-byte thing, which suggests #2784159: Remove CURL timeout in BTB may be the true root cause.
Comment #62
hchonovHas it been considered converting the update path tests to phpunit tests and see if the random failures still occur?
Comment #63
tacituseu CreditAttribution: tacituseu commentederror.log from #59 (https://www.drupal.org/pift-ci-job/558039) has:
in it
Comment #64
catchThe migrate_drupal_ui fails like https://www.drupal.org/pift-ci-job/558033 are with the old WebTestCase not BrowserTestBase so can't really be related.
Looks like at least some are failing with the 0 byte response too: https://www.drupal.org/pift-ci-job/558039
Comment #65
hchonovWhat if we disable OPCache just to ensure it is not causing the problems? I am not sure if the Drupal CI is using mod_php?
Comment #67
Mixologicre #64 the fail mentioned in #64 at least is showing zend_mm_heap corrupted in the error log, along with three apache core dumps: https://dispatcher.drupalci.org/job/default/284689/artifact/jenkins-defa...
Comment #68
mpdonadioHave we tried the USE_ZEND_ALLOC=0 trick in the apache environment?
Comment #69
tacituseu CreditAttribution: tacituseu commentedThis might be issue with peak memory usage under high concurrency, as couple of memory intensive tests are running near each other (UpdatePathTestBaseFilledTest, UpdatePathRC1TestBaseFilledTest), add to that update batches spawned from them and usage of zlib and it could get Apache running out of available memory trying to create/send response causing a segfault (so ulimately mpm_prefork/MaxRequestWorkers sizing issue).
Due to high concurrency tests are not finishing in deterministric order so that would explain why sometimes the peaks miss each other and they pass.
Traced the source of spikes to:
Drupal\system\Tests\Update\UpdatePathTestBase::setUp()
which gets called in a loop from
Drupal\simpletest\TestBase::run()
So it is called 3 times for UpdatePathTestBaseFilledTest (testUpdatedSite, testDatabaseLoaded, testUpdateHookN) which makes it even worse.
Attached a crude patch to get some data on memory usage for each test, hope parsers won't mind abuse of #debug.
Comment #71
tacituseu CreditAttribution: tacituseu commentedNow with phpunit support. Usage: check console output on dispatcher, 'messages' == 'bytes'.
Comment #72
tacituseu CreditAttribution: tacituseu commentedComment #74
alexpottThis will reduce the chances of concurrency issues on testbot? no? It has concurrency 31
Comment #75
tacituseu CreditAttribution: tacituseu commented@alexpott: Yes, it was about increasing chances for good pass, so we will get data from all the tests.
What I was after is finding out approximate max memory usage of 31 concurrently running tests.
In short PHP 5.5: 1.82GB, PHP 5.6: 1.83GB, PHP 7: 1.55GB. Now this is with concurrency forced to 8, normal run would have different characteristic (higher).
Details in attached spreadsheet.
Comment #76
tacituseu CreditAttribution: tacituseu commentedNow to get one from normal (concurrency: 31) pass.
Comment #77
tacituseu CreditAttribution: tacituseu commentedNew one keeps log allowing to reconstruct what is executing at the same time, also covers update.php.
Comment #79
tacituseu CreditAttribution: tacituseu commentedOnce more.
Comment #80
tacituseu CreditAttribution: tacituseu commentedComment #82
BerdirIs it correct that this only happens with 5.5?
What if we switch the default test branch to 7 until this is resolved at least? It looks like more patches fail with this than don't at the moment, which makes it extremely annoying to work on core atm.
Comment #83
catchI tried to change the branch testing default, but the option is greyed out for me - I haven't change this since the UI change so not sure if it's me or the UI...
Comment #84
MixologicSorry, looks like we have a bug in the qa page. I'll ping @drumm about it on monday.
Comment #85
claudiu.cristeaThere's now way to temporarily increase the memory of the instances and see if the failure goes away?
Comment #86
tacituseu CreditAttribution: tacituseu commentedResults from #79 give 2.13GB for PHP 5.5 and 8.3.x, noticed there's also 2GB APCu cache defined in the container, adding it as datapoint.
Comment #87
tacituseu CreditAttribution: tacituseu commentedComment #88
Wim LeersI think this is more accurate at this point.
Perhaps more importantly, I found a contrib module with the same problem: the CDN module, which I maintain! It does daily re-testing of the PHP7 branch. As such, it noticed that since December 20, it's failing 100% of the time in its update path test! See https://www.drupal.org/pift-ci-job/565105. Despite no commits since December 2.
Screenshot for posterity:
What's so great about this? Several things:
That should all be quite helpful in pinpointing the root cause.
Comment #89
Wim LeersComment #90
Wim LeersSo I wonder if the CDN update path test and the failing update path tests in core are doing something particular that some other update path tests are not doing.
The fixture that the (sole) CDN update path test is using does not contain any serialized PHP. In fact, it might very well be the simplest of all fixtures. Yet it still causes this failure.
It also appears that it's failing with a far more useful error message:
This immediately points to the same area of code that @tacituseu traced it to in #69!
Comment #91
Berdir@Wim: That doesn't look like the same error to me, it clearly says that it is missing a file.
The reason for that apparently is that CI uses composer to install modules now and it does so in modules/contrib. You have a hardcoded relative path to the database dump, that doesn't work. You need a drupal_get_path() or something there, or (as that will likely not work at that point) some lower-level discovery API or some dynamic logic to find the modules folder. Keep in mind that technically, modules can still be in sites/all/modules/contrib or site specific folders too.
Comment #92
alexpottThinking about things the bots have the following config:
So we're going to opcache the uncompressed database file for no good reason. Maybe it just breaks.
Comment #93
alexpottI'm not sure why enable_cli is actually on - I can't remember if this helps on not. But if it was off we'd not cache these files.
Comment #94
alexpottComment #95
alexpottSo I think we can exclude opcaching of the database dumps as the issue because we have a fail in #94 - https://www.drupal.org/pift-ci-job/565630
Comment #96
alexpottAlso thinking about it some more it can't the extraction that is the problem because it is failing whilst running the updates.
Comment #97
alexpottUpdating issue title since #88 is a red-herring and the fails fortunately are not 100% of the time.
Comment #98
hchonovSo if I understand it correctly what @tacituseu has shown is :
- concurrency=8 : memory consumption for PHP 5.5 = 1.82GB and all the tests pass
- concurrency=31 : memory consumption for PHP 5.5 = 2.13GB and update tests fail
This sounds like some 2GB memory limit and also that
a) Drupal increased the memory usage or has now new tests that require much more memory
or
b) a server setup has changed introducing this memory limit.
So in order to get the tests pass again and the work on the other issues continue probably we should set the concurrency to 8 as a temporary workaround and investigate further the problem here.
Comment #99
hchonovAlso it probably makes sense to execute each failed test again with concurency=1 to ensure we would not go into failing tests due to memory limits.
Comment #100
tacituseu CreditAttribution: tacituseu commentedPHP 7 failures seem to be a recent and different issue.
With regard to my memory usage hunt I expected it to be much worse.
The bots are running on cc2.8xlarge AWS instances, and from the testbot code it doesn't look like containers are run with memory constraints so it is unlikely to be the reason (it still might be for local/Vagrant testing due to :
v.customize [ "modifyvm", :id, "--memory", "4096" ]
).Comment #101
tacituseu CreditAttribution: tacituseu commentedAs for #98 and 1.82GB vs 2.13GB:
1.82GB was just an approximation by summing memory usage of tests finishing together, there is plenty of small tests that come and go quite quickly and lower the figure.
2.13GB figure takes into account which tests are actually running together ($children array in simpletest_script_execute_batch()). At the end of the run there is a bunch of tests in it that are longer running/taking more resources:
Comment #102
claudiu.cristeaIf we know the date when the failures start to occur, let's add a patch that reverts all changes since that date and see if the failures persist. If yes, there it's something with the infrastructure.
Comment #103
xjm@claudiu.cristea, these fails have been happening for most of 2016 at least. One difference is that they seem to be happening more frequently now.
Comment #104
mpdonadioHow long would test runs take if we eliminate concurrency as a stop-gap? Or maybe drop it to something really low? There are a lot of patches that won't stay RTBC, and this is really become problematic for some issues. Being selfish here, but I feel like I lost a lot of time out of the 8.2.x cycle with being able to get patches in.
Comment #106
MixologicSorry. Testbots are configured to abort after 1:50 minutes to trap hung builds.
I've been working on new container builds and should have the entire process reconstructed by next week - from that point forward troubleshooting environment issues will be much, much easier to manage.
Comment #107
hchonovWhat about running the update tests with concurrency=1 and all other with concurrency=31?
Comment #108
MixologicI dont think thats going to work. Run-tests.sh calls itself, and the tests are executed in an arbitrary order, so there may already be 27 tests running when the update test gets executed.
In addition it appears as though that patch revealed something that *should* be failing the test. The log for that passing test has a fatal error in run-tests.sh:
https://dispatcher.drupalci.org/job/default/293122/consoleFull
So looks like drupalci isnt catching fatals in run-tests.sh. bummer.
Comment #109
MixologicBeen working on new containers and container build process: some input/+1s etc would be helpful here: #2841441: Migrating to new php container versions
Comment #110
hchonovAll the tests should pass now when we force the garbage collection in the setup of update tests.
Comment #111
catchhttps://www.drupal.org/pift-ci-job/567196 looks the same to me.
We've switched the default patch testing environment to PHP 5.6 to try to stop the issue queue falling over all the time.
Comment #112
hchonovSo it was just a luck that the patch was green in #2816097-50: IGNORE patch testing issue :(
Comment #113
hchonovComment #114
MixologicOkay, so I have drupalci now set so that the host AMI allows for core dumps, and the containers are being started with the proper ulimits. I still need to have them dropping the cores in the right place and adding the php-src so we have zbacktrace as well as apache-dbg, but at least we're making core files if I can catch them.
I *did* catch one for this pgsql/mysql 5.5 fail:https://dispatcher.drupalci.org/job/php5.5_postgres9.1/1670/
Hopefully that gives us something to go on. The container build process has run into some snags that I have to work out (notable 5.6.29 had a core dump, 2 failed tests, and took 1hr 20 min.
Comment #115
MixologicGot it to fail again, looks like its the exact same spot (this time on mysql and not pgsql)
https://dispatcher.drupalci.org/job/php5.5_mysql5.5/4714/
PHP5.5 MySQL 5.5
Comment #116
tacituseu CreditAttribution: tacituseu commentedLooks like
http://git.php.net/?p=php-src.git;a=commitdiff;h=f29c98c1289b00e0dbb58631df6a5e006f5311d1
which is non-changelog-listed part of 5.5.24, referenced by
https://support.zend.com/hc/en-us/articles/205304527-Composer-Crashes-Under-Zend-Server-8-PHP-5-6
Comment #117
tacituseu CreditAttribution: tacituseu commentedComment #118
catchThis needs a comment, but very nice find and I think we should commit it to start stabilising things.
Comment #119
xjmI queued more runs to demonstrate with reasonable statistical certainty that it is actually improving things over the baseline. Also here is a control patch to compare it to the current stability on CI.
Comment #120
tacituseu CreditAttribution: tacituseu commentedOnly meant the patch as a confirmation of the problem, which is a NULL pointer dereference in garbage collector.
Maybe there is some other solution:
- figure out which update produces so many objects and then either refactor it or force batch to run it in its own request or call gc_disable() only in it ?
- does it warrant raising minimum version required ?
Also one of the postgre fails (https://www.drupal.org/pift-ci-job/567927) is another issue #2157927: Intermittent test fails in LocaleUpdateTest::testUpdateImportSourceRemote(), the others are different but there are no segfaults.
Comment #121
MixologicLooks like that one fail on mysql 5.5 in that last battery on #117 is just another example of a different intermittent failure. #2806697: Random fail for AlreadyInstalledException
Comment #122
xjm@tacituseu, really great work tracking this down. With the additional test runs I think we can actually say with some confidence that you've found the source of a bug that has eluded us since last June. (The fails on #117 are other known fails, so we exclude them from the count.)
I actually would consider committing the patch from #117 or something like it as a hotfix. The risk is maybe that update processes on real sites would run out of memory? Could we just disable garbage collection only in update tests somehow?
Comment #123
xjmComment #124
tacituseu CreditAttribution: tacituseu commented#122: should be possible by checking User-Agent, from
Drupal\simpletest\WebTestBase::curlInitialize()
:Comment #125
catchYes we still have drupal_valid_test_ua() and could use that. Given this is a PHP bug, I'd prefer the hotfix to trying to refactor an update to not trigger it - we don't know what contrib updates will do and they can run tests as well.
Comment #126
catchHere it is with the user agent check and a comment.
Comment #127
alexpottI think it is more likely that this will help real sites rather than cause memory issues. The big question for me is what could be making this an effective fix. I.e. where are we generating so many objects that need cleaning up. Config schema generation might be one such place.
See the following links for numbers on turning of gc in composer...
Issue in composer: https://github.com/composer/composer/pull/3482#issuecomment-65199153
Solution and people replies: https://github.com/composer/composer/commit/ac676f47f7bbc619678a29deae09...
Comment #129
alexpottBasically I think if
gc_disable()
helps testbot then it is going to help everyone.Comment #130
alexpottWe need to call
gc_collect_cycles()
before callinggc_disable()
see https://github.com/composer/composer/blob/b2efcb1078b8abfb9f20ae228dc255...Comment #131
alexpottAfter thinking about this some more I think we should proceed with a quick fix for testbot and then open a follow up to get some real numbers about the affect of this.
Comment #132
alexpottCommitted and pushed 528e307 to 8.3.x and e690435 to 8.2.x. Thanks!
Comment #135
xjmComment #136
xjm@alexpott posted #2842393: Discover why gc_disable() improves update.php stability.
Comment #137
penyaskitoI'm still suffering this issue in the Lingotek module. See https://www.drupal.org/pift-ci-job/571581 as an example of another UpdatePathTestBase based test failing. Configured to use 8.2.x + PHP7.
Comment #138
alexpott@penyaskito I don't think those fails are the same - firstly it does not look random. There's no 0 byte request and there is no segmentation fault.
Comment #139
penyaskito@alexpott You're right, found my issue. Sorry for the annoyance.
Comment #141
xjm