Problem/Motivation
With gitlab ci, overall test runs complete within around 15-18 minutes. However, if one test is individually very slow, this can block the pipeline from completing for sometimes minutes after all other tests have completed. If we can refactor these individual blocking tests, we may be able to get test runs well below 15 minutes. This should also reduce gitlab hosting costs since it will reduce the time that any particular test pod is reserved.
We can also link to gitlab configuration changes from this issue too, but focusing on tests for now since at least a couple will obscure other improvements.
Steps to reproduce
Proposed resolution
Identify bottlenecks in the pipelines and fix them, whether containers, pipeline configuration, or specific long-running tests:
#3387706: Don't make other tests depend on PHPUnit
#3386479: Copy less files around in ComponentsIsolatedBuildTest
#3386458: Add GenericModuleTestBase and use it to test general module things
#3387737: Split PHP image into php(cli/apache) and yarn(node/nightwatch)
#3371963: Update Nightwatch to 3.x
#3387117: Enable distributed caching in GitLab Runner
#3388375: Run nightwatch tests in parallel
#3388365: Distribute @group #slow tests between test runners and mark more tests
#3389281: [meta] Refactor ultra-slow tests
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
Issue fork drupal-3386474
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
- 3386474-multi-install
changes, plain diff MR !9125
- 3386474-installer
changes, plain diff MR !9441
- 11.x
compare
- 3386474-meta-speed-up
changes, plain diff MR !4747
- 3386474-new-approached-printstats
changes, plain diff MR !4849
- 3386474-new-approaches
changes, plain diff MR !4802
- 3386474-omnibus2
changes, plain diff MR !6271
- 3386474-headers-dev-images
changes, plain diff MR !7745
- 3386474-11
changes, plain diff MR !8137
- 3388135-mysql8
changes, plain diff MR !4822
- 3386474-jit
changes, plain diff MR !7756
- 3386474-new-pipeline-world
changes, plain diff MR !8952
- 2024-07-testing
changes, plain diff MR !8873
- with-times-multi
changes, plain diff MR !9191
Comments
Comment #2
catchComment #3
catchComment #5
catchCombining #3386448: Get rid of InstallUninstallTest and #3386479: Copy less files around in ComponentsIsolatedBuildTest to see what the effect is.
Comment #6
catchhttps://git.drupalcode.org/project/drupal/-/jobs/82114 - 7m49s down from over 10 minutes.
https://git.drupalcode.org/project/drupal/-/jobs/82113 - 8m21s down from over 10 minutes.
Possible next candidates:
Drupal\Tests\config_translation\Functional\ConfigTranslation takes a long time to come back by the looks of things.
I've made some modifications to the pipeline config again:
1. Removed parallel from kernel tests, because often the three kernel test runners end up waiting for gitlab pods which doesn't help them finish quicker.
2. Reduced functional test concurrency down to six - without InstallUninstallTest functional tests can finish quicker/about the same time as nightwatch/functional javascript/build tests.
By doing #1 and #2, the overall pipeline looks like the following:
1. composer and yarn
2. Linting, unit tests, kernel tests
3. All other test types.
We use exactly 10 jobs for #3, which means it maxes out the concurrency limit but also minimises queuing time. However it's only an improvement on the current gitlab MR once those two tests are dealt with, so marking PP-3.
Comment #7
longwaveAs noted in #3386076: GitLab CI integration for core the Ruby gem
parallel_testshas a useful technique we could borrow: it logs test runtimes and then uses that data in future runs to group them into roughly equal runtimes.Comment #8
catchLast commits make phpunit parallel 2, and reduce the functional tests parallel by one.
This doesn't improve the overall run time, 14m30s is a touch slower but probably margin of error, however if we're able to resolve the kubernetes pod not available issues, it might speed things up by compressing the blocking unit tests step so that it definitely finishes alongside the lint jobs.
Comment #9
catchThe issue with the
parallel_testsapproach for us with the current setup is that we're only parallelizing functional tests (and in this issue, unit tests, but those are pretty well matched for times already). The other test types take about as long as the parallelized functional tests and we can't go over ten job concurrency or we end up with queueing.If we changed the approach and did something like parallelizing all functional, functional javascript, and build tests together, (possibly include kernel tests too) then it might be a further improvement though.
Comment #10
andypostComment #11
catchComment #12
longwaveComment #14
catch13 minutes.
https://git.drupalcode.org/project/drupal/-/pipelines/20725
With install/uninstall test and the nightwatch composer test optimisations applied, the bottleneck moves to functional javascript tests.
With parallel 2 these take 9 and 7 minutes respectively https://git.drupalcode.org/project/drupal/-/jobs/93117 https://git.drupalcode.org/project/drupal/-/jobs/93118. This is not really an improvement on parallel 1 so it is probably another test that individually takes a very long time.
However nightwatch doesn't use parallel or run-tests.sh and takes about 8 minutes too, we'd need to run those in parallel to get under this, but that seems possible: https://nightwatchjs.org/guide/running-tests/parallel-running.html
Comment #15
mstrelan commented#3371963: Update Nightwatch to 3.x might help too, the release notes suggest it may speed up tests by up to 25%
Comment #16
andypostMaking CI images smaller makes jobs to start faster, so I started to optimize images #3387737: Split PHP image into php(cli/apache) and yarn(node/nightwatch)
Comment #17
mstrelan commentedIs it worth having another image without apache? For composer, yarn, unit and kernel tests. Or does the overhead of separate images make it not worthwhile?
Comment #18
catchI have the same question as #17.
There's also an explicit 'start apache' step in the pipeline which we can remove at least for unit/kernel tests which is just a few seconds at most but worth cleaning up, only worth doing if we keep the same images though.
Comment #19
catchComment #20
catchComment #22
andypostMysql images has no updates last 2 years so I filed #3388135: Fix base image for Mysql 8.0/5.7 where few layers are cut off
Comment #23
andypostOptimization of image (first layer is php, apache based on php) drupalci/php-8.2-cli allowed to cut down >60% from
500MBto 188MB comparing to drupalci/php-8.2-apachebtw I can't find any metrics about pod's start-up, so how to measure is open question
Looking for reviews and ideas in #3387737-11: Split PHP image into php(cli/apache) and yarn(node/nightwatch) and related MR
Comment #24
catch#3388375: Run nightwatch tests in parallel should remove the nightwatch bottleneck.
Also incorporating #3388365: Distribute @group #slow tests between test runners and mark more tests.
Comment #25
catch10m27s with the current patch set https://git.drupalcode.org/project/drupal/-/pipelines/21810
Comment #26
catchCurrent state of the combined MR has us at around 11 minutes: https://git.drupalcode.org/project/drupal/-/pipelines/22188
Comment #27
andypostGonna explore JIT+preload
#3388646: Add option to enable JIT for CI PHP images
#3108687-36: Compatibility with/take advantage of code preloading in PHP 7.4
#2946472-31: Attempt to autostart chromedriver for selenium tests
Comment #28
longwaveIs it possible to skip the ESLint step if no JS files have changed, or the stylelint step if no CSS files have changed?
edit:
rules:changeslooks like it should be able to do this: https://docs.gitlab.com/ee/ci/yaml/#ruleschanges - probably worth exploring this in its own issueComment #29
andypostMR can contain more then one commit so checking for files in all commits is tricky
Comment #30
andypostJIT does not affect PHPUnit but there's some failures in kernel tests https://git.drupalcode.org/issue/drupal-3386474/-/jobs/104981
Comment #33
catchBiggest thing I'm learning here is that once some of the issues already opened are applied, there is a hard bottleneck of specific long-running tests which break any attempts to optimize concurrency or use parallel runners, and skew perceived performance when CPUs are constrained or concurrency reduced, because they always take a long time to run.
Opened #3389281: [meta] Refactor ultra-slow tests.
Comment #34
catchNothing individually committable from here any more, just merging all the current work into it to see overall progress.
Comment #35
mstrelan commentedNot sure how hard this would be to split up, but I'm wondering if phpstan (and phpcs) can start while the yarn job is still running. Yarn consistently takes 20-30 seconds longer than composer and phpstan is the longest job in the lint stage by well over a minute.
Comment #36
catchI am pretty sure that was working until we moved linting to the parent jobs.
Same issue with unit, kernel, and functional tests in the child jobs, they could also start as soon as composer finishes.
They only depend on composer, not yarn, so not clear to me what changed.
Comment #37
catch11 minutes, 1 second:
https://git.drupalcode.org/project/drupal/-/pipelines/24323
Comment #38
catch@mstrelan, good point, it used to work like that, issue here: #3390073: Use 'needs' instead of 'dependencies' to speed up gitlab CI jobs and run phpstan first
Comment #39
naveenvalechaComment #44
andypost2 unit and a few functional tests fail if enable PHP JIT in Apache #3388646-3: Add option to enable JIT for CI PHP images
Comment #47
mstrelan commentedAdded #3454092: Convert WebAssertTest to a Unit test as a child issue
Comment #48
catchOpened #3462762: [meta] Core test suite performance to have more of a proper meta for test suite performance improvements, this was more of a sandbox issue so is very messy.