Problem/Motivation
Drupal CI test runs do something like this: lint -> unit -> kernel -> functional -> functional javascript -> nightwatch.
Each test group is executed with maximum concurrency, but the groups have to be executed one after the other.
gitlab has jobs rather than single script, this allows different test types to run in parallel.
#3386076: GitLab CI integration for core is structured so that CI pipelines run in two stages.
First, a lint step, with cspell, phpstan etc.
Second, all of the various test types are executed in their own jobs.
Unit tests are a dependency for the other test types, so if unit tests fail,the functional, functional javascript, nightwatch etc. don't run.
Since the lint step blocks the test step, this means you have in effect three steps, each of which must finish before the first starts:
lint -> unit -> other test types
With this configuration, a full pipeline run is around 32 minutes.
This issue is to try to improve on that 32 minute test time, as well as allowing unit tests to run independent of linting, so that they always return even if there's a typo.
This results in:
lint + phpunit -> other test types - takes a minute or two off the pipeline.
Additionally, functional tests even at 32 concurrency were taking over 20 minutes to complete. If we chunk those into eight groups (and also build tests into two groups for the same reason), we end up with various run times of between about 4 and 11 minutes. The maximum concurrency is ten, so when a 4 minute job finishes, it frees up a runner/pod to complete another short job while the 11 minute ones are still running.
We have tweaked the order of test execution to functional -> build -> javascript -> nightwatch to support starting the longest running tests first.
With these changes (all just changes to the YAML config plus a small supporting change to run-tests.sh), pipelines complete in around 18 minutes.
Steps to reproduce
Proposed resolution
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
Issue fork drupal-3386091
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
catchThis is already done with --with-unit-tests in the core template, closing as by design. Doesn't need its own group
Comment #3
catchActually, re-opening with a slightly different purpose.
If we skip the linting tests for unit tests, but add it to all the other types of tests, the following happens:
1. Unit tests run parallel to linting, this removes a 2 minute blocking step in between linting and all other test types so theoretically might reduce overal pipeline time from ~32 minutes down to ~30 minutes.
2. Unit tests will run despite failures in linting tests - this means if you've made a typo in a code comment and also broken a unit test, you'll get the unit test feedback back alongside the linting errors.
All the other test types continue to rely on both unit tests and linting so we don't do expensive test runs for things that are nearly guaranteed to fail.
Comment #5
catchThe MR is based off the main gitlab MR, this is the actual change for this issue: https://git.drupalcode.org/issue/drupal-3386091/-/commit/8cfe4fc5fed6bc6...
Comment #6
catchComment #7
smustgrave commentedShould TESTSDIRECTORY be added to dictionary?
Or phpcs ignore for that file?
Comment #13
catchTried allowing kernel tests to run parallel to unit tests, it didn't improve things so reverted it.
Small tidy-up merging with-composer into with-composer-cache because they were identical.
Last pipeline run was 18 minutes, 5 seconds: https://git.drupalcode.org/project/drupal/-/pipelines/18810
Anything else we've tried is not getting us below 18 minutes, some are neutral, some make it worse, so I think this issue has got as far as it can without more intrusive changes.
The next stage would be #3386217: \Drupal\Tests\Core\Test\PhpUnitCliTest requires a database and webserver - move to a functional test to make the unit test run shorter - that in turn would start all the other test types faster.
Retitling again and updating the issue summary.
We keep seeing the same nightwatch failure over and over on the pipelines, that is the only test type we're not messing around with, so probably not introduced here, but needs its own issue at least.
Comment #14
catchAlso I think we're allowed to bump the priority. gitlabci already completes nearly twice as quickly as DrupalCI, this makes it three times as quick as DrupalCI.
Comment #16
catchKernel parallel x 2 gets us down to 17 minutes 24 seconds. https://git.drupalcode.org/project/drupal/-/pipelines/18837 this is because the kernel test runner frees up to pick up the last of the functional tests.
Which gives me the idea to up the parallezation of kernel tests and run them alongside linting again - idea is to max out the 10 concurrent jobs during both the linting and test stages. About to push that.
Comment #18
catchOK 17m49s now: https://git.drupalcode.org/project/drupal/-/pipelines/18855 margin with error with the previous attempt which was 17m30s.
The trade-off is that kernel x 4 can result in one or two of the linting steps queueing, delaying the start of the longest javascript/functional jobs by a minute, so we really are at the limit at 17m30s.
If we can somehow optimize the yarn build step, that would start a lot of things earlier, even 30seconds could shave off 1-2 minutes by freeing up runners. Maybe the cloud caching stuff will help.
The advantage of this approach though is that kernel test failures will report at the same time as phpunit failures and cs failures, all within about 4-5 minutes of pushing to a branch. We still block the properly expensive test runs on unit tests and linting though.
Also now the kernel tests are reporting back quicker, we could consider adding *with-kernel-tests to the more expensive tests without a speed penalty.
Comment #19
catchOK even at ten parallel for functional tests, the bottleneck is still the first group of functional tests - there must be some tests in there that are so slow they create a hard floor even while other tests are spread out. So we definitely have reached the limit of overall run times here, but the kernel tests happening earlier isn't affecting the speed of the whole run, so I think that's a good change.
Stopping for now - IMO this is mergeable either as a follow-up to the initial gitlab issue or we could merge it back in there and commit them together.
Comment #22
fjgarlin commentedGiven that #3386076: GitLab CI integration for core was (and still is) a first proposal and that this issue is built on top of that one, I think the changes from this MR should go directly there and continue the conversation and improvements there.
The MR for #3386076: GitLab CI integration for core already has some feedback and it'd be great to not have two diverging proposals.
The changes to leverage parallelism in this issue are great and the fact that we are already below the 20 minute mark is amazing.
I'm going to mark it RTBC but really aiming for the changes to go directly to #3386076: GitLab CI integration for core and then we can mark this issue as fixed.
The nightwatch issue seems intermittent but we can perhaps try to fix it in the other issue too if it keeps on appearing.
Comment #23
fjgarlin commentedComment #24
catchMerged over there, marking this as duplicate.