Problem/Motivation

https://dri.es/what-it-costs-to-run-drupal-infrastructure

I’ve been contemplating the issue of DA budget and infrastructure costs.

In a Slack thread in #drupal-infrastructure discussing that post, @drumm said:

GitLab CI is the real cost, no one else does 1/10th of the amount we provide for free

I’ve been on a kick to reduce waste via the gitlab_templates world, but that’s only for contrib. Unsurprisingly, the vast bulk of CI usage is from core. Here are the top projects circa January 2025 (the most recent data I have access to):

Graph of GitLab CI usage across projects circa January 2025

Yes, there is still a lot of room for incremental improvement: skipping known flakey tests to avoid having to re-run so many pipelines, converting Functional to Kernel or Unit, etc. But a radical idea came to me that I think we should explore:

What if we completely disable running any automatic pipelines on each push to a core branch?

Allow me to explain. Every commit we might push comes from an issue that already ran numerous pipelines, and the last pipeline must pass before we even consider merging it. Why re-run a whole other pipeline on the push? We’d still have daily jobs that would alert us if a missing rebase allowed 1 change to “break HEAD” due to interaction with another push that just happened. Worst case, we’d know about this within 24 hours. Meanwhile, our core developer base seems active enough that Slack usually lights up very quickly if HEAD is failing tests for any reason. Finally, I know we’d be running every possible pipeline configuration before we actually tag another release.

So I assert that the chances of shipping a core release with a regression due to not running the pipeline after every push is 0%. Meanwhile, it seems like this would approximately cut in half the total CI usage for "native core pipelines" with a single easy change. No other optimization is going to get anywhere close to that. However, since I don't have access to the underlying numbers, I have no idea if this "approximately cut in half" is at all true, nor if cutting that value in half is even meaningful improvement compared to all the issue forks.

I asked in the core subsystem maintainers' Slack channel about this idea to "read the temperature" and gauge interest. Generally, folks thought it was worth exploring, but we lack the info needed to make an informed decision. So I'm opening this public issue to explore it.

Steps to reproduce

Proposed resolution

Modify what branches run automatic pipelines "on-commit". Rely on nightly pipelines (and the crowd-sourced core contributor community) to alert us if we "break HEAD" due to interactions of different commits in the same day breaking each other due to missing rebases or whatever.

  1. Skipping all of the frequent random test failures from #3600653: Temporarily skip failing tests, round 2
  2. Release branches - 1) no daily pipeline, 2) run all environments on push
  3. Development branches - 1) No on push pipelines, 2) daily pipeline using all environments at ~0400 UTC 3) extra daily pipeline on weekdays UTC, 1200 UTC, using the current on-push matrix.

Issue to fixed the skipped tests in this round: #3600655: [meta] Fix and re-enable tests skipped for random failures, round 2

Remaining tasks

  1. Get real info on how much CI time we're using on:
    1. Scheduled pipelines
    2. on-commit pipelines when pushing changes to core branches (what this issue aims to eliminate)
    3. Total CI usage from all core issue forks
  2. Assess if eliminating on-commit pipelines would be the massive improvement I think it might be, or if everything is dwarfed by issue forks.
  3. Done. Risks from release management perspective. See #6
    • On commit test runs usually find the following types of errors
      • something that breaks a specific database driver
      • commit of an MR that has a very stale test run that was green at the time but would have failed if run again
      • same thing but two up to date MRs that cause each other to fail without merge conflicts
      • cherry picks to another branch that didn't get a test run.
    • If this change is made, then on release day release managers would have to remember to manually trigger a branch run if something for the release was just committed.
  4. If this is likely to result in significant savings, open an MR against main to alter the .gitlab-ci world to make it so. Done: MR !15152
  5. Decide how far to backport the change (if we get this far)

Estimates on the number of pipelines over a 6 month period: Comments #25 - #30

User interface changes

Introduced terminology

API changes

Data model changes

Release notes snippet

Issue fork drupal-3580398

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

    Comments

    dww created an issue. See original summary.

    dww’s picture

    PS we could move the entire “on_push” configuration to “on_tag” to ensure full pipelines run whenever release managers push a tag, but before they create a release node.

    dww’s picture

    Title: Consider not running core GitLab pipelines on-commit » Consider not running core GitLab pipelines on_push
    mstrelan’s picture

    Most of the issue fork pipelines run against main. I think we would still need on-commit pipelines for other branches.

    dww’s picture

    Most of the issue fork pipelines run against main.

    Definitely.

    I think we would still need on-commit pipelines for other branches.

    Not so sure. 😅 Between nightly builds and on_tag pipelines, I still think we could skip on_push on non-main branches, too.

    But let's see what the numbers reveal. Maybe this is a drop in the bucket compared to issue forks, and we need to stay focused on the sorts of efforts I linked in related issues.

    catch’s picture

    So 99% of the on-commit runs don't find any issues. The ones that do tend to be:

    - something that breaks a specific database driver
    - commit of an MR that has a very stale test run that was green at the time but would have failed if run again
    - same thing but two up to date MRs that cause each other to fail without merge conflicts
    - cherry picks to 11.x that didn't get a test run. I did this yesterday by cherry picking a unit test change that triggered a deprecation on phpunit 10.

    For everything except the database driver/11.x backports, MR runs in issue forks tend to find out that head is broken very quickly. Because of the disruption I tend to get pinged (if I'm the culprit or if the culprit isn't online) pretty quickly.

    The main advantage of the branch runs then is to confirm that it is indeed head that is broken.

    For database drivers and 11.x it can take longer to spot that head is broken anyway. Mostly because of the frequency of random test failures meaning that email alerts are unreliable.

    So short version is, I think we could trial this.

    The one bigger downside would be on release days but we could manually trigger a branch run if we've just committed something for the release.

    dww’s picture

    Issue summary: View changes

    Fantastic, thanks for that update! Really helpful and encouraging to hear. I'm crossing off "Explore other risks from release management perspective..." from remaining tasks.

    The one bigger downside would be on release days but we could manually trigger a branch run if we've just committed something for the release.

    Indeed. And/or what I proposed at #2:

    we could move the entire “on_push” configuration to “on_tag” to ensure full pipelines run whenever release managers push a tag, but before they create a release node.

    catch’s picture

    The tag runs fail every time because we have tests that don't like running against a tag. That is potentially fixable one by one but very hard to enforce over time.

    quietone’s picture

    Issue summary: View changes

    I've summarized catch's response to the issue summary. It is a nice list of what on-commit finds and the impact on release managers.

    dww’s picture

    Status: Active » Needs review

    Guess it's inherently impossible to test this via an MR. 😉 But I left the pipeline running to make sure I didn't break anything else. Obviously needs careful review before we try it.

    dww’s picture

    Issue summary: View changes
    catch’s picture

    Thinking about it more and briefly discussing with @quietone in slack, I think them main thing I'd want to go along with this is skipping all of the frequent random test failures from #2829040: [meta] Known intermittent, random, and environment-specific test failures. People are understandably reluctant to skip tests, but if we want to know when we've actually newly broken something, it's more important that a pipeline failure is a surprise rather than immediately triggering a hunt for which job to re-run. Especially with re-runs being broken for everyone in MAINTAINERS.txt at the moment.

    e.g. #3285193: Temporarily skip random test failures that hide real test failures, part 4 where we did this previously, and #3267247: [meta] Fix and re-enable tests skipped for random failures where we unskipped them again. I don't think there's a current issue/MR to do that for the current batch but could have missed one.

    That would independently be a quality of life improvement for most contributors, and we can skip the tests just at the point before the random test failure so we don't completely lose the test coverage that actually passes.

    --

    Also one possible way to roughly gauge the impact of this without the DA having to provide it:

    A branch run looks like this: https://git.drupalcode.org/project/drupal/-/pipelines/774226/ that took 20 minutes wall time. I don't know how to convert that to CI/compute minutes though, and the total time report in https://git.drupalcode.org/project/drupal/-/pipelines/774226/test_report looks broken to me.

    This month we had:

    git log --pretty=oneline --since="one month ago" | wc -l
    215
    

    215 commits to main.

    git log --pretty=oneline --since="one month ago" | wc -l
    159
    

    159 commits to 11.x

    git log --pretty=oneline --since="one month ago" | wc -l
    36
    

    36 commits to 11.3.x

    git log --pretty=oneline --since="one month ago" | wc -l
    12
    

    12 commits to 10.6.x

    We could count the total requested CPUs in gitlab-ci.yml for jobs, then multiply that by the minutes. If we remove the lint + unit test jobs from the count, that will compensate for the fact they don't run on the different database pipelines. Then we have a CPU request * walltime calculation per branch run. Even that won't be accurate because jobs finish at different times, but most individual core test jobs run 2-4 minutes these days so it wouldn't be completely inaccurate, probably better than what we get from the gitlab UI by the looks of things.

    This also suggests to me that if we can identify release branches and run the on-commit tests on those, that would require a fraction of the CI time compared to main/11.x branch runs, but also strike out the release day problem from the issue summary.

    catch’s picture

    One thing that has changed significantly in the past five years compared to the five previous years, is when issue pipelines used to take 55 minutes to run, it was a very time consuming process to re-run tests prior to commit. I don't always remember to do that when the last run is very old, but it's now very easy to type /rebase in the comment box on the MR and get a fresh run in less time than it usually takes to do a last pass over the MR and figure out issue credit. Also pretty common for people to manually rebase MRs if they notice they're 150 commits behind HEAD or similar. We might end up doing a bit more than we do now if we go ahead here, but I don't think it would happen dramatically more than it already does.

    catch’s picture

    I'm +1 to trialling this on the non-release branches, tagging for RM review so the other release managers get a chance to chime in here.

    nicxvan’s picture

    I am all for this.

    I also think just skipping the flaky tests will have a huge impact, the number of tests I rerun just for flaky tests is very high.

    godotislate’s picture

    Re: #15, I'm +1 as well for trying out the non-release branches.

    kentr’s picture

    It was suggested in #2829040-235: [meta] Known intermittent, random, and environment-specific test failures to put flakey tests in their own group (job?).

    Could we run those first and use Gitlab workflow rules to abort if they fail?

    catch’s picture

    @kentr I think that would make them even more disruptive than they are now - e.g. you wouldn't be able to see any more results until you've passed the gauntlet of the flakiest tests all passing at the same time.

    For on-commit and scheduled tests, what we really need is that when it fails, we know it's because we committed something bad recently, not clicking through and seeing it's blockfilteruitest failing for the 2,000th time in three years.

    quietone’s picture

    Issue summary: View changes

    I've updated the issue summary to include current proposal of

    1. Skipping all of the frequent random test failures from
    2. Trialing this on the non-release branches,

    How long would the trial last?

    catch’s picture

    I think we could try it for six months with the option to revert earlier if it causes noticeable problems before then. Noticeable problems would be main or 11.x broken for more than a day, or more often (we already break them occasionally but find out very quickly usually).

    If it doesn't cause noticeable problems in six months then that's probably enough to stick with it - can always reverse that decision later.

    catch’s picture

    Only a theory but we regularly get gitlab queues backing up, delaying branch creation, MR pipelines and everything else, and this seems to coincide with commits to core.

    Today we committed four issues to main between 11:03 and 11:28, the were all backport to 11.4.x, so that's 12 on-commit pipelines in the space of 25 minutes.

    If we trialled this, only the 11.4 commit would have triggered on-commit pipelines, so a 1/3rd reduction. Commits to 11.3 and 10.6 would also trigger pipelines but that's less frequent and also desired anyway.

    godotislate’s picture

    I also have the impression that gitlab queues get backed up at times coinciding with a series of core commits. Still +1 for a trial.

    catch’s picture

    Summarising some notes from https://drupal.slack.com/archives/CGKLP028K/p1781003925872199 where we are trying to diagnose the gitlab sidekiq (probably) queue getting backed up for 30-60 minutes at a time.

    Looking at an on commit pipeline. This one is 36 jobs https://git.drupalcode.org/project/drupal/-/pipelines/842881 for the main pipeline (couple of these are manual so slightly less actually run), + 21 jobs for each environment. We run three non-default environments for on-commit jobs. So that's 36 + 63 = > 96 jobs per commit, per branch.

    So when we committed and backported four issues to main, 11.x, 11.4.x in 30 minutes. That's 150 jobs * 3 * 4 = ~1152 jobs in half an hour.

    Given those branches were main, 11.x and 11.4.x, this change would mean that instead of 1152 jobs in half an hour, we'd do closer to 400 for four commits, so 1/3rd the workload when we backport to a release branch, and no jobs at all when we don't.

    For comparison, project_analysis does about 50 jobs to analyze 7,000 modules, then another 20-50 to post issues. So about 100 jobs max in total per run.

    edit: counted the jobs wrong in the gitlab UI in the first attempt at this comment, fixed since.

    catch’s picture

    Discussed this a bit with @longwave in slack based on the above and also looked at commit rate against the various active branches.

    Background is:

    On-commit pipelines we do default + 3 extra environments. Daily pipelines we do default + 6 extra environments.

    So four environments for on commit and 7 environments for daily. Let's call them 'environment pipelines', there is probably a proper gitlab term which I can't remember.

    On the 10.6.x branch, we've made exactly 100 commits to that branch in the past six months:

    git log --pretty=oneline --since="six months ago" | wc -l
    100
    

    183 daily+ 100 on-commit

    (183 * 7=1281) + (100 * 4= 400) = 1681 environment pipelines.

    On the 11.3.x branch, we've made 249 commits in the past six months

    183 daily + 249 on-commit.

    (183 * 7=1281) + (249 * 4) = 2227 environment pipelines.

    For those release branches, we can compare to if we just ran every pipeline on commit and dropped the daily job altogether:

    10.6.x

    100 commits * 7 = 700 (compared to 1681)

    11.3.x * 7 - 1743 (compared to 2227).

    So if we had run all environments for all commits for the two release branches, we would have run 2443 environment pipelines instead of 3908 and got better feedback.
    ~

    1465 less environment pipelines and got more instant feedback - something like a 1/3rd saving in CI time for better results.

    On the other hand.

    For main, we have made 924 commits in the past six months.
    (183 * 7=1281) + (924 * 4 = 3696) = 4977

    For 11.x, we've made 763 commits in the past six months.

    (183 * 7 = 1281) + (763*4=3052) = 4333

    Add them together and you get 9310.

    @longwave suggested two different ideas in slack:

    1. Using a delay + cancellation for on-commit jobs to try to run on-commit pipelines every x hours.

    2. A more simple version of that: drop the on-commit jobs but make the daily jobs twice-daily.

    If we had twice-daily pipelines for main and 11.x and no on-commit pipelines, that's

    183 * 7 * 2 * 2 in six months, so 5124 environment pipelines, compared to 9310.

    For the approximately 1/3rd of commits that make it back to release branches, we'd still have an on-commit pipeline for those anyway with the above plan. For the other 2/3rds, if we don't find out before, we'd know we broke head in < 12 hours.

    catch’s picture

    Refinement of #25.

    Most of our commits happen on weekdays, so we could try something like this for development branches (11.x and main):

    Keep the daily scheduled pipeline more or less as is, at say 4am UTC.

    Add a 'twice daily weekday pipeline' that runs the current on-push matrix at 12pm and 8pm UTC.

    That would be 7 * 7 + 10 * 4 environment pipelines per week. Which works out as 2314 * 2 = 5096 - less CI time than if we run the daily job twice per day every day, but more of a safety net during the week.

    quietone’s picture

    If I understand correctly, the last proposal is to make changes for 11.x and main only. The changes are 1) stop on_push pipelines 2) add a pipeline that runs twice a day on weekdays UTC using the current on-push matrix.

    If that is right, it is worth trying. Thanks for working through the numbers.

    catch’s picture

    @quietone there would be a separate change to expand the matrix of tests for on-push pipelines for release branches and get rid of the daily jobs entirely, it's bit lost in #25 nearer the beginning.

    I think we can split this into two separate issues - changing the on-commit matrix for release branches (might need a RELEASE_BRANCH=1 in gitlab.yml which we update when branching a new release branch) the continuing here for the development branches.

    quietone’s picture

    Lost idea.

    Is this the intention?
    Release branches - 1) no daily pipeline, 2) run all environments on push
    Development branches - 1) No on push pipelines, 2) daily pipeline using all environments at ~0400 UTC 3) Twice daily pipeline on weekdays UTC, 1200 and 2000 UTC, using the current on-push matrix.

    For weekdays, instead of 3 pipeline runs on weekdays can it be reduced to 2, 1 on all environments and 1 on the on-push matrix? They could be at the times suggested for the twice daily, 1200 and 2000. Just that one would be full.

    catch’s picture

    @quietone

    Yes that sounds right and I think twice daily would be fine. Once we've made all the code changes to support this we'll be able to vary the schedule easily via gitlab without core changes. We could also open a follow-up for @longwave's rate limiting idea which will be harder to do but give us more flexibility than the scheduling.

    quietone’s picture

    Issue summary: View changes

    Just updating the issue summary with the latest decision

    quietone’s picture

    Issue summary: View changes
    quietone’s picture

    Created an issue to skip tests and a followup to fix the ones skipped in this round.
    #3600653: Temporarily skip failing tests, round 2
    #3600655: [meta] Fix and re-enable tests skipped for random failures, round 2

    Most of the release managers have commented in agreement. And this is a trail, so I am removing the tag.