Overview

https://git.drupalcode.org/project/gitlab_templates/-/work_items/3572380 was recently merged for all contrib modules using the default DA templates. Canvas implementation extends the default templates and adds their own jobs (which is totally fine), however some of those jobs always run, no matter what.

Checking the usage over the last week alone, canvas is on par with Drupal core:
CI usage

This is probably due to the volume of issues and MR being worked on each day, so again, the above is not a bad thing, but something to think about.

Last, I see that there is already a "*manual-rule", which is exactly what has been done in the gitlab_templates,

Proposed resolution

I think the idea would be to review which jobs could use the "*manual-rule" that currently do not use it.
I don't know enough about the whole pipeline to know which job is relavant where, so I'll let the people that know suggest possible improvements, if any.

User interface changes

If implemented, some pipeline jobs would need to be manually triggered.

Issue fork canvas-3585979

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

fjgarlin created an issue. See original summary.

fjgarlin’s picture

Also, seeing the per-job breakdown (see the images), there are many cypress and playwright jobs that reach the 30 min timeout, so these jobs are likely to be run again (either automatically or manually). Is there a way to parallelize these jobs so they take max 10-15 min?

penyaskito made their first commit to this issue’s fork.

penyaskito’s picture

One impediment we have is Gitlab absence of support for exclude/negation rules.

e.g. we could reduce the amount of jobs easily with something like

> if ONLY packages/* change, skip X jobs.

Gitlab supports
> if packages/* change, run X jobs.

Flipping the condition would force us to maintain (and hardcode) an allowlist: meaning every new file type or directory could become a silent trap for skipping the job that shouldn't be skipped.

penyaskito’s picture

Component: … to be triaged » Project management
Status: Active » Needs review

Needs review for !944, that's an easy win both for the team speed + reducing the number of timeouts (aka people hitting retry).

Can't claim this would help with the DA spend though, as there will be extra jobs per MR. Not sure which one outweighs the other.

fjgarlin’s picture

1 jobs at 30 minutes equals 6 jobs at 5 minutes, but the former could be "retried" if it goes over the 30 min, so that's an improvement. Every bit counts.

fjgarlin’s picture

wim leers’s picture

wim leers’s picture

Status: Needs review » Reviewed & tested by the community

RTBC'd the first MR.

  • penyaskito committed 464e13b6 on 1.x
    chore: #3585979 CI: Use 6 shards for playwright to avoid 30m timeouts...
penyaskito’s picture

Status: Reviewed & tested by the community » Needs work

Merged !944. Leaving as NW as there's more to fix here.

penyaskito’s picture

wim leers’s picture

Assigned: Unassigned » justafish

I am not convinced that the script in !995 really helps? It's a single line we're expected to modify. I defer to @justafish.

wim leers’s picture

Status: Needs work » Needs review
wim leers’s picture

wim leers’s picture

After #3529128: Speed up PHPUnit on CI; stop relying on drupal.org composer template lands, I think we should consider making:

  1. PHPUnit (11.2) run on all DBs
  2. PHPUnit (11.3) run only on SQLite

Or vice versa.

Thoughts?

EDIT: to clarify, I mean for merge commits (aka push to 1.x). Currently both 11.2 and 11.3 run all tests on all 4 DBs.

wim leers’s picture

@justafish's #3586660: CI: Add additional caching to GitLab CI pipelines should also make a difference here: same # of CI jobs, but they should run faster.

wim leers’s picture

Assigned: justafish » Unassigned
Status: Needs review » Postponed (maintainer needs more info)
StatusFileSize
new4.69 MB

@fjgarlin: Yesterday's last commit had an absolutely terrifying CI pipeline. Why? See attached GIF. Pipeline URL: https://git.drupalcode.org/project/canvas/-/pipelines/808898

In part due to our PHPUnit CI jobs taking >30 mins. That's fixed as of today: #3529128: Speed up PHPUnit on CI; stop relying on drupal.org composer template landed.

However, it also appears to be in part due to infra instability. I've seen this happen many times in the past. But perhaps now is the time to investigate it? It manifests like this:

………
  - Drupal\Tests\canvas\Unit\DataType\ComponentInputsTest
  - Drupal\Tests\canvas\Unit\UiFixturesValidationTest
Test run started:
  Thursday, April 30, 2026 - 00:36
Test summary
------------
WARNING: Event retrieved from the cluster: 0/6 nodes are available: 2 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
WARNING: Event retrieved from the cluster: 0/7 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/7 nodes are available: 7 Preemption is not helpful for scheduling.
WARNING: Event retrieved from the cluster: The node was low on resource: memory. Threshold quantity: 100Mi, available: 102296Ki. Container helper was using 503912Ki, request is 0, has larger consumption of memory. Container database was using 193996Ki, request is 0, has larger consumption of memory. Container build was using 8420012Ki, request is 0, has larger consumption of memory. Container chrome was using 1748Ki, request is 0, has larger consumption of memory. Container selenium was using 246008Ki, request is 0, has larger consumption of memory. 
WARNING: Event retrieved from the cluster: Container runtime did not kill the pod within specified grace period.
WARNING: Event retrieved from the cluster: error killing pod: [failed to "KillContainer" for "build" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillContainer" for "chrome" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillContainer" for "helper" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillContainer" for "database" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillContainer" for "selenium" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "70149220-9ffa-4954-b4cb-4795fbd8400c" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
Uploading artifacts for failed job
00:00
Cleaning up project directory and file based variables
00:00
ERROR: Job failed (system failure): pod "gitlab-runner/runner-s8ex1x2yj-project-19391-concurrent-11-ug7cktfg" is disrupted: reason "TerminationByKubelet", message "The node was low on resource: memory. Threshold quantity: 100Mi, available: 102296Ki. Container helper was using 503912Ki, request is 0, has larger consumption of memory. Container database was using 193996Ki, request is 0, has larger consumption of memory. Container build was using 8420012Ki, request is 0, has larger consumption of memory. Container chrome was using 1748Ki, request is 0, has larger consumption of memory. Container selenium was using 246008Ki, request is 0, has larger consumption of memory. "

https://git.drupalcode.org/project/canvas/-/jobs/9597857

Any idea what's going on? Is this a known d.o GitLab CI problem? Is it something Canvas is doing?

fjgarlin’s picture

Oh wow, that is indeed horrifying! However, the timestamps coincide with yesterday's security update to the CI runners (https://drupal.slack.com/archives/C51GNJG91/p1777507813295379) so I think, in a way, it was just bad timing.

If this happens again at a time when there aren't any updates happening, then we need to investigate for sure. But from the above output, it seems that the jobs were running and the containers were just killed, which most likely retriggered the jobs to run (if not automatic it was done manually, hence the fully green pipeline).

If this was posponed just based on this, I guess it can be unposponed.

wim leers’s picture