Problem/Motivation

~
Opening this as an alternative to #3387117: Enable distributed caching in GitLab Runner.

gitlab has an API for fetching artifacts from arbitrary branches and jobs and it recommends this over using the distributed cache, which it looks like may not be shared between issue repositories and the main project (and also isn't enabled on Drupal gitlab yet).

phpstan is the longest job in the lint step, taking 1m51s (and also requiring a lot of cpu), whereas the others complete in less than a minute.

If we can cut phpstan jobs to under one minute, that will potentially mean all pipeline runs return a minute faster.

What this MR does:

1. Configures the phpstan tmp directory so that it's within the workspace
2. Sets phpstan artifacts to always upload.
3. Downloads the phpstan result cache using curl from the gitlab API. This is to avoid using 'needs:' which results in a hard failure if the file isn't available, whereas we need the cache get to fail silently so phpstan can just run without the cache if it's not there. This is to prevent a circular dependency from the job to itself.

Steps to reproduce

https://git.drupalcode.org/project/drupal/-/jobs/2191580 instead of finishing in about 1m44s, finished in 55s. Because phpstan is the slowest lint step, this has the potential to reduce overall pipeline time (currently around 7m30s+ down to about 6m40+

Proposed resolution

Remaining tasks

Follow-ups are open to apply the same pattern to cspell and eslint. We should do phpcs too.

User interface changes

API changes

Data model changes

Release notes snippet

CommentFileSizeAuthor
#30 3462763-nr-bot.txt90 bytesneeds-review-queue-bot

Issue fork drupal-3462763

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

catch created an issue. See original summary.

catch’s picture

Issue summary: View changes

catch’s picture

Status: Active » Closed (won't fix)

No this is a dead-end - no way to share artificacts between different pipelines of the same branch, let alone different branches.

https://docs.gitlab.com/ee/ci/caching/

catch’s picture

Status: Closed (won't fix) » Needs work
catch’s picture

Getting closer.

I had to move the phpstan tmpDir into core so it can be used as an artifact, with the associated .gitignore etc. cspell cache is already in the core directory so hopefully this is fine.

Temporarily set the ref to point to the MR branch, that lets us create artifacts in one pipeline and download them in the next to simulate what would happen once this is committed. Also added --vvv to the phpstan command so it tells us if it gets cache hits and misses.

https://git.drupalcode.org/project/drupal/-/jobs/2190395 shows a cache hit.

Here's a cache hit with Drupal.php changed https://git.drupalcode.org/project/drupal/-/jobs/2190459

catch’s picture

To make this optionally get the cache, we'd need https://gitlab.com/gitlab-org/gitlab/-/issues/389050 which doesn't exist. But I think we can get it in the script section with curl via the gitlab API instead per https://docs.gitlab.com/ee/ci/jobs/job_artifacts.html. This might be the only way to avoid a circular dependency between the job and itself...

bbrala’s picture

Interesting to just do it manually... Keeping an eye on this one.

catch’s picture

Status: Needs work » Needs review

Here's a successful run using curl against the gitlab API, with the ref temporarily changed to the merge request ref (because HEAD doesn't store the phpstan result cache as an artifact yet).

https://git.drupalcode.org/project/drupal/-/jobs/2191580

Given that works, this ought to work once it's merged and tries to fetch from HEAD instead.

If we wanted to test it a bit more for robustness on gitlab, one way would be to make a commit that only stores the artifacts and doesn't include the curl script. Then this MR would be able to download the real artifact from a branch run. Happy to do that if someone is happy to review/RTBC/commit the extra MR either from this issue or a spin-off.

However, since the script can fail and the job still passes, this should be low risk now.

catch’s picture

Issue summary: View changes

Ready for review, updated the issue summary with the eventual approach. A lot of twists and turns in the commit hisory because I was learning as I go, but it's not that complicated at the end of it all.

catch’s picture

One more commit (for now) to reduce the number of CPUs requested to 4 - we don't need 16 CPUs on a cache hit and this should mean less time waiting for pods to be available.

The latest jobs will be cache misses because it's looking for artifacts from HEAD that don't exist, but even with 4 cpus they still complete in less than two minutes, so we're talking a 10-20s slowdown on cache misses vs. a 1 minute speed up on cache hits, with one quarter the requested CPUs - although it could be more than 4 CPUs got allocated, don't think that's possible to see. https://git.drupalcode.org/project/drupal/-/jobs/2191864

Note that when there's a cache miss phpstan complains about the file being corrupted (because it's whatever gitlab sends as a 404 response), but the job passes and you can see something went wrong, so for me at least, that's not bad.

catch’s picture

Issue summary: View changes
catch’s picture

One thing we could potentially do here but which I haven't is to have a specific job for warming this cache instead of using the job itself.

- run the job only on commit runs
- potentially run phpstan, cspell, anything else we want to generate artifact caches for
- in the normal job, don't upload the cache as an artifact, only download it from gitlab

This would mean duplicating most of the phpstan job into a second job, but it would result in less bandwidth and storage usage.

bbrala’s picture

You could possible scope the caching with usage of the variables of gitlab and add it to the phpstan job id think. That menas no extra job, but only cache on commit.

catch’s picture

I couldn't work out how to do what with artifacts: when but there might be a way yeah.

Added an extra MR branch here with an extra job for comparison in the meantime.

I don't have a strong opinion between the two, just trying to provide options and see the first sub-7 minute core pipeline run.

catch’s picture

Issue summary: View changes
catch’s picture

Title: Try to store the phpstan result cache as an artifact » Use artifacts to share the phpstan result cache from core to MRs
bbrala’s picture

I looked at using artifacte: when: and indeed, seems pretty hard to get it to do the work there.

I specific cache job is a possibility, but thinking about this, this kinda is a bandaid until we can just use real caches by gitlab? I ended up not caring to much if it is seperate or not. Its a little more controlled seperately which is good. Might make the move to gitlab cache a little easier since we only need to change very little in the main job.

I end up with preferring a seperate job. But the preference is kinda light.

andypost’s picture

Nice to see this working!

btw both cache and artifacts has TTL so some fallback when cache is missing still required.
so instead of || true for curl if could use condition to print error message and re-init cache

bbrala’s picture

Isnt the fallback pretty much automatic, since if its not available it will just run without?

catch’s picture

Yes #21 is right, and the MR demonstrates this because we don't have the artifacts being created in core yet, so the cache is always empty (except when I hacked it to point to the MR instead).

catch’s picture

https://gitlab.com/gitlab-org/gitlab/-/issues/269404 suggests that forks can't use the cache from the upstream project, assume those forks are the same as our issue forks (but maybe they're not?), then this might not be a stop-gap but the only way to share the cache between core and all the issue forks.

catch’s picture

Put this into a sandbox MR, hacking it to get the cache from the MR itself, and got a test run of 7m10s, I think the shortest we've seen prior to that was about 7m25s. My hope is that this will get us under 7 minutes, which has been elusive so far, but 7m10s is in the right direction at least.

https://git.drupalcode.org/project/drupal/-/pipelines/231155

catch’s picture

Opened #3463949: [PP-1] Use artifacts to share the cspell cache between runs for cspell. Using the cache the entire cspell job completes in about a minute which means we can remove the git diff logic entirely. Postponed on this one because it uses the same lint cache warming job as this.

Definitely leaning towards the separate cache warming job now because it makes it a lot clearer where the cache actually gets created and pulled from.

catch changed the visibility of the branch 3462763-try-to-store to hidden.

catch’s picture

Issue summary: View changes
bbrala’s picture

Yeah I'm also leaning towards a separate job for preparing the caches. Although it might cost you a little in feedback after commit.

catch’s picture

Although it might cost you a little in feedback after commit.

This should hopefully be OK because the on-commit jobs (at least as the current MR stands) will continue to run the separate lint jobs, they'll just run this extra cache warming job on top. So if we introduce a lint failure, we should still be able to see what went wrong in those individual jobs. However it definitely introduces some level of uncertainty because any jobs that don't run on MRs are hard to test - should possibly add a when: manual to make that easier?

needs-review-queue-bot’s picture

Status: Needs review » Needs work
StatusFileSize
new90 bytes

The Needs Review Queue Bot tested this issue. It no longer applies to Drupal core. Therefore, this issue status is now "Needs work".

This does not mean that the patch necessarily needs to be re-rolled or the MR rebased. Read the Issue Summary, the issue tags and the latest discussion here to determine what needs to be done.

Consult the Drupal Contributor Guide to find step-by-step guides for working with issues.

catch’s picture

Status: Needs work » Needs review

Rebased and merged #3463949: [PP-1] Use artifacts to share the cspell cache between runs into here because the cspell changes are very small relative to the MR size, and I think it's easier to see how it'll work with multiple jobs when there's two examples in here.

catch changed the visibility of the branch 3462763-cache-warming-job to hidden.

catch’s picture

Set up an MR in a sandbox issue to change the cache target to point back to the MR's own jobs - this is because the actual change here restricts the artifact download to only core branches, which means the cache is always empty until we actually commit this, and I've made enough mess in the branches on this issue without introducing an extra one here.

phpstan: https://git.drupalcode.org/project/drupal/-/jobs/2245377
cspell: https://git.drupalcode.org/project/drupal/-/jobs/2245381

And the actual cache warming job itself. https://git.drupalcode.org/project/drupal/-/jobs/2245376

The cache warming job also fetches its own cache and uses that if it can, because this will save CPU cycles on the commit runs - then it'll write a new cache based on the changes. We don't have to do this, but it'll mean the updated files are available much quicker. If the cache is empty, phpstan complains that the cache file is corrupted (because it's a gitlab 404 response instead) but then runs without it and generates a new one. cspell just silently runs without cache if it can't find a cache file or if it's wrong, and generates a new cache file too. So the job can't break itself (afaict), it will just either use the cache or not.

We will probably want a follow-up to remove the verbose output from the jobs, but I think it's worth keeping that until we see it actually working after commit.

Would still be useful to confirm that $CI_PROJECT_ID points to core's project ID (59858) for merge requests in issues/branches from non-core committers. If it doesn't, we can set our own variable in gitlabci.yml hard-coding it or something.

bbrala’s picture

Trying again. My 2 gitlab comments got eaten.

Can we make a reference of the cachefetch script lines? That cleans the jobs themselves up, better elreadability and no duplication?

Secondly, is cspell caching strategy content optimal? Id say metadata is fine and way less compute/io. Might not matter much since I remember the storage is on a memdisk.

catch’s picture

Metadata is the default so at least by accident I tried that first. However it uses mtime which is a no go, so had to change it to content.

The cache fetch lines are unique to each job so not sure that will help us much? Each only specifies the files it needs.

bbrala’s picture

Might have missed something then. Thought the cache fetch for the phpstan job was the same as the fetch fo r the warm job and assumed the same for cspell. Mightve missed a env var then that might be different.

You do fetch the cache in the warm job right?

catch’s picture

No it's me that missed it. Thought you meant sharing the command between phpstan and cspell, not between the cache warming job and the lint job. Good plan and will do when back at computer.

andypost’s picture

Maybe instead of custom scripting the glab ci a better option

catch’s picture

@andypost: afaict glab ci is intended for local use rather than on gitlab runners themselves, we'd have to install it in the gitlab images, and we'd also need to figure out authentication, if it supports job tokens, it doesn't mention that in the docs. Would only replace a couple of lines of YAML as well.

@bbrala: moved the cache fetching to references, that is indeed tidier. Also moved cspell command to references since that's identical too. phpstan command we do different things between the cache warming job and the two invocations in the job itself, so left that bit.

bbrala’s picture

Nice and clean like this. Changes look good except I haven't seen the cache run with the latest code so bit wear to rtbc.

andypost’s picture

@catch glab is limited in CI but because of absence of write permissions, the auth is glab auth login --token=$CI_JOB_TOKEN

catch’s picture

bbrala’s picture

Status: Needs review » Reviewed & tested by the community

Well, this seems great. Checked if cache is downloaded and used, all seems like it should be.

cpsell is 7 seconds cached, phpstan wat less.

Think this is great, don't think the glab cli is really needed right now. It will need a small change before merge, but feel this is all good.

Thanks catch for working on this <3

Added credits for catch, me and also andypost.

catch’s picture

It will need a small change before merge

Just to say the MR here doesn't have the hack to point the cache items to itself, those pipeline runs are from a branch on a different, sandbox issue so that this one stayed relatively clean and should be fine to merge as-is. Because the change isn't committed yet, we haven't seen a successful cache fetch from a core pipeline run, so will need to check it actually works as expected but that can only be done after merge.

@mstrelan I originally did that on purpose since they're both dealing with the version, but yeah let's move that comment back where it was. Done that, leaving RTBC because it's just moving a comment one line down.

Thanks for the reviews here, started work on #3463108: Use artifacts to share the eslint and styleint caches from core to MRs and opened #3464361: [CI] Use artifact caching for phpcs.

catch’s picture

Title: Use artifacts to share the phpstan result cache from core to MRs » Use artifacts to share the phpstan result and cspell caches from core to MRs

bbrala changed the visibility of the branch 3462763-test-mtime-optimization to hidden.

bbrala’s picture

Added followup issue to investigate using git-restore-mtime to optimize even more. Could mean we don't need strategy -> content but could use file metadata. Should theoretically be more performant.

  • alexpott committed c5d4f8a4 on 11.0.x
    Issue #3462763 by catch, bbrala, andypost: Use artifacts to share the...

  • alexpott committed c56ebfb7 on 11.x
    Issue #3462763 by catch, bbrala, andypost: Use artifacts to share the...
alexpott’s picture

Version: 11.x-dev » 11.0.x-dev
Status: Reviewed & tested by the community » Fixed

Committed c56ebfb and pushed to 11.x. Thanks!
Committed c5d4f8a and pushed to 11.0.x. Thanks!

  • alexpott committed 2cd8c6ec on 11.0.x
    Issue #3462763 follow-up by alexpott: Use artifacts to share the phpstan...
alexpott’s picture

This broke 11.0.x because of dictionary changes due to #3460114: Update JavaScript dependencies for Drupal 11.0-rc - that remade the dictionary on 11.x but did not account for differences on 11.0.x... to fix I remade the dictionary on 11.0.x and committed.

  • catch committed fffc68d7 on 11.0.x
    Issue #3462763 by catch, bbrala, andypost, alexpott: Use artifacts to...

  • catch committed fbe398b5 on 11.x
    Issue #3462763 by catch, bbrala, andypost, alexpott: Use artifacts to...
catch’s picture

Pushed a follow-up to also run this on the daily tests.

I've been seeing 404s for the artifacts on the 11.x branch, but correctly seeing them on the 11.0.x branch.

The gitlab API URL we're using provides 'the artifacts from the latest successful pipeline' - my theory is that if the latest successful pipeline doesn't actually run the job, then you end up with a 404, even if the last pipeline where the job ran was successful.

If so, this means at least for now, we need to switch the scheduled performance jobs to run on the 11.0.x branch, and ensure that the cache warming job runs on the daily pipeline - so that every 11.x pipeline runs this job.

Should find out whether the above is all correct within an hour or so.

catch’s picture

Confirmed. After disabling the 11.x performance test schedule (now moved to 11.0.x) and after a successful on-commit pipeline run, I got a cache hit on a core MR: https://git.drupalcode.org/project/drupal/-/jobs/2295859

  • catch committed 2ef15514 on 11.0.x
    Issue #3462763 by catch, bbrala, andypost, alexpott: Use artifacts to...

  • catch committed 20ea83cb on 11.x
    Issue #3462763 by catch, bbrala, andypost, alexpott: Use artifacts to...
catch’s picture

One more follow-up commit - after the above, I still haven't seen a cache hit on an MR that's not from a core developer. I wonder if it's the job token not being valid for branch pipelines when it's from a fork. However, @fjgarlin tested the URL logged out and was able to download the artifacts, which means we don't need the job token at all anyway because the API URL is completely publicly accessible anyway.

catch’s picture

Version: 11.0.x-dev » 10.4.x-dev
Status: Fixed » Patch (to be ported)

We should try to backport this to 10.4.x and 10.3.x, not because MR pipeline times matter that much on those branches, but it would help to keep the gitlab files in sync for other changes. There's other issues tagged for backport though we should get in first.

quietone made their first commit to this issue’s fork.

quietone’s picture

Made an MR for the 10.4 backport, MR 9353.

catch’s picture

Status: Patch (to be ported) » Needs review
smustgrave’s picture

Status: Needs review » Reviewed & tested by the community

Backport seems inline with 11.x

wim leers’s picture

WOWWWWWWWWW! 🤩🤩🤩 👏

Can we achieve the same for https://www.drupal.org/project/gitlab_templates, to accelerate contrib CI?

catch’s picture

@Wim the current solution only works for core scheduled/on-commit pipelines and MRs from committers so far. This is due to how both phpstan and all the yarn linters validate their cache files (mostly to do with absolute paths, but also what project ID points to etc.), #3470641: gitlab artifact caching doesn't work due to differences with project ID and build directories has details.

I think that contrib would have the same issue - i.e. caching working for scheduled/on-commit pipelines and committers but not anyone else, so we should probably get core 100% working before trying to apply the same pattern. However in theory it should be be possible to make it work the same way.

larowlan’s picture

Status: Reviewed & tested by the community » Needs work

I think there's some differences between what's in 11.x and what's in the backport. I'm not sure they're significant but would appreciate a second opinion

catch’s picture

Version: 10.4.x-dev » 11.0.x-dev
Status: Needs work » Fixed

Since we've missed multiple other gitlab backports between 11.x and 10.x, the effort of backporting everything now will probably be as much or more than dealing with any critical issues if they come up. So I'm going to just mark this fixed against 11.x.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.