Document how to run evals in various scenarios [#3582953]

Problem/Motivation

Now that we have an evals framework as of #3581832: Create an eval framework to determine if guidance updates are making things better or worse let's document how to use it in practice.

Scenario 1: Before / after new rule

#3581669: Create scaffolding to automatically put markdown files in the right place added a "stub" skill for documentation but #3581687: Guidance on how to write excellent documentation has a transcript of an interview with @eojthebrave at DrupalCon which is surely better than this :) (at least, once we "skillify" it)

Compare/contrast "nothing" and "new rule," including how to go about writing an eval from scratch in the first place.

Scenario 2: Updates to an existing rule

#3581832: Create an eval framework to determine if guidance updates are making things better or worse brought in #3581672: Guidance on writing excellent automated tests, at least in part. I would like Claude to take another stab at it and then use the new evals framework in practice to see how the new version holds up.

Compare/contrast "old rule" and "new rule," including how to make amendments to evals.

Scenario 3: The agents keep messing something up and we want to add a new eval to catch it

For example, "stop writing Drupal 7 code" :P

Proposed resolution

Have Claude do this, and also write its own docs to help.

Issue fork ai_best_practices-3582953

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3582953-document-how-to changes, plain diff MR !2
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

3 April 2026 at 06:46

webchick created an issue. See original summary.

Comment #2

webchick

she/they

English

Vancouver 🇨🇦

commented 3 April 2026 at 10:39

Issue summary:

View changes

Comment #3

3 April 2026 at 10:46

zorz made their first commit to this issue’s fork.

Comment #4

webchick

she/they

English

Vancouver 🇨🇦

commented 3 April 2026 at 10:49

Assigned:

webchick

» Unassigned

Unassigning myself cos I don't think I could've come up with https://drupal.slack.com/archives/C0APH70JV18/p1775209263697149?thread_t... :D

Comment #5

3 April 2026 at 11:09

zorz opened merge request !2

Comment #6

zorz commented 5 April 2026 at 14:45

Status:

Active

» Needs review

Comment #7

webchick

she/they

English

Vancouver 🇨🇦

commented 5 April 2026 at 21:02

Ok this looks AMAZING! :D Thank you soooo much!!

In reviewing the code, I spun off a few other issues:

#3583192: Test eval "creating a new skill" scenario and #3583191: Test eval "editing an existing skill" scenario (also added as sub-issues here) with the idea to run this new framework against a couple of "real-world" scenarios from the issue queue.

#3583202: Explore provider-agnostic eval runner and #3583203: Expand eval runner to test both guidance AND models? as possible future directions. (Not pressing.)

We currently have a chicken/egg situation where we don't have eval/checker.py and this guidance on how to use it in the source tree, and therefore people won't be writing evals along with their guidance proposals.

I think I would recommend we merge this in, and then expand those other issues with eval coverage as a means to test that this is working as intended. Does that make sense to you?

Comment #8

zorz commented 5 April 2026 at 21:51

Status:

Needs review

» Needs work

Comment #9

zorz commented 5 April 2026 at 21:56

Thank you webchick I really enjoy collaborating with you. I am picking this up. I understand that the momentum presses to create the eval runners first and I totally agree. I will need to do some more work here in this issue in parallel with the ones you mention.

Comment #10

alex ua commented 5 April 2026 at 22:20

Wanted to flag a natural upstream input source for Scenario 3 (catching regressions / new problems).

When an experienced developer corrects an AI agent during a live session — "that's not how cache tags work" or "that hook was replaced in Drupal 11" — that correction is a ready-made eval case waiting to happen. The problem is there's currently no standardized way to capture these corrections so they can feed into evals.

I've opened the issue for capturing expert corrections proposing a capturing-expert-corrections skill that logs corrections as structured JSONL. Each entry includes what was wrong, what's correct, which subsystem, and a failure classification.

The bridge to this issue's Scenario 3 is direct:

Expert catches agent error → correction logged with classification
Correction log entry becomes the basis for a new eval case
Eval confirms the fix actually improves output
Scenario 3 documentation can reference correction logs as the "how did we know we needed this eval?" origin story

This gives Scenario 3 a concrete, repeatable workflow instead of relying on someone manually noticing "the agent keeps doing X wrong." The correction log is the early warning system; evals are the regression gate.

Happy to help draft the Scenario 3 section with correction-log-driven examples if that would be useful.

Comment #11

zorz commented 5 April 2026 at 22:32

Assigned:

Unassigned

» zorz

Comment #12

zorz commented 5 April 2026 at 22:36

Assigned:	zorz	» Unassigned
Status:	Needs work	» Needs review

Comment #13

zorz commented 5 April 2026 at 22:50

Alex that maps well to what compare.py already supports. A correction log entry becomes an evals.json case with must_contain/must_not_contain checks. Go ahead and draft the Scenario 3 section. I'll wire it into the comparison framework once #3582953: Document how to run evals in various scenarios merges.

If you want to take a look at the correction to eval flow, the CONTRIBUTING.md in the [MR !2](https://git.drupalcode.org/project/ai_best_practices/-/merge_requests/2) has the "Adding a new eval from scratch" walkthrough.

Comment #14

alex ua commented 5 April 2026 at 23:52

zorz- here you go!

Scenario 3: a live correction becomes a regression test

When an experienced contributor corrects an AI agent during a real Drupal session, treat that correction as the seed of a new behavioral eval. The goal is not just to note "the model got this wrong", but to convert the mistake into a repeatable case that fails before the guidance change and passes after it.

Workflow

Capture the correction. Record the original task, the incorrect claim or code, the corrected answer, the subsystem, and a short note explaining why the original reasoning failed.
Turn it into an evals.json case. Rephrase the real task as the eval prompt. Use must_contain_any for the required fix, must_not_contain for the bad pattern, and check_php_lint when the response generates PHP.
Run the comparison. If this is an existing skill, compare the old version against the edited version with python3 evals/compare.py --skill ... --runs 3. If this is a new skill, compare a no-skill baseline against the skill with --no-baseline.
Review the delta. A successful fix should improve the new case without regressing existing cases. Include the comparison output in the merge request so reviewers can see pass-rate delta, token usage, and cost.

What the eval should contain

Prompt: ask the model to do the same kind of task that triggered the correction
must_contain_any: the corrected API, cacheability metadata, or other required pattern
must_not_contain: the stale API, invalid assumption, or unsafe shortcut that was corrected
check_php_lint: enable when the model outputs PHP code

The "Adding a new eval from scratch" walkthrough in CONTRIBUTING.md already covers the file layout and commands. This Scenario 3 section documents the missing part: how a real correction during live use becomes the next regression test.

This keeps Scenario 3 grounded in real failures from real contributor workflows while reusing the comparison framework that already exists in compare.py. The correction log is the intake layer; evals.json is the regression layer.

Comment #15

webchick

she/they

English

Vancouver 🇨🇦

commented 6 April 2026 at 07:20

Component:	Documentation	» Evals
Priority:	Normal	» Major
Status:	Needs review	» Fixed

Scenario 3 is a great idea. Let's tackle that in #3583213: Guidance on capturing expert corrections to improve AI skills.

Meanwhile, it behooves us to get something in around eval checking (and instructions on how to do do it) in sooner than later, so I've merged in https://git.drupalcode.org/project/ai_best_practices/-/merge_requests/2 as a starting point.

Thank you SO much for this, @zorz!! Exciting. :D

Comment #16

6 April 2026 at 07:20

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Comment #17

zorz commented 6 April 2026 at 10:21

Heads up on an isolation issue I found while building the documentation evals for #3583241.

claude -p (pipe mode) loads all enabled plugins, hooks, skills, and MCP servers from ~/.claude/settings.json by default. There is no visual indication this is happening since pipe mode has no UI. That means every eval run with compare.py so far included whatever plugins the person running it had installed as unmeasured context in both configs.

The fix is two flags added to the claude -p invocation:

--setting-sources "" blocks all user/project/local settings (plugins, hooks, skills, CLAUDE.md, rules)
--strict-mcp-config blocks all MCP servers

The good news: A/B deltas were always clean because both configs got the same contamination. So comparative conclusions ("skill X adds Y% improvement") still hold. The bad news: absolute pass rates may have been inflated by plugin context that was silently injected.

For reference, when I ran the coding-standards evals in #3583192, I had 13 enabled plugins including one that injects 16 Drupal-specific skills. The Sonnet 24/24 result was real in terms of the delta (0% with or without the coding-standards skill), but the baseline was not a clean "no guidance" baseline.

The fix is included in MR !5 on #3583241. I also added a --cwd flag so the model runs from a neutral directory and cannot see the evals/ or skills/ directories in the repo.

Comment #18

20 April 2026 at 10:25

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Document how to run evals in various scenarios

Problem/Motivation

Scenario 1: Before / after new rule

Scenario 2: Updates to an existing rule

Scenario 3: The agents keep messing something up and we want to add a new eval to catch it

Proposed resolution

Issue fork ai_best_practices-3582953

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Scenario 3: a live correction becomes a regression test

Workflow

What the eval should contain

Comment #15

Comment #16

Comment #17

Comment #18

Child issues

News items

Our community

Documentation

Drupal code base

Governance of community