Problem/Motivation

Now that we have an evals framework as of #3581832: Create an eval framework to determine if guidance updates are making things better or worse let's document how to use it in practice.

Scenario 1: Before / after new rule

#3581669: Create scaffolding to automatically put markdown files in the right place added a "stub" skill for documentation but #3581687: Guidance on how to write excellent documentation has a transcript of an interview with @eojthebrave at DrupalCon which is surely better than this :) (at least, once we "skillify" it)

Compare/contrast "nothing" and "new rule," including how to go about writing an eval from scratch in the first place.

Scenario 2: Updates to an existing rule

#3581832: Create an eval framework to determine if guidance updates are making things better or worse brought in #3581672: Guidance on writing excellent automated tests, at least in part. I would like Claude to take another stab at it and then use the new evals framework in practice to see how the new version holds up.

Compare/contrast "old rule" and "new rule," including how to make amendments to evals.

Scenario 3: The agents keep messing something up and we want to add a new eval to catch it

For example, "stop writing Drupal 7 code" :P

Proposed resolution

Have Claude do this, and also write its own docs to help.

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

webchick created an issue. See original summary.

webchick’s picture

Issue summary: View changes

zorz made their first commit to this issue’s fork.

webchick’s picture

Assigned: webchick » Unassigned

Unassigning myself cos I don't think I could've come up with https://drupal.slack.com/archives/C0APH70JV18/p1775209263697149?thread_t... :D

zorz’s picture

Status: Active » Needs review
webchick’s picture

Ok this looks AMAZING! :D Thank you soooo much!!

In reviewing the code, I spun off a few other issues:

#3583192: Test eval "creating a new skill" scenario and #3583191: Test eval "editing an existing skill" scenario (also added as sub-issues here) with the idea to run this new framework against a couple of "real-world" scenarios from the issue queue.

#3583202: Explore provider-agnostic eval runner and #3583203: Expand eval runner to test both guidance AND models? as possible future directions. (Not pressing.)

We currently have a chicken/egg situation where we don't have eval/checker.py and this guidance on how to use it in the source tree, and therefore people won't be writing evals along with their guidance proposals.

I think I would recommend we merge this in, and then expand those other issues with eval coverage as a means to test that this is working as intended. Does that make sense to you?

zorz’s picture

Status: Needs review » Needs work
zorz’s picture

Thank you webchick I really enjoy collaborating with you. I am picking this up. I understand that the momentum presses to create the eval runners first and I totally agree. I will need to do some more work here in this issue in parallel with the ones you mention.

alex ua’s picture

Wanted to flag a natural upstream input source for Scenario 3 (catching regressions / new problems).

When an experienced developer corrects an AI agent during a live session — "that's not how cache tags work" or "that hook was replaced in Drupal 11" — that correction is a ready-made eval case waiting to happen. The problem is there's currently no standardized way to capture these corrections so they can feed into evals.

I've opened the issue for capturing expert corrections proposing a capturing-expert-corrections skill that logs corrections as structured JSONL. Each entry includes what was wrong, what's correct, which subsystem, and a failure classification.

The bridge to this issue's Scenario 3 is direct:

  1. Expert catches agent error → correction logged with classification
  2. Correction log entry becomes the basis for a new eval case
  3. Eval confirms the fix actually improves output
  4. Scenario 3 documentation can reference correction logs as the "how did we know we needed this eval?" origin story

This gives Scenario 3 a concrete, repeatable workflow instead of relying on someone manually noticing "the agent keeps doing X wrong." The correction log is the early warning system; evals are the regression gate.

Happy to help draft the Scenario 3 section with correction-log-driven examples if that would be useful.

zorz’s picture

Assigned: Unassigned » zorz
zorz’s picture

Assigned: zorz » Unassigned
Status: Needs work » Needs review
zorz’s picture

Alex that maps well to what compare.py already supports. A correction log entry becomes an evals.json case with must_contain/must_not_contain checks. Go ahead and draft the Scenario 3 section. I'll wire it into the comparison framework once #3582953: Document how to run evals in various scenarios merges.

If you want to take a look at the correction to eval flow, the CONTRIBUTING.md in the [MR !2](https://git.drupalcode.org/project/ai_best_practices/-/merge_requests/2) has the "Adding a new eval from scratch" walkthrough.

alex ua’s picture

zorz- here you go!

Scenario 3: a live correction becomes a regression test

When an experienced contributor corrects an AI agent during a real Drupal session, treat that correction as the seed of a new behavioral eval. The goal is not just to note "the model got this wrong", but to convert the mistake into a repeatable case that fails before the guidance change and passes after it.

Workflow
  1. Capture the correction. Record the original task, the incorrect claim or code, the corrected answer, the subsystem, and a short note explaining why the original reasoning failed.
  2. Turn it into an evals.json case. Rephrase the real task as the eval prompt. Use must_contain_any for the required fix, must_not_contain for the bad pattern, and check_php_lint when the response generates PHP.
  3. Run the comparison. If this is an existing skill, compare the old version against the edited version with python3 evals/compare.py --skill ... --runs 3. If this is a new skill, compare a no-skill baseline against the skill with --no-baseline.
  4. Review the delta. A successful fix should improve the new case without regressing existing cases. Include the comparison output in the merge request so reviewers can see pass-rate delta, token usage, and cost.
What the eval should contain
  • Prompt: ask the model to do the same kind of task that triggered the correction
  • must_contain_any: the corrected API, cacheability metadata, or other required pattern
  • must_not_contain: the stale API, invalid assumption, or unsafe shortcut that was corrected
  • check_php_lint: enable when the model outputs PHP code

The "Adding a new eval from scratch" walkthrough in CONTRIBUTING.md already covers the file layout and commands. This Scenario 3 section documents the missing part: how a real correction during live use becomes the next regression test.

This keeps Scenario 3 grounded in real failures from real contributor workflows while reusing the comparison framework that already exists in compare.py. The correction log is the intake layer; evals.json is the regression layer.

webchick’s picture

Component: Documentation » Evals
Priority: Normal » Major
Status: Needs review » Fixed

Scenario 3 is a great idea. Let's tackle that in #3583213: Guidance on capturing expert corrections to improve AI skills.

Meanwhile, it behooves us to get something in around eval checking (and instructions on how to do do it) in sooner than later, so I've merged in https://git.drupalcode.org/project/ai_best_practices/-/merge_requests/2 as a starting point.

Thank you SO much for this, @zorz!! Exciting. :D

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

zorz’s picture

Heads up on an isolation issue I found while building the documentation evals for #3583241.

claude -p (pipe mode) loads all enabled plugins, hooks, skills, and MCP servers from ~/.claude/settings.json by default. There is no visual indication this is happening since pipe mode has no UI. That means every eval run with compare.py so far included whatever plugins the person running it had installed as unmeasured context in both configs.

The fix is two flags added to the claude -p invocation:

  • --setting-sources "" blocks all user/project/local settings (plugins, hooks, skills, CLAUDE.md, rules)
  • --strict-mcp-config blocks all MCP servers

The good news: A/B deltas were always clean because both configs got the same contamination. So comparative conclusions ("skill X adds Y% improvement") still hold. The bad news: absolute pass rates may have been inflated by plugin context that was silently injected.

For reference, when I ran the coding-standards evals in #3583192, I had 13 enabled plugins including one that injects 16 Drupal-specific skills. The Sonnet 24/24 result was real in terms of the delta (0% with or without the coding-standards skill), but the baseline was not a clean "no guidance" baseline.

The fix is included in MR !5 on #3583241. I also added a --cwd flag so the model runs from a neutral directory and cannot see the evals/ or skills/ directories in the repo.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.