Problem/Motivation
Now that we have an evals framework as of #3581832: Create an eval framework to determine if guidance updates are making things better or worse let's document how to use it in practice.
Scenario 1: Before / after new rule
#3581669: Create scaffolding to automatically put markdown files in the right place added a "stub" skill for documentation but #3581687: Guidance on how to write excellent documentation has a transcript of an interview with @eojthebrave at DrupalCon which is surely better than this :) (at least, once we "skillify" it)
Compare/contrast "nothing" and "new rule," including how to go about writing an eval from scratch in the first place.
Scenario 2: Updates to an existing rule
#3581832: Create an eval framework to determine if guidance updates are making things better or worse brought in #3581672: Guidance on writing excellent automated tests, at least in part. I would like Claude to take another stab at it and then use the new evals framework in practice to see how the new version holds up.
Compare/contrast "old rule" and "new rule," including how to make amendments to evals.
Scenario 3: The agents keep messing something up and we want to add a new eval to catch it
For example, "stop writing Drupal 7 code" :P
Proposed resolution
Have Claude do this, and also write its own docs to help.
Issue fork ai_best_practices-3582953
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
webchickComment #4
webchickUnassigning myself cos I don't think I could've come up with https://drupal.slack.com/archives/C0APH70JV18/p1775209263697149?thread_t... :D
Comment #6
zorz commentedComment #7
webchickOk this looks AMAZING! :D Thank you soooo much!!
In reviewing the code, I spun off a few other issues:
#3583192: Test eval "creating a new skill" scenario and #3583191: Test eval "editing an existing skill" scenario (also added as sub-issues here) with the idea to run this new framework against a couple of "real-world" scenarios from the issue queue.
#3583202: Explore provider-agnostic eval runner and #3583203: Expand eval runner to test both guidance AND models? as possible future directions. (Not pressing.)
We currently have a chicken/egg situation where we don't have
eval/checker.pyand this guidance on how to use it in the source tree, and therefore people won't be writing evals along with their guidance proposals.I think I would recommend we merge this in, and then expand those other issues with eval coverage as a means to test that this is working as intended. Does that make sense to you?
Comment #8
zorz commentedComment #9
zorz commentedThank you webchick I really enjoy collaborating with you. I am picking this up. I understand that the momentum presses to create the eval runners first and I totally agree. I will need to do some more work here in this issue in parallel with the ones you mention.
Comment #10
alex ua commentedWanted to flag a natural upstream input source for Scenario 3 (catching regressions / new problems).
When an experienced developer corrects an AI agent during a live session — "that's not how cache tags work" or "that hook was replaced in Drupal 11" — that correction is a ready-made eval case waiting to happen. The problem is there's currently no standardized way to capture these corrections so they can feed into evals.
I've opened the issue for capturing expert corrections proposing a
capturing-expert-correctionsskill that logs corrections as structured JSONL. Each entry includes what was wrong, what's correct, which subsystem, and a failure classification.The bridge to this issue's Scenario 3 is direct:
This gives Scenario 3 a concrete, repeatable workflow instead of relying on someone manually noticing "the agent keeps doing X wrong." The correction log is the early warning system; evals are the regression gate.
Happy to help draft the Scenario 3 section with correction-log-driven examples if that would be useful.
Comment #11
zorz commentedComment #12
zorz commentedComment #13
zorz commentedAlex that maps well to what compare.py already supports. A correction log entry becomes an evals.json case with must_contain/must_not_contain checks. Go ahead and draft the Scenario 3 section. I'll wire it into the comparison framework once #3582953: Document how to run evals in various scenarios merges.
If you want to take a look at the correction to eval flow, the CONTRIBUTING.md in the [MR !2](https://git.drupalcode.org/project/ai_best_practices/-/merge_requests/2) has the "Adding a new eval from scratch" walkthrough.
Comment #14
alex ua commentedzorz- here you go!
Scenario 3: a live correction becomes a regression test
When an experienced contributor corrects an AI agent during a real Drupal session, treat that correction as the seed of a new behavioral eval. The goal is not just to note "the model got this wrong", but to convert the mistake into a repeatable case that fails before the guidance change and passes after it.
Workflow
evals.jsoncase. Rephrase the real task as the eval prompt. Usemust_contain_anyfor the required fix,must_not_containfor the bad pattern, andcheck_php_lintwhen the response generates PHP.python3 evals/compare.py --skill ... --runs 3. If this is a new skill, compare a no-skill baseline against the skill with--no-baseline.What the eval should contain
must_contain_any: the corrected API, cacheability metadata, or other required patternmust_not_contain: the stale API, invalid assumption, or unsafe shortcut that was correctedcheck_php_lint: enable when the model outputs PHP codeThe "Adding a new eval from scratch" walkthrough in
CONTRIBUTING.mdalready covers the file layout and commands. This Scenario 3 section documents the missing part: how a real correction during live use becomes the next regression test.This keeps Scenario 3 grounded in real failures from real contributor workflows while reusing the comparison framework that already exists in
compare.py. The correction log is the intake layer;evals.jsonis the regression layer.Comment #15
webchickScenario 3 is a great idea. Let's tackle that in #3583213: Guidance on capturing expert corrections to improve AI skills.
Meanwhile, it behooves us to get something in around eval checking (and instructions on how to do do it) in sooner than later, so I've merged in https://git.drupalcode.org/project/ai_best_practices/-/merge_requests/2 as a starting point.
Thank you SO much for this, @zorz!! Exciting. :D
Comment #17
zorz commentedHeads up on an isolation issue I found while building the documentation evals for #3583241.
claude -p(pipe mode) loads all enabled plugins, hooks, skills, and MCP servers from~/.claude/settings.jsonby default. There is no visual indication this is happening since pipe mode has no UI. That means every eval run withcompare.pyso far included whatever plugins the person running it had installed as unmeasured context in both configs.The fix is two flags added to the
claude -pinvocation:--setting-sources ""blocks all user/project/local settings (plugins, hooks, skills, CLAUDE.md, rules)--strict-mcp-configblocks all MCP serversThe good news: A/B deltas were always clean because both configs got the same contamination. So comparative conclusions ("skill X adds Y% improvement") still hold. The bad news: absolute pass rates may have been inflated by plugin context that was silently injected.
For reference, when I ran the coding-standards evals in #3583192, I had 13 enabled plugins including one that injects 16 Drupal-specific skills. The Sonnet 24/24 result was real in terms of the delta (0% with or without the coding-standards skill), but the baseline was not a clean "no guidance" baseline.
The fix is included in MR !5 on #3583241. I also added a
--cwdflag so the model runs from a neutral directory and cannot see the evals/ or skills/ directories in the repo.