Problem/Motivation

#3582953: Document how to run evals in various scenarios adds a comparison script + documentation for several eval scenarios.

Let's test them each with real-world scenarios from the issue queue.

**Scenario 2: Proving a new skill helps**

Test before/after on #3581705: Guidance on coding standards

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

CommentFileSizeAuthor
#6 showinfo.gif5.16 MBpritam-osl
#5 brands.png55.44 KBpritam-osl
#5 drupal.png24.2 KBpritam-osl
#5 impact.png36.96 KBpritam-osl

Comments

webchick created an issue. See original summary.

zorz’s picture

Assigned: Unassigned » zorz
zorz’s picture

I ran compare.py --no-baseline against the coding-standards skill from #3581705 with 8 behavioral evals I wrote for this test. The first 5 cover basic patterns (2-space indentation, elseif, constructor DI, short array syntax, controller create pattern). The last 3 target patterns that contradict general PHP conventions, where I expected Sonnet to struggle:

  • B06 (uncuddled braces): Drupal puts elseif and else on their own line after the closing brace, not cuddled as } elseif. This contradicts PSR-12.
  • B07 ($this->t): Classes must use $this->t() via StringTranslationTrait, not the global t().
  • B08 (accessCheck): Entity queries must include ->accessCheck(), required since Drupal 9.2.

Sonnet results (3 runs, 8 cases)

Case               no skill     with skill      Delta
B01 (indent)       3/3 (100%)   3/3 (100%)         =
B02 (elseif)       3/3 (100%)   3/3 (100%)         =
B03 (DI)           3/3 (100%)   3/3 (100%)         =
B04 (arrays)       3/3 (100%)   3/3 (100%)         =
B05 (ctrl DI)      3/3 (100%)   3/3 (100%)         =
B06 (braces)       3/3 (100%)   3/3 (100%)         =
B07 ($this->t)     3/3 (100%)   3/3 (100%)         =
B08 (accessCheck)  3/3 (100%)   3/3 (100%)         =
TOTAL                  100%          100%        +0%
Avg cost           $0.0483      $0.0654       +35%

24/24 PASS on both configs. I also re-ran the baseline from an empty temp directory (no repo context, no CLAUDE.md) to rule out the model picking up hints from the project files. Same result: 8/8 PASS.

Sonnet already knows all 8 Drupal coding standard patterns, including the ones that contradict PSR-12. The skill adds 35% cost for zero quality improvement.

I also found a bug in compare.py while testing: load_skill() could not handle directory-based skills (skills/{name}/SKILL.md). Fixed and pushed to #3582953.

The "proving a new skill helps" workflow works. Coding standards is the wrong test case for Sonnet because it already has this knowledge. The value shows when testing genuine knowledge gaps (like RunTestsInSeparateProcesses in writing-automated-tests: 0% to 100%).

zorz’s picture

Assigned: zorz » Unassigned
Status: Active » Needs review
pritam-osl’s picture

StatusFileSize
new36.96 KB
new24.2 KB
new55.44 KB
pritam-osl’s picture

StatusFileSize
new5.16 MB
webchick’s picture

Status: Needs review » Fixed

The skill adds 35% cost for zero quality improvement.

LOL, well THAT's no good. :D On the other hand, it is fantastic that we have a way to know this. :D

Thank you so much for testing (and for finding + fixing the directory bug!)

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.