Test eval "creating a new skill" scenario [#3583192]

Problem/Motivation

#3582953: Document how to run evals in various scenarios adds a comparison script + documentation for several eval scenarios.

Let's test them each with real-world scenarios from the issue queue.

Scenario 2: Proving a new skill helps

Test before/after on #3581705: Guidance on coding standards

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comment	File	Size	Author
#6	showinfo.gif	5.16 MB	pritam-osl
#5	brands.png	55.44 KB	pritam-osl
#5	drupal.png	24.2 KB	pritam-osl
#5	impact.png	36.96 KB	pritam-osl

Comments

Comment #1

5 April 2026 at 19:33

webchick created an issue. See original summary.

Comment #2

zorz commented 5 April 2026 at 20:34

Assigned:

Unassigned

» zorz

I ran compare.py --no-baseline against the coding-standards skill from #3581705 with 8 behavioral evals I wrote for this test. The first 5 cover basic patterns (2-space indentation, elseif, constructor DI, short array syntax, controller create pattern). The last 3 target patterns that contradict general PHP conventions, where I expected Sonnet to struggle:

B06 (uncuddled braces): Drupal puts elseif and else on their own line after the closing brace, not cuddled as } elseif. This contradicts PSR-12.
B07 ($this->t): Classes must use $this->t() via StringTranslationTrait, not the global t().
B08 (accessCheck): Entity queries must include ->accessCheck(), required since Drupal 9.2.

Sonnet results (3 runs, 8 cases)

Case               no skill     with skill      Delta
B01 (indent)       3/3 (100%)   3/3 (100%)         =
B02 (elseif)       3/3 (100%)   3/3 (100%)         =
B03 (DI)           3/3 (100%)   3/3 (100%)         =
B04 (arrays)       3/3 (100%)   3/3 (100%)         =
B05 (ctrl DI)      3/3 (100%)   3/3 (100%)         =
B06 (braces)       3/3 (100%)   3/3 (100%)         =
B07 ($this->t)     3/3 (100%)   3/3 (100%)         =
B08 (accessCheck)  3/3 (100%)   3/3 (100%)         =
TOTAL                  100%          100%        +0%
Avg cost           $0.0483      $0.0654       +35%

24/24 PASS on both configs. I also re-ran the baseline from an empty temp directory (no repo context, no CLAUDE.md) to rule out the model picking up hints from the project files. Same result: 8/8 PASS.

Sonnet already knows all 8 Drupal coding standard patterns, including the ones that contradict PSR-12. The skill adds 35% cost for zero quality improvement.

I also found a bug in compare.py while testing: load_skill() could not handle directory-based skills (skills/{name}/SKILL.md). Fixed and pushed to #3582953.

The "proving a new skill helps" workflow works. Coding standards is the wrong test case for Sonnet because it already has this knowledge. The value shows when testing genuine knowledge gaps (like RunTestsInSeparateProcesses in writing-automated-tests: 0% to 100%).

Comment #4

zorz commented 6 April 2026 at 05:34

Assigned:	zorz	» Unassigned
Status:	Active	» Needs review

Comment #5

pritam-osl commented 6 April 2026 at 05:53

Status	File	Size
new	impact.png	36.96 KB
new	drupal.png	24.2 KB
new	brands.png	55.44 KB

Comment #6

pritam-osl commented 6 April 2026 at 05:56

Status	File	Size
new	showinfo.gif	5.16 MB

Comment #7

webchick

she/they

English

Vancouver 🇨🇦

commented 6 April 2026 at 07:07

Status:

Needs review

» Fixed

The skill adds 35% cost for zero quality improvement.

LOL, well THAT's no good. :D On the other hand, it is fantastic that we have a way to know this. :D

Thank you so much for testing (and for finding + fixing the directory bug!)

Comment #8

6 April 2026 at 07:07

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

Comment #9

20 April 2026 at 07:10

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Test eval "creating a new skill" scenario

Problem/Motivation

Scenario 2: Proving a new skill helps

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments

Comment #1

Comment #2

Comment #3

Sonnet results (3 runs, 8 cases)

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Parent issue

News items

Our community

Documentation

Drupal code base

Governance of community

Test eval "creating a new skill" scenario

Problem/Motivation

**Scenario 2: Proving a new skill helps**

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments

Sonnet results (3 runs, 8 cases)

Parent issue

Scenario 2: Proving a new skill helps