Problem/Motivation

#3581687: Guidance on how to write excellent documentation outlines some best practices around writing documentation captured at DrupalCon.

Now that #3581832: Create an eval framework to determine if guidance updates are making things better or worse has been added, per @zorz in #3581687-7: Guidance on how to write excellent documentation:

it would be nice to have an example of how a good documentation looks like and have the eval system tinker the skill until the score is good enough.

One way to know that could be with data. Per @hestenet at #3581687-9: Guidance on how to write excellent documentation:

This is analysis recently done by the Pronovix team as part of an STA grant to support some documentation improvements - would this help?

https://docs.google.com/spreadsheets/d/1QZXAVnVf2wKdpL-8t3y1CtuoFhyaRFPa...

For that matter, they also produced a number of other artifacts, including things like templates and a contributor checklist:

https://drive.google.com/drive/folders/1EOXk8BfvycxpTyVRZzWlvG2MSH6Qetpy

Remaining tasks

Comments

webchick created an issue. See original summary.

zorz’s picture

Assigned: Unassigned » zorz

  • a982cdbf committed on 3583241-add-evals-for-docs
    feat: #3583241 Add documentation evals and check_markdown_structure...
zorz’s picture

Assigned: zorz » Unassigned
Status: Active » Needs review

I put together evals for this skill and found a few things worth sharing. MR is at MR !5.

What's in the MR

  • 5 behavioral eval cases + 12 static checks for how-to-write-documentation
  • A new check_markdown_structure assertion type for grading non-code output (heading hierarchy, required sections, code blocks, paragraph count). Deterministic, no LLM judge, zero cost.
  • An eval isolation fix for compare.py and run-evals.py (details below)
  • A --cwd flag so the model runs from a neutral directory and can't see eval files
  • Updated CONTRIBUTING.md documenting all of the above

Eval isolation fix

I discovered that claude -p loads every enabled plugin, hook, skill, and MCP server from ~/.claude/settings.json by default. That means all previous eval runs were contaminated by whatever plugins the person running them had installed. The A/B delta was always clean (both configs equally contaminated), so comparative results still hold. But absolute pass rates may have been inflated.

The fix is two flags: --setting-sources "" blocks all user/project settings, and --strict-mcp-config blocks MCP servers. I'm posting a separate note about this on #3582953.

Results

Sonnet, 3 runs, fully isolated:

Case           no skill     with skill      Delta
-------- -------------- -------------- ----------
B01          3/3 (100%)     3/3 (100%)          =
B02          3/3 (100%)     3/3 (100%)          =
B03          3/3 (100%)     3/3 (100%)          =
B04          3/3 (100%)     3/3 (100%)          =
B05          3/3 (100%)     3/3 (100%)          =
-------- -------------- -------------- ----------
TOTAL             100%           100%        +0%

     Avg output tokens          1,136            723       -413
     Avg cost/question $       0.0261 $       0.0191 $  -0.0070

Same story as #3583192 (coding-standards): Sonnet already knows how to write Drupal documentation. The skill adds zero accuracy improvement but makes output 36% shorter and 27% cheaper.

Haiku scored 80% on both configs (delta 0%). The one failing case (B05, contrib README) fails because Haiku asks for file write permission instead of generating content inline.

Eval design

I used the Pronovix STA grant deliverables that @hestenet shared in #3581687-9 as the basis for the structural checks. Their 24-item contributor checklist maps cleanly onto deterministic assertion rules. The check_markdown_structure assertion type should be reusable for any future non-code skill.

What's still open

Task 1 from the issue (picking an example of "good" documentation) is partially addressed. I used the Pronovix quality criteria to define "good" structurally rather than pointing at a single exemplar page. The DDEV installation page that @eojthebrave mentioned could be added as a golden reference test in a follow-up.

  • webchick committed 08742c6c on 1.0.x authored by zorz
    feat: #3583241 Add evals for how-to-write-documentation.md
    
    By: zorz
    By...
webchick’s picture

Status: Needs review » Fixed

I discovered that claude -p loads every enabled plugin, hook, skill, and MCP server from ~/.claude/settings.json by default. That means all previous eval runs were contaminated by whatever plugins the person running them had installed.

Oooh, that's sneaky!! Nice find!

I'm comfortable addressing the remaining issues in a follow-up (if we still think that's desired; these types of deterministic checks seem even better) sooo...

Merged! :D Thank you so much!

Now that this issue is closed, review the contribution record.

As a contributor, attribute any organization that helped you, or if you volunteered your own time.

Maintainers, credit people who helped resolve this issue.

  • webchick committed 08742c6c on 3583203-multi-model-comparison authored by zorz
    feat: #3583241 Add evals for how-to-write-documentation.md
    
    By: zorz
    By...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.