Problem/Motivation
#3581687: Guidance on how to write excellent documentation outlines some best practices around writing documentation captured at DrupalCon.
Now that #3581832: Create an eval framework to determine if guidance updates are making things better or worse has been added, per @zorz in #3581687-7: Guidance on how to write excellent documentation:
it would be nice to have an example of how a good documentation looks like and have the eval system tinker the skill until the score is good enough.
One way to know that could be with data. Per @hestenet at #3581687-9: Guidance on how to write excellent documentation:
This is analysis recently done by the Pronovix team as part of an STA grant to support some documentation improvements - would this help?
https://docs.google.com/spreadsheets/d/1QZXAVnVf2wKdpL-8t3y1CtuoFhyaRFPa...
For that matter, they also produced a number of other artifacts, including things like templates and a contributor checklist:
https://drive.google.com/drive/folders/1EOXk8BfvycxpTyVRZzWlvG2MSH6Qetpy
Remaining tasks
- Decide on an example of "good" documentation (@eojthebrave mentioned https://www.drupal.org/docs/getting-started/installing-drupal/install-dr... for the handbook anyway)
- Formulate into eval framework language
Comments
Comment #2
zorz commentedComment #4
zorz commentedI put together evals for this skill and found a few things worth sharing. MR is at MR !5.
What's in the MR
how-to-write-documentationcheck_markdown_structureassertion type for grading non-code output (heading hierarchy, required sections, code blocks, paragraph count). Deterministic, no LLM judge, zero cost.compare.pyandrun-evals.py(details below)--cwdflag so the model runs from a neutral directory and can't see eval filesEval isolation fix
I discovered that
claude -ploads every enabled plugin, hook, skill, and MCP server from~/.claude/settings.jsonby default. That means all previous eval runs were contaminated by whatever plugins the person running them had installed. The A/B delta was always clean (both configs equally contaminated), so comparative results still hold. But absolute pass rates may have been inflated.The fix is two flags:
--setting-sources ""blocks all user/project settings, and--strict-mcp-configblocks MCP servers. I'm posting a separate note about this on #3582953.Results
Sonnet, 3 runs, fully isolated:
Same story as #3583192 (coding-standards): Sonnet already knows how to write Drupal documentation. The skill adds zero accuracy improvement but makes output 36% shorter and 27% cheaper.
Haiku scored 80% on both configs (delta 0%). The one failing case (B05, contrib README) fails because Haiku asks for file write permission instead of generating content inline.
Eval design
I used the Pronovix STA grant deliverables that @hestenet shared in #3581687-9 as the basis for the structural checks. Their 24-item contributor checklist maps cleanly onto deterministic assertion rules. The
check_markdown_structureassertion type should be reusable for any future non-code skill.What's still open
Task 1 from the issue (picking an example of "good" documentation) is partially addressed. I used the Pronovix quality criteria to define "good" structurally rather than pointing at a single exemplar page. The DDEV installation page that @eojthebrave mentioned could be added as a golden reference test in a follow-up.
Comment #6
webchickOooh, that's sneaky!! Nice find!
I'm comfortable addressing the remaining issues in a follow-up (if we still think that's desired; these types of deterministic checks seem even better) sooo...
Merged! :D Thank you so much!