Update 2026-05-08: scope expanded
This issue originally proposed a JSON Schema for the dataset YAML format alone. After re-reading #3586840 (registry), #3588426 (Jamie's browser), and #3585124 (Marcus's convergence ticket) together, three schemas need to ship as one coherent contract, not separately:
dataset.schema.json— the dataset YAML format (original scope of this issue).rubric.schema.json— a rubric is a versioned, named composition of checks. Cases reference rubrics byrubric_ref. Eleven check kinds cover the current ecosystem (deterministic +llm_judge+composite).judge.schema.json— an LLM-as-judge instruction template, versioned separately from the rubric that calls it. Carries TPR/TNR validation metadata.
Splitting into three entities (case → rubric → judge) is what lets a single judge prompt serve many rubrics, and lets a single rubric serve many cases. Inlining as the original draft did would have re-locked the same problem (no reuse, no versioning, no portability) at a different layer.
A separate sibling issue will propose a result envelope schema for reproducibility metadata around an eval run (dataset_version, rubric_versions, judge_versions, model id, harness identity). That work is forthcoming, target 2026-05-22, and is not a prerequisite for this issue.
Problem/Motivation
The dataset YAML format is currently defined implicitly by what each grader plugin reads. There is no JSON Schema, no formal validator, and no admin-time check that a dataset is well-formed. Errors surface only at run time, often as silent skips: a typo in expected_tools becomes expected_tool, the row loads, the grader sees no expected tools, and the row scores 0 with no warning.
The same problem repeats one layer down. Grader behavior is currently expressed three different ways across the ecosystem:
- ai_best_practices:
must_contain_any[],must_not_contain[],check_php_lint,check_markdown_structure{...}inlined directly on each case inevals.json. - ai_eval: grader plugins annotated
@AiEvalGrader; per-target config picks which graders run. Plugin code is the rubric. - ai_agents_test: TBD per Marcus's convergence ticket #3585124.
This will get worse, not better:
- The shared dataset registry proposal (#3586840) needs a contract that registry-side validation can enforce when authors save a new dataset, rubric, or judge version.
- Per-domain extensions (privacy tiers, routing classes, RAG-specific fields, agent eval_conditions) need a way to be declared without forking the format.
- IDE tooling (YAML schemas in VS Code, JetBrains) cannot help authors today because there is nothing to point the IDE at.
- Reproducibility breaks: changing a grader plugin's logic silently changes scores across every eval target using it. Versioned rubrics and judges fix this; ungoverned schema does not.
Steps to reproduce
- Create a dataset row with a misspelled grader-specific key:
expecteed_factsinstead ofexpected_facts. - Run
drush ai-eval:runagainst a target usingfact_match_grader. - Observe: row scores low or skips with no clear "your dataset has a typo" feedback.
- Separately: change a grader plugin's threshold from 0.7 to 0.8. Re-run. Observe: every eval target's scores shift, with no audit trail tying the change to the score change.
Proposed resolution
Ship three JSON Schemas (Draft 2020-12) at schema/ in the ai_eval module:
1. schema/dataset.schema.json
- Top level: object with required
questionsarray. - Each question: required
id(string),input(string); optionalcriteria,expected,expected_facts,expected_tools; optionalrubric_ref(pointer into rubric registry, formatrubric/{id}@{semver}); optionalbundle(string, names a domain-specific extension bundle such asrag,agent, ordrupal_builder; bundle field schemas owned per-domain by the consuming module). additionalProperties: trueat the row level so domain-specific extensions remain valid.expected.formatas an enum (json,text) since FormatGrader's accepted values are closed.
2. schema/rubric.schema.json
- Top level: object with required
id,version,checks,scoring. id: snake_case string.version: semver. Breaking score-distribution changes require major bump.checks: array of one or more checks. Eleven check kinds shipped at v1:must_contain_any,must_not_contain,regex,json_schema,php_lint,markdown_structure,tool_usage,format,fact_match,llm_judge,composite. Each kind is a discriminated subschema.scoring.combine: enum (all_pass,any_pass,weighted_avg,min,max,median).llm_judgechecks reference judge prompts viajudge_prompt_ref: judge/{id}@{semver}and may declare a jury (multiple judges with weights and aggregation).compositechecks reference another rubric viarubric_ref; single-level depth limit at v1 (no composite-of-composite).additionalProperties: trueat root so bundles can extend.
3. schema/judge.schema.json
- Top level: object with required
id,version,template. template: Mustache-style template. Variables:{{ input }},{{ output }},{{ expected }},{{ context }}.score_type: enum (binary,continuous,levels);levelsrequireslevel_names.validation: optional object withtpr,tnr,validated_against,validated_at,sample_size.JudgeValidatorpopulates these.applicable_to: array of domain strings where this judge has been validated.
Loader wiring
DatasetLoader validates cases on load. New RubricLoader and JudgePromptLoader validate their respective shapes.
- Validate on load. Soft-fail by default: log a warning per offending row, continue running. The point is to surface mistakes, not to block legitimate domain extensions.
- Add a
strictmode (per-target boolean) that hard-fails on validation errors for users who want CI-grade strictness. - Reference integrity: every
rubric_refin cases must resolve. Everyjudge_prompt_refin rubrics must resolve. Everycompositerubric must not cycle. - Version pinning:
rubric_ref: rubric/foo@1.2resolves to exact version 1.2; missing pin warns and resolves to latest minor (configurable per-target: strict mode forbids unpinned refs). - Judge validation freshness: if
validation.validated_atis older than 90 days, log a warning. Strict mode requires re-validation. - Expose validation as Drush commands:
drush ai-eval:validate-dataset {filename},drush ai-eval:validate-rubric {filename},drush ai-eval:validate-judge {filename}for offline checks before committing.
Publish the schema URLs so authors can add # yaml-language-server: $schema=... at the top of their YAML for IDE autocomplete and inline validation.
Migration
Existing ai_eval grader plugins map onto check kinds:
accuracy_grader→llm_judgewithjudge_prompt_ref: judge/accuracy@1.0fact_match_grader→fact_matchformat_grader→formattool_usage_grader→tool_usagestructured_match_grader→json_schema
The plugin manager stays. Plugins become executors for check kinds rather than the rubric definition itself.
Remaining tasks
- Draft
schema/dataset.schema.json,schema/rubric.schema.json,schema/judge.schema.json(initial drafts target Friday 2026-05-15). - Add a
justinrainbow/json-schema(or equivalent —opis/json-schemaif preferred for Drupal contrib) dev dependency. - Modify
DatasetLoaderto validate on load. Addstrictfield toEvalTargetconfig. - Implement
RubricLoaderandJudgePromptLoaderwith reference-integrity validation and cycle detection. - Refactor existing grader plugins to act as executors for check kinds (no behavior change at this step).
- Add Drush commands
ai-eval:validate-dataset(aliasaevd),ai-eval:validate-rubric(aevr),ai-eval:validate-judge(aevj). - Document in README under "Dataset Format", "Rubric Format", "Judge Format" sections. Link the schemas from the project page.
- Tests: well-formed dataset/rubric/judge passes; missing required fields warns; misspelled known field still passes (
additionalProperties); enum violations warn; reference integrity catches danglingrubric_ref/judge_prompt_ref; cycle detection catches composite-of-composite.
User interface changes
EvalTargetform gains a "Strict validation" checkbox that applies to dataset, rubric, and judge schemas uniformly.EvalTargetform shows validation errors inline when a target is saved with a dataset/rubric/judge whose schema validation fails (warning in soft mode, error in strict mode).
API changes
DatasetLoader::load()may now log warnings per row. No signature change.- New services:
RubricLoaderandJudgePromptLoaderwith shapeload(string $id, ?string $version): array. Concrete interfaces to be decided during implementation. - New Drush commands:
ai-eval:validate-dataset(aliasaevd),ai-eval:validate-rubric(aevr),ai-eval:validate-judge(aevj). - Grader plugins gain a
$check_kindproperty identifying which check kind they execute. Plugin discovery still uses@AiEvalGraderannotation; a newcheck_kindannotation key is added (back-compat: missing key infers from plugin id).
Data model changes
EvalTargetconfig entity addsstrict_validationboolean (defaultFALSE). Existing config keeps current soft-fail behavior.- Three new on-disk artifact types beneath the dataset directory:
rubrics/{id}.yamlandjudges/{id}.yamlalongside existing dataset YAML files. Filesystem layout convention; no Drupal-side entity required for v1 (entity question is deferred to #3586840).
Out of scope for this issue
- Dataset registry storage and JSON:API endpoints — that's #3586840.
- Result envelope schema (reproducibility metadata around an eval run) — separate forthcoming issue, target 2026-05-22.
- Browser UI — that's #3588426.
- Domain bundle field schemas (rag, agent, drupal_builder, etc.) — owned per-bundle by the consuming module.
- Entity-vs-files question for where rubrics/judges/datasets ultimately live — deferred to #3586840. Files-on-disk convention is sufficient for v1; the same shape works whether the registry stores entities, content entities, or git-tracked files.
Issue fork ai_eval-3586842
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
zorz commentedComment #4
zorz commented