Add JSON Schemas for the dataset, rubric, and judge formats [#3586842]

Update 2026-05-08: scope expanded

This issue originally proposed a JSON Schema for the dataset YAML format alone. After re-reading #3586840 (registry), #3588426 (Jamie's browser), and #3585124 (Marcus's convergence ticket) together, three schemas need to ship as one coherent contract, not separately:

dataset.schema.json — the dataset YAML format (original scope of this issue).
rubric.schema.json — a rubric is a versioned, named composition of checks. Cases reference rubrics by rubric_ref. Eleven check kinds cover the current ecosystem (deterministic + llm_judge + composite).
judge.schema.json — an LLM-as-judge instruction template, versioned separately from the rubric that calls it. Carries TPR/TNR validation metadata.

Splitting into three entities (case → rubric → judge) is what lets a single judge prompt serve many rubrics, and lets a single rubric serve many cases. Inlining as the original draft did would have re-locked the same problem (no reuse, no versioning, no portability) at a different layer.

A separate sibling issue will propose a result envelope schema for reproducibility metadata around an eval run (dataset_version, rubric_versions, judge_versions, model id, harness identity). That work is forthcoming, target 2026-05-22, and is not a prerequisite for this issue.

Problem/Motivation

The dataset YAML format is currently defined implicitly by what each grader plugin reads. There is no JSON Schema, no formal validator, and no admin-time check that a dataset is well-formed. Errors surface only at run time, often as silent skips: a typo in expected_tools becomes expected_tool, the row loads, the grader sees no expected tools, and the row scores 0 with no warning.

The same problem repeats one layer down. Grader behavior is currently expressed three different ways across the ecosystem:

ai_best_practices: must_contain_any[], must_not_contain[], check_php_lint, check_markdown_structure{...} inlined directly on each case in evals.json.
ai_eval: grader plugins annotated @AiEvalGrader; per-target config picks which graders run. Plugin code is the rubric.
ai_agents_test: TBD per Marcus's convergence ticket #3585124.

This will get worse, not better:

The shared dataset registry proposal (#3586840) needs a contract that registry-side validation can enforce when authors save a new dataset, rubric, or judge version.
Per-domain extensions (privacy tiers, routing classes, RAG-specific fields, agent eval_conditions) need a way to be declared without forking the format.
IDE tooling (YAML schemas in VS Code, JetBrains) cannot help authors today because there is nothing to point the IDE at.
Reproducibility breaks: changing a grader plugin's logic silently changes scores across every eval target using it. Versioned rubrics and judges fix this; ungoverned schema does not.

Steps to reproduce

Create a dataset row with a misspelled grader-specific key: expecteed_facts instead of expected_facts.
Run drush ai-eval:run against a target using fact_match_grader.
Observe: row scores low or skips with no clear "your dataset has a typo" feedback.
Separately: change a grader plugin's threshold from 0.7 to 0.8. Re-run. Observe: every eval target's scores shift, with no audit trail tying the change to the score change.

Proposed resolution

Ship three JSON Schemas (Draft 2020-12) at schema/ in the ai_eval module:

1. `schema/dataset.schema.json`

Top level: object with required questions array.
Each question: required id (string), input (string); optional criteria, expected, expected_facts, expected_tools; optional rubric_ref (pointer into rubric registry, format rubric/{id}@{semver}); optional bundle (string, names a domain-specific extension bundle such as rag, agent, or drupal_builder; bundle field schemas owned per-domain by the consuming module).
additionalProperties: true at the row level so domain-specific extensions remain valid.
expected.format as an enum (json, text) since FormatGrader's accepted values are closed.

2. `schema/rubric.schema.json`

Top level: object with required id, version, checks, scoring.
id: snake_case string.
version: semver. Breaking score-distribution changes require major bump.
checks: array of one or more checks. Eleven check kinds shipped at v1: must_contain_any, must_not_contain, regex, json_schema, php_lint, markdown_structure, tool_usage, format, fact_match, llm_judge, composite. Each kind is a discriminated subschema.
scoring.combine: enum (all_pass, any_pass, weighted_avg, min, max, median).
llm_judge checks reference judge prompts via judge_prompt_ref: judge/{id}@{semver} and may declare a jury (multiple judges with weights and aggregation).
composite checks reference another rubric via rubric_ref; single-level depth limit at v1 (no composite-of-composite).
additionalProperties: true at root so bundles can extend.

3. `schema/judge.schema.json`

Top level: object with required id, version, template.
template: Mustache-style template. Variables: {{ input }}, {{ output }}, {{ expected }}, {{ context }}.
score_type: enum (binary, continuous, levels); levels requires level_names.
validation: optional object with tpr, tnr, validated_against, validated_at, sample_size. JudgeValidator populates these.
applicable_to: array of domain strings where this judge has been validated.

Loader wiring

DatasetLoader validates cases on load. New RubricLoader and JudgePromptLoader validate their respective shapes.

Validate on load. Soft-fail by default: log a warning per offending row, continue running. The point is to surface mistakes, not to block legitimate domain extensions.
Add a strict mode (per-target boolean) that hard-fails on validation errors for users who want CI-grade strictness.
Reference integrity: every rubric_ref in cases must resolve. Every judge_prompt_ref in rubrics must resolve. Every composite rubric must not cycle.
Version pinning: rubric_ref: rubric/foo@1.2 resolves to exact version 1.2; missing pin warns and resolves to latest minor (configurable per-target: strict mode forbids unpinned refs).
Judge validation freshness: if validation.validated_at is older than 90 days, log a warning. Strict mode requires re-validation.
Expose validation as Drush commands: drush ai-eval:validate-dataset {filename}, drush ai-eval:validate-rubric {filename}, drush ai-eval:validate-judge {filename} for offline checks before committing.

Publish the schema URLs so authors can add # yaml-language-server: $schema=... at the top of their YAML for IDE autocomplete and inline validation.

Migration

Existing ai_eval grader plugins map onto check kinds:

accuracy_grader → llm_judge with judge_prompt_ref: judge/accuracy@1.0
fact_match_grader → fact_match
format_grader → format
tool_usage_grader → tool_usage
structured_match_grader → json_schema

The plugin manager stays. Plugins become executors for check kinds rather than the rubric definition itself.

Remaining tasks

Draft schema/dataset.schema.json, schema/rubric.schema.json, schema/judge.schema.json (initial drafts target Friday 2026-05-15).
Add a justinrainbow/json-schema (or equivalent — opis/json-schema if preferred for Drupal contrib) dev dependency.
Modify DatasetLoader to validate on load. Add strict field to EvalTarget config.
Implement RubricLoader and JudgePromptLoader with reference-integrity validation and cycle detection.
Refactor existing grader plugins to act as executors for check kinds (no behavior change at this step).
Add Drush commands ai-eval:validate-dataset (alias aevd), ai-eval:validate-rubric (aevr), ai-eval:validate-judge (aevj).
Document in README under "Dataset Format", "Rubric Format", "Judge Format" sections. Link the schemas from the project page.
Tests: well-formed dataset/rubric/judge passes; missing required fields warns; misspelled known field still passes (additionalProperties); enum violations warn; reference integrity catches dangling rubric_ref / judge_prompt_ref; cycle detection catches composite-of-composite.

User interface changes

EvalTarget form gains a "Strict validation" checkbox that applies to dataset, rubric, and judge schemas uniformly.
EvalTarget form shows validation errors inline when a target is saved with a dataset/rubric/judge whose schema validation fails (warning in soft mode, error in strict mode).

API changes

DatasetLoader::load() may now log warnings per row. No signature change.
New services: RubricLoader and JudgePromptLoader with shape load(string $id, ?string $version): array. Concrete interfaces to be decided during implementation.
New Drush commands: ai-eval:validate-dataset (alias aevd), ai-eval:validate-rubric (aevr), ai-eval:validate-judge (aevj).
Grader plugins gain a $check_kind property identifying which check kind they execute. Plugin discovery still uses @AiEvalGrader annotation; a new check_kind annotation key is added (back-compat: missing key infers from plugin id).

Data model changes

EvalTarget config entity adds strict_validation boolean (default FALSE). Existing config keeps current soft-fail behavior.
Three new on-disk artifact types beneath the dataset directory: rubrics/{id}.yaml and judges/{id}.yaml alongside existing dataset YAML files. Filesystem layout convention; no Drupal-side entity required for v1 (entity question is deferred to #3586840).

Out of scope for this issue

Dataset registry storage and JSON:API endpoints — that's #3586840.
Result envelope schema (reproducibility metadata around an eval run) — separate forthcoming issue, target 2026-05-22.
Browser UI — that's #3588426.
Domain bundle field schemas (rag, agent, drupal_builder, etc.) — owned per-bundle by the consuming module.
Entity-vs-files question for where rubrics/judges/datasets ultimately live — deferred to #3586840. Files-on-disk convention is sufficient for v1; the same shape works whether the registry stores entities, content entities, or git-tracked files.

Issue fork ai_eval-3586842

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

issue/ai_eval-3586842 changes, plain diff MR !5
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

25 April 2026 at 11:11

zorz created an issue. See original summary.

Comment #2

zorz commented 8 May 2026 at 11:08

Issue summary:

View changes

Comment #3

8 May 2026 at 11:28

zorz opened merge request !5

Comment #4

zorz commented 8 May 2026 at 13:31

Title:	Add a JSON Schema for the dataset YAML format	» Add JSON Schemas for the dataset, rubric, and judge formats
Issue summary:	View changes

Add JSON Schemas for the dataset, rubric, and judge formats

Update 2026-05-08: scope expanded

Problem/Motivation

Steps to reproduce

Proposed resolution

1. `schema/dataset.schema.json`

2. `schema/rubric.schema.json`

3. `schema/judge.schema.json`

Loader wiring

Migration

Remaining tasks

User interface changes

API changes

Data model changes

Out of scope for this issue

Issue fork ai_eval-3586842

Comments

Comment #1

Comment #2

Comment #3

Comment #4

News items

Our community

Documentation

Drupal code base

Governance of community

Add JSON Schemas for the dataset, rubric, and judge formats

Update 2026-05-08: scope expanded

Problem/Motivation

Steps to reproduce

Proposed resolution

1. schema/dataset.schema.json

2. schema/rubric.schema.json

3. schema/judge.schema.json

Loader wiring

Migration

Remaining tasks

User interface changes

API changes

Data model changes

Out of scope for this issue

Issue fork ai_eval-3586842

Comments

Comment #1

Comment #2

Comment #3

Comment #4

1. `schema/dataset.schema.json`

2. `schema/rubric.schema.json`

3. `schema/judge.schema.json`