Update 2026-05-08: scope expanded

This issue originally proposed a JSON Schema for the dataset YAML format alone. After re-reading #3586840 (registry), #3588426 (Jamie's browser), and #3585124 (Marcus's convergence ticket) together, three schemas need to ship as one coherent contract, not separately:

  1. dataset.schema.json — the dataset YAML format (original scope of this issue).
  2. rubric.schema.json — a rubric is a versioned, named composition of checks. Cases reference rubrics by rubric_ref. Eleven check kinds cover the current ecosystem (deterministic + llm_judge + composite).
  3. judge.schema.json — an LLM-as-judge instruction template, versioned separately from the rubric that calls it. Carries TPR/TNR validation metadata.

Splitting into three entities (case → rubric → judge) is what lets a single judge prompt serve many rubrics, and lets a single rubric serve many cases. Inlining as the original draft did would have re-locked the same problem (no reuse, no versioning, no portability) at a different layer.

A separate sibling issue will propose a result envelope schema for reproducibility metadata around an eval run (dataset_version, rubric_versions, judge_versions, model id, harness identity). That work is forthcoming, target 2026-05-22, and is not a prerequisite for this issue.

Problem/Motivation

The dataset YAML format is currently defined implicitly by what each grader plugin reads. There is no JSON Schema, no formal validator, and no admin-time check that a dataset is well-formed. Errors surface only at run time, often as silent skips: a typo in expected_tools becomes expected_tool, the row loads, the grader sees no expected tools, and the row scores 0 with no warning.

The same problem repeats one layer down. Grader behavior is currently expressed three different ways across the ecosystem:

  • ai_best_practices: must_contain_any[], must_not_contain[], check_php_lint, check_markdown_structure{...} inlined directly on each case in evals.json.
  • ai_eval: grader plugins annotated @AiEvalGrader; per-target config picks which graders run. Plugin code is the rubric.
  • ai_agents_test: TBD per Marcus's convergence ticket #3585124.

This will get worse, not better:

  • The shared dataset registry proposal (#3586840) needs a contract that registry-side validation can enforce when authors save a new dataset, rubric, or judge version.
  • Per-domain extensions (privacy tiers, routing classes, RAG-specific fields, agent eval_conditions) need a way to be declared without forking the format.
  • IDE tooling (YAML schemas in VS Code, JetBrains) cannot help authors today because there is nothing to point the IDE at.
  • Reproducibility breaks: changing a grader plugin's logic silently changes scores across every eval target using it. Versioned rubrics and judges fix this; ungoverned schema does not.

Steps to reproduce

  1. Create a dataset row with a misspelled grader-specific key: expecteed_facts instead of expected_facts.
  2. Run drush ai-eval:run against a target using fact_match_grader.
  3. Observe: row scores low or skips with no clear "your dataset has a typo" feedback.
  4. Separately: change a grader plugin's threshold from 0.7 to 0.8. Re-run. Observe: every eval target's scores shift, with no audit trail tying the change to the score change.

Proposed resolution

Ship three JSON Schemas (Draft 2020-12) at schema/ in the ai_eval module:

1. schema/dataset.schema.json

  • Top level: object with required questions array.
  • Each question: required id (string), input (string); optional criteria, expected, expected_facts, expected_tools; optional rubric_ref (pointer into rubric registry, format rubric/{id}@{semver}); optional bundle (string, names a domain-specific extension bundle such as rag, agent, or drupal_builder; bundle field schemas owned per-domain by the consuming module).
  • additionalProperties: true at the row level so domain-specific extensions remain valid.
  • expected.format as an enum (json, text) since FormatGrader's accepted values are closed.

2. schema/rubric.schema.json

  • Top level: object with required id, version, checks, scoring.
  • id: snake_case string.
  • version: semver. Breaking score-distribution changes require major bump.
  • checks: array of one or more checks. Eleven check kinds shipped at v1: must_contain_any, must_not_contain, regex, json_schema, php_lint, markdown_structure, tool_usage, format, fact_match, llm_judge, composite. Each kind is a discriminated subschema.
  • scoring.combine: enum (all_pass, any_pass, weighted_avg, min, max, median).
  • llm_judge checks reference judge prompts via judge_prompt_ref: judge/{id}@{semver} and may declare a jury (multiple judges with weights and aggregation).
  • composite checks reference another rubric via rubric_ref; single-level depth limit at v1 (no composite-of-composite).
  • additionalProperties: true at root so bundles can extend.

3. schema/judge.schema.json

  • Top level: object with required id, version, template.
  • template: Mustache-style template. Variables: {{ input }}, {{ output }}, {{ expected }}, {{ context }}.
  • score_type: enum (binary, continuous, levels); levels requires level_names.
  • validation: optional object with tpr, tnr, validated_against, validated_at, sample_size. JudgeValidator populates these.
  • applicable_to: array of domain strings where this judge has been validated.

Loader wiring

DatasetLoader validates cases on load. New RubricLoader and JudgePromptLoader validate their respective shapes.

  • Validate on load. Soft-fail by default: log a warning per offending row, continue running. The point is to surface mistakes, not to block legitimate domain extensions.
  • Add a strict mode (per-target boolean) that hard-fails on validation errors for users who want CI-grade strictness.
  • Reference integrity: every rubric_ref in cases must resolve. Every judge_prompt_ref in rubrics must resolve. Every composite rubric must not cycle.
  • Version pinning: rubric_ref: rubric/foo@1.2 resolves to exact version 1.2; missing pin warns and resolves to latest minor (configurable per-target: strict mode forbids unpinned refs).
  • Judge validation freshness: if validation.validated_at is older than 90 days, log a warning. Strict mode requires re-validation.
  • Expose validation as Drush commands: drush ai-eval:validate-dataset {filename}, drush ai-eval:validate-rubric {filename}, drush ai-eval:validate-judge {filename} for offline checks before committing.

Publish the schema URLs so authors can add # yaml-language-server: $schema=... at the top of their YAML for IDE autocomplete and inline validation.

Migration

Existing ai_eval grader plugins map onto check kinds:

  • accuracy_graderllm_judge with judge_prompt_ref: judge/accuracy@1.0
  • fact_match_graderfact_match
  • format_graderformat
  • tool_usage_gradertool_usage
  • structured_match_graderjson_schema

The plugin manager stays. Plugins become executors for check kinds rather than the rubric definition itself.

Remaining tasks

  • Draft schema/dataset.schema.json, schema/rubric.schema.json, schema/judge.schema.json (initial drafts target Friday 2026-05-15).
  • Add a justinrainbow/json-schema (or equivalent — opis/json-schema if preferred for Drupal contrib) dev dependency.
  • Modify DatasetLoader to validate on load. Add strict field to EvalTarget config.
  • Implement RubricLoader and JudgePromptLoader with reference-integrity validation and cycle detection.
  • Refactor existing grader plugins to act as executors for check kinds (no behavior change at this step).
  • Add Drush commands ai-eval:validate-dataset (alias aevd), ai-eval:validate-rubric (aevr), ai-eval:validate-judge (aevj).
  • Document in README under "Dataset Format", "Rubric Format", "Judge Format" sections. Link the schemas from the project page.
  • Tests: well-formed dataset/rubric/judge passes; missing required fields warns; misspelled known field still passes (additionalProperties); enum violations warn; reference integrity catches dangling rubric_ref / judge_prompt_ref; cycle detection catches composite-of-composite.

User interface changes

  • EvalTarget form gains a "Strict validation" checkbox that applies to dataset, rubric, and judge schemas uniformly.
  • EvalTarget form shows validation errors inline when a target is saved with a dataset/rubric/judge whose schema validation fails (warning in soft mode, error in strict mode).

API changes

  • DatasetLoader::load() may now log warnings per row. No signature change.
  • New services: RubricLoader and JudgePromptLoader with shape load(string $id, ?string $version): array. Concrete interfaces to be decided during implementation.
  • New Drush commands: ai-eval:validate-dataset (alias aevd), ai-eval:validate-rubric (aevr), ai-eval:validate-judge (aevj).
  • Grader plugins gain a $check_kind property identifying which check kind they execute. Plugin discovery still uses @AiEvalGrader annotation; a new check_kind annotation key is added (back-compat: missing key infers from plugin id).

Data model changes

  • EvalTarget config entity adds strict_validation boolean (default FALSE). Existing config keeps current soft-fail behavior.
  • Three new on-disk artifact types beneath the dataset directory: rubrics/{id}.yaml and judges/{id}.yaml alongside existing dataset YAML files. Filesystem layout convention; no Drupal-side entity required for v1 (entity question is deferred to #3586840).

Out of scope for this issue

  • Dataset registry storage and JSON:API endpoints — that's #3586840.
  • Result envelope schema (reproducibility metadata around an eval run) — separate forthcoming issue, target 2026-05-22.
  • Browser UI — that's #3588426.
  • Domain bundle field schemas (rag, agent, drupal_builder, etc.) — owned per-bundle by the consuming module.
  • Entity-vs-files question for where rubrics/judges/datasets ultimately live — deferred to #3586840. Files-on-disk convention is sufficient for v1; the same shape works whether the registry stores entities, content entities, or git-tracked files.

Issue fork ai_eval-3586842

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

zorz created an issue. See original summary.

zorz’s picture

Issue summary: View changes

zorz’s picture

Title: Add a JSON Schema for the dataset YAML format » Add JSON Schemas for the dataset, rubric, and judge formats
Issue summary: View changes