Problem/Motivation

The shipped sample dataset at data/sample.yaml contains rows that exercise graders not present in src/Plugin/AiEvalGrader/:

  • Rows S07, S08 use expected.route and also_valid_routes, intended for a RouteGrader plugin that is not shipped.
  • Rows S09, S10 use expected.tier, expected.entity_type, expected.project_id, and also_valid, intended for a StructuredMatchGrader plugin that is not shipped.

These rows came from the local reference implementation, where those graders are defined in a separate module. They were copied into the contrib's sample dataset without their corresponding plugins. Anyone running drush ai-eval:run against the shipped sample with the shipped graders gets nothing useful from these rows: the relevant grader is missing, so the row's expected fields are silently ignored.

Steps to reproduce

  1. Install ai_eval 1.0.0-alpha10.
  2. Configure an eval target pointing at sample.yaml with the shipped graders enabled.
  3. Run drush ai-eval:run.
  4. Observe that rows S07-S10 produce only the LLM-judge dimensions; their structured-match expectations are not validated by anything.

Proposed resolution

Replace the orphaned rows with examples that exercise graders we actually ship:

  • Drop S07, S08 (RouteGrader rows). Replace with one or two tool_usage_grader examples using expected_tools.
  • Drop S09, S10 (StructuredMatchGrader rows). Replace with one or two fact_match_grader examples using expected_facts and must_not_contain.
  • Add a comment block at the top of sample.yaml noting that domain-specific fields (privacy tiers, entity types, route classification) belong in your own dataset, not the contrib's sample.

Remaining tasks

  • Patch data/sample.yaml: drop S07-S10, add two tool_usage_grader rows and two fact_match_grader rows.
  • Update README.md "Dataset Format" examples if any reference the dropped fields.
  • Ship in next alpha.

User interface changes

None.

API changes

None.

Data model changes

None. data/sample.yaml is documentation, not part of the data model.

Comments

zorz created an issue.