Problem/Motivation

This issue is a design conversation, filed as a sibling to #3585124 (research: merging ai_agents_test with ai_eval). It is not a recommendation and not a unilateral commitment by the ai_eval maintainer. The purpose is to make the registry idea legible to the AI Initiative team so the boundary discussion can happen in writing, in parallel with the convergence ticket.

Datasets in ai_eval today are flat YAML files at a configured path on the local filesystem. This works for single-site, single-team usage but creates friction in three real scenarios:

  1. Multi-site agencies maintaining the same eval suite across many client installations: copy-paste workflow, no shared source of truth.
  2. Reproducibility and traceability under EU AI Act style obligations: "what was version 1.2 of dataset X on date Y" has no immutable answer in a filesystem-based model.
  3. Convergence with ai_agents: ai_agents_test needs test fixtures of essentially the same shape as ai_eval datasets. Two parallel storage layers will diverge.

A user-facing UI editor for local files (#3586770) would help case (1) but not (2) or (3), and risks freezing the wrong abstraction.

Steps to reproduce

Not applicable. This is a design discussion.

Proposed resolution

Two concerns, deliberately split so the boundary question can be answered without dragging the implementation question with it:

  • (a) Where the dataset entity lives. Schema, fields, validation rules, versioning semantics. This is the boundary question that overlaps with #3585124.
  • (b) The registry recipe. Versioned HTTP-served storage, browse/edit UI, OAuth2-protected write, audit log. This is independent of (a) once (a) is decided. It can ship in ai_eval regardless of where the entity lives.

Three options for (a). No recommendation pending input from the AI Initiative team:

Option A: entity in drupal/ai core

  • Aligns with @lbesenyei's comment #4 on #3585124 ("shared test dataset config entity in core").
  • ai_eval, ai_agents_test, AiLlm all consume the same entity type.
  • Schema governance lives with the AI Initiative.
  • ai_eval's role: graders, gates, optimizer, results. The registry recipe (b) still makes sense in ai_eval.
  • Cost: requires a core change with a known timeline before downstream modules can rely on it.

Option B: entity in ai_eval

  • What the original draft of this issue assumed.
  • ai_eval defines the entity, ai_agents_test reads it through the adapter pattern from #3585124.
  • Schema governance lives with ai_eval, with explicit invitation for ai_agents_test contributors.
  • Cost: schema decisions made in ai_eval may not get full AI Initiative buy-in retroactively.

Option C: hybrid

  • Entity type and base schema in drupal/ai core.
  • Bundles, registry recipe, and editing UX in ai_eval.
  • Per-domain extensions live on bundles defined by consumers (ai_eval ships rag, agent, classification, judge_validation bundles; ai_agents_test could ship agent_fixture as another bundle on the same entity type).
  • Cost: more moving parts, slowest to design, but cleanest separation of governance from product.

Concern (b), the registry recipe, sketched below for shape regardless of which option wins on (a):

  • New module ai_eval_registry (sub-module of ai_eval). Recipe-installable. Any Drupal site can be a registry.
  • JSON:API endpoints for the read hot path (GET dataset {id} version {v}). OAuth2 for write.
  • UI: standard Drupal entity forms plus structured/YAML editor plus diff view plus per-version audit log.
  • Consuming side: new @AiEvalDatasetSource plugin type. Existing file-based loader becomes the default FileSource plugin; new RegistrySource plugin for the registry. Eval target config grows a dataset_source field; existing configs default to file and keep working unchanged.
  • Caching: stale-on-failure on the consumer side so registry downtime does not break eval runs.

What this is not:

  • Not a unilateral commitment. The boundary question (a) is open. Implementation depends on the answer.
  • Not replacing file-based datasets. FileSource stays the default for single-site users. Registry is opt-in.
  • Not a hosted-product pitch. Reference implementation is open-source Drupal; anyone can self-host. A public PointBlank instance might exist for convenience but is not the gating path.
  • Not a position on graders or gates. Graders, gates, optimization, scoring, and judge validation stay in ai_eval regardless of where the dataset entity lives.

Remaining tasks

Open decisions (gate code work):

  1. Pick option A, B, or C for the dataset entity location. Input requested from @marcus_johansson, @lbesenyei, @yautja_cetanu. If the answer is "core" but the core change is not on a known timeline, please say so explicitly so consumers can decide whether to wait or to ship in ai_eval and migrate later.
  2. Bundles or single-shape entity? Mature eval ecosystems (HuggingFace datasets, LangSmith, DeepEval, Argilla) all model per-domain extensions: a RAG eval row carries different fields from an agent-routing eval row from a classification eval row. Bundles, or an equivalent per-domain extension mechanism, look necessary regardless of which option above wins. Formal schema work to be tracked separately.
  3. Convergence with ai_agents_test fixtures. Same entity type with separate bundles, or two distinct entity types?
  4. Auth model for registry writes. OAuth2 against the registry, or piggyback on the consuming site's existing auth?
  5. Canonical instance. Self-host only? PointBlank-hosted reference instance? Long-term move to a drupal.org-hosted official registry? This is the lowest-priority decision and can be revisited.

Implementation tasks (gated on the decisions above):

  • Define eval_dataset entity schema in the agreed module location, with versioning semantics.
  • Implement @AiEvalDatasetSource plugin type in ai_eval. Refactor existing loader into FileSource plugin.
  • Implement RegistrySource plugin (HTTP fetch, cache, stale-on-failure).
  • Implement registry-side: entity (or entity consumer), JSON:API config, OAuth2 scopes, admin UI.
  • Recipe to install registry mode on a fresh site.
  • Migration drush command for existing local datasets.
  • Documentation in README and integration.md.

User interface changes

On the consuming Drupal site (any option):

  • Eval target form gains a dataset_source select (file, registry).
  • When source is registry, dataset field becomes a structured input: registry URL plus dataset ID plus version pin.
  • Settings form gains a registry credentials section (OAuth2 client ID/secret) when at least one target uses a registry source.

On a registry-mode site (new, ships in ai_eval regardless of option):

  • Admin browse/list of dataset entities with version history and diff view.
  • Entity form for creating and editing datasets, with structured editor (YAML or table view) and JSON-schema validation surfaced inline.
  • Per-version immutable audit log view.

API changes

Independent of which option wins on (a):

  • New plugin type @AiEvalDatasetSource in ai_eval. Existing direct file-loader call sites refactor to go through the plugin manager. FileSource ships as the default plugin and preserves current behavior.
  • New service interface for dataset retrieval: DatasetSourceInterface::load(string $id, ?string $version): array.
  • EvalTarget config entity grows a dataset_source property (default: file). Existing config keeps working without change.
  • JSON:API endpoints exposed by registry-mode sites:
    • GET /jsonapi/eval_dataset/{id}/version/{v}: read.
    • POST /jsonapi/eval_dataset/{id}/versions: write (creates new immutable version), OAuth2 protected.
    • GET /jsonapi/eval_dataset/{id}/versions: list versions.
  • OAuth2 scopes: ai_eval_registry:read, ai_eval_registry:write.

Specific to option B (entity in ai_eval) only:

  • ai_eval ships the entity type definition and bundle storage handlers.

Specific to options A and C (entity in core):

  • Core ships the entity type. ai_eval ships bundles (option C) or just consumes (option A).

Data model changes

On the consuming site (any option):

  • EvalTarget config entity adds dataset_source string property (default: file).
  • No schema change for existing tables.

Dataset entity (location depends on outcome of decision 1):

  • Content entity with revisions enabled.
  • Bundles by dataset type (rag, agent, classification, judge_validation) for per-domain field schemas, in line with conventions across mature eval ecosystems. Single-shape entity left as an option for discussion under decision 2 above.
  • Core fields: label, machine_name, owner, visibility (public/private/org), tags, schema_version, questions (JSON payload), changelog, created, changed.
  • Bundle-specific fields declared by the bundle's owning module.
  • Revisions are immutable once published. Edits create new revisions; old revisions remain addressable by version pin.
  • Audit log: rely on Drupal core revisions plus standard watchdog entries. No new tables.

Comments

zorz created an issue. See original summary.

zorz’s picture

Issue summary: View changes
zorz’s picture

  • zorz committed 502a5aa9 on main
    docs(proposals): seed d.o issue drafts and post-BoF design conversation...