[Tracker]
Update Summary: [One-line status update for stakeholders]
Short Description: Research whether ai_agents_test should be merged with ai_eval, and how the core ai module's tests/src/AiLlm harness fits into the picture.
Check-in Date: MM/DD/YYYY
[/Tracker]

Problem/Motivation

We currently have three overlapping efforts in the Drupal AI ecosystem that each cover part of "validate that an AI configuration actually works":

  • ai_agents_test - a Drupal module that lets site builders build test suites from real-world prompts (captured from testers and end-users) and run them against any AI agent configuration to validate production readiness. Focused on agent decision-making across provider/model combinations. Currently at 1.0.0-alpha4.
  • ai_eval - a broader evaluation harness with two modes (agent mode that exercises end-to-end agent plugins including tool invocation, and chat mode that hits providers directly). Ships five pluggable graders (four LLM-as-judge on a 1-5 scale for relevance/completeness/accuracy/actionability, plus one deterministic format validator), hard/soft quality gates for CI/CD, a results dashboard with week-over-week trends, and a prompt-optimization loop that proposes improved system prompts when gates fail.
  • tests/src/AiLlm in the core ai module - a low-level PHPUnit kernel test harness for running tests against real providers. Provides AiProviderTestBase, AiTestUiInterface/AiTestUiTrait for exposing PHPUnit tests through a Drupal UI, and honors an AI_PHPUNIT_TARGET_MODELS env var so the same test class can run against many provider/model pairs. Includes a FiberTest example that exercises concurrent generation. Developer-facing, code-first, and used by the ai module's own test suite.

The overlap is real but the audiences are different:

  • ai_agents_test serves site builders who want to curate prompts from users and re-run them as the agent configuration evolves.
  • ai_eval serves teams that want structured scoring, CI gates, and automated prompt iteration.
  • AiLlm serves module developers writing PHPUnit tests that need real providers (hits OpenAI/Anthropic/etc. when credentials are present, skips otherwise).

The concern is that we end up with three separate ways to say "run a prompt set against a provider", three separate storage models for "a set of prompts with expected behaviour", and three separate result dashboards. Contrib sites that want CI-grade evaluation on user-curated prompt sets would benefit from an integrated story rather than piecing together tooling.

Proposed resolution

This is a research task, not an implementation task. Outcome should be a written decision on whether to merge, align, or keep separate, plus a concrete action plan.

  • Map the features of each project side-by-side: prompt/dataset storage model, agent vs chat targets, scoring mechanism, gate/threshold behaviour, UI surface, CI integration, provider-credential handling, and test-execution surface (Drupal UI, Drush, PHPUnit).
  • Identify the intersection: prompt sets, provider/model targeting, result storage. These are likely candidates for a shared core abstraction.
  • Identify the distinctions: ai_eval's LLM-judge graders and prompt-optimization loop are unique; ai_agents_test's user-prompt-capture workflow is unique; AiLlm's PHPUnit-and-env-var targeting is unique.
  • Decide which of three paths makes sense: (a) merge ai_agents_test into ai_eval as a prompt-capture submodule; (b) keep ai_agents_test as a lightweight site-builder UI that writes datasets into ai_eval's storage model; (c) keep both independent but extract a shared "AI evaluation dataset" schema into ai core.
  • For AiLlm: decide whether the PHPUnit-style kernel test harness should stay in the ai module as the developer-facing test API, while ai_eval/ai_agents_test operate at the site-builder layer on top. Evaluate whether AiTestUiInterface should converge with ai_eval's test-run UI.
  • Survey the existing overlap with community frameworks (Guardrails AI, DeepEval, promptfoo) to make sure we're not reinventing a solved problem.
  • Publish the findings as a comment here, and open concrete follow-up issues in the affected projects with the agreed direction.

Open questions for discussion:

  • Do ai_agents_test users expect to grade the captured prompts, or just re-run and eyeball the output? If graded, that's ai_eval territory.
  • Should ai_eval's agent-mode evaluation reuse the core AiAgentInterface executor the way the ai module already does, or keep its own runner?
  • Would a shared "test dataset" config entity (prompts, optional expected behaviour, optional gate thresholds) in ai core let both contribs cooperate without hard dependencies?
  • Does it make sense for AiLlm's AI_PHPUNIT_TARGET_MODELS provider matrix to feed ai_eval's gate reports, so PHPUnit and CI-gate runs share the same signal?

AI usage (if applicable)

[x] AI Assisted Issue
This issue was generated with AI assistance, but was reviewed and refined by the creator.

[ ] AI Assisted Code

[ ] AI Generated Code

[ ] Vibe Coded

- This issue was created with the help of AI

Comments

marcus_johansson created an issue. See original summary.

arianraeesi’s picture

lbesenyei’s picture

Assigned: Unassigned » lbesenyei
lbesenyei’s picture

Assigned: lbesenyei » Unassigned
Status: Active » Needs review

Feature map

  • Prompt/dataset storage

    • ai_agents_test: Prompt suites curated from user/tester prompts
    • ai_eval: Structured evaluation datasets/cases
    • AiLlm: PHPUnit test methods + fixtures/code
  • Target type

    • ai_agents_test: Agent configs (decision behavior focus)
    • ai_eval: Agent mode + chat/provider mode
    • AiLlm: Provider/model API behavior via kernel tests
  • Scoring

    • ai_agents_test: Mostly pass/fail/manual review orientation
    • ai_eval: Pluggable graders/li>
    • AiLlm: PHPUnit assertions
  • Gates/thresholds

    • ai_agents_test: Minimal/lightweight
    • ai_eval: Hard/soft quality gates for CI/CD
    • AiLlm: PHPUnit pass/fail + CI status
  • UI surface

    • ai_agents_test: Site-builder Drupal U
    • ai_eval: Dashboard + trends + optimization UI
    • AiLlm: Dev/test UI bridge (AiTestUiInterface) + PHPUnit
  • CI integration

    • ai_agents_test: Limited/indirect
    • ai_eval: First-class CI gate flow
    • AiLlm: Native via PHPUnit pipelines
  • Execution surface

    • ai_agents_test:Drupal UI, Drush
    • ai_eval:Drupal UI, CI
    • AiLlm:PHPUnit CLI


Intersections to unify

  • Shared primitives: dataset, target matrix (provider/model/agent config), run result record.
  • These should be canonical across contrib: one schema, one run metadata contract, one way to compare runs over time.


Distinctions to keep

  • ai_agents_test: capture/curation workflow from real user prompts.
  • ai_eval: grader framework, quality gates, trend dashboard, prompt optimization loop.
  • AiLlm: code-first kernel/provider testing with AI_PHPUNIT_TARGET_MODELS.


Answers to open questions

  • Do ai_agents_test users expect grading?

    Mostly “rerun + eyeball,” but if scored regression is needed a button couldbe added to “score in ai_eval” on same dataset.
  • Should ai_eval agent mode reuse core executor?

    Yes. Reuse the AiAgentInterface execution path to avoid behavior drift between “prod agent run” and “eval run.”
  • Shared test dataset config entity in core?

    Yes, but not everything: prompts, optional expected behavior, optional per-case metadata/tags.

    Keep grader/gate config in ai_eval to avoid bloating core.
  • Should AI_PHPUNIT_TARGET_MODELS feed ai_eval reports?

    Yes via adapter. Export PHPUnit run artifacts into shared run schema so CI signals can be viewed together.


Recommended path

Keep ai_agents_test as the site-builder UX, but have it write/read datasets using ai_eval’s evaluation dataset model.
Keep tests/src/AiLlm in core ai as the developer-facing PHPUnit harness.
Add a thin integration layer so AiLlm and ai_eval can emit compatible run artifacts (shared IDs/metadata), without forcing one runtime to become the other.

yautja_cetanu’s picture

- We should split out the Test Builder UI - From the core of the the module for running tests and defining the schema.
- Perhaps also split out the UI for the reporting on tests.

Into Sub-modules?

zorz’s picture

Quick follow-up from the hallway conversation at Dev Days Athens, with the first concrete artifact for the BoF.

Where we landed

  • Convergence direction: yes. One user-facing story for "test your agents", not two competing modules. My reading is that "merge" in the hallway meant the user experience (one story), not the codebase (one module). The convergence that makes sense is adapter-shaped: ai_agents_test keeps the agent fixtures, ai_eval keeps the graders, gates, and dataset schema, and a bridge composes them. Both modules stay installable independently. Your "Add skill to create a test" ticket on ai_agents_test reads as adapter-shaped work, so I'm scoping accordingly. Flag if you read the conversation differently.
  • Firsthand evaluation first. You're going to install ai_eval (1.0.0-alpha8 is on drupal.org: composer require drupal/ai_eval:1.0.0-alpha8) and run it before we lock API decisions.
  • AI Initiative devs to help. Much appreciated. Contributors from either side are welcome to work directly on ai_eval under the adapter shape; collaboration doesn't require a merge. Happy to sync on who and when offline.
  • BoF case = views agent. Reproducible eval target for the failure modes you mentioned. Rather than hold results for the live demo, attaching them here so everyone reading the ticket has the same artifact, Jamie included, since he's already left Athens.

Schema landscape (quick note)

One thing worth stating up-front, since the "dataset schema stays in ai_eval" line will probably draw scrutiny: there is no single mandated LLM-as-judge schema standard in the industry. The de facto conventions are OpenInference (telemetry/tracing layer on top of OpenTelemetry), LangSmith's EvaluationResult (key / score / comment), DeepEval's G-Eval pattern, and Ragas' SingleTurnSample/MultiTurnSample. ai_eval's current shape (per-question input + expected_facts + must_not_contain; per-grader score + reason) already maps cleanly onto those: not identical field names, but same three roles (metric key, quantitative grade, natural-language justification). So keeping the schema in ai_eval isn't "our thing vs. the industry"; it's "the same shape the industry uses, named in Drupal idiom." I can post a mapping table on a follow-up comment if that would help.

What "the eval" actually is

One clarification worth landing up front, because it shapes where the adapter goes: ai_eval isn't a point-in-time test runner. The dataset lives through the product's life in three phases, and the same YAML serves all three.

  1. Dev phase, dataset is the spec. You write input + expected_facts + must_not_contain rows before you write the agent. Running the eval during development tells you when you've hit the spec. It's TDD for LLM systems.
  2. Optimize phase, dataset is the tuning signal. The optimizer iterates prompts against the dataset with guardrails (3-run aggregation, margin floor, propose-queue) that prevent overfitting. Without those, any optimizer will benchmax the test set; with them, the optimizer is a tool rather than a cheat engine.
  3. Post-launch phase, dataset is the regression guard, fed by observability. Production traces (Langfuse, OpenInference-compatible pipelines) surface anomalies that become new dataset rows. The eval grows with the product instead of going stale.

That framing is why the schema stays in ai_eval: phases 1, 2, and 3 need to write to and read from the same artefact. If the schema lived in ai_agents_test, phase 3 (observability → dataset) would need a second translation layer, and the spec-vs-trace symmetry breaks.

Your new skill ticket

The ticket you opened right after the meeting, #3586527 "Add skill to create a test" on ai_agents_test, reads cleanly as the phase-1 leg of the lifecycle view above: a skill that turns an ai_agents_test fixture into ai_eval dataset rows = turning an agent definition into its spec. There's a natural phase-3 counterpart (an importer that reads observability traces into dataset rows) that would close the loop. Not asking you to scope that now; just naming the shape so we can talk adapter design against the whole picture rather than one slice of it.

For the skill ticket specifically: I'd rather not sketch an interface until I've read the fixture format and seen what you're imagining for the skill itself. If it's related to the convergence, happy to co-design once you've fleshed out the ticket. If it's separate, I'll start a dedicated adapter ticket in the right place instead.

First artifact: views-agent eval target (dataset + results)

7-question dataset targeting the views_agent plugin (from ai_agents_extra). Target config views_agent_bof: soft gate, threshold 3.0, graders accuracy_grader, fact_match_grader, tool_usage_grader, format_grader. Judge: google/gemma-4-31b-it via OpenRouter.

Full dataset YAML + results JSON + target config: gist (raw reproducible bundle)

Headline

  • avg_score: 2.21 / 5.0 (threshold 3.0, soft gate → reported as FAIL)
  • pass_rate: 0% (Wilson 95% CI [0%, 39%])
  • valid_count: 6/7 (V04 crashed; see below)
  • total elapsed: 488s (Gemma 4 31B judge, OpenRouter)

Per-question results

ID avg accuracy fact_match tool_usage format raw response (truncated)
V01 2.25 2 2 0 5 Your view is here: /admin/structure/views/view/published_articles
V02 2.00 1 2 0 5 There was an error creating the node type: No entity type for field type on view recent_registered_users
V03 2.25 2 2 0 5 Your view is here: /admin/structure/views/view/articles_by_editor
V04 0.00 agent invocation failed Undefined array key "vid" ViewsAgent.php:515 · Cannot load the "taxonomy_vocabulary" entity with NULL ID
V05 2.25 2 2 0 5 Your view is here: /admin/structure/views/view/articles_by_tag
V06 2.25 2 2 0 5 Your view is here: /admin/structure/views/view/articles_json_export
V07 2.25 2 2 0 5 Your view is here: /admin/structure/views/view/recent_nodes

Failure modes

  1. URL-only responses (V01, V03, V05, V06, V07). Agent creates a view, names it plausibly, and returns only the admin URL. No evidence of what was actually configured. The "looks good, might be wrong" pattern, exactly what expected_facts ground-truth mode is meant to catch.
  2. User-entity confusion (V02). Agent errors out trying to create a user-listing view as a node type. Real error surface, not a grader artefact.
  3. Taxonomy contextual filter crash (V04). Hard crash: Undefined array key "vid" ViewsAgent.php:515 followed by Cannot load the "taxonomy_vocabulary" entity with NULL ID. Deterministic, reproducible, LLM-independent.
  4. Tool usage never matches (all rows). tool_usage_grader scores 0/5 every row. I went looking for the cause in the ai_agents code before posting: the Views Agent (ai_agents_extra/src/Plugin/AiAgent/ViewsAgent.php) never outputs tool-call markers in its response string. It calls internal class methods like $this->createView(...) and returns only the user-facing text ("Your view is here: /admin/..."). AiAgentInterface has no getTools() or getToolsInvoked() contract, no separate Tool plugin registry, and no agreed cross-agent naming convention. On the ai_eval side, tool_usage_grader only regex-parses the response string for patterns like "tool_call": "name". Since the agent never emits such markers, no regex match is possible regardless of expected-tool name. So the zero score isn't an undiscovered naming contract; it's a missing surface. This is load-bearing for the adapter design, so a concrete question below.

Rows I need your input on before locking

  • V04 (taxonomy contextual filter): current phrasing did reproduce a crash, but is that the failure you hit, or did you have a different one in mind?
  • V06 (REST export): real failure case for you, or smoke case?

The 7 rows were kept small on purpose: 20 red rows is worse than 6 with clean failure modes for a time-boxed BoF. Happy to extend the dataset with failure modes you've hit that aren't covered here. Raise them in the gist comments or on this ticket.

What I'm taking on next

  1. Update V04/V06 once you confirm the failure shape you had in mind.
  2. Adapter sketch: how a row in an ai_agents_test fixture maps to an ai_eval dataset question. Short design comment here after I've read the fixture format.
  3. Release-notes-grade summary of what alpha8 already contains that a convergence plan should account for: judge validation workflow, response_char_limit, ground-truth expected_facts, optimizer noise controls.
  4. Optional: if useful, a mapping table between ai_eval's schema and OpenInference / LangSmith EvaluationResult, so future ai_agents_test users see the naming equivalences without having to reverse-engineer them.

Things I'd like from you

  • Does the three-phase lifecycle match how you've been thinking about ai_agents_test, or is your mental model different? Answer shapes whether the phase-3 observability leg is worth scoping together.
  • Any fixture-format constraints on the ai_agents_test side I should respect before drafting the adapter?
  • Beyond V04/V06: any views-agent failure modes you'd specifically want covered?
  • Where should tool_usage_grader read tool-call evidence from, given that today's agents don't emit tool names in the response string? Options I can see: (a) agents embed a tool-call manifest in their response, (b) a hook or event subscriber logs method invocations on the agent side, (c) AiAgentInterface gains a getToolsInvoked() method, (d) ai_eval stops grading tool usage for this class of agent and relies on accuracy_grader + fact_match_grader instead. Preference? This shapes both the adapter and whether tool_usage_grader needs a rewrite for convergence.

Thanks again for the time yesterday.

George

zorz’s picture

Update: I posted a sibling design issue at #3586840 covering a shared dataset registry for ai_eval.

It is deliberately framed as a design conversation, not a recommendation. The body presents three options for where the dataset entity lives (in drupal/ai core per @lbesenyei's comment #4 above, in ai_eval, or hybrid) and explicitly asks for input before any code work starts. The registry recipe (versioned HTTP storage, browse/edit UI, OAuth2-protected write) is split out as a separate concern that can ship in ai_eval regardless of where the entity lands.