[Tracker]
Update Summary: [One-line status update for stakeholders]
Short Description: Research whether ai_agents_test should be merged with ai_eval, and how the core ai module's tests/src/AiLlm harness fits into the picture.
Check-in Date: MM/DD/YYYY
[/Tracker]
Problem/Motivation
We currently have three overlapping efforts in the Drupal AI ecosystem that each cover part of "validate that an AI configuration actually works":
- ai_agents_test - a Drupal module that lets site builders build test suites from real-world prompts (captured from testers and end-users) and run them against any AI agent configuration to validate production readiness. Focused on agent decision-making across provider/model combinations. Currently at 1.0.0-alpha4.
- ai_eval - a broader evaluation harness with two modes (
agentmode that exercises end-to-end agent plugins including tool invocation, andchatmode that hits providers directly). Ships five pluggable graders (four LLM-as-judge on a 1-5 scale for relevance/completeness/accuracy/actionability, plus one deterministic format validator), hard/soft quality gates for CI/CD, a results dashboard with week-over-week trends, and a prompt-optimization loop that proposes improved system prompts when gates fail. - tests/src/AiLlm in the core ai module - a low-level PHPUnit kernel test harness for running tests against real providers. Provides
AiProviderTestBase,AiTestUiInterface/AiTestUiTraitfor exposing PHPUnit tests through a Drupal UI, and honors anAI_PHPUNIT_TARGET_MODELSenv var so the same test class can run against many provider/model pairs. Includes aFiberTestexample that exercises concurrent generation. Developer-facing, code-first, and used by the ai module's own test suite.
The overlap is real but the audiences are different:
- ai_agents_test serves site builders who want to curate prompts from users and re-run them as the agent configuration evolves.
- ai_eval serves teams that want structured scoring, CI gates, and automated prompt iteration.
- AiLlm serves module developers writing PHPUnit tests that need real providers (hits OpenAI/Anthropic/etc. when credentials are present, skips otherwise).
The concern is that we end up with three separate ways to say "run a prompt set against a provider", three separate storage models for "a set of prompts with expected behaviour", and three separate result dashboards. Contrib sites that want CI-grade evaluation on user-curated prompt sets would benefit from an integrated story rather than piecing together tooling.
Proposed resolution
This is a research task, not an implementation task. Outcome should be a written decision on whether to merge, align, or keep separate, plus a concrete action plan.
- Map the features of each project side-by-side: prompt/dataset storage model, agent vs chat targets, scoring mechanism, gate/threshold behaviour, UI surface, CI integration, provider-credential handling, and test-execution surface (Drupal UI, Drush, PHPUnit).
- Identify the intersection: prompt sets, provider/model targeting, result storage. These are likely candidates for a shared core abstraction.
- Identify the distinctions: ai_eval's LLM-judge graders and prompt-optimization loop are unique; ai_agents_test's user-prompt-capture workflow is unique; AiLlm's PHPUnit-and-env-var targeting is unique.
- Decide which of three paths makes sense: (a) merge ai_agents_test into ai_eval as a prompt-capture submodule; (b) keep ai_agents_test as a lightweight site-builder UI that writes datasets into ai_eval's storage model; (c) keep both independent but extract a shared "AI evaluation dataset" schema into ai core.
- For AiLlm: decide whether the PHPUnit-style kernel test harness should stay in the ai module as the developer-facing test API, while ai_eval/ai_agents_test operate at the site-builder layer on top. Evaluate whether
AiTestUiInterfaceshould converge with ai_eval's test-run UI. - Survey the existing overlap with community frameworks (Guardrails AI, DeepEval, promptfoo) to make sure we're not reinventing a solved problem.
- Publish the findings as a comment here, and open concrete follow-up issues in the affected projects with the agreed direction.
Open questions for discussion:
- Do ai_agents_test users expect to grade the captured prompts, or just re-run and eyeball the output? If graded, that's ai_eval territory.
- Should ai_eval's agent-mode evaluation reuse the core
AiAgentInterfaceexecutor the way the ai module already does, or keep its own runner? - Would a shared "test dataset" config entity (prompts, optional expected behaviour, optional gate thresholds) in ai core let both contribs cooperate without hard dependencies?
- Does it make sense for AiLlm's
AI_PHPUNIT_TARGET_MODELSprovider matrix to feed ai_eval's gate reports, so PHPUnit and CI-gate runs share the same signal?
AI usage (if applicable)
[x] AI Assisted Issue
This issue was generated with AI assistance, but was reviewed and refined by the creator.
[ ] AI Assisted Code
[ ] AI Generated Code
[ ] Vibe Coded
- This issue was created with the help of AI
Comments
Comment #2
arianraeesi commentedComment #3
lbesenyei commentedComment #4
lbesenyei commentedFeature map
Prompt/dataset storage
Target type
Scoring
Gates/thresholds
UI surface
CI integration
Execution surface
Intersections to unify
Distinctions to keep
Answers to open questions
Mostly “rerun + eyeball,” but if scored regression is needed a button couldbe added to “score in ai_eval” on same dataset.
Yes. Reuse the AiAgentInterface execution path to avoid behavior drift between “prod agent run” and “eval run.”
Yes, but not everything: prompts, optional expected behavior, optional per-case metadata/tags.
Keep grader/gate config in ai_eval to avoid bloating core.
Yes via adapter. Export PHPUnit run artifacts into shared run schema so CI signals can be viewed together.
Recommended path
Keep ai_agents_test as the site-builder UX, but have it write/read datasets using ai_eval’s evaluation dataset model.
Keep tests/src/AiLlm in core ai as the developer-facing PHPUnit harness.
Add a thin integration layer so AiLlm and ai_eval can emit compatible run artifacts (shared IDs/metadata), without forcing one runtime to become the other.
Comment #5
yautja_cetanu commented- We should split out the Test Builder UI - From the core of the the module for running tests and defining the schema.
- Perhaps also split out the UI for the reporting on tests.
Into Sub-modules?
Comment #6
zorz commentedQuick follow-up from the hallway conversation at Dev Days Athens, with the first concrete artifact for the BoF.
Where we landed
composer require drupal/ai_eval:1.0.0-alpha8) and run it before we lock API decisions.Schema landscape (quick note)
One thing worth stating up-front, since the "dataset schema stays in ai_eval" line will probably draw scrutiny: there is no single mandated LLM-as-judge schema standard in the industry. The de facto conventions are OpenInference (telemetry/tracing layer on top of OpenTelemetry), LangSmith's
EvaluationResult(key/score/comment), DeepEval's G-Eval pattern, and Ragas'SingleTurnSample/MultiTurnSample. ai_eval's current shape (per-questioninput+expected_facts+must_not_contain; per-graderscore+reason) already maps cleanly onto those: not identical field names, but same three roles (metric key, quantitative grade, natural-language justification). So keeping the schema in ai_eval isn't "our thing vs. the industry"; it's "the same shape the industry uses, named in Drupal idiom." I can post a mapping table on a follow-up comment if that would help.What "the eval" actually is
One clarification worth landing up front, because it shapes where the adapter goes: ai_eval isn't a point-in-time test runner. The dataset lives through the product's life in three phases, and the same YAML serves all three.
input+expected_facts+must_not_containrows before you write the agent. Running the eval during development tells you when you've hit the spec. It's TDD for LLM systems.That framing is why the schema stays in ai_eval: phases 1, 2, and 3 need to write to and read from the same artefact. If the schema lived in ai_agents_test, phase 3 (observability → dataset) would need a second translation layer, and the spec-vs-trace symmetry breaks.
Your new skill ticket
The ticket you opened right after the meeting, #3586527 "Add skill to create a test" on ai_agents_test, reads cleanly as the phase-1 leg of the lifecycle view above: a skill that turns an ai_agents_test fixture into ai_eval dataset rows = turning an agent definition into its spec. There's a natural phase-3 counterpart (an importer that reads observability traces into dataset rows) that would close the loop. Not asking you to scope that now; just naming the shape so we can talk adapter design against the whole picture rather than one slice of it.
For the skill ticket specifically: I'd rather not sketch an interface until I've read the fixture format and seen what you're imagining for the skill itself. If it's related to the convergence, happy to co-design once you've fleshed out the ticket. If it's separate, I'll start a dedicated adapter ticket in the right place instead.
First artifact: views-agent eval target (dataset + results)
7-question dataset targeting the
views_agentplugin (fromai_agents_extra). Target configviews_agent_bof: soft gate, threshold 3.0, gradersaccuracy_grader,fact_match_grader,tool_usage_grader,format_grader. Judge:google/gemma-4-31b-itvia OpenRouter.Full dataset YAML + results JSON + target config: gist (raw reproducible bundle)
Headline
Per-question results
Your view is here: /admin/structure/views/view/published_articlesThere was an error creating the node type: No entity type for field type on view recent_registered_usersYour view is here: /admin/structure/views/view/articles_by_editorUndefined array key "vid" ViewsAgent.php:515 · Cannot load the "taxonomy_vocabulary" entity with NULL IDYour view is here: /admin/structure/views/view/articles_by_tagYour view is here: /admin/structure/views/view/articles_json_exportYour view is here: /admin/structure/views/view/recent_nodesFailure modes
expected_factsground-truth mode is meant to catch.Undefined array key "vid" ViewsAgent.php:515followed byCannot load the "taxonomy_vocabulary" entity with NULL ID. Deterministic, reproducible, LLM-independent.tool_usage_graderscores 0/5 every row. I went looking for the cause in the ai_agents code before posting: the Views Agent (ai_agents_extra/src/Plugin/AiAgent/ViewsAgent.php) never outputs tool-call markers in its response string. It calls internal class methods like$this->createView(...)and returns only the user-facing text ("Your view is here: /admin/...").AiAgentInterfacehas nogetTools()orgetToolsInvoked()contract, no separate Tool plugin registry, and no agreed cross-agent naming convention. On the ai_eval side,tool_usage_graderonly regex-parses the response string for patterns like"tool_call": "name". Since the agent never emits such markers, no regex match is possible regardless of expected-tool name. So the zero score isn't an undiscovered naming contract; it's a missing surface. This is load-bearing for the adapter design, so a concrete question below.Rows I need your input on before locking
The 7 rows were kept small on purpose: 20 red rows is worse than 6 with clean failure modes for a time-boxed BoF. Happy to extend the dataset with failure modes you've hit that aren't covered here. Raise them in the gist comments or on this ticket.
What I'm taking on next
response_char_limit, ground-truthexpected_facts, optimizer noise controls.EvaluationResult, so future ai_agents_test users see the naming equivalences without having to reverse-engineer them.Things I'd like from you
tool_usage_graderread tool-call evidence from, given that today's agents don't emit tool names in the response string? Options I can see: (a) agents embed a tool-call manifest in their response, (b) a hook or event subscriber logs method invocations on the agent side, (c)AiAgentInterfacegains agetToolsInvoked()method, (d) ai_eval stops grading tool usage for this class of agent and relies onaccuracy_grader+fact_match_graderinstead. Preference? This shapes both the adapter and whethertool_usage_graderneeds a rewrite for convergence.Thanks again for the time yesterday.
George
Comment #7
zorz commentedUpdate: I posted a sibling design issue at #3586840 covering a shared dataset registry for ai_eval.
It is deliberately framed as a design conversation, not a recommendation. The body presents three options for where the dataset entity lives (in
drupal/aicore per @lbesenyei's comment #4 above, in ai_eval, or hybrid) and explicitly asks for input before any code work starts. The registry recipe (versioned HTTP storage, browse/edit UI, OAuth2-protected write) is split out as a separate concern that can ship in ai_eval regardless of where the entity lands.