Research: merging ai_agents_test with ai_eval (and AiLlm test harness) (#3585124) · Issues · project / ai_agents_test

Research: merging ai_agents_test with ai_eval (and AiLlm test harness)

>>> [!note] Migrated issue   Reported by: [marcus_johansson](https://www.drupal.org/user/385947) >>> [Tracker] Update Summary: [One-line status update for stakeholders] Short Description: Research whether ai_agents_test should be merged with ai_eval, and how the core ai module's tests/src/AiLlm harness fits into the picture. Check-in Date: MM/DD/YYYY [/Tracker] <h3 id="summary-problem-motivation">Problem/Motivation</h3> We currently have three overlapping efforts in the Drupal AI ecosystem that each cover part of "validate that an AI configuration actually works": <ul> <li>ai_agents_test - a Drupal module that lets site builders build test suites from real-world prompts (captured from testers and end-users) and run them against any AI agent configuration to validate production readiness. Focused on agent decision-making across provider/model combinations. Currently at 1.0.0-alpha4.</li> <li>ai_eval - a broader evaluation harness with two modes (<code>agent</code> mode that exercises end-to-end agent plugins including tool invocation, and <code>chat</code> mode that hits providers directly). Ships five pluggable graders (four LLM-as-judge on a 1-5 scale for relevance/completeness/accuracy/actionability, plus one deterministic format validator), hard/soft quality gates for CI/CD, a results dashboard with week-over-week trends, and a prompt-optimization loop that proposes improved system prompts when gates fail.</li> <li>tests/src/AiLlm in the core ai module - a low-level PHPUnit kernel test harness for running tests against real providers. Provides <code>AiProviderTestBase</code>, <code>AiTestUiInterface</code>/<code>AiTestUiTrait</code> for exposing PHPUnit tests through a Drupal UI, and honors an <code>AI_PHPUNIT_TARGET_MODELS</code> env var so the same test class can run against many provider/model pairs. Includes a <code>FiberTest</code> example that exercises concurrent generation. Developer-facing, code-first, and used by the ai module's own test suite.</li> </ul> The overlap is real but the audiences are different: <ul> <li>ai_agents_test serves site builders who want to curate prompts from users and re-run them as the agent configuration evolves.</li> <li>ai_eval serves teams that want structured scoring, CI gates, and automated prompt iteration.</li> <li>AiLlm serves module developers writing PHPUnit tests that need real providers (hits OpenAI/Anthropic/etc. when credentials are present, skips otherwise).</li> </ul> The concern is that we end up with three separate ways to say "run a prompt set against a provider", three separate storage models for "a set of prompts with expected behaviour", and three separate result dashboards. Contrib sites that want CI-grade evaluation on user-curated prompt sets would benefit from an integrated story rather than piecing together tooling. <h3 id="summary-proposed-resolution">Proposed resolution</h3> This is a research task, not an implementation task. Outcome should be a written decision on whether to merge, align, or keep separate, plus a concrete action plan. <ul> <li>Map the features of each project side-by-side: prompt/dataset storage model, agent vs chat targets, scoring mechanism, gate/threshold behaviour, UI surface, CI integration, provider-credential handling, and test-execution surface (Drupal UI, Drush, PHPUnit).</li> <li>Identify the intersection: prompt sets, provider/model targeting, result storage. These are likely candidates for a shared core abstraction.</li> <li>Identify the distinctions: ai_eval's LLM-judge graders and prompt-optimization loop are unique; ai_agents_test's user-prompt-capture workflow is unique; AiLlm's PHPUnit-and-env-var targeting is unique.</li> <li>Decide which of three paths makes sense: (a) merge ai_agents_test into ai_eval as a prompt-capture submodule; (b) keep ai_agents_test as a lightweight site-builder UI that writes datasets into ai_eval's storage model; (c) keep both independent but extract a shared "AI evaluation dataset" schema into ai core.</li> <li>For AiLlm: decide whether the PHPUnit-style kernel test harness should stay in the ai module as the developer-facing test API, while ai_eval/ai_agents_test operate at the site-builder layer on top. Evaluate whether <code>AiTestUiInterface</code> should converge with ai_eval's test-run UI.</li> <li>Survey the existing overlap with community frameworks (Guardrails AI, DeepEval, promptfoo) to make sure we're not reinventing a solved problem.</li> <li>Publish the findings as a comment here, and open concrete follow-up issues in the affected projects with the agreed direction.</li> </ul> Open questions for discussion: <ul> <li>Do ai_agents_test users expect to grade the captured prompts, or just re-run and eyeball the output? If graded, that's ai_eval territory.</li> <li>Should ai_eval's agent-mode evaluation reuse the core <code>AiAgentInterface</code> executor the way the ai module already does, or keep its own runner?</li> <li>Would a shared "test dataset" config entity (prompts, optional expected behaviour, optional gate thresholds) in ai core let both contribs cooperate without hard dependencies?</li> <li>Does it make sense for AiLlm's <code>AI_PHPUNIT_TARGET_MODELS</code> provider matrix to feed ai_eval's gate reports, so PHPUnit and CI-gate runs share the same signal?</li> </ul> <h3 id="summary-ai-usage">AI usage (if applicable)</h3> [x] AI Assisted Issue This issue was generated with AI assistance, but was reviewed and refined by the creator. [ ] AI Assisted Code [ ] AI Generated Code [ ] Vibe Coded - This issue was created with the help of AI

issue