Research: merging ai_agents_test with ai_eval (and AiLlm test harness)
>>> [!note] Migrated issue
<!-- Drupal.org comment -->
<!-- Migrated from issue #3585124. -->
Reported by: [marcus_johansson](https://www.drupal.org/user/385947)
>>>
<p>[Tracker]<br>
<strong>Update Summary: </strong>[One-line status update for stakeholders]<br>
<strong>Short Description: </strong>Research whether ai_agents_test should be merged with ai_eval, and how the core ai module's tests/src/AiLlm harness fits into the picture.<br>
<strong>Check-in Date: </strong>MM/DD/YYYY<br>
[/Tracker]</p>
<h3 id="summary-problem-motivation">Problem/Motivation</h3>
<p>We currently have three overlapping efforts in the Drupal AI ecosystem that each cover part of "validate that an AI configuration actually works":</p>
<ul>
<li><strong>ai_agents_test</strong> - a Drupal module that lets site builders build test suites from real-world prompts (captured from testers and end-users) and run them against any AI agent configuration to validate production readiness. Focused on agent decision-making across provider/model combinations. Currently at 1.0.0-alpha4.</li>
<li><strong>ai_eval</strong> - a broader evaluation harness with two modes (<code>agent</code> mode that exercises end-to-end agent plugins including tool invocation, and <code>chat</code> mode that hits providers directly). Ships five pluggable graders (four LLM-as-judge on a 1-5 scale for relevance/completeness/accuracy/actionability, plus one deterministic format validator), hard/soft quality gates for CI/CD, a results dashboard with week-over-week trends, and a prompt-optimization loop that proposes improved system prompts when gates fail.</li>
<li><strong>tests/src/AiLlm</strong> in the core ai module - a low-level PHPUnit kernel test harness for running tests against real providers. Provides <code>AiProviderTestBase</code>, <code>AiTestUiInterface</code>/<code>AiTestUiTrait</code> for exposing PHPUnit tests through a Drupal UI, and honors an <code>AI_PHPUNIT_TARGET_MODELS</code> env var so the same test class can run against many provider/model pairs. Includes a <code>FiberTest</code> example that exercises concurrent generation. Developer-facing, code-first, and used by the ai module's own test suite.</li>
</ul>
<p>The overlap is real but the audiences are different:</p>
<ul>
<li><strong>ai_agents_test</strong> serves site builders who want to curate prompts from users and re-run them as the agent configuration evolves.</li>
<li><strong>ai_eval</strong> serves teams that want structured scoring, CI gates, and automated prompt iteration.</li>
<li><strong>AiLlm</strong> serves module developers writing PHPUnit tests that need real providers (hits OpenAI/Anthropic/etc. when credentials are present, skips otherwise).</li>
</ul>
<p>The concern is that we end up with three separate ways to say "run a prompt set against a provider", three separate storage models for "a set of prompts with expected behaviour", and three separate result dashboards. Contrib sites that want CI-grade evaluation on user-curated prompt sets would benefit from an integrated story rather than piecing together tooling.</p>
<h3 id="summary-proposed-resolution">Proposed resolution</h3>
<p>This is a research task, not an implementation task. Outcome should be a written decision on whether to merge, align, or keep separate, plus a concrete action plan.</p>
<ul>
<li>Map the features of each project side-by-side: prompt/dataset storage model, agent vs chat targets, scoring mechanism, gate/threshold behaviour, UI surface, CI integration, provider-credential handling, and test-execution surface (Drupal UI, Drush, PHPUnit).</li>
<li>Identify the intersection: prompt sets, provider/model targeting, result storage. These are likely candidates for a shared core abstraction.</li>
<li>Identify the distinctions: ai_eval's LLM-judge graders and prompt-optimization loop are unique; ai_agents_test's user-prompt-capture workflow is unique; AiLlm's PHPUnit-and-env-var targeting is unique.</li>
<li>Decide which of three paths makes sense: (a) merge ai_agents_test into ai_eval as a prompt-capture submodule; (b) keep ai_agents_test as a lightweight site-builder UI that writes datasets into ai_eval's storage model; (c) keep both independent but extract a shared "AI evaluation dataset" schema into ai core.</li>
<li>For AiLlm: decide whether the PHPUnit-style kernel test harness should stay in the ai module as the developer-facing test API, while ai_eval/ai_agents_test operate at the site-builder layer on top. Evaluate whether <code>AiTestUiInterface</code> should converge with ai_eval's test-run UI.</li>
<li>Survey the existing overlap with community frameworks (Guardrails AI, DeepEval, promptfoo) to make sure we're not reinventing a solved problem.</li>
<li>Publish the findings as a comment here, and open concrete follow-up issues in the affected projects with the agreed direction.</li>
</ul>
<p>Open questions for discussion:</p>
<ul>
<li>Do ai_agents_test users expect to grade the captured prompts, or just re-run and eyeball the output? If graded, that's ai_eval territory.</li>
<li>Should ai_eval's agent-mode evaluation reuse the core <code>AiAgentInterface</code> executor the way the ai module already does, or keep its own runner?</li>
<li>Would a shared "test dataset" config entity (prompts, optional expected behaviour, optional gate thresholds) in ai core let both contribs cooperate without hard dependencies?</li>
<li>Does it make sense for AiLlm's <code>AI_PHPUNIT_TARGET_MODELS</code> provider matrix to feed ai_eval's gate reports, so PHPUnit and CI-gate runs share the same signal?</li>
</ul>
<h3 id="summary-ai-usage">AI usage (if applicable)</h3>
<p>[x] AI Assisted Issue<br>
This issue was generated with AI assistance, but was reviewed and refined by the creator.</p>
<p>[ ] AI Assisted Code</p>
<p>[ ] AI Generated Code</p>
<p>[ ] Vibe Coded</p>
<p>- <strong>This issue was created with the help of AI</strong></p>
issue