Drupal Eval Commons umbrella (#3586445) · Issues · project / ai

Drupal Eval Commons umbrella

## TL;DR Four Drupal AI projects (`drupal/ai`, `ai_eval`, `ai_agents_test`, `ai_best_practices`) are each working on pieces of the same puzzle: how to write, run, and share evals for AI features. Today there is no shared format for eval cases or results, so the same work risks being redone in incompatible ways. This umbrella suggests splitting the problem into five layers and agreeing on the lowest layer first, so the other layers can move in parallel without one blocking the rest. Nothing in your current module has to change today; the proposal is additive, and a module adopts each layer when ready by translating its eval data into the shared format. Once enough layers land, results become comparable across modules, cases and judges can be shared instead of rewritten, and a browser gives users a way to find evals. If you maintain or follow any of those four projects, the useful thing to do now is comment on three questions in the body: is the five-layer split the right frame, should Layer 1 move first, and does storage stay a Layer 3 decision. ## Goal Agree on a shared, layered contract for eval data across `drupal/ai`, `ai_eval`, `ai_agents_test`, and `ai_best_practices`, so the four active eval threads can progress in parallel without one settling prematurely and constraining the others. ## Background A few current eval threads in Drupal AI are really touching the same underlying problem: we do not yet have a common contract for eval data. Relevant work already in motion: - [#3585124](https://www.drupal.org/i/3585124): convergence of test fixtures across modules - [#3586840](https://www.drupal.org/i/3586840): shared dataset registry - [#3586842](https://www.drupal.org/i/3586842): dataset schema work - [#3588426](https://www.drupal.org/i/3588426): browser / community submission for skills, evals, and results - [#3586440](https://www.drupal.org/i/3586440): `ai_best_practices` tooling vision (orchestration sibling to Layer 3) These are not unrelated efforts. They look like different layers of the same stack. Without a shared structure, one layer can easily be settled in a way that constrains the others too early. With a shared structure, the child issues can move in parallel against a clearer contract. This issue is a coordination umbrella for that structure. It is filed on `drupal/ai` because the work crosses multiple related contrib spaces (`drupal/ai`, `ai_eval`, `ai_agents_test`, `ai_best_practices`); implementation stays in the individual issue queues. ## Proposed approach Treat the eval stack as five layers and do not block the lower-level contract on higher-level implementation questions. ```text Layer 5 Domain-specific bundles Layer 4 Browser / community submission Layer 3 Registry / storage / distribution Layer 2 Result envelope Layer 1 Cases, rubrics, judges ``` ### Layer 1: Cases, rubrics, judges Foundation. Natural extension of [#3586842](https://www.drupal.org/i/3586842). Standardize three versioned artifacts: - case - rubric - judge prompt That gives a clean separation between the eval case, the grading logic, and reusable judge prompts and validation metadata. This is the first layer that needs to become explicit, because every other layer depends on it. This decomposition aligns with [Inspect AI](https://inspect.aisi.org.uk/)'s task model: a Layer 1 `case` matches an Inspect AI `Dataset` entry, and the `rubric` + `judge prompt` pair matches an Inspect AI `Scorer`. Keeping the alignment makes Inspect AI task imports straightforward and lets Drupal-side artifacts feed ecosystems already built on Inspect AI (such as Hugging Face's [community benchmark](https://huggingface.co/blog/community-evals) registration). ### Layer 2: Result envelope Use EvalEval's `every_eval_ever` schema as the base envelope. Drupal-specific metadata lives in EEE's documented `additional_details` slots (the EEE-blessed extension mechanism, present at each major sub-object). That gives a standard way to record: - model and model version - harness and harness version - dataset version - rubric and judge versions - result scores and evaluation metadata Where a harness already produces uncertainty estimates, the envelope should preserve them. `ai_eval` for example emits Wilson confidence intervals on `pass_rate`; losing that signal during translation would weaken cross-harness comparison. The envelope stays format-agnostic. Per-harness adapters translate between native run data and the envelope: JSON-serialized OTel spans following GenAI semantic conventions, `AgentTestResult` entities from `ai_agents_test`, file-based outputs from simpler harnesses. JSON-serialized OTel spans with GenAI semantic conventions are the closest thing to a cross-framework standard for runtime data right now, but there is no clear winner. LangGraph, CrewAI, LangSmith and others all diverge. Treating the envelope as the shared integration point lets each producer keep its native format while still emitting comparable result records. The same gap affects related debugging tooling (Agent Debugger, AI API Explorer), which has no shared format for storing run/request records today. A shared envelope is useful there too. ### Layer 3: Registry / storage / distribution The space of [#3586840](https://www.drupal.org/i/3586840). The entity-location and storage question stays here, not on Layer 1. That means: - define the contract first - allow file-based representations first - return to the registry/entity design with working examples and better evidence Layer 3 also overlaps with the tooling work in `ai_best_practices` ([#3586440](https://www.drupal.org/i/3586440)). That work orchestrates how skills, evals, and guidelines reach a Drupal project: aggregation into installed projects, placement on disk, and lifecycle. The orchestration story is a natural sibling to the storage/registry story, since both answer "how do these artifacts get from producers to consumers." Either layer is fine as the primary home for that conversation, as long as the contract from Layers 1 and 2 stays portable across both. ### Layer 4: Browser / community submission The space opened by [#3588426](https://www.drupal.org/i/3588426). The browser should be treated as a consumer of the lower layers: - first against file-based metadata and results - later against registry-backed data if and when Layer 3 lands That lets browser work begin without waiting for the full registry discussion to finish. ### Layer 5: Domain-specific bundles Modules will likely need domain-specific extensions on top of the base contract. Examples may include `drupal_builder`, `agent`, `rag`, `classification`, `judge_validation`. A note on `agent`: `ai_agents_test` currently stores run data in `AgentTestResult` entities, not OTel spans. For trace-derivable envelopes, either OTel emission can be added or a harness adapter can translate directly from `AgentTestResult` to the Layer 2 envelope. Both paths fit the translation-layer pattern in Layer 2. This layer should stay intentionally light at first. The base contract must allow extensions without fragmenting the overall format. ### Scope by issue - [#3586842](https://www.drupal.org/i/3586842) carries Layer 1 schema work - a new sibling issue should carry Layer 2 result-envelope work - [#3586840](https://www.drupal.org/i/3586840) carries Layer 3 registry/storage work - [#3588426](https://www.drupal.org/i/3588426) carries Layer 4 browser/submission work - Layer 5 evolves incrementally in the modules that need it This umbrella does not replace those issues. It gives them a shared frame. ### Adoption posture This umbrella is **additive, not a release blocker**. Existing eval frameworks across `ai_best_practices`, `ai_eval`, and `ai_agents_test` keep working as they do today. The Commons artifacts (Layer 1 schemas, Layer 2 result envelope, and so on) are designed to be adopted incrementally as each layer lands. A module shipping today does not have to wait on Commons artifacts to release, and modules with their own runners can keep running them; the Layer 2 envelope is an output format, not a runtime dependency. The goal is interoperability between modules over time, not a synchronized release. ## Stakeholders Pre-briefed and accepted (2026-05-11 / 2026-05-13): - [Marcus_Johansson](https://www.drupal.org/u/marcus_johansson) (AI Initiative Technical Lead, `drupal/ai` maintainer) — Layer 2 translation-layer framing. - [yautja_cetanu](https://www.drupal.org/u/yautja_cetanu) (Jamie Abrahams, FreelyGive, AI Initiative PM) — Layer 4 owner via [#3588426](https://www.drupal.org/i/3588426). - [ronaldtebrake](https://www.drupal.org/u/ronaldtebrake) (Ronald te Brake) — `ai_best_practices` tooling layer / orchestration sibling to Layer 3, [#3586440](https://www.drupal.org/i/3586440). - [webchick](https://www.drupal.org/u/webchick) (Angie Byron) — `ai_best_practices` roadmap driver, [#3585542](https://www.drupal.org/i/3585542). Stakeholders can use `/subscribe` to follow this issue without being assigned. ## Open questions 1. Does this 5-layer structure match the problem well enough to use as the umbrella? 2. Do people agree that Layer 1 should proceed independently of the Layer 3 storage/entity debate? 3. Is `drupal/ai` the right primary home for this umbrella? What should be decided now, if the answer to (1)–(3) is broadly yes: - The 5-layer split is the right coordination model. - Layer 1 should move first. - The storage/entity-location question remains a Layer 3 decision, not a blocker on Layer 1. _These decisions concern the cross-project pattern. Individual project maintainers retain authority over what ships in their own queue._ If there is agreement on those points, the child issues can make concrete progress immediately. ## Resources Related issues: - [#3585124](https://www.drupal.org/i/3585124) — convergence of test fixtures across modules - [#3586840](https://www.drupal.org/i/3586840) — shared dataset registry - [#3586842](https://www.drupal.org/i/3586842) — dataset schema work (Layer 1) - [#3588426](https://www.drupal.org/i/3588426) — browser / community submission (Layer 4) - [#3586440](https://www.drupal.org/i/3586440) — `ai_best_practices` tooling vision (orchestration sibling to Layer 3) - [#3585542](https://www.drupal.org/i/3585542) — `ai_best_practices` roadmap External: - EvalEval / `every_eval_ever`: https://github.com/evaleval/every_eval_ever - EvalEval site: https://evalevalai.com/ ## Decision _Leave empty until resolved._ Sections of this issue description were drafted with AI assistance and human-edited.

issue