feat: #3588915 Eval dashboard POC (EvalEval envelope + HuggingFace + Gradio)

Closes #3588915

What is this?

Drupal AI skills are SKILL.md files that guide AI assistants to give better answers in specific domains (automated testing, configuration management, etc.). This MR builds upon our existing infrastructure to measure how well those skills actually work — running test prompts against real models, recording the results in a structured format, and publishing them to a browsable dashboard.

If you're unfamiliar with any of the tools involved:

  • Evals — automated tests for AI: give a model a prompt, check whether the response meets defined criteria. Run across multiple prompts and models to catch regressions and compare quality. Think of it as PHPUnit, but for AI outputs instead of PHP code.
  • Inspect AI — an open-source eval framework from the UK AI Security Institute. We use it to send prompts to AI models, automatically check whether the responses meet our criteria, and save the results for later analysis.
  • EvalEval envelope — a structured schema for recording eval results (model version, scores, token usage, per-sample details) in a way that's interoperable across projects and tools.
  • HuggingFace — the standard platform for sharing AI datasets and models. We publish eval results there so anyone can browse, filter, and query them without running anything locally.

Before this MR

The repo already had:

  • evals/{skill}/evals.json — test cases (prompts + expected outputs) per skill
  • evals/{skill}/static-checks.json — deterministic structural checks (does the file exist, does it contain required terms, etc.)
  • evals/run-evals.py — a script to run behavioral evals against Claude, Gemini, OpenRouter, and other providers

Results landed as individual JSON trace files in a local --output-dir. There was no standard format for those results, no way to compare runs across models, and no browsable record of what passed or failed.

After this MR

The same eval cases now flow through a three-step pipeline that produces structured, publishable, browsable results:

inspect eval drupal_inspect.py → convert_to_envelope.py → upload_to_hf.py (run evals, save .eval log) (convert to EvalEval) (publish to HF)

Step 1 — run evals with Inspect AI

inspect eval evals/drupal_inspect.py@drupal_automated_testing \
  --model anthropic/claude-haiku-4-5-20251001

Step 2 — convert to EvalEval envelope

  python3 evals/convert_to_envelope.py \
    --inspect-log logs/<run>.eval \
    --output-dir ./envelope \
    --skill drupal-automated-testing \
    --validate

Step 3 — publish to HuggingFace

  python3 evals/upload_to_hf.py \
    --envelope-dir ./envelope \
    --repo your-org/eval-results

What's included

Inspect AI integration:

  • evals/drupal_inspect.py — Task definitions and four reusable scorers: must_contain_any, must_not_contain, php_lint, markdown_structure. One @task per skill; scorers are shared across all of them.
  • evals/convert_to_envelope.py — Extended with --inspect-log to accept Inspect .eval logs alongside the existing --trace-dir path. Extracts real per-sample latency, estimates cost from token counts, formats case IDs consistently.

Infrastructure improvements to existing eval tooling:

  • evals/providers.py — Add OpenRouter provider; extract actual model name from Claude CLI JSON so traces no longer land as unknown/unknown
  • evals/run-evals.py — Refactor failure handling so all cases write a trace file; failures now appear in results instead of being silently dropped

New publishing scripts:

  • evals/upload_to_hf.py — Uploads EvalEval envelope files to HuggingFace; also generates a flat results.jsonl for the HF dataset viewer
  • evals/space/ — Gradio dashboard: filter by namespace, skill, model, pass/fail; shows latency, token usage, cost per case

New skill:

  • skills/drupal-eval-pipeline/ + evals/drupal-eval-pipeline/ — Documents the three-step pipeline with static checks and four behavioral eval cases

Live POC

Relation to the Eval Commons proposal

This MR is a proof-of-concept for ai#3586445 (ai#3586445). It provides a working end-to-end implementation of Layers 1–4 at POC depth and informed the review comment on that issue.

🤖 Generated with Claude Code (https://claude.ai/claude-code)

Edited by Angie Byron

Merge request reports

Loading