feat: #3588915 Eval dashboard POC (EvalEval envelope + HuggingFace + Gradio)
Closes #3588915
What is this?
Drupal AI skills are SKILL.md files that guide AI assistants to give better answers in specific domains (automated testing, configuration management, etc.). This MR builds upon our existing infrastructure to measure how well those skills actually work — running test prompts against real models, recording the results in a structured format, and publishing them to a browsable dashboard.
If you're unfamiliar with any of the tools involved:
- Evals — automated tests for AI: give a model a prompt, check whether the response meets defined criteria. Run across multiple prompts and models to catch regressions and compare quality. Think of it as PHPUnit, but for AI outputs instead of PHP code.
- Inspect AI — an open-source eval framework from the UK AI Security Institute. We use it to send prompts to AI models, automatically check whether the responses meet our criteria, and save the results for later analysis.
- EvalEval envelope — a structured schema for recording eval results (model version, scores, token usage, per-sample details) in a way that's interoperable across projects and tools.
- HuggingFace — the standard platform for sharing AI datasets and models. We publish eval results there so anyone can browse, filter, and query them without running anything locally.
Before this MR
The repo already had:
evals/{skill}/evals.json— test cases (prompts + expected outputs) per skillevals/{skill}/static-checks.json— deterministic structural checks (does the file exist, does it contain required terms, etc.)evals/run-evals.py— a script to run behavioral evals against Claude, Gemini, OpenRouter, and other providers
Results landed as individual JSON trace files in a local --output-dir. There was no standard format for those results, no way to compare runs across models, and no browsable record of what passed or failed.
After this MR
The same eval cases now flow through a three-step pipeline that produces structured, publishable, browsable results:
inspect eval drupal_inspect.py → convert_to_envelope.py → upload_to_hf.py (run evals, save .eval log) (convert to EvalEval) (publish to HF)
Step 1 — run evals with Inspect AI
inspect eval evals/drupal_inspect.py@drupal_automated_testing \
--model anthropic/claude-haiku-4-5-20251001Step 2 — convert to EvalEval envelope
python3 evals/convert_to_envelope.py \
--inspect-log logs/<run>.eval \
--output-dir ./envelope \
--skill drupal-automated-testing \
--validateStep 3 — publish to HuggingFace
python3 evals/upload_to_hf.py \
--envelope-dir ./envelope \
--repo your-org/eval-resultsWhat's included
Inspect AI integration:
evals/drupal_inspect.py— Task definitions and four reusable scorers: must_contain_any, must_not_contain, php_lint, markdown_structure. One @task per skill; scorers are shared across all of them.evals/convert_to_envelope.py— Extended with --inspect-log to accept Inspect .eval logs alongside the existing --trace-dir path. Extracts real per-sample latency, estimates cost from token counts, formats case IDs consistently.
Infrastructure improvements to existing eval tooling:
evals/providers.py— Add OpenRouter provider; extract actual model name from Claude CLI JSON so traces no longer land as unknown/unknownevals/run-evals.py— Refactor failure handling so all cases write a trace file; failures now appear in results instead of being silently dropped
New publishing scripts:
evals/upload_to_hf.py— Uploads EvalEval envelope files to HuggingFace; also generates a flat results.jsonl for the HF dataset viewerevals/space/— Gradio dashboard: filter by namespace, skill, model, pass/fail; shows latency, token usage, cost per case
New skill:
skills/drupal-eval-pipeline/+evals/drupal-eval-pipeline/— Documents the three-step pipeline with static checks and four behavioral eval cases
Live POC
- Dataset: https://huggingface.co/datasets/webchick/eval-results-poc
- Dashboard: https://huggingface.co/spaces/webchick/eval-dashboard-poc
Relation to the Eval Commons proposal
This MR is a proof-of-concept for ai#3586445 (ai#3586445). It provides a working end-to-end implementation of Layers 1–4 at POC depth and informed the review comment on that issue.