feat: add A/B comparison script for measuring skill effectiveness

Adds evals/compare.py — runs behavioral evals with and without a skill loaded, reports pass rate delta, token usage, and cost per question.

Documents the before/after workflow in CONTRIBUTING.md so contributors can measure the impact of their changes.

Merge request reports

Loading