feat: add A/B comparison script for measuring skill effectiveness
Adds evals/compare.py — runs behavioral evals with and without a skill loaded, reports pass rate delta, token usage, and cost per question.
Documents the before/after workflow in CONTRIBUTING.md so contributors can measure the impact of their changes.