feat: #3583202 Add provider-agnostic eval runner

Adds codex, gemini, and mistral providers to compare.py on top of the merged claude multi-model support. Each provider is a black box function with signature (prompt, model, cwd) -> result dict; the 8-field primitive (response, elapsed, exit_code, tokens, cost) is unchanged across providers. Provider selection via --provider flag, default claude. Model defaults per provider, pricing table for cost estimation.

Cross-provider run on writing-automated-tests (5 cases, no-baseline, 1 run):

  • claude sonnet: 80% -> 100% with skill (+20%), $0.036 -> $0.044/q
  • codex gpt-5.4: 80% -> 100% with skill (+20%), $0.328 -> $0.106/q
  • gemini-2.5-pro: 20% -> 20% with skill (+0%), $0.031 -> $0.033/q

By: zorz

Merge request reports

Loading