Rust repo change
Small real changes with tests, cargo checks, and reviewable diffs.
Measurement
Phonton benchmarks should read like audit logs: fixed tasks, pinned commits, provider disclosure, verification outcomes, and correction burden.
Benchmark plan
Small real changes with tests, cargo checks, and reviewable diffs.
A task only counts when the verification gate passes or failure is correctly escalated.
Track model tier, retries, tokens, and configured provider pricing separately.
Record how much human correction remains after Phonton marks work ready.
Disclosure
pinnedEvery run links to the exact repo state.declaredModel and routing settings are part of the result.reportedSyntax, workspace, and test outcomes are shown.Run format
Fix repo commit, task prompt, provider config, and expected verification command.
Capture plan, retries, checks, tokens, cost, and final review payload.
Show verified completion, failure mode, or human correction burden without hiding misses.