Pin repo, commit, prompt, tool versions, model route, and allowed capabilities.
Benchmarks
Confidence comes from artifacts.
Phonton is designed for context efficiency and proof-carrying development. Public claims against Cursor, Claude Code, Codex, HermesAgent, BridgeSpace, or other ADEs require reproducible benchmark packets.

Comparison protocol
An ADE benchmark needs the whole run.
Capture raw logs, provider usage when available, tool calls, retries, and verifier output.
Publish final diff, review artifact, quality notes, rollback path, and cost summary.
Required packet
Every comparison should be replayable.
Fixture repo
Pinned repository and commit before the run starts.
Prompt
Exact goal text, including file and MCP mentions.
Tool versions
Phonton version, model/provider route, and comparator versions.
Raw logs
Provider usage, command output, tool calls, retries, and failures.
Final diff
The produced patch and changed-file summary.
Verification
Syntax, build, test, runtime, or failure diagnostics.
Review artifact
HandoffPacket or nearest equivalent completion summary.
Quality review
Human or automated quality notes with reproducible criteria.
Current public artifact
The existing benchmark is intentionally narrow.
The planner-preview report is useful release evidence, but it is not a provider invoice, not a cached-token measurement, not an end-to-end quality score, and not a competitor comparison.
Measures plan preview time and estimated context reduction only.
Not measured by the current public artifact.
Not claimed without fixed fixtures and raw evidence from every tool.
Claim rule
Say what is proven, and separate what is designed.
Allowed:
Phonton is designed for context efficiency and visible proof.
Not allowed without artifacts:
Phonton uses 90% fewer tokens than another ADE.
Phonton beats Cursor, Claude Code, Codex, HermesAgent, or BridgeSpace.