
You don't know if your coding agent is good. You have a vibe. Tonight we replace the vibe with a number. Everyone ships an agent and says it "feels solid." Nobody can tell you its pass rate on a task it has never seen. VCN #45: Bench is the night you build a real eval and start measuring. Format: The walkthrough. What a coding-agent eval actually is. SWE-bench-style task sets, where they come from, and why a public benchmark tells you almost nothing about YOUR repo. Assemble your own bench. Pull real tasks out of your own codebase: a bug with a known fix, a refactor with a clear oracle, a feature with a passing test. Five tasks beats a thousand you can't trust. Build the harness. Wire a deterministic oracle per task (the test that decides pass or fail, no LLM judge). Run the agent. Score it. Then run it again, and again, and measure pass@k AND pass^k, the consecutive-green reliability that actually predicts whether you can leave it alone. Read the leaderboard. Your agent's real numbers on your real tasks. Where it's flaky, where it's solid, what a single failing oracle just told you. By 10pm you have a repeatable eval that scores any coding agent on your own tasks. We reuse this exact bench next Saturday at the Bake-Off (#46) to score agents head to head. Builders only. Bring a repo with at least one test you trust. Doors 7pm. Walkthrough 7:30. Frontier Tower Floor 10. Hosted by Vibe Coding Nights: Rayyan Zahid (Immersive Commons), Michalis Vasileiadis (Hacker Bob), Eric…