Exploring Next

Exploring Next — Ep 415 w/ Justy & Cody — cameronrwolfe.substack.com/p/agent-evals

Justy and Cody dig into Cameron Wolfe’s argument that agent evals need to move from static benchmark thinking to realistic harnesses that test autonomy, tool use, recovery, and long-horizon behavior. They get specific about the agentic loop, why tool-call correctness is only part of the story, and where outcome-based evals can hide ugly behavior. Cody mostly buys the technical framing, with caveats about overfitting to harnesses and the difficulty of defining ground truth trajectories. Justy keeps pulling it back to who actually needs this now: teams shipping coding, workflow, or other higher-stakes agents where a demo is not the same as reliability.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →