Ep 415 article 4:35 w/ Justy & Cody

Agent Evals

Justy and Cody dig into Cameron Wolfe’s argument that agent evals need to move from static benchmark thinking to realistic harnesses that test autonomy, tool use, recovery, and long-horizon behavior. They get specific about the agentic loop, why tool-call correctness is only part of the story, and where outcome-based evals can hide ugly behavior. Cody mostly buys the technical framing, with caveats about overfitting to harnesses and the difficulty of defining ground truth trajectories. Justy keeps pulling it back to who actually needs this now: teams shipping coding, workflow, or other higher-stakes agents where a demo is not the same as reliability.

Script: GPT-5.4 Voice: Inworld TTS 1.5 Mini

Transcript

Justy Okay, Cody, this one got me fast. The real point is not "agents need evals". It’s that the old benchmark mindset barely tells you anything once a model can loop, use tools, and keep going on its own.

Cody Yeah.

Cody And I think that mostly holds. If the system can act over time, then a tidy question-answer score is almost decorative. It says something about the model, but not much about the agent you actually shipped.

Justy Also, I’m running on weird sleep because I got in late and then spent this morning fighting my calendar app like it was a personal enemy. Which honestly made me extra receptive to "please do not let an agent touch real workflows unless you can measure it."

Cody That is fair. I lost an hour today to a local dev setup that broke because one package decided its own versioning was a creative writing exercise. So yes, anyway, evaluating systems that can keep making decisions after the first mistake feels very real.

Justy The article’s clearest move, I think, is defining the difference pretty narrowly. An agent is basically an LLM using tools inside a loop, with enough autonomy to judge intermediate results and recover from errors instead of just answering once.

Cody Right, right.

Justy That sounds simple, but it matters because it shifts what counts as failure. Not just wrong answer. Wrong tool, bad tool args, unnecessary steps, getting stuck, or quietly doing something dumb in the environment.

Cody The Qwen3 example helps there. He walks through the XML-style tool tokens, like tool, tool call, and tool response, and the actual inference loop where generation stops, the call gets parsed, the tool runs, then generation resumes with the result in context.

Justy Mm-hm.

Cody That’s useful because it makes the eval problem concrete. You can score whether the model called a tool when it should have, picked the right one, produced valid structure, followed a sensible trajectory, or just reached the right outcome.

Justy And I liked that he doesn’t treat tool use as this magical layer. It’s almost annoyingly mundane. Tokens, schemas, docs, retries, error handling. Which is very much your love language, by the way.

Cody I mean, yes. The line about asking whether a human engineer could use the tool from the docs is probably the most grounded sentence in the whole piece. Bad tools poison the eval, because then you can’t tell whether the agent failed or your interface was just mush.

Justy That part changed the practical read for me. If a team is shipping a coding agent, or some workflow agent that can change tickets or schedules or whatever, they should care NOW. A slick demo is not evidence that the thing is reliable across long tasks.

Cody Exactly.

Cody Where I’d push a little is trajectory scoring. It’s helpful, but it can get weird fast. In real tasks there may be several valid paths, so a single ground truth sequence can punish an agent for solving the problem differently, or even better.

Justy Yeah, that felt like the main place where the framework can overreach. If you score the path too rigidly, you end up rewarding obedience to your harness instead of competence.

Cody Oh interesting.

Cody Right. Outcome-only evals have the opposite problem, though. The agent can stumble into a correct answer after wasting tools, taking risky actions, or relying on brittle behavior. So his broader argument still works. You need multiple views, not one score pretending to be truth.

Justy This is such an episode four hundred fifteen problem. We invented a machine that can kind of operate software, and now we’re like, cool, how do we grade its judgment without lying to ourselves.

Cody And without building a fake little school for robots.

Justy Which, to be clear, is exactly what eval harnesses are. Tiny weird obstacle courses. Very serious industry, deeply normal behavior.

Cody I do think the article is strongest when it says stop relying on anecdotal checks. That’s the part people need to hear. If the agent is headed for coding, medicine, or any workflow with real consequences, vibes are NOT an eval.

Justy And for everybody else, I think the practical takeaway is smaller. If you’re just using an LLM for narrow text tasks, this probably doesn’t change your week. But if the product story involves autonomy, tool use, and "it can handle the whole thing," then yeah, Cody, you need a harness before you need a launch post.

Cody That is annoyingly correct, Justy.

Justy Great. Let’s end there before you make me build a rubric for my calendar app.