Ep 201 tool 5:11 w/ Justy & Cody

Improving Deep Agents with harness engineering

LangChain improved their coding agent from Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness - the system that wraps around the model. They used trace analysis to identify failure patterns and implemented targeted fixes like self-verification loops, context injection, and reasoning budget optimization. The 13.7 point improvement shows how much performance gains come from better tooling around models, not just bigger models.

Script: Sonnet 4.5 Voice: OpenAI TTS

Transcript

Izzo Your coding agent writes beautiful code, tests it once, says 'looks good' and ships a bug to prod.

Izzo You're listening to Exploring Next, episode two-oh-two. I'm Izzo, here with Boone, and today we're talking harness engineering — the art of wrapping better tooling around your AI agents.

Boone LangChain just dropped some serious results, Izzo. They took their coding agent from Top 30 to Top 5 on Terminal Bench without touching the model at all.

Izzo Thirteen point seven percent improvement just from changing the harness. That's the system that sits around GPT-5.2-Codex, right?

Boone Exactly. Think of it like this — you've got this incredibly smart but chaotic intern. The harness is your management system that channels that intelligence toward actually shipping working code.

Izzo Okay but why does this matter right now? Because every team I know is wrestling with agents that look brilliant in demos but fall apart in production.

Boone Right. And what's clever here is they used traces to debug at scale. Instead of guessing why agents fail, they built an automated trace analyzer that spawns parallel analysis agents to find patterns.

Izzo Hold on — they built agents to debug their agents?

Boone Meta, right? The flow is: fetch experiment traces from LangSmith, spawn analysis agents, synthesize findings, then make targeted harness changes. It's like automated boosting but for agent behavior.

Izzo That's actually brilliant. So what were the agents screwing up?

Boone Classic stuff. They'd write a solution, read their own code, say 'this looks fine' and stop. No testing, no verification against the actual task spec.

Izzo The confidence of a junior dev combined with the persistence of a machine. Terrifying.

Boone So they added what they call a PreCompletionChecklistMiddleware — intercepts the agent right before it tries to exit and forces a verification pass.

Izzo Boone, break down this middleware approach for me. How's it actually implemented?

Boone It's hook-based architecture. The middleware sits between model calls and tool execution. So when the agent thinks it's done, the PreCompletion hook kicks in and basically says 'not so fast, did you actually test this?'

Izzo Smart. What else did they build?

Boone LocalContextMiddleware runs on startup, maps the current directory, finds Python installations, discovers available tools. Agents are terrible at environmental awareness, so you inject that context upfront.

Izzo Makes sense from a product perspective. If I'm shipping this to developers, I can't have agents spending half their time figuring out basic env setup.

Boone And here's where it gets interesting — they implemented LoopDetectionMiddleware that tracks per-file edit counts. After N edits to the same file, it suggests reconsidering the approach.

Izzo Because agents get stuck in doom loops?

Boone Exactly. Ten-plus iterations of tiny variations on the same broken solution. The loop detection isn't perfect, but it helps agents step back and try a different angle.

Izzo What about the reasoning budget stuff? That seems like a key optimization.

Boone GPT-5.2-Codex has four reasoning modes — low, medium, high, xhigh. More reasoning means better evaluation but burns 2x more tokens and time.

Izzo Classic compute-quality tradeoff.

Boone They settled on what they call a 'reasoning sandwich' — xhigh for planning, high for implementation, xhigh for verification. Spend the compute where it matters most.

Izzo That's actually really smart resource allocation. Planning and verification are where you want the deep thinking.

Boone Running everything at xhigh scored poorly because agents timed out. Pure high scored 63.6%, but their sandwich approach hit 66.5%.

Izzo So the big insight here is that agent performance isn't just about model capability — it's about the systems you build around the model. Right. And what I love is they made trace analysis into a reusable skill. This isn't just debugging one agent, it's a systematic approach to harness optimization. From a go-to-market angle, that's huge. Every team building agents needs this kind of debugging infrastructure. The Terminal Bench results are compelling too. Jumping from outside