Ep 352 Blog April 30, 2026 10:10 w/ Justy & Cody

Tuning Deep Agents to Work Well with Different Models

Justy and Cody dig into LangChain’s new Deep Agents model-specific harness profiles. Cody is skeptical that prompt-and-tool tuning is a durable win, while Justy sees a practical adoption path for builders who keep hitting model-specific quirks. They land on a cautious take: useful, real, and probably underappreciated, but not magic.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/352"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 352 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script GPT-5.4 mini Voice ElevenLabs

Transcript

Justy Exploring Next, episode 352. We’re talking about Deep Agents getting model-specific profiles, and yeah, this is one of those things that sounds small until your agent starts acting weird on a different model.

Cody That’s exactly my read. The headline is basically, we found a nicer way to squeeze more out of the same models by changing prompts, tools, and middleware. Useful, but also a little suspicious because it’s easy to overcredit the harness.

Justy Sure, but if you’re a team shipping against more than one provider, that weirdness is the whole problem. People don’t want to rebuild the agent every time they swap from OpenAI to Anthropic or Google.

Cody Right, and the blog is pretty explicit that the old setup used one fixed set of prompts and tools for everyone. That’s clean architecturally, but it ignores the fact that model families have different prompting guides and different tool-use habits.

Justy And that’s the part that feels practical to me. This isn’t some abstract benchmark flex. It’s basically, your agent can be decent and still leave money on the table because the model wants to be talked to differently.

Cody Yeah. The example that jumps out is Codex wanting tool names like apply_patch and shell_command. That’s not just cosmetic. If the model has a stronger prior for a certain tool contract, giving it that contract can change how well it plans and executes.

Justy Which makes adoption feel less like a research curiosity and more like a product thing. If I’m a PM looking at an agent feature, I care whether the team can get the same workflow to behave across models without a pile of special-case code.

Cody And the clever bit is they’re not just swapping prompts. For Codex they override the default file edit implementation and alias execute as shell_command. So they’re admitting the interface itself matters, not just the words around it.

Justy That feels like a sane move. Also a little annoying, because of course the answer is more plumbing. [chuckles] But if it moves the benchmark, it moves the benchmark.

Cody The benchmark part is where I get cautious. They’re looking at a curated subset of tau2-bench, specifically harder tasks where frontier models aren’t saturated. That’s a reasonable place to measure harness effects, but it’s still a slice, not the whole world.

Justy Totally, though I think that’s fair. If everything were already saturated, you wouldn’t learn much. And the numbers they show are not tiny. GPT 5.3 Codex goes from 33 to 53, Claude Opus 4.7 from 43 to 53 on that subset.

Cody Those are meaningful jumps. I just don’t want people hearing that and thinking the harness is a universal cheat code. The blog also says they got gpt-5.2-codex from 52.8 to 66.5 on Terminal-Bench 2.0 with harness-layer changes. That’s impressive, but it’s the kind of improvement that can be very workload-specific.

Justy Yeah, but workload-specific is still where products live. Nobody ships ‘the benchmark.’ They ship code review agents, support agents, dev tools, internal automation. If the model is flaky on planning or tool use, a better harness is a real user story.

Cody I buy that. The Opus profile changes are a good example. They’re mostly prompt-focused, with lines like reflect on tool results before proceeding and use tools to observe state directly instead of reasoning from memory. That’s basically trying to stop the model from freelancing.

Justy Which honestly sounds like half the bugs I’ve seen in agent demos. [laughs] The model says it checked the file, and then you look and it absolutely did not check the file.

Cody Exactly. The harness is nudging it into better habits. But there’s a trade-off. Once you start having per-model profiles, you’re signing up for ongoing maintenance. Every model update can shift the prompt sweet spot, and now you’ve got a matrix to keep healthy.

Justy That’s the adoption barrier to me. Not whether the idea is good. It’s whether a team has enough reason to manage profiles instead of saying, ‘good enough, one harness, ship it.’ Small teams especially are gonna feel that.

Cody And yet if you’re already supporting multiple models, you’re probably paying that complexity somewhere anyway. The question is whether it’s hidden in random prompt hacks or centralized in something like Deep Agents profiles.

Justy That’s a good distinction. I think the market for this is teams building agentic developer tools, support workflows, research assistants, anything with tool use and provider switching. Less so a hobby project that just wants one bot to answer questions.

Cody Yeah, and I’d add that it’s especially relevant for people with evals. If you can measure the delta cleanly, the profile idea gives you a knob to turn without touching the rest of your stack.

Justy So the real verdict is probably: not magic, but not fluff either. If you’re serious about agents across model families, per-model harness tuning is a legit lever.

Cody Agreed. My skepticism is mostly about scope, not usefulness. The win is real, the gains are plausible, and the architecture makes sense. I just wouldn’t mistake a profile for a substitute for solid evals or a good task spec.

Justy And I wouldn’t ignore it just because it sounds like prompt fiddling. Sometimes prompt fiddling is the product. [chuckles]

Cody Fair. Build Next-wise, I’d do a weekend test with one open-source agent stack and two models. Take a simple repo, maybe a codebase you already know, and run the same task set with one shared harness versus model-specific prompts and tool names.

Justy Yeah, and keep it concrete. Compare something like Deep Agents if you’re already in that ecosystem, or even your own wrapper if you’re not. Use a few tasks that involve file edits, searches, and tests, then see if the profile actually changes pass rate or just makes the logs prettier.

Cody If you want a solo version, pick one bugfix task in a small repo and try it with two model configs. Track whether the model reads files before claiming things, whether it batches tool calls, and whether the final diff is actually usable.

Cody Exactly. And if the profile helps, great. If not, you’ve learned the harness wasn’t your bottleneck.

Justy Alright, that’s it for this one. Exploring Next. Thanks for hanging with us, and we’ll catch you in the next episode.