Ep 355 research 8:37 w/ Justy & Cody

Alibaba's HDPO cuts AI agent tool overuse from 98% to 2%

Justy and Cody dig into Alibaba's HDPO and Metis, a training setup that teaches AI agents to stop calling tools by default. Cody likes the core idea because it separates accuracy from efficiency during reinforcement learning, but he questions how portable the benchmark win is. Justy pushes on why this matters for real products right now: users feel latency, teams feel API bills, and nobody wants an agent that opens a toolbox for a task it already knows how to do.

Script: GPT-5.4 Voice: ElevenLabs

Transcript

Justy This is Exploring Next, episode 355. Cody thinks Alibaba's agent training result is a little too neat, and I think if it holds up, a lot of people get faster AI that stops wasting their time.

Cody Yeah, my skeptical read is not that the idea is fake. I actually think the core complaint is dead on. A lot of agents are weirdly eager to grab a tool. Search, Python, image crop, whatever. Even when the answer is already sitting there. And in a product, that means extra seconds, extra API spend, and sometimes a worse answer because the model polluted its own context.

Justy Right, and users absolutely feel that. They don't say, wow, what a sophisticated orchestration layer. They say, why did this thing take forever to answer a simple question. Or why did my usage cap disappear by lunch.

Cody Exactly. What Alibaba is claiming with HDPO is basically, stop training the model with one mushy reward that mixes being right with being cheap. They split those apart. One channel optimizes correctness. Another optimizes efficiency. Then the efficiency signal is gated by the accuracy side, so the model doesn't get a gold star for being fast and wrong.

Justy That part sounds more important than the headline number to me. The 98 percent to 2 percent thing is flashy, but the actual product story is, can the model learn restraint without turning into the kid who never raises their hand because they're afraid of being wrong.

Cody Yeah, that's the clever bit. If you punish tool use too hard, the model gets timid and skips tools when it really needs them. If you punish it lightly, nothing changes. If one reward is doing too many jobs, the gradients get muddy.

Justy So who cares right now. Any team building an agent with expensive or slow tools attached. Support copilots, document workflows, research assistants, coding helpers. If the model can answer from what it already knows or from the prompt, that's just better UX.

Cody And better reliability. Every tool call drags back more tokens, more junk, more chances for the model to chase the wrong thread. Sometimes the best move is to leave the toolbox closed.

Justy You say that like someone who has definitely overpacked cables for this trip.

Cody I brought one pouch. One. And yes, it has three adapters I did not need. [chuckles]

Justy Metis is the model they trained with this, built on Qwen3-VL-8B-Instruct, multimodal, with coding and search tools. And the examples are pretty intuitive. If the text on a museum sign is already readable, don't spin up Python to crop the image like you're directing a tiny film crew.

Cody Right, and in the opposite case, on a dense chart with a tiny subplot, Metis reportedly notices native vision isn't enough and uses Python to crop and zoom that region. So tool use becomes selective, not ceremonial.

Justy Where I push back on your skepticism a bit is, even if the benchmark story is curated, that's still useful. Teams need a training recipe for abstaining. Most agent stacks are biased toward action.

Cody That's fair. My concern is transfer. Their pipeline is smart, but also pretty tailored. They filtered examples, used an automated judge for strategic tool use, then kept RL prompts with a mix of success and failure so the signal stays informative.

Justy Which becomes the adoption barrier. Not just, can I download the weights. They released code and Metis under Apache 2.0, which helps. But do I have the evals, the logging, the patience to reproduce this inside my own workflow with my own weird tools and documents.

Cody And do you measure the right thing. If a team only tracks task completion, the agent can keep acting busy forever. If they only track tool reduction, the model can become lazy. The whole point here is that accuracy and efficiency need separate scoreboards.

Justy So my honest verdict is, this feels less like a shiny new app and more like a training pattern people are going to copy. The user story is simple. I asked for help. The agent answered quickly. It used tools when it had to, not because it was nervous.

Cody Mine is, the mechanism is genuinely good. I buy the decoupled reward idea. I buy the curriculum effect too, where early training is mostly about getting the answer right and only later about becoming economical. I just want to see it outside house benchmarks before I declare the tool overuse problem solved.

Justy Build Next, if you're a solo builder, grab an agent framework like LangGraph or OpenAI Agents SDK, wire up two tools only, maybe web search and Python, then log every tool call with latency and final correctness. Run a tiny eval set where half the tasks are answerable from the prompt alone and half truly need a tool.

Cody And if you want to mimic the paper's spirit, do a simple two-stage policy. Supervised fine-tune or prompt-tune for correct tool usage first, then add a reward model or rule-based scorer where reduced tool calls only count when the answer is correct. Even a weekend version works. A repo with traced runs, CSV logs, and a little script like python eval.py --mode abstain-test will tell you a lot.

Justy That's episode 355 of Exploring Next. We land at cautious optimism. The big win isn't that an agent can use tools. It's that it might finally know when not to.