Ep 370 Research Paper May 6, 2026 9:19 w/ Justy & Cody

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Justy and Cody dig into HeavySkill, a paper arguing that a lot of so-called agent harness magic is really a simpler inner pattern: generate multiple reasoning paths in parallel, then run a separate deliberation pass that compares and summarizes them. They unpack the memory-cache trick, why it can beat plain Best-of-N, where the gains seem to come from, and what this means for builders deciding between brittle orchestration and something more shippable.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/370"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 370 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script GPT-5.4 Voice ElevenLabs

Transcript

Justy The funny part is this paper kind of says the fancy harness might just be a very expensive way to make the model think twice.

Cody Yeah, and not even think twice in some mystical agent sense. Their claim is more like the useful core is: branch a bunch of reasoning paths in parallel, then have a second pass read the pile and do a cleaner synthesis. That's HeavySkill in plain English.

Justy Which is a relief, honestly. I got in late last night, made bad hotel coffee in your kitchen this morning, and I do not have the energy for another paper that's like eight boxes and arrows pretending to be a product strategy. Anyway, this one is trying to isolate what actually helped.

Cody Exactly. The problem they're aiming at is that agent harnesses have gotten crowded with planners, subagents, skills, memory, tool wrappers, all that stuff. If performance goes up, it's hard to tell whether the gain came from orchestration itself or just from giving the model more shots plus a decent aggregation step.

Justy And a lot of people have been stuck on that. Research teams because they can't compare systems cleanly, and product teams because shipping a brittle multi-agent rig is way different from shipping one model with a smarter inference pattern.

Cody The key innovation here is they collapse the harness into a two-stage pipeline. Stage one is parallel reasoning, so for a question q they sample K separate trajectories from one model. Stage two is sequential deliberation, where another model reads a serialized cache of those attempts and writes a final answer after comparing them.

Justy So not majority vote, not just pick the prettiest answer. More like, read the work, notice where paths disagree, and use that disagreement as signal.

Cody Right. On STEM tasks with verifiable answers, Heavy-Pass@k beats the simpler baselines. My read is the deliberation pass is acting like an implicit verifier. It isn't merely counting answers, it's looking across rationales and spotting which chain has the stronger internal support.

Justy That part feels product-relevant. Because a lot of teams quietly do Best-of-N already. Generate five things, rerank, pray. If this gets better results without inventing a whole org chart of agents, that's a big simplification.

Cody And the mechanism matters. They add this serialized memory cache between stages because full trajectories can blow the context window. So they prune the trajectories, shuffle them to avoid positional bias, then feed that compact cache into the deliberation stage.

Justy I think that's one of the more believable parts of the paper. It's not pretending context is free. It's saying, no, you need a structured handoff between generation and deliberation or this thing collapses under its own transcript.

Cody They also do iterative deliberation. So after one summary pass, later rounds can append prior summary content back into the cache and refine again. It's basically a loop where the model revisits earlier attempts and its own synthesis.

Justy Which is where I start asking who actually builds with this. For production, I think this is shippable in narrow cases: coding assistants, math-heavy support tooling, maybe internal analysis workflows where latency is acceptable and correctness matters more than speed.

Cody I agree, with asterisk. The paper says quality and diversity of trajectories are the two big drivers, and the deliberation stage leans heavily on the capability of the summarizer model. So if your base model is weak, making ten bad branches doesn't magically help. You're just buying ten versions of confusion. [chuckles]

Justy Ten parallel interns with the same bad notes. Perfect. [laughs] But yeah, that trade-off is real. Cost goes up with width, latency goes up with deliberation depth, and now you've got another model choice in the loop.

Cody Also, I like that they don't oversell it as pure training magic. They have a training-free framework first, then show RL with verifiable rewards can optimize both breadth and depth. In other words, you can train for better branch generation and better synthesis.

Justy The strongest claim, to me, is that stronger models under heavy thinking can get close to Pass@N upper bounds. If that holds broadly, then the summarization pass is doing real work, not cosmetic cleanup.

Cody My only mild pushback is evaluation shape. They cover STEM, coding, and general tasks, which is good, but a lot of the cleanest story comes from domains with verifiable answers. In messy product environments, the summary model can still confidently synthesize nonsense.

Justy Yeah, same. Research-wise, solid. Production-wise, I'd call it more of a pattern than a turnkey system. But it's a useful pattern because it tells builders where to spend effort.

Cody If I were building next off this, I'd do three things. One, clone their repo at github.com/wjn1996/HeavySkill and reproduce one benchmark with your own prompts. Two, for a solo weekend project, wire up an open model stack in vLLM or Ollama that samples coding solutions in parallel, then runs a separate summarizer prompt over pruned traces. Three, add a tiny verifier layer on top to compare plain Best-of-N against HeavyS

Justy And I'd keep the product version boring on purpose. Start with one expensive endpoint for high-stakes requests, store the branch summaries not the whole thought dump, and track whether deliberation actually reduces retries. If it doesn't, don't cosplay as an agent company because the diagram looked cool. [chuckles]

Cody That's episode 370 getting weirdly practical. Also, your coffee was terrible, Justy.

Justy Deserved. Anyway, Cody, I like this one because it cuts through the boxes and arrows. More thinking, better handoff, less theater.