Ep 463 API Docs June 4, 2026 3:45 w/ Justy & Cody

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long Running Agents | NVIDIA Technical Blog

NVIDIA’s Nemotron 3 Ultra (550B parameters, 55B active) targets long-running agent workflows with hybrid Mamba-Transformer layers, NVFP4 quantization, LatentMoE routing, and multi-token prediction. It claims 5x throughput and up to 30% cost savings on agent tasks via token efficiency, while posting leading scores on Agent Productivity PinchBench (91%), Long Context Ruler @1M (95%), and others. Open weights, open recipes, and a transparent RL data pipeline aim at broad fine-tuning and domain specialization.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/463"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 463 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Mistral Medium 3.5 128B Voice Deepgram TTS

Transcript

Justy Okay, I just read this thing and my first thought is—holy crap, they’re finally talking about the spiral in agents.

Cody Mm-hm.

Justy Like, every time you spin up a sub-agent or call a tool, you’re just dumping more tokens into the mix, and then the next turn has to re-read all of it. It gets ridiculous fast.

Cody Right. And that’s where this Nemotron 3 Ultra thing slots in—they’re pitching it as the heavy lifter for the steps.

Justy Frontier reasoning plus five times the throughput? That’s the dream for anyone running long workflows.

Cody Yeah, if the throughput claim holds. Five times is… a lot.

Justy Well, they’re leaning on NVFP4 quantization and LatentMoE. Cody, you’ve seen MoE routing before—does LatentMoE actually cut the compute the way they say?

Cody It’s MoE but with a compressed latent space for the router. So instead of firing a bunch of experts every step, it picks a few and only materializes those. And NVFP4 is four-bit weights with per-channel scales—so yeah, you can fit way more on the same GPU without losing accuracy.

Justy And the hybrid Mamba-Transformer layers—that’s for the long context, right?

Cody Mamba handles the sequential part cheaply, Transformer layers kick in when you need the expressivity. It’s a trick to keep long-context inference from blowing up.

Justy So—this is such an Exploring Next take—if you’re running a multi-agent pipeline and you keep hitting context limits or token costs, swapping in Nemotron 3 Ultra for the orchestration layer could actually be a no-brainer.

Cody God, you already have the product slide in your head.

Justy I’m just saying—thirty percent lower token cost on SWE-bench? That’s real money for teams shipping agents at scale.

Cody Okay, but look at the EnterpriseOps-Gym number—thirty-three percent on long-horizon planning. That’s behind GLM and Qwen. So it’s not wins.

Justy Fair. But they’re still leading on PinchBench and Long Context Ruler at a million tokens. And they hit ninety-five percent on Ruler—no one else in the table even a million.

Cody Yeah… and the open weights and recipes are a big deal. If you’re in a domain where none of the teachers fit, you can fine-tune it with your own data.

Justy Which, by the way—Multi-Teacher On-Policy Distillation with more than ten domain-specific teachers. That’s how they’re getting the specialization without starting from scratch every time.

Cody Mm-hm. And the RL pipeline’s fully transparent, so you can audit what the model’s actually learning from.

Justy Anyway, I flew in late last night and my brain’s still on west coast time, so bear with me—

Cody You sound like you drank three espressos on the plane.

Justy I did. But the thing that’s sticking with me is the angle. You don’t need Nemotron for every single call—just the ones where the agent has to think hard.

Cody Right. So you’d pair it with a smaller, faster model for the routine stuff and only route the complex turns to Nemotron.

Justy Exactly. And if the throughput and token efficiency pan out, the math might actually work.

Cody I mean… I’m still side-eyeing that five-x claim until I see third-party repros. But the architecture checks out.

Justy And the benchmarks—even with the mixed bag—are strong enough that I’d at least kick the tires.

Cody Yeah. If your agents are blowing through context or tokens, it’s worth a look. If not… eh.

Justy There it is—the Cody caveat. Always a caveat.

Cody Someone’s gotta be the skeptic.

Justy Alright, forty-six-whatever this is. Next time you’re in LA, we’re testing Nemotron on my to-do list. See if it can finally organize my inbox.

Cody Good luck with that.