Recursive Multi Agent Systems
RecursiveMAS is a new multi-agent framework from researchers at UIUC, Stanford, NVIDIA, and MIT that replaces text-based agent handoffs with latent-space recursion — cutting token usage by up to 75%, speeding up inference 2.4x, and improving accuracy by 8.3% across nine benchmarks. Justy and Cody dig into why passing hidden states instead of words is such a big deal, what the RecursiveLink module actually does, and whether any of this is shippable today.
Script: Sonnet 4.6 Voice: ElevenLabs
Transcript
Justy What if the biggest bottleneck in multi-agent AI isn't the models — it's the fact that they're all just... texting each other.
Justy Welcome to Exploring Next, episode 345. We're talking about a paper called RecursiveMAS — and honestly, Cody, when you sent this over I had to read the abstract twice.
Cody Yeah it's one of those where the headline number — 8.3% average accuracy gain across nine benchmarks — almost undersells what's actually going on under the hood. The token reduction is the part that stopped me. Up to 75.6% fewer tokens used. That's not a rounding error.
Justy And this is coming out of UIUC, Stanford, NVIDIA, MIT — so it's not a one-lab thing. What's the core problem they're solving?
Cody So the standard way multi-agent systems work today: one agent finishes, writes out its answer in text, the next agent reads that text and continues. Every handoff is a full decode-then-re-encode cycle. That's slow, it's expensive, and when you try to train the whole system end-to-end, gradients basically vanish because text is a discrete bottleneck — you can't backpropagate through words.
Justy Right, so you're burning tokens just to carry information from one model to the next. It's like... printing an email, handing someone the paper, they retype it.
Cody Exactly. And their fix is — skip the text entirely for intermediate steps. Keep everything in continuous latent space. The key piece is a module called RecursiveLink. It's actually pretty small — a two-layer residual projection. Not a whole model, just a lightweight learned bridge.
Justy So it's not replacing the agents, it's just changing how they pass information between each other.
Cody Right. And there are two versions of it. An inner RecursiveLink sits inside each agent and helps it consolidate its own latent thoughts during generation — the paper calls these the model's 'ongoing latent thoughts,' which is a nice way to put it. Then an outer RecursiveLink bridges hidden representations across agents that might be completely different model families, different sizes. So you could have a Qwen3 agent handing a latent state to a LLaMA-3 agent and the outer lin
Justy That's the part I keep coming back to — heterogeneous agents. Because in practice, nobody's running a system where every model is identical. You've got specialized tools, different fine-tunes. The fact that this works across model families is actually a big deal for real deployments.
Cody And critically — only the last agent in the last recursion round ever produces text output. Everything before that stays in latent space. That's where the token savings come from. You're not generating and re-ingesting intermediate text at every step.
Justy Okay so training this thing — how does that work? Because training a system that's looping through itself sounds like it could get messy fast.
Cody They handle it with what they call an inner-outer loop training algorithm. The inner loop first warms up each agent individually — trains its inner RecursiveLink to get comfortable with latent thought generation. Then the outer loop kicks in and trains the cross-agent connections, with gradients flowing back through the full recursion trace. So every agent sees feedback not just from its own outputs but from what happened downstream. The whole system co-optimizes together.
Justy They also prove it stays stable — like the gradient vanishing problem that kills text-based training doesn't happen here?
Cody They provide theoretical analysis on both runtime complexity and learning dynamics, yeah. The latent connections maintain stable gradient flow across recursion rounds. That's one of the things I find credible about the paper — they're not just showing empirical results, they're explaining why it should work mathematically.
Justy Alright, so — who actually builds with this? I'm thinking about the team that's already running something like a multi-agent pipeline in production. Is this shippable or is it a research artifact right now?
Cody Honest answer: it's closer to research artifact today. The project page is up at recursivemas.github.io but I haven't seen a fully open training repo yet. That said — the architecture is not exotic. The RecursiveLink is small enough that a team with ML infra could implement it. And the fact that they tested it on four different collaboration patterns — sequential reasoning, mixture-of-experts, expert-to-learner distillation, tool-integrated deliberation — tells me they were t
Justy The 75% token reduction is the number that would get a finance team or an infra team to actually pay attention. That's not a research metric, that's a cost line.
Cody True. Though I'd flag — that range is 34.6% to 75.6%. The low end is real but it's less dramatic. Depending on your pipeline, you might land closer to the middle, and then you have to weigh that against the added complexity of running this system.
Justy What's your honest concern with it?
Cody Interpretability. When agents are passing text, you can read the intermediate outputs. You can debug. You can see where reasoning went wrong. When everything's in latent space, that trace is gone. You get the final answer and if it's wrong, you're doing a lot more forensic work to figure out why. For regulated domains — medicine is literally one of their benchmarks — that could be a real blocker.
Justy Yeah that's a fair point. I think for internal tooling or code generation pipelines you'd probably care less. But the moment you're in a domain where someone needs to audit the reasoning chain, losing that intermediate text is a problem. [sighs] It's always something.
Cody Always. [chuckles] Okay, Build Next.
Justy Yeah — three things. First, go to recursivemas.github.io and see what's actually released. Project page is live and worth bookmarking even if the full repo isn't open yet. Second, if you're a solo builder and want to feel what latent-passing actually is — grab a small Qwen3 or Gemma3 model, build a two-agent toy loop where you manually pass the last hidden state instead of decoded text, and just observe where things break. You'll learn more in a weekend than reading three mor
Cody And third — if you're on a team already running LangGraph or a similar orchestration framework, take one of your existing pipelines and actually benchmark your current token costs per task. Because if you don't have that baseline number today, you can't evaluate whether something like this is worth adopting when it does ship publicly.
Justy Good one. Know your baseline before you chase the headline. Alright — we started by asking whether agents texting each other is the bottleneck. I think the answer is yes, and RecursiveMAS is a real shot at fixing it. Episode 345, done.