RecursiveMAS cuts multi agent AI costs by 75%: researchers
Justy and Cody dig into RecursiveMAS, a research framework that lets multi-agent systems pass latent embeddings instead of text, cutting token usage and speeding up inference while keeping base model weights frozen.
Script: GPT-5.5 Voice: Inworld TTS 1.5 Max
Transcript
Justy Wait, the agents are basically passing brain mush to each other instead of writing memos. That is extremely my kind of weird.
Cody And extremely the kind of thing where I immediately want to see the bill of materials. Because the article’s central claim is pretty sharp: text is the expensive coordination layer in multi-agent systems, so stop making every agent serialize its intermediate state into tokens.
Justy Before we get too noble about tokens, I need to say I slept like garbage after flying back from D C. Your city gave me one beautiful coffee and then punished me with airport lighting for three hours.
Cody That tracks. My week was mostly debugging a flaky home router and pretending that counted as personal growth. Also, your suitcase made a sound like a tiny printer when you rolled it in, which feels relevant because RecursiveMAS is also about avoiding unnecessary output.
Justy Cody, that suitcase has seen things. Anyway, yes. The reason this grabbed me is that so many agent demos feel magical until the invoice shows up. If this actually cuts token chatter without making the workflow worse, product teams care immediately.
Cody Right. The source says researchers from University of Illinois Urbana-Champaign and Stanford built RecursiveMAS so agents collaborate through embedding space. Instead of agent one writing a text explanation, agent two reading it, and so on, they pass continuous latent representations around a loop.
Justy Right.
Cody That matters because normal text-based agent collaboration is sequential and slow. One model has to finish generating tokens before the next model can start. Also, forcing a model to spell out internal reasoning just so another model can consume it is a pretty expensive translation step.
Justy The phrase I kept circling was that they want the whole multi-agent system to co-evolve as one thing, not just improve each agent in isolation. That is very appealing from a product standpoint. Less duct tape between specialist agents, more one coordinated workflow.
Cody Yeah.
Justy But I know your face. This is your skeptical eyebrow face, which, for the record, has appeared in at least three hundred of the four hundred twenty episodes of our tiny nonsense show.
Cody It’s a useful face. Mechanically, the interesting part is the RecursiveLink. The base models stay frozen, and these lightweight two-layer modules learn how to map hidden states around. The article says the trained RecursiveLink parameters are roughly thirteen million parameters, about zero point three one percent of the frozen models’ trainable parameter count.
Justy Oh interesting.
Cody There are inner and outer versions. The inner link maps an agent’s newly generated embeddings back into its own input embedding space, so it can keep producing latent thoughts without decoding text. The outer link bridges agents, including agents with different embedding dimensions, which matters if one is Qwen and another is Mistral or Gemma three.
Justy That part feels less like science fiction and more like plumbing. Still fancy plumbing. But it makes sense: if the models speak different hidden-state dialects, the bridge is the product.
Cody Exactly.
Justy I’m scrolling for the exact benchmark bit because I do not want to over-sell this and have you frame it as Justy’s latest optimism crime. Okay, nine benchmarks. Math, science and medicine, code generation, and search-based question answering.
Cody Appreciated. The model mix was open-weights stuff like Qwen, Llama three, Gemma three, and Mistral, assigned into collaboration patterns like sequential reasoning and mixture-of-experts. They compared against standalone models using low-rank adaptation or full supervised fine-tuning, plus TextGrad, Mixture-of-Agents, LoopLM, and a text-recursive version called Recursive-TextMAS.
Justy Mm-hm.
Cody The reported numbers are real enough to pay attention to: average accuracy improvement of eight point three percent over the strongest baselines, eighteen point one percent over TextGrad on A I M E twenty twenty-five, and thirteen percent on A I M E twenty twenty-six. Inference speedup lands between one point two and two point four times.
Justy And the token story is the cleanest product story. Thirty-four point six percent reduction in the first recursion round versus Recursive-TextMAS, then seventy-five point six percent by round three. That is the kind of graph that makes a finance person suddenly become emotionally available.
Cody That is a haunting sentence.
Justy No, but truly. If an enterprise team has a multi-step agent workflow that is slow, expensive, and awkward to train, this says: maybe the coordination layer is the waste. Maybe the agents do not need to narrate everything to each other.
Cody Sure.
Cody My caveat is inspectability. Text handoffs are inefficient, but they are also debuggable. You can log them. You can read where agent two misunderstood agent one. With latent handoffs, you may get speed and lower token cost, but the failure mode becomes murkier unless the tooling catches up.
Justy That is fair. Product reality is not just the model being cheaper. Someone on-call needs to understand why a workflow gave a weird answer at ten p.m. Still, if only the final agent emits text, I can imagine this being useful behind very narrow workflows where the evaluation signal is clear.
Cody Yeah, code generation benchmarks maybe fit that better than open-ended research assistants. The stronger the final correctness check, the more comfortable I am with hidden collaboration. Search-based Q and A could work too, depending on whether the retrieval evidence stays visible somewhere outside the latent loop.
Justy Tiny detour, but “agents communicating telepathically” is such dangerous branding. I am picturing five overcaffeinated interns sitting around your kitchen table silently deciding who forgot to buy oat milk.
Cody My agents would absolutely optimize for blaming the router.
Justy And then produce one final textual output: Cody is right, somehow.
Cody The other practical bit is memory. The article says if multiple agents use the same backbone model in different roles, they can share that backbone in G P U memory instead of loading separate copies. That could matter a lot for teams trying to run multi-agent systems without turning every experiment into a hardware procurement saga.
Justy And they released the code and trained weights under Apache two point zero, so this is at least pokeable, not just a chart floating around. I do not think it means every agent product rewrites itself tomorrow. But for people already hitting token cost and latency walls, this feels like a real design pattern to watch.
Cody That’s my read too. The argument holds up best where agent communication is the bottleneck and the task has measurable outcomes. It overgeneralizes if someone hears “latent agents” and assumes all the messy production stuff disappears. It doesn’t. It just moves the hard parts.
Justy Good. So, suitcase printer, telepathic interns, and one actually useful research idea. I’ll take it, Cody. That’s enough Exploring Next for one Wednesday.