Forge: Scalable Agent RL Framework and Algorithm
Izzo and Boone dive deep into MiniMax's Forge framework — a production-scale RL system that trained their M2.5 model across hundreds of thousands of real-world agent scaffolds. They explore how Forge solves the fundamental trilemma of system throughput, training stability, and agent flexibility through architectural innovations like middleware abstraction, windowed FIFO scheduling, and prefix tree merging for massive computational efficiency.
Script: Sonnet 4.5 Voice: ElevenLabs
Transcript
Izzo If you've tried to scale reinforcement learning for real agents, you've hit the wall.
Izzo Welcome back to Exploring Next, I'm Izzo. This is episode one-eighty-seven, and I'm here with Boone to talk about something that just dropped from MiniMax — their Forge framework that's actually solving the agent RL scaling problem.
Boone Yeah, and Izzo, this isn't just another research paper. They used this system to train their M2.5 model across over a hundred thousand distinct real-world agent scaffolds. We're talking millions of samples processed daily.
Izzo Right, so why does this matter right now? Everyone's trying to build production agents, but the moment you want to do RL at scale — to actually improve these agents through experience — you hit what they call the 'impossible triangle.'
Boone Exactly. You've got three things you need: high system throughput, training stability, and agent flexibility. Pick two. Traditional RL frameworks force you to choose, which is why most production agent systems are still doing supervised fine-tuning instead of proper RL.
Izzo So Boone, break down this impossible triangle for me. What's actually happening under the hood?
Boone It comes down to scheduling hell. Agent rollouts have insane variance — some finish in seconds with simple API calls, others take hours for complex reasoning chains. If you use strict FIFO scheduling, one slow agent blocks everything. But if you go greedy and process fast agents first, you get massive data distribution shift.
Izzo And that distribution shift is a killer for training stability. You start with easy tasks, then suddenly you're flooded with hard tasks, and your gradients start oscillating.
Boone Plus there's this computational waste problem. In agent scenarios, you get tons of requests sharing identical prefixes because of how context management works. Traditional systems are just burning compute on redundant processing.
Izzo Okay, so how does Forge actually solve this? What's their architectural innovation here?
Boone They went full middleware. Three-layer architecture that completely decouples the agent logic from the training infrastructure. You've got the Agent Side that just focuses on reasoning and environment interaction, then this Middleware layer with a Gateway server and Data Pool, and finally the Training and Inference engines.
Izzo Smart. So agents become pure trajectory producers, and they don't need to worry about the underlying model mechanics.
Boone Right, and the Gateway server is brilliant — it standardizes all the communication between agents and the LLM using common protocols. The agents don't even know what model they're talking to. This is how they integrated hundreds of scaffold types without touching the agent code.
Izzo From a product perspective, that's huge. You can onboard new agent architectures without rebuilding your training pipeline. What about the scheduling problem?
Boone They created this windowed FIFO strategy. Instead of pure FIFO or greedy processing, they batch completions within time windows. So you get the throughput benefits of async processing but maintain enough data distribution stability for training.
Izzo And the prefix redundancy issue?
Boone Prefix tree merging. When multiple agents share context prefixes, the system merges those computations instead of duplicating them. Given that they're handling 200k context lengths, this saves massive amounts of compute.
Izzo *laughs* Okay, I have to ask — is this actually working in production, or are we looking at research theater?
Boone No, this is real. They're processing millions of samples daily, and their M2.5 model is showing consistent reward convergence across all those scaffolds. The math checks out — they're maximizing throughput times sample efficiency while keeping update variance below their stability threshold.
Izzo What's interesting from a market angle is that they're not just solving this for their own agents. The middleware design means other companies could potentially plug into this framework.
Boone Exactly. And they've got this CISPO algorithm running on top that handles the credit assignment problem — figuring out which actions in a 200k context actually contributed to the final outcome.
Izzo That's the real challenge with agent RL. You've got these extended horizons where a single decision early in the chain affects everything downstream, but the reward signal is super sparse.
Boone And they're doing something clever with efficiency-aware rewards. Traditional RL only cares about correctness, but they're actually optimizing for wall-clock execution time. So agents learn to use tools efficiently, not just correctly.
Izzo That's a game-changer for production deployment. An agent that takes an hour to solve a problem correctly isn't actually useful, even if it gets the right answer.
Boone Right. And the system architecture supports both white-box and black-box agent training. White-box for when you want to optimize specific reasoning patterns, black-box for robustness across different scaffolds.
Izzo So what should people actually go build with this? What's the weekend project here?
Boone First, check out the MiniMax blog post — they've got architectural diagrams and the mathematical formulation. If you're working with agent frameworks, start thinking about how to decouple your agent logic from your model calls.
Izzo And if you're doing any kind of agent RL, experiment with windowed batching instead of pure FIFO. Even without the full Forge system, that scheduling strategy could improve your training stability. Also worth looking into prefix caching in your own systems. If you're seeing repeated context patterns, there's probably compute you can save. I'm definitely adding a prefix tree implementation to my weekend project list. The bigger takeaway is that production agent RL is actually