Ep 445 article 4:43 w/ Justy & Cody

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient MachineLearningMastery

Continuous batching is a scheduling technique that keeps LLM inference servers from wasting GPU cycles on padding. Instead of forcing short requests to wait for long ones in a fixed batch, continuous batching frees up slots the moment a request finishes and admits new work immediately, eliminating idle padding tokens and improving throughput.

Script: Haiku 4 Voice: Deepgram TTS

Transcript

Justy Okay, so this piece is about how LLM servers actually schedule requests without wasting GPU time on padding. Which sounds boring until you realize that if you're running any kind of inference service at scale, this is literally the difference between your hardware being useful and your hardware being expensive idle.

Cody Right. And the article's doing something smart—it starts with static batching, which is the intuitive but wrong way, then shows you why continuous batching fixes it.

Justy So static batching is… you group requests into fixed batches, each batch waits for its slowest request to finish, then the next batch starts. That's it.

Cody Exactly. And they use a concrete example: three requests in a batch. One needs six tokens, one needs fifty, one needs three hundred. The GPU decodes all of them token-by-token, but the six-token request finishes after step six. Its slot is still there, still active, still burning cycles on padding tokens until the three-hundred-token request finishes three hundred steps later.

Justy That is so inefficient.

Cody It's bad. And they include the actual code—static_batching function using Hugging Face transformers, batch size of three, six requests ranging from thirty to three hundred tokens. The output shows exactly what happens: all three slots wait at a barrier until the longest finishes, then the next wave starts.

Justy Mm-hm. So the fix is continuous batching.

Cody The fix is to not have a batch barrier at all. The moment request A finishes at step six, you pull a new request into that slot. While requests B and C are still decoding, you're already prefilling a new request that arrived. No padding, no idle slots, no waiting.

Justy And the way you make that work is ragged batching—you don't pad everything to the same length. You let each request keep its own length in the batch.

Cody Right. You have to track which tokens belong to which request, but that's not expensive. The KV cache is already per-request anyway. Once you're in the decode loop, each step is just one forward pass that pulls a new token from each active request, and you can dynamically add and remove requests from the active set without any barrier.

Justy So the implementation is actually not that hard?

Cody The concept is simple. That's why vLLM and SGLang and other modern frameworks do this by default now.

Justy And the reason I care about this is because if you're building any LLM service, you either implement continuous batching or your GPU is sitting idle waiting for slow requests. That's not a theoretical loss—that's real throughput, real cost.

Cody Exactly. If you're serving hundreds of concurrent users, heterogeneous request lengths are the norm. Some user asks for three tokens, another asks for five hundred. Static batching forces you to pick a batch size and a max length, and you're always padding short requests or truncating long ones or both.

Justy Okay, so one thing I want to push on: the article assumes you have a GPU with enough memory to hold multiple KV caches at once. If memory is tight, continuous batching becomes a trade-off, right?

Cody Yeah, that's real. KV cache size scales with sequence length and batch size. But the article assumes you're already batching, so you're already paying that memory cost. Continuous batching just lets you use that memory more efficiently by not padding.

Justy Fair. And for the people who should care about this—that's anyone running inference at scale. Not just researchers, not just hobbyists, but anyone actually serving requests.

Cody If you're using an off-the-shelf inference service like OpenAI or Anthropic, they're already doing this. If you're running your own stack, understanding this is kind of essential. You can't tune or debug what you don't understand.

Justy Mm-hm. So if someone's building an LLM API or optimizing their inference costs, this is required reading.

Cody Absolutely. And if you're not building inference yourself but you're evaluating services, understanding this is how you ask smart questions about throughput and latency guarantees.

Justy Yeah. Like, "Do you use continuous batching?" is now a fair question to ask a vendor.

Cody If they don't, you know they're leaving money on the table.

Justy Yep. Alright, so if you're shipping inference, read this. If you're not but you're curious how the infrastructure actually works, also read it. The code is there, the problem is real, and the solution is practical.