Ep 484 research 8:05 w/ Justy & Cody

End to End Context Compression at Scale

Justy and Cody dig into Latent Context Language Models (LCLMs) — encoder-decoder compressors that shrink long prompts into short latent sequences, cutting memory and latency at ratios up to 1:16 while staying competitive on accuracy. They cover the architecture search, the training recipe, the agent use-case, and what production deployment actually looks like.

Script: GLM 5.1 Voice: Murf.AI Gen2

Transcript

Justy …so I land at midnight, right, and the whole kitchen situation at my place is just — there's nothing. Like, not even coffee. I had to DoorDash groceries at one A M.

Cody That's deeply sad. You've been back from that trip how many days?

Justy Three. I'm still not recovered. Jet lag plus no groceries equals a rough week. Anyway — this LCLM paper, I actually read it on the plane and I cannot stop thinking about it.

Cody The context compression one? I skimmed it. The KV cache stuff is genuinely the bottleneck right now, so I was curious what they actually pulled off.

Justy Right. So the core problem — every token you feed into a model, you're caching key-value pairs. The longer your context, the bigger that cache gets. It's linear. You hit millions of tokens, you're just done. Memory's gone.

Cody And the existing compression approaches kind of suck. You either trash model quality by throwing away too much, or you spend so much compute compressing a single prompt that you've defeated the purpose. Plus a lot of them need the full input to fit in the context window anyway, which — that's the whole problem you're trying to solve.

Justy Exactly. So what this team did is go back to encoder-decoder compressors. The idea's been around, but they were never competitive on the accuracy versus efficiency trade-off. These folks said, what if we just did a real architecture search and pre-trained properly?

Cody Mm-hm.

Justy So the encoder takes your long token sequence and maps it into a much shorter sequence of latent embeddings. Then the decoder — which is the actual language model — just reads those compressed latents instead of the raw tokens. They're calling the whole family Latent Context Language Models.

Cody And they trained them at actual scale. Three hundred fifty billion tokens each, across three compression ratios — one to four, one to eight, one to sixteen. The encoder's a 0.6 billion parameter model, decoder's four billion.

Justy Which is — I mean, sixteen times compression. That's aggressive.

Cody What I respect about the methodology is they didn't just fine-tune something and call it a day. They pre-trained many variants from scratch to figure out the right architecture. That's expensive and honestly most teams don't bother. They just take an off-the-shelf model, slap a compression head on it, and publish.

Justy And you can tell the difference. Like, the Pareto frontier results — they're actually pushing the whole curve forward across accuracy, compression speed, and peak memory. Not just winning on one metric and quietly losing on another.

Cody Sure. But I do want to poke at the encoder size. Zero point six billion is small. That's the bottleneck, right? It has to understand the full long context well enough to compress it into these latents. What happens when you're pushing to truly massive contexts and the encoder itself starts struggling?

Justy That's fair. I think the play here is actually the agentic use-case they highlight. The paper shows LCLMs working as backbones for long-horizon agents — the agent skims through the compressed context and then adaptively expands relevant segments when it needs more detail. That's a product pattern, Cody. That's not just a benchmark win.

Cody Okay, that part is genuinely interesting. The skim-then-expand loop — you keep the compressed version in memory, which is cheap, and you only decompress the chunk the agent actually needs to reason over. That's how you'd want a production agent to work, not loading the full raw context every single inference step.

Justy Right? And one more thing — they specifically call out that this approach is compatible with modern production inference engines. A lot of KV cache compression tricks aren't, which has always been the quiet reason none of them ship.

Cody Yeah, that's the real kill-shot on most compression work. You get a nice paper result but it requires a custom inference path that no SWE is going to maintain. If LCLMs slot into existing serving infrastructure, that's — that actually matters.

Justy Look at you, almost optimistic.

Cody I said it matters. I didn't say I'm deploying it Monday. The encoder-decoder split adds real architectural complexity. You're now maintaining two models that have to stay in sync, you've got a training pipeline that's more involved, and the decoder's only four billion parameters. That's not where the industry's operating for most production workloads.

Justy No, I know. But they released the models on Hugging Face and the code is on GitHub. So at least you can actually test it, which is more than I can say for most compression papers.

Cody That's true. The repo is LCLM on GitHub under LeonLixyz, and the models are up on Hugging Face under latent-context. That's — yeah, I'll probably kick the tires this weekend.

Justy See? Optimistic.

Cody I'm going to regret telling you that.

Justy You absolutely are. Okay Cody, I need to go buy actual groceries before I wither away. This was fun.

Cody Go feed yourself. And maybe don't order groceries at midnight next time.