Ep 358 Blog May 1, 2026 4:03 w/ Justy & Cody

Google AI breakthrough means chatbots use six times less memory during conversations without compromising performance

Google's TurboQuant compresses AI working memory (the KV cache) by up to 6x in real time using two novel techniques — PolarQuant and QJL — without degrading model performance. Justy and Cody dig into what this actually means for inference costs, who benefits first, and why the 'DeepSeek moment' framing is both apt and a little overblown.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/358"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 358 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Sonnet 4.6 Voice Deepgram TTS

Transcript

Justy Every time you open a chatbot and just... keep talking, the thing has to hold all of that somewhere.

Justy Welcome to Exploring Next, episode 358. Google just published something called TurboQuant and the short version is: six times less memory, same output.

Cody And the memory we're talking about isn't storage — it's working memory. The KV cache. Think of it like RAM for the conversation itself. Every token the model processes, every partial answer it's building toward, lives there while it's generating a response.

Justy And it adds up fast, right? Like, it's not just your one little question.

Cody Scales linearly with users. A single sentence is maybe a few dozen tokens, totally fine. But sophisticated tasks — long documents, multi-turn research sessions — we're talking hundreds of thousands of tokens, which can mean tens of gigabytes per session. Then multiply that by the billions of requests a day a system like ChatGPT handles.

Justy Okay so that's the actual cost center. Not the model weights sitting on disk — the live working memory during inference.

Cody Exactly. And what TurboQuant does is compress that cache in real time. Quantization itself isn't new — but that's static: compress once before the model runs, done. Here they're compressing the KV cache while the model is actively generating output, which means the compressed data has to stay accurate and current simultaneously.

Justy So what are the two techniques? I read PolarQuant and something called QJL, which sounds like a government agency.

Cody [chuckles] Quantized Johnson-Lindenstrauss, yeah, very approachable name. PolarQuant takes the vectors in the KV cache and rotates them from Cartesian coordinates into polar coordinates. When you do that, the angles line up more consistently, so you can represent them with fewer bits. Then QJL nudges the values back toward accuracy to clean up whatever errors the compression introduced. Together they hit that claimed six-times reduction — tested on Llama 3.1-8B, Gemma, Mistra

Justy Okay, walk me through who actually wins here first. My instinct is it's not the end user directly — it's whoever's running the inference.

Cody That's right, at least initially. Inference providers can either serve way more users on the same hardware, or run a much larger model for the same user count. The sleeper use case I find more interesting is on-device — phones, laptops. If you can fit a capable model in the memory envelope of a phone chip, that changes what's possible without a network call.

Justy That's the one that actually changes the product story. Local inference that doesn't feel compromised.

Cody Right. Though I want to flag the thing that I think is getting a little lost in the hype framing.

Justy The DeepSeek comparison. Cloudflare's CEO said it on X and then memory company stocks dropped the same day.

Cody Which, look, I get why the framing landed — surprise efficiency breakthrough, comparable outputs, fraction of the resource cost. But there's a Merrill Lynch analyst note that's the more grounded take: six-times efficiency gains don't typically become six-times less hardware. They become six-times bigger models or longer context windows. The efficiency gets reinvested. And also — this is still lab-stage. Google presented at ICLR end of April, PolarQuant and QJL at AISTATS in e

Justy Which is the adoption barrier I keep bumping into. Lab paper to production inference stack is not a short path.

Cody Months at minimum. Though if it gets picked up by open-source inference frameworks — vLLM, text-generation-inference — that timeline compresses. And TurboQuant is inference-only, which is actually fine — inference is where the ongoing operational cost lives. Training is expensive but you do it once. Inference is every single request, forever.

Justy [sighs] Okay, I think that's the honest picture. Genuinely clever technique, meaningful if it ships, but the stock-market-panic version is probably overdone.

Cody Agreed. Though I will say — the coordinate rotation idea in PolarQuant is just elegant. I'd have been annoyed I didn't think of it first.

Justy [chuckles] Sure, you were this close.

Cody And when the ICLR session recordings drop, the PolarQuant paper is worth reading directly. The math behind the coordinate rotation is actually pretty accessible — it's one of those papers where the core idea fits in about two pages.

Justy So: llama.cpp for solo builders, text-generation-inference if you're running something real, and the paper itself when it's up. That's a solid weekend. Thanks for being in DC, Cody — and for letting me use your coffee maker at six in the morning.