Ep 302 article 11:57 w/ Justy & Cody

The Complete Guide to Inference Caching in LLMs

Justy and Cody dig into inference caching for LLMs and why it matters right now for anybody paying real model bills or waiting on sluggish responses. They unpack the three layers from the article — KV caching inside a single generation, prefix caching across requests with identical leading tokens, and semantic caching using embeddings plus vector search to skip model calls entirely. The episode stays grounded in production reality: prompt structure, exact-match requirements, provider behavior, GPU memory trade-offs, and when semantic caching is actually worth the extra moving parts.

Script: GPT-5.4 Voice: ElevenLabs

Transcript

Cody Okay, I have a theory... half the reason people think their LLM app is magical is they haven't looked at the invoice yet.

Justy [chuckles] Also the latency. You click a button, stare at the spinner, and suddenly everybody becomes a performance engineer. Cody flew into LA for this, by the way, and immediately started judging my Wi-Fi like he's doing a site reliability audit in my kitchen.

Cody Your router has a personality, Justy. It drops packets like it's setting boundaries. And yes, we are once again making an extremely specific episode of Exploring Next, episode 302, about inference caching... which is catnip for maybe a dozen people and absolutely worth it.

Justy The reason to care is simple, Cody. If you're building with LLMs right now, the same prompt scaffolding gets recomputed over and over, and you're paying for that every single time. So today is basically, how do you stop lighting money on fire without rewriting your whole product?

Cody Right, and the article lays it out cleanly. There are really three layers here. KV caching inside a single request, prefix caching across requests that share the same beginning, and semantic caching where you don't call the model at all if a similar question was already answered.

Justy And I like that framing because these aren't substitutes. This is not pick your favorite buzzword. It's more like stacked defenses against waste. If I'm a product team shipping support chat, internal search, or a doc assistant, this stuff changes margins fast.

Cody Exactly. Start at the bottom. KV caching is the built-in one. During transformer inference, every token gets turned into query, key, and value vectors. The expensive part is attention over prior tokens. If you had to recompute keys and values for all previous tokens every time you generate the next token, decoding gets ugly fast.

Justy Cody, do the non-whiteboard version before your voice changes. What is the practical picture here?

Cody [sighs] Fine. Think of the model generating token one hundred. Without KV caching, it keeps redoing work for tokens one through ninety-nine. With KV caching, it stores the key and value states from earlier steps in GPU memory, then reuses them. So for token one hundred, you only compute the new part. Same answer path, less repeated math.

Justy So that's why the article says it's automatic and always on. You're not really choosing it. You're benefiting from it whether you know the term or not.

Cody Yeah, mostly. And that's important because prefix caching is basically that same idea stretched across requests. If the first thousand or ten thousand tokens are identical between requests, the engine can reuse the KV states for that prefix instead of rebuilding them every time.

Justy That, to me, is the money slide. Most real apps have a giant system prompt, some rules, maybe a reference doc, maybe few-shot examples... and then the only thing that changes is the user message at the end. So the user story is, stop making the model reread the employee handbook on every chat turn.

Cody [exhales] Yes. And the article is very clear about the catch. Prefix caching only works when the prefix is exactly identical. Not vaguely similar. Not same meaning. Exact tokens. A trailing space can break it. Reordered JSON keys can break it. Dropping today's date into the top of the prompt absolutely breaks it.

Justy Which is such a classic product footgun. Somebody adds a harmless little timestamp, ships on Friday, and then Monday everyone's asking why costs jumped. [laughs] This is why PMs become annoying about prompt templates.

Cody And for once, you're right to be annoying. Static stuff first, dynamic stuff last. If you've got instructions, examples, shared docs, put them up front. Session IDs, user text, anything that changes per request goes at the end. Also make serialization deterministic. If your app injects JSON, lock the key order.

Justy Wait, this is the same instinct as caching anywhere else. Stable inputs, predictable formatting, don't sabotage your own hit rate. You'd be amazed how many teams miss that because they treat prompts like loose strings instead of structured assets.

Cody Right, but there are provider differences worth knowing. The article mentions Anthropic exposing prompt caching with cache control on content blocks. OpenAI doing prefix caching automatically on prompts over 1024 tokens. Gemini calling it context caching and charging separately for stored cache, which changes the economics for giant reusable contexts.

Justy And if you're self-hosting, vLLM and SGLang matter here because they handle automatic prefix caching in the inference engine. That's a huge adoption point. If I can get savings without touching app logic, that's instantly more likely to make it into a roadmap instead of dying in a doc.

Cody [giggles] You mean into the backlog you invent every episode?

Justy We absolutely have a backlog now. My wife walked through earlier, heard us saying 'prefix invalidation,' and gave me the look people reserve for garage bands. Anyway... if it's low-code to adopt, it gets a much bigger market.

Cody And this is where semantic caching becomes a different animal. You're no longer reusing partial internal model state. You're caching full input-output pairs at the application layer. New query comes in, you embed it, search a vector store for similar past queries, and if the cosine similarity clears your threshold, you return the cached answer and skip the model call.

Justy So unlike prefix caching, which still calls the model but makes the front half cheaper, semantic caching can short-circuit the whole thing.

Cody Exactly... and that sounds amazing until you remember the overhead. Every request now needs an embedding step and a vector search. Pinecone, Weaviate, pgvector, whatever you pick, that's another subsystem. This only pays when traffic has a lot of repetition in meaning, not just wording.

Justy Customer support, FAQs, internal help desks. Places where people ask the same question ten slightly different ways before lunch.

Cody Yep. 'How do I reset my password' versus 'I can't log in, how do I change credentials'... semantically close enough that you may not need fresh generation. But if you're doing creative writing, open-ended analysis, or highly personalized agent flows, hit rates can be lousy and then you've just added plumbing for no real win.

Justy [pause] And there is a product risk here that's real, not performative skepticism. If the answer can go stale, semantic caching needs TTLs and maybe tighter scoping. Otherwise you serve a very confident old answer because the vectors matched. That's not the model being wrong, that's your cache policy being lazy.

Cody Totally. I actually think the article's decision logic is pretty sane. KV caching is just there. Prefix caching is the highest-leverage move for most production apps, especially RAG systems with a big shared document block. Semantic caching is situational. Great when queries repeat. Extra baggage when they don't.

Justy And I would grade that stack an A-minus. Prefix caching gets the A. Semantic caching gets, like, a B-plus because the upside is real but the operational sloppiness tax is also real. [laughs] You flew zero miles and still found a way to disrespect my favorite part. Semantic caching is going straight onto the weekend project list out of spite. That list is now longer than most enterprise roadmaps, Cody. But okay, if somebody wants to actually touch this stuff after we stop talk