Exploring Next

Exploring Next — Ep 302 w/ Justy & Cody — The Complete Guide to Inference Caching in LLMs

Justy and Cody dig into inference caching for LLMs and why it matters right now for anybody paying real model bills or waiting on sluggish responses. They unpack the three layers from the article — KV caching inside a single generation, prefix caching across requests with identical leading tokens, and semantic caching using embeddings plus vector search to skip model calls entirely. The episode stays grounded in production reality: prompt structure, exact-match requirements, provider behavior, GPU memory trade-offs, and when semantic caching is actually worth the extra moving parts.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →