Ep 229 News March 17, 2026 6:30 w/ Justy & Cody

Langsmart Publishes Industry’s First p95 Semantic Cache Benchmarks for On Premises AI Gateway, Challenges Market: “Show Me the p95”

Langsmart's Smartflow platform achieved 10.2x faster AI response times in Fortune 200 testing, delivering sub-300ms p95 latency on modest on-premises hardware while challenging the industry to publish real performance benchmarks.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/229"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 229 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Sonnet 4.5 Voice OpenAI TTS

Transcript

Izzo If you're running AI in production at a bank, your prompts can't leave the building.

Izzo You're listening to Exploring Next, episode two twenty-nine. I'm Izzo, here with Boone, and we're diving into something that sounds like marketing fluff but actually isn't — Langsmart just published the first real p95 benchmarks for on-premises AI gateways.

Boone And they're throwing down a gauntlet to the whole industry with 'show me the p95' — which honestly, as someone who's debugged production systems at 3am, I love that energy.

Izzo Right? Because here's why this matters right now — every Fortune 500 is scrambling to deploy AI, but regulated industries like banking and healthcare can't just pipe their data through OpenAI's API.

Boone They need these AI gateway things that sit in their own data centers, but nobody's publishing real performance numbers. It's all been 'trust us, it's fast' until now.

Izzo So Langsmart tested their Smartflow platform with a Fortune 200 bank and got some wild results. Boone, break down what they actually measured here.

Boone They deployed this thing as a Docker container on basically a laptop — 4 vCPUs, 8 gigs of RAM — and got responses down from 2.2 seconds to 220 milliseconds. That's a 10x speedup.

Izzo Hold on, 2.2 seconds to 220 milliseconds? That's not just faster, that's crossing the line from 'users notice the delay' to 'feels instant.'

Boone Exactly. And the p95 latency — meaning 95% of requests finish faster than this — was under 300 milliseconds. For context, most enterprise SLAs want sub-500ms, so they're beating that comfortably.

Izzo But how does semantic caching actually work? Because this isn't just storing exact query matches, right?

Boone Right, that's the clever bit. Traditional caching is like 'if you ask the exact same question, here's the exact same answer.' Semantic caching is more like 'if you ask a similar question, here's a similar answer that's probably good enough.'

Izzo So it's doing some kind of similarity matching on the meaning, not just the text string.

Boone Exactly. They're probably using embeddings to represent the semantic meaning of prompts, then doing nearest-neighbor search to find cached responses that are close enough. At 0.95 similarity threshold, they got 40-50% hit rates.

Izzo That hit rate is actually impressive for semantic matching. But Boone, what's the architecture trade-off here? Why can't these banks just use cloud-hosted gateways?

Boone Data sovereignty. When you're a bank processing loan applications, you legally cannot send that data to a third-party cloud service. It has to stay in your network perimeter.

Izzo So the whole value prop is 'we give you cloud-level performance without your data leaving the building.' That's a real product-market fit for regulated industries.

Boone And the fact that they're doing it on such modest hardware is interesting. 4 vCPUs and 8GB RAM is like... I've got more compute power in my weekend project server.

Izzo Which makes me wonder about the competitive landscape. Who else is playing in this on-premises AI gateway space?

Boone That's where their 'show me the p95' challenge gets spicy. They're basically saying all the other vendors are making performance claims without publishing real latency percentiles.

Izzo Smart move. It's like when Netflix started publishing their chaos engineering results — suddenly everyone else looked less transparent by comparison.

Boone Right, and p95 latency is what actually matters in production. Average response time means nothing if 5% of your users are waiting forever.

Izzo From a go-to-market angle, this feels like they're trying to create a new standard for how enterprise AI infrastructure gets evaluated. Make benchmarks a competitive requirement.

Boone Which is good for buyers but probably terrifying for competitors who've been handwaving their performance claims. Now they have to put up actual numbers.

Izzo The Docker deployment model is smart too — it's not some custom appliance you have to rack and stack. Just docker run and you're live.

Boone Although I'm curious about the semantic similarity algorithm. 0.95 threshold sounds high — that's pretty strict matching. I wonder what embedding model they're using under the hood.

Izzo And how they're handling cache invalidation when models get updated or business logic changes. That's always the hard part with caching systems.

Boone True. Plus, financial workloads are probably more repetitive than general AI usage — lots of similar loan evaluations or risk assessments. That might inflate their hit rates compared to other industries.

Izzo Fair point. But even if you cut their numbers in half, going from 2+ seconds to under 500ms is still a massive user experience improvement. Definitely. And I'm adding 'build a semantic cache for my local LLM setup' to the weekend project list. This has me curious about the implementation. So if you want to dig into this yourself — go check out their full benchmarking methodology at langsmart.ai. They published the actual test setup, not just marketing claims. And if you're fe