Agentic AI: How to Save on Tokens | Towards Data Science
Cody and Justy examine whether the token-saving techniques in Ida Silfverskiöld's article (prompt caching, semantic caching, lazy-loading, routing, context cleanup) are practical wins or theoretical cost-cutting that introduces real friction. Cody opens skeptical: the savings are real but the tradeoffs are often hidden or underestimated. Justy counters that for production teams already bleeding money on agentic AI, even 20-30% savings justifies the engineering lift. They land on a nuanced take: prompt caching is genuinely low-risk and worth it; semantic caching and aggressive routing are trickier and need honest trade-off audits before deployment.
Script: Haiku 4 Voice: OpenAI TTS
Transcript
Justy Exploring Next, episode 354. So we're talking agentic AI and token costs. Cody's read the piece and he thinks a lot of these savings strategies are basically theater. I think they're practical, and honestly necessary right now.
Cody Okay, I'm not saying the savings aren't real. I'm saying people are selling these techniques like they're free optimization when they're actually trade-offs in disguise. Prompt caching? Fine. Semantic caching? That's where it falls apart. You're matching on similarity, which means you need embeddings, thresholds, cache invalidation logic. And nobody talks about when that fails—when the meaning matches but the answer is wrong because the context changed.
Justy Okay, but—look, I'm managing budgets. A team running agentic workflows is spending two, three grand a month on tokens. If prompt caching alone cuts that by 30 or 40 percent, that's real money. Is there friction? Sure. But it's worth it.
Cody The prompt caching part I get. You set your static stuff first, tool definitions, examples, system prompt. It matches exactly, you get the discount. That's straightforward. But semantic caching is a rabbit hole. You have to decide what similarity threshold to use. Is 0.85 cosine similarity close enough to reuse the cached answer? What if the user context changed? What if the data got updated? You're caching answers, not just computation. So you need a TTL, but how long? And o
Justy Fair. But the piece also covers routing and lazy-loading tools, and context cleanup. Those are less exotic. You route simple stuff to a smaller model, save money. You don't load tool definitions until you need them. And you actually delete old conversation history instead of letting it pile up. That's not theater. That's just discipline.
Cody Yeah, those are legitimate. But routing adds latency for the classification step, and you're assuming the cheaper model won't fail. Lazy-loading tools saves tokens on definitions you don't load, but you pay for the API calls or state management to figure out which tools you need upfront. Every savings strategy comes with a hidden cost.
Justy So what would actually be useful? Like, if you were building this for a production team, what would you want to know?
Cody I'd want a framework for auditing your current setup. What's actually eating tokens? Is it the system prompt, or the tool definitions, or old context, or repeated tool outputs? Because the fix is different for each one. And I'd want real numbers on trade-offs. If I enable semantic caching, what's my false-positive rate? What's the latency improvement? You don't get that from a conceptual article.
Justy Alright, so for someone wanting to actually try this—what's the move?
Cody Start with prompt caching. It's the safest win. Structure your prompt so the static stuff is first—system instructions, examples, tool definitions. Then enable caching on your API provider. OpenAI's 90 percent discount on cached tokens is real. Anthropic's the same, but you pay for storage. Then measure your actual token spend for a week or two. Break it down by component—system prompt, tools, context, outputs. That tells you where your real waste is. Maybe 40 percent of your
Justy So it's measure, optimize the obvious thing, measure again.
Cody Yeah. And semantic caching—I'd only touch that if you have high-volume generic queries and you've already optimized everything else. And you test it thoroughly.
Justy Fair. I think the piece is useful as a tour of the techniques, but you're right—it doesn't tell you how to actually deploy them without breaking things. It's a menu, not a recipe.
Cody [laughs] Yeah. A menu where some dishes are 'easy and delicious' and others are 'looks good in the photo but takes six hours and might taste weird.'
Justy Alright. So Build Next—what do you actually want someone to try this weekend?
Cody If you're using OpenAI, write a simple agent with a long system prompt and two or three tools. Call it ten times with the same user query. Measure the tokens on the first call and the second call. You should see a significant drop on the second call if prompt caching is working. Then change one character in the system prompt and call it again—the cache should miss. That teaches you how exact the matching is.
Cody Prompt caching? Absolutely. It's low-risk and high-reward. The others—routing, semantic caching, aggressive cleanup—they're worth doing if your costs are actually a problem. But don't do them because an article made them sound easy. Do them because you measured and you know they'll help.
Justy That's the move. Thanks, Cody.