Ep 347 tool 4:03 w/ Justy & Cody

Cut AI token usage by 96%? Here's how AWS Strands Agents does it.

AWS Strands Agents is a way to cut agent token usage by making models ask for only the context they need, when they need it. Instead of stuffing huge prompts up front, it uses tools, memory, and session state to keep agents lean, which matters for cost, latency, and scaling.

Script: GPT-5.4 mini Voice: Deepgram TTS

Transcript

Justy If your agent is chewing through context like crazy, the bill and the lag show up fast. That’s the part people feel.

Justy Welcome back to Exploring Next, episode 347. I’m Justy, and Cody’s here with me in person, which is nice because we can argue about token bills without a screen in the way.

Cody Yeah, and this one’s timely because a lot of agent apps are doing the dumb expensive thing. They load a ton of text up front, then keep paying to resend it. That scales badly.

Justy And the user doesn’t care that the prompt was elegant. They care that the thing took forever and the usage dashboard looks weird. So who is this actually for?

Cody Mostly teams building production agents. Support workflows, internal ops, coding assistants, research assistants. Anywhere the agent keeps looping over the same app state, docs, or user history. Strands Agents is interesting because it leans into tools and session-aware state instead of treating every turn like a fresh essay.

Justy That feels like a real adoption barrier, though. People already have some agent stack half-working. If the new thing means rewiring the whole flow, they’ll probably stay put.

Cody Right, and that’s the trade-off. The clever part is the pattern: keep the model smaller in the moment, and let it call out for only what it needs. The article’s point is basically that you can cut token usage a lot by not stuffing everything into the prompt. I think the headline number was 96% in some cases, which is huge if it holds in your workload.

Justy Ninety-six is wild. But I’m always thinking, okay, what’s the story for the person actually paying for this? Is it a startup with one agent, or a bigger team with lots of sessions and lots of repeated context?

Cody Bigger teams feel it first. If you have dozens or hundreds of agent sessions, tiny inefficiencies become real money. And latency matters too. Smaller prompts mean faster turns, which makes the agent feel less like it’s thinking in molasses. [chuckles]

Justy So how does it work under the hood? Because I’ve seen a lot of agent frameworks say they’re efficient and then they just move the mess around.

Cody That’s fair. Strands Agents is built around orchestration rather than one giant prompt blob. The model can use tools, and those tools can fetch fresh context from external systems or session memory. So instead of serializing your whole world into the context window, you make context a resource the agent requests. That’s the part I find genuinely smart.

Justy And the weird part is that it sounds more annoying for the developer, but better for the product. Which is usually how these things go.

Cody Yeah. More moving pieces on the back end, but cleaner behavior for the user. The downside is you need good tool boundaries and decent retrieval. If your tool calls are sloppy, the agent just becomes a confused little tourist asking for the wrong map.

Justy [sighs] That’s a very vivid image. I buy it, though. And from a product angle, the barrier is not just technical. It’s also trust. Teams need to believe the agent won’t forget something important because it wasn’t in the prompt.

Cody Exactly. The source’s core idea is that agents should be stateful without being bloated. That’s a nice middle ground. I do think the article is probably a little optimistic if someone reads it as 'just swap frameworks and your costs vanish.' You still need good evals, logging, and a sense of what context actually matters.

Justy Yeah, I’d push that too. The market doesn’t adopt a framework because it’s elegant. They adopt it when the first few workflows feel safer, cheaper, and easier to ship. Otherwise it sits in a repo and everybody nods at it.

Cody Build Next-wise, I’d start simple. Use the AWS Strands Agents repo and wire up one agent that can answer questions from a small docs folder. Then add a tool that fetches only the relevant file chunks on demand, and log tokens before and after.

Justy For a solo builder, that’s a solid weekend project. You could do the same thing with a local markdown folder and a tiny command-line app. No big platform needed.

Cody Yeah, and if you want to get more serious, compare that against a basic LangChain or custom tool-calling setup. Same task, same inputs, different orchestration style. Measure latency, token use, and how often the agent asks for the wrong thing.

Justy That’s the real test. Not whether it sounds clever in a blog post, but whether your app gets cheaper and less annoying to use. Alright, I’m gonna call that a win for lunch-table engineering today.

Justy We’ll leave it there. Exploring Next, episode 347. Cody, thanks for the deep dive, and yeah, I’m still thinking about that tourist with the wrong map.