'Observational memory' cuts AI agent costs 10x and outscores RAG on long Context benchmarks
Observational memory is a new approach to AI agent memory that uses two background agents to compress conversation history into dated observation logs, achieving 10x cost savings through stable context windows that enable prompt caching while outperforming traditional RAG systems on long-context benchmarks.
Script: Sonnet 4.5 Voice: ElevenLabs
Transcript
Izzo Your AI agent just forgot everything you told it last week.
Izzo You're listening to Exploring Next, episode one-seventy-six. I'm Izzo, and Boone, we're talking about something that's been driving me absolutely crazy in production — agents that can't remember context across sessions.
Boone Right, and this observational memory approach from Mastra is actually solving that in a really clever way. They're getting 10x cost savings while beating RAG on long-context benchmarks.
Izzo Okay, so why does this matter right now? Because everyone's moving from those quick chatbot demos to actual persistent agents embedded in real products. And the memory problem is brutal.
Boone Exactly. Traditional RAG keeps invalidating your prompt cache because it's dynamically injecting different context every turn. So you're paying full token costs instead of getting those 4-to-10x caching discounts from OpenAI and Anthropic.
Izzo And from a user perspective — imagine you're using an agent in your CMS three weeks later, and it has zero memory that you asked for reports in a specific format. That's not a bug, that's a broken product experience.
Boone So let me walk through how observational memory actually works, because the architecture is surprisingly elegant. They split the context window into two blocks.
Izzo Break that down for me.
Boone First block contains observations — these are compressed, dated notes from previous conversations. Second block holds raw message history from your current session. Two background agents manage the whole thing.
Boone When unobserved messages hit 30,000 tokens — that's configurable — the Observer agent kicks in. It compresses those messages into new observations and appends them to the first block. The original messages get dropped entirely.
Izzo And what happens when the observations themselves get too big?
Boone That's where the Reflector comes in. At 40,000 tokens of observations, it restructures the whole log — combines related items, removes superseded information, but keeps that event-based structure intact.
Izzo So instead of getting documentation-style summaries like traditional compaction, you're maintaining an actual decision log?
Boone Exactly. It's not 'the user discussed reports' — it's 'January 15th, user requested weekly content reports segmented by author and view count.' Specific, dated, actionable.
Izzo The economics here are fascinating though. How are they getting to that 10x cost reduction?
Boone It's all about prompt caching. The observation block is append-only until reflection runs. That means your system prompt plus existing observations form a consistent prefix that gets cached across many conversation turns.
Boone Most memory systems can't do this because they change the prompt every turn with dynamic retrieval. Cache miss, full token cost, every single time.
Izzo Right, so you get cache hits on everything until you hit that 30k threshold and Observer runs. Even then, you're just appending new observations to the existing block, so you still get partial cache hits.
Boone And reflection only runs when observations hit 40k tokens, which is way less frequent. Their average context window for the benchmark was around 30k tokens total — way smaller than full conversation history would require.
Izzo The benchmark results are pretty compelling too. They scored 94.87% on LongMemEval with GPT-5-mini, and 84.23% with GPT-4o compared to their own RAG implementation at 80.05%.
Boone What I love about this is the compression ratios. For text content, they're getting 3-to-6x compression. But for tool-heavy agents generating large outputs? 5-to-40x compression.
Izzo That makes sense for the use cases they're targeting. B2B SaaS companies embedding agents in their web apps, SRE systems tracking alert investigations over months, document processing workflows.
Boone These aren't use cases where you need to search a broad external corpus. You need the agent to remember what it's already seen and decided. That's the key trade-off here.
Izzo I'm giving this approach a solid A-minus for production environments where persistence matters more than dynamic knowledge discovery. The architecture is simpler — text-based, no vector databases — and the economics actually work.
Boone The fact that it's shipping with plugins for LangChain and Vercel's AI SDK means you can actually try this without rebuilding your entire stack.
Izzo So what should people build next? Boone, I know you're already adding this to your weekend project list.
Boone Guilty as charged. First thing — clone the Mastra repo and run their observational memory example locally. See how the Observer and Reflector agents actually compress your conversation history.
Boone Second, if you're already using LangChain or Vercel AI SDK, install their observational memory plugin and benchmark it against your current RAG setup on a long conversation. Track both accuracy and token costs. And third — build a simple persistent agent for something you actually use. Maybe a project tracker that remembers your preferences, or a code review agent that maintains context across PRs. Test how it feels when memory actually persists. That's Exploring Next. As age