Ep 443 research 5:59 w/ Justy & Cody

AI memory framework MeMo skips LLM retraining

MIT's MeMo framework encodes new knowledge into a small dedicated memory model so teams can swap in a better LLM without retraining — and the performance gains are real. Justy and Cody break down how it actually works, what the benchmarks mean, and where the trade-offs bite.

Script: Sonnet 4.6 Voice: Rime Mist v3

Transcript

Justy Okay so — what if you trained your memory ONCE, and then just... swapped in a smarter brain whenever a better model dropped? That's basically what this MIT paper is claiming.

Cody Which sounds too clean. But the numbers are actually interesting, so — yeah, let's get into it.

Justy Also I'm running on like five hours of sleep, I drove up this morning and traffic was unhinged. I'm not even fully here yet.

Cody I could tell. You texted me from the freeway, which, please don't do that. Anyway — MeMo.

Justy Okay, MeMo. So the thing it's solving — LLMs are frozen after training, right? Their internal knowledge just... stops. And if you want to update it, you're either doing full retraining, which is wildly expensive, or you're doing RAG, which has its own mess of problems.

Cody Right.

Cody And the RAG problems are real. One of the co-authors said something I thought was pretty sharp — that vector databases have a fundamentally difficult job encoding the full semantics of a chunk of text in a single vector. Because relevance sometimes only becomes apparent when you look at multiple chunks together. So you end up retrieving the wrong stuff, or just... missing the point of the query entirely.

Justy Mm-hm.

Cody And that noise sensitivity is a real problem in enterprise deployments. Your knowledge base is a disaster — duplicate files, outdated policies, stuff that should've been deleted two years ago.

Justy Every company's internal wiki is just a graveyard of good intentions. So MeMo's answer is — don't retrieve from raw documents at all. You encode the knowledge into a separate small model instead.

Cody Yeah, and the architecture is actually clever. There are two main pieces. The Memory model is a smaller language model — they used Qwen two-point-five fourteen billion in the experiments — and it gets fine-tuned specifically to hold the new knowledge in its weights. Then there's the Executive model, which is your big frozen LLM, the reasoning engine. It never gets touched.

Justy Oh interesting.

Cody At inference time, the Executive decomposes a user's question into atomic sub-questions, fires them at the Memory model like API calls basically, gets the facts back, and synthesizes the answer. It's treating the Memory model as an oracle.

Justy And the thing that makes that work is what they call reflections — which I love as a name, honestly. Instead of just dumping raw documents into training, a Generator model first converts everything into targeted question-answer pairs. Every angle of the corpus, captured as QA pairs. THEN the Memory model trains on that.

Cody Which is doing real work. You're not just compressing text — you're forcing the model to internalize the knowledge in a queryable form. That's why it holds up under noise. The Executive never sees the messy raw docs.

Justy Right, right.

Justy And the benchmark results are kind of wild, Cody. On NarrativeQA — which is a long-document multi-hop reasoning benchmark — MeMo hit 53.58% accuracy paired with Gemini 3 Flash. HippoRAG2, which is a state-of-the-art graph-based RAG system, maxed out at 23.21%.

Cody That gap is LARGE. Like, that's not a marginal improvement — that's a different regime of performance on the same task.

Justy And then the swap thing — they just switched the Executive model from Qwen to Gemini 3 Flash and got a 26.73% jump on NarrativeQA. No retraining. The Memory model didn't change at all.

Cody Yeah, that's the part that's actually interesting to me from a systems standpoint. The memory artifact is decoupled from the reasoning engine. So as frontier models improve, you just... plug in the new one. Your private data never has to leave your infrastructure to get fine-tuned into a closed model.

Justy Which is the enterprise pitch right there. Regulated industries, legal, healthcare — anywhere you can't just send your docs off to a third-party API for fine-tuning.

Cody Sure. Though I want to be honest about the trade-offs, because they're real. Continual updates — like when new documents come in — use model merging instead of full retraining. They derive a task vector from the new data and mathematically merge it into the existing Memory model weights. It's much cheaper. But the paper says you take an eleven to nineteen percent accuracy hit compared to a full retrain.

Justy Hm.

Justy Is that a dealbreaker though? For most enterprise use cases — like, you're not doing perfect retrieval anyway. An eleven percent hit off a baseline that already doubled HippoRAG2 is still probably fine.

Cody Fair. My other flag is the upfront cost. Generating the reflections — converting your whole corpus into QA pairs with a thirty-two billion parameter Generator model — that's not free. For a huge knowledge base that's a meaningful compute bill before you've even trained the Memory model.

Justy One-time cost though, presumably. And then updates are cheap.

Cody Presumably. I'd want to see how the reflection generation scales on a corpus that's actually enterprise-sized — like, hundreds of thousands of documents, not research benchmark scale. But yeah, the architecture is genuinely novel. I'm not just being a pessimist here.

Justy You said that so defensively.

Cody I'm growing.

Justy Okay — this was episode four forty three of Exploring Next, which, how are we at four forty three. That's unhinged. Get some rest, Cody, and I'll talk to you soon.

Cody Drive safe this time.