Ep 460 Research Paper June 4, 2026 4:17 w/ Justy & Cody

MemTrain: Self Supervised Context Memory Training

Self-supervised framework MemTrain improves LLM context memory by training on unlabeled Wikipedia with coupled proxy tasks—masked reconstruction and memory recall—using GRPO. Achieves up to 17.67-point gains on long-horizon reasoning without task-specific labels.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/460"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 460 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Mistral Medium 3.5 128B Voice Inworld TTS 1.5 Max

Transcript

Justy Okay, I take back every nice thing I’ve ever said about my phone’s memory…

Cody Here we go.

Justy No, no — hear me out. I was trying to book a trip yesterday, right? And the chatbot kept forgetting which hotel I’d picked three messages ago. It’s like talking to someone who just blinked and reset.

Cody That’s not memory, that’s a stateless prompt window.

Justy Exactly! And that’s the problem MemTrain’s going after. Long-horizon agents that can actually a thought.

Cody Right. So the paper’s framing it as the difference between cramming the whole conversation into the prompt — which explodes in cost — versus teaching the model to keep a little compressed notebook of what matters.

Justy And that notebook’s the memory state. memory t-minus-one gets fed in with the new input, model writes memory t…

Cody Mm-hm.

Justy But the kicker is, until now, training that notebook required labeled data and RL. Which is why it’s all domain-specific and brittle.

Cody Yeah, and labeled long-horizon memory tasks are a nightmare to collect. You need trajectories where the model to remember something from turn seventeen to turn forty-two, and then you need a human to verify that it did.

Justy Which no one wants to pay for.

Cody Exactly. So MemTrain sidesteps that by using self-supervised tasks on unlabeled Wikipedia.

Justy Okay, I’m listening. How?

Cody Two coupled objectives. First, masked reconstruction: they hide an entity in the text, run the agent through multiple memory-updating rounds, then make it recover the masked entity from the final memory state. Forces the model to keep information that’ll matter later.

Justy And the second?

Cody Intermediate memory recall. Same setup, but now the model has to reconstruct masked historical info using the memory state in the interaction. So it’s not just about the end result — it’s about faithful compression at every step.

Justy So one’s outcome-focused, the other’s process-focused. Clever.

Cody And they jointly optimize both with GRPO, which I assume stands for… some flavor of policy optimization. The paper doesn’t spell it out, but the results speak for themselves.

Justy Seventeen point six seven gain on long-text QA. That’s… not nothing.

Cody Yeah, and it’s model-agnostic. They tested it across a few different LLMs, and the memory improvements transferred to downstream tasks without task-specific fine-tuning.

Justy So who ships this? I’m thinking any agent stack that’s doing multi-turn workflows — customer support, research assistants, even that terrible travel bot from yesterday.

Cody Well, the code’s not linked in the paper, so for now it’s research-only. But the approach is reproducible if you’ve got the compute for the self-supervised pretraining.

Justy Which, knowing you, you’re already calculating how many GPUs that’d take.

Cody I was .

Justy Sure. Anyway — the trade-off here’s the cost of the proxy tasks, right? You’re training on Wikipedia, which is clean, but real-world interactions are messier.

Cody That’s my one push-back, yeah. The masked objectives are a proxy for memory needs. Wikipedia’s not interactive, so the ‘memory’ they’re training on might not map perfectly to, say, a user changing their mind halfway through a conversation.

Justy But it’s still a step forward from ‘here’s a labeled dataset of ten memory-heavy tasks, good luck generalizing.’

Cody No argument. And the GRPO optimization’s smart — balancing the two objectives so you’re not just overfitting to one.

Justy I do love that they’re using unlabeled data. Feels like the only scalable way to get memory right.

Cody Yeah, and the fact that it’s self-supervised means you could, in theory, keep throwing more unlabeled text at it to improve.

Justy At which point Cody starts sweating about the carbon footprint of training runs…

Cody Oh, come on. It’s a valid concern.

Justy It’s also a very take.

Justy Anyway. No code link, so no Build Next this time. But man, if this works in production, it’s the kind of thing that makes agents feel… I dunno. Less like a chatbot, more like a colleague.

Cody Or at least like a colleague who doesn’t forget your coffee order.

Justy I’ll take it. Safe travels back to D.C., and try not to overthink the GPU math on the flight.