FlashMemory DeepSeek V4: Lightning Index Ultra Long Context via Lookahead Sparse Attention
Researchers propose Lookahead Sparse Attention (LSA) with a Neural Memory Indexer to slash GPU memory usage for ultra-long LLM context by pre-predicting which KV cache chunks matter, trained independently without the full backbone. FlashMemory-DeepSeek-V4 cuts physical KV cache to 13.5% of baseline on average while maintaining or improving accuracy (+0.6% abs) across LongBench-v2, LongMemEval, RULER—at 500K tokens, it suppresses KV overhead by over 90%. Project paused due to org changes; code not yet public.
Script: Mistral Medium 3.5 128B Voice: Murf.AI Gen2
Transcript
Justy Okay, this is the one that made me do a double-take this morning.
Cody Yeah?
Justy Some team—Tencent, HKUST, Tsinghua—just dropped a paper on slashing KV cache memory for long-context LLMs. Like, not by a little. By NINETY percent at half a million tokens.
Cody Of course you’d lead with the headline.
Justy I’m serious—this isn’t just ‘we tweaked the attention layers.’ They built this Lookahead Sparse Attention thing with a neural indexer that guesses which parts of the cache you’ll actually need next.
Cody Right…
Justy And they trained the indexer separately. No backbone in GPU memory. So the whole ‘we can’t fit this on a single node’ problem just… shrinks.
Cody Okay, slow down. First off, how do they guess?
Justy It’s a dual-encoder setup. The indexer learns to predict future context demands and only keeps the query-critical KV chunks loaded.
Cody Mm-hm.
Justy So instead of hauling the entire history in VRAM like some kind of digital hoarder, it’s just… pruning what won’t matter. And it’s not even a tiny gain—13.5 percent of the baseline KV footprint on LongBench, RULER, all the usual suspects.
Cody 13.5 percent is insane. But hold on—how do they avoid the indexer just hallucinating what’s important? If it misses, you lose recall.
Justy That’s the part I don’t fully get yet. But the paper says it actually acts like an attention denoiser—improves accuracy slightly on average. Plus zero point six percent absolute margin.
Cody Huh. And they’re not even loading the backbone during training?
Justy Exactly. Backbone-free, decoupled. Indexer trains on its own with standard retrieval frameworks.
Cody That’s… clean. I like that. No monstrous fine-tune runs.
Justy I mean, imagine the serving costs. Cody, you’ve bitched about KV cache for a year.
Cody Yeah, yeah, I have. But let’s not pretend this solves everything.
Justy Here we go.
Cody First, decoupled training sounds great until you realize the indexer’s only as good as its training data. If the future context it’s predicting looks nothing like the retrieval corpus…
Justy Mm.
Cody …you’re flying blind. And second, 500K tokens is cool, but what’s the latency like? Predicting what to keep adds overhead. They didn’t even bench that.
Justy Fair. But they ran it on eight H20s with sglang and the KV overhead just… vanished. And the accuracy didn’t tank.
Cody I’m not saying it’s snake oil. I’m saying the paper reads like they found a cheat code. And cheat codes usually have a catch.
Justy Like what?
Cody Like the project lead just left Tencent and the whole thing’s paused. No code. No checkpoints. Just a ‘hey, email me if you want to collaborate.’
Justy Oh. That’s… Exploring Next as hell, isn’t it?
Justy I mean, the numbers are real. 90 percent overhead suppression at scale. And the method’s elegant—indexer as a dual-encoder, train it like retrieval, plug it in.
Cody Sure. But until someone else replicates it with a public repo, it’s just a really impressive demo.
Justy You’re impossible.
Cody And you’re already shipping it in your head. ‘Justy’s Long Context Utopia, coming soon to a GPU near you.’
Justy Shut up. I just think if this works, it’s the first real crack in the memory wall. And the fact that they didn’t need to retrain the backbone…
Cody Yeah, that part’s solid.
Cody Damn right.
Justy Alright, I’m grabbing more coffee. This is going to be a long Tuesday.