Embarrassingly Simple Self Distillation Improves Code Generation
Apple researchers developed Simple Self-Distillation (SSD), a technique that improves code generation models by fine-tuning them on their own raw outputs—no verification needed. The method improved Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench by reshaping token distributions to balance precision and exploration in code generation.
Script: Sonnet 4.5 Voice: Google TTS
Transcript
Izzo What if your coding model could get better just by learning from its own mistakes?
Izzo You're listening to Exploring Next, episode two-sixty-four. I'm Izzo, and with me is Boone. Today we're diving into Apple's Simple Self-Distillation paper—a technique that's embarrassingly simple but surprisingly powerful.
Boone And when Apple says 'embarrassingly simple,' they mean it. No teacher models, no verification, no RL. Just sample from your model and fine-tune on those raw outputs.
Izzo Which sounds almost too good to be true. But they took Qwen3-30B from 42% to 55% pass@1 on LiveCodeBench. That's a thirteen-point jump from doing basically nothing.
Boone Right, and it works across five different models—Llama, Qwen, different sizes. The gains are real.
Izzo So Boone, walk me through what they're actually doing here. Because sampling your own outputs and training on them feels like it should just reinforce whatever the model was already doing wrong.
Boone That's the intuition, but they're being clever about temperature settings. During training, they sample at one temperature—call it T-train—then at inference time, they use a different temperature, T-eval.
Izzo And that temperature shift is doing the heavy lifting?
Boone Exactly. They sample training data at higher temperatures to get more diversity, then deploy at lower temperatures for more focused outputs. But here's the key insight—they identify what they call the precision-exploration conflict.
Izzo Break that down for me.
Boone Think about code generation. You've got fork positions where multiple approaches are valid—like choosing between a hash map or a tree structure. Then you've got lock positions where syntax is rigid—you can't just make up function names.
Izzo Ah, so you want high temperature at the forks for creativity, but low temperature at the locks for correctness.
Boone Right! But with traditional decoding, you pick one global temperature. It's always a compromise. SSD reshapes the model's distributions in a context-dependent way.
Izzo That's actually brilliant. Instead of fighting the temperature knob at inference time, you're baking that intelligence into the model weights.
Boone And the mechanism is what they call support compression and within-support reshaping. At lock positions, it suppresses low-probability distractors. At fork positions, it preserves useful diversity.
Izzo From a product perspective, this is huge. Every company building coding assistants is hitting the same wall—how do you get better without massive human labeling or complex RL setups?
Boone The operational simplicity is the killer feature. You need prompts and compute. That's it. No execution environments, no test suites, no reward engineering.
Izzo And they only needed one sample per prompt. Not even multiple candidates. The dataset is ten thousand competitive programming problems from rSTARcoder.
Boone With minimal filtering. They just removed empty responses and single-line stubs. No correctness signal whatsoever.
Izzo Wait, let me make sure I understand the training setup. They're using standard supervised fine-tuning—just cross-entropy loss over the sampled sequences?
Boone Yep. Two thousand five hundred iterations for instruct models, three hundred for thinking models. Learning rate five times ten to the minus six. Nothing exotic.
Izzo This feels like one of those techniques that's going to propagate fast. The barrier to entry is so low.
Boone And the gains concentrate on harder problems, which is where teams actually need the help. Easy problems already work fine.
Izzo The coverage improvements are even more impressive. Hard problem pass@5 went from 31% to 54%. So it's not just getting one answer right—it's exploring the solution space better.
Boone That suggests the model is learning to maintain diversity where it matters while being precise where it counts. The theoretical analysis in their appendix backs this up.
Izzo From a competitive standpoint, if I'm running a coding assistant product, this is probably shipping in the next quarter. The risk-reward is too good.
Boone Especially since it generalizes across model families. They tested Llama and Qwen, both instruct and thinking variants. It's not some weird architectural quirk.
Izzo The thinking models are interesting—those needed fewer training iterations but still saw gains. Makes sense if they're already better at reasoning through the problem space. And they validated on LiveCodeBench v6, which covers February to May 2025 problems. So it's not overfitting to some old benchmark. Boone, any concerns with this approach? It feels almost too clean. The main risk is probably mode collapse if you iterate too many times. They only did one round of SSD, but i