Autoresearch
Karpathy's autoresearch lets AI agents autonomously experiment on machine learning models overnight — modifying code, training for 5 minutes, evaluating results, and iterating while you sleep. We dive into how it works, the clever design constraints, and why this might be the beginning of fully autonomous AI research.
Script: Sonnet 4.5 Voice: ElevenLabs
Transcript
Izzo What if you could go to sleep and wake up to an AI that spent the night making your machine learning model better?
Izzo Welcome back to Exploring Next, episode two-thirty-eight. I'm Izzo, and with me as always is Boone. Today we're diving into something that feels straight out of science fiction but is very much shipping code.
Boone Karpathy just dropped autoresearch on GitHub, and honestly? This might be the most elegant approach to autonomous AI research I've seen.
Izzo Right, so here's the hook — you point an AI agent at a machine learning training setup, tell it to experiment overnight, and wake up to potentially dozens of model improvements. No human in the loop.
Boone And the beauty is in the constraints. Fixed five-minute training runs, single file modifications, one simple metric to optimize.
Izzo Okay but Boone, break this down for me. What's actually happening under the hood when this thing runs?
Boone Three core files. Prepare.py handles data and utilities — that's fixed, the agent never touches it. Train.py contains your entire GPT model, optimizer, training loop — this is what gets modified. And program.md, which is basically the agent's instruction manual.
Izzo So the agent is literally rewriting the model architecture between experiments?
Boone Exactly. It might change the depth from eight layers to six, swap out the optimizer from Muon to pure AdamW, adjust batch sizes, even modify the attention patterns. Everything's fair game.
Izzo That's wild. But how do you compare experiments if the agent is changing fundamental architecture decisions?
Boone This is where the five-minute time budget is genius. Instead of training for a fixed number of epochs, everything runs for exactly five minutes wall-clock time. Doesn't matter if you're training a massive model or a tiny one — same time, same comparison basis.
Izzo Ah, so it's optimizing for the best model your hardware can produce in five minutes, not the best model period.
Boone Right. And the metric is validation bits per byte — lower is better, and it's vocabulary-size independent. So if the agent decides to experiment with different tokenizer sizes, the comparisons still make sense.
Izzo From a product perspective, this is fascinating because it's solving the researcher productivity problem. How many experiments can a human realistically run per day versus this thing?
Boone Karpathy estimates about twelve experiments per hour, so roughly a hundred while you sleep. Compare that to a human researcher who might manage two or three thoughtful experiments in a full day.
Izzo The iteration speed alone is compelling, but I'm curious about the scope limitation. Why constrain the agent to just one file?
Boone Keeps the blast radius manageable. The agent can't accidentally break your data pipeline or mess with evaluation code. All the creative destruction happens in train.py, and you can easily diff what changed between experiments.
Izzo Smart. It's like giving the agent a sandbox where it can't break anything critical but still has room to be genuinely creative.
Boone And here's what I love — it's deliberately minimal. Single GPU, no distributed training, no complex configs. The whole thing runs on one H100 with PyTorch and a few small packages.
Izzo Though I'm seeing forks already popping up for MacOS, Windows, even AMD chips. The core idea seems to be resonating.
Boone Makes sense. Once people see autonomous research working, they want it on their platform. Though the H100 constraint isn't arbitrary — you need enough compute for meaningful five-minute experiments.
Izzo So who's the target user here? This feels like it's aimed at researchers who already know what they're doing.
Boone Definitely not beginners. You need to understand model architecture, hyperparameter trade-offs, and how to write effective prompts for the agent in program.md. But for experienced practitioners? This could be transformative.
Izzo The program.md file is interesting too — it's essentially programming the research methodology, not just the model.
Boone Exactly. You're encoding your research intuition into instructions for the agent. Over time, I bet people will develop increasingly sophisticated research strategies in those markdown files.
Izzo I'm giving this concept an A-minus. The execution is elegant, the constraints are thoughtful, and the potential impact is huge. My only hesitation is adoption — this requires serious ML chops.
Boone Fair, but remember — this is version one of what Karpathy calls 'how it all began.' The dystopian intro suggests he thinks this evolves into something much bigger.
Izzo Right, the whole 'autonomous swarms of AI agents' thing. Which is either exciting or terrifying depending on your perspective. Por que no los dos? But seriously, if you want to experiment with this, here's what to try. First, clone the autoresearch repo and get the basic setup working. Run prepare.py to download data, then manually run train.py once to make sure everything works. Second, if you don't have an H100, check out the forks. The MacOS MLX version looks particularly