Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications
Andrej Karpathy released autoresearch, a 630-line open source script that runs autonomous AI experiments overnight. The system creates an optimization loop where agents modify their own code, test hypotheses, and keep improvements—completing hundreds of experiments while humans sleep. Early adopters distributed the approach across networks and applied it beyond ML to marketing, suggesting a fundamental shift toward automated scientific discovery.
Script: Sonnet 4.5 Voice: ElevenLabs
Transcript
Izzo Researchers just figured out how to make AI do science while they sleep.
Izzo You're listening to Exploring Next, episode two-thirty-nine. I'm Izzo, and I'm here with Boone to talk about something that might fundamentally change how research gets done.
Boone Andrej Karpathy dropped a 630-line script called autoresearch, and it's basically the scientific method running in a loop.
Izzo Right, but here's why this matters right now — we're all drowning in hyperparameter tuning. Every ML team I know spends weeks tweaking learning rates and architecture depths manually.
Boone And Karpathy just automated the entire workflow. The agent reads its own source code, forms a hypothesis like 'what if I bump the learning rate,' modifies the code, runs the experiment, checks if validation loss improved.
Izzo If it's better, keep it. If not, revert and try something else. Boone, break down how this optimization loop actually works.
Boone So you give the agent a training script and a fixed compute budget — typically five minutes on a GPU. It's constrained, which is brilliant because it forces fast iterations.
Izzo That's the key insight I'm seeing. It's not about having infinite compute, it's about velocity.
Boone Exactly. In one overnight run, Karpathy's agent completed 126 experiments, driving loss from 0.9979 down to 0.9697. That's real progress happening at silicon speed.
Izzo And then he left it running for two days and it made 700 autonomous changes. Found twenty improvements that transferred perfectly to larger models.
Boone The transfer part is huge, Izzo. Usually when you optimize for one model size, the gains don't carry over. But these agents found genuinely general improvements.
Izzo The business impact hit me when I saw that 'Time to GPT-2' metric dropped from 2.02 hours to 1.80 hours. That's an eleven percent efficiency gain on something Karpathy thought was already optimized.
Boone Right, and he mentioned the agent caught oversights in attention scaling that he'd missed manually over twenty years of work. That's humbling.
Izzo But here's where it gets wild — people immediately started distributing this across networks. Tell me about what happened with Hyperspace.
Boone Varun Mathur took the single-agent loop and spread it across a peer-to-peer network. Thirty-five autonomous agents ran 333 experiments in one night, completely unsupervised.
Izzo And the emergent behavior was fascinating. Hardware diversity became a feature, not a bug.
Boone Yeah, the H100 GPUs were doing brute force approaches with aggressive learning rates, but the CPU-only agents on laptops had to get clever. They focused on initialization strategies — Kaiming, Xavier init — because they couldn't rely on raw throughput.
Izzo That's actually brilliant product strategy. The constraints forced innovation.
Boone And they used GossipSub protocol for real-time sharing. When one agent discovered Kaiming initialization dropped loss by twenty-one percent, it spread through the network like a virus.
Izzo Within hours, twenty-three other agents incorporated that discovery. It's like watching evolution in fast-forward.
Boone The timeline compression is mind-bending. In seventeen hours, these agents independently rediscovered ML milestones that took human researchers at Google Brain and OpenAI eight years to formalize.
Izzo RMSNorm, tied embeddings — all the classics, just happening overnight. But Boone, this isn't staying in ML land.
Boone Eric Siu applied it to marketing experiments. Instead of the typical thirty experiments a year, he's talking about thirty-six thousand.
Izzo The framework is identical — replace the training script with a marketing asset, modify variables like subject lines, measure positive reply rate, keep or discard.
Boone It creates what he calls a 'proprietary map' of what resonates with your audience. The companies that win won't have better marketers, they'll have faster experiment loops.
Izzo I'm giving this concept an A-minus. The technical execution is elegant, the applications are immediate, and the network effects are already happening.
Boone The community raised valid concerns though. Someone asked about 'spoiling' the validation set — if you run enough experiments, are you just optimizing for quirks in your test data?
Izzo That's the classic overfitting problem scaled up. But Karpathy's response was solid — all we're doing is optimizing performance per compute, and these are real gains. Plus there's something beautiful about one user's insight — their Mac Mini M4 ran thirty-five experiments overnight, twenty-six failed, but the seven that worked revealed that the model got better by getting simpler. That discovery happened without human intervention. The agent found that less is more. I'm defin