Exploring Next

Exploring Next — Ep 251 w/ Justy & Cody — SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski and team at UW-Madison just dropped SlopCodeBench — the first benchmark that measures what happens when coding agents have to keep extending their own messy code. Turns out every single model fails spectacularly at long-term software development, with code quality degrading so badly that extensions become impossible. This isn't about whether agents can solve coding problems — it's about whether they can build software that doesn't collapse under its own weight.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →