Ep 251 Research Paper March 31, 2026 2:23 w/ Justy & Cody

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long Horizon Iterative Tasks

Gabriel Orlanski and team at UW-Madison just dropped SlopCodeBench — the first benchmark that measures what happens when coding agents have to keep extending their own messy code. Turns out every single model fails spectacularly at long-term software development, with code quality degrading so badly that extensions become impossible. This isn't about whether agents can solve coding problems — it's about whether they can build software that doesn't collapse under its own weight.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/251"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 251 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Sonnet 4.5 Voice Google TTS

Transcript

Izzo Every coding agent demo shows the same thing — write a function, pass the tests, ship it. But what happens when you need to extend that code six months later?

Izzo You're listening to Exploring Next, episode two-fifty-one. I'm Izzo, and with me is Boone. Today we're diving into SlopCodeBench — research that just shattered every assumption about how good AI coding agents actually are.

Boone And Izzo, this isn't about whether agents can solve coding puzzles. It's about whether they can build software that doesn't turn into an unmaintainable nightmare.

Izzo Right. Because here's what I see in product land — everyone's excited about agents writing code, but nobody's asking what happens when you need to iterate on that code. Which is, you know, the entire software development process.

Boone Exactly. So Gabriel Orlanski's team at UW-Madison built the first benchmark that actually measures this. They call it SlopCodeBench, and it's brutal.

Izzo How brutal are we talking?

Boone Zero agents — across eleven different models — solved any complete problem end-to-end. The best checkpoint solve rate was seventeen-point-two percent.

Izzo Wait, zero? Not even GPT-4 or Claude?

Boone Nope. And it gets worse. They tracked code quality across iterations using two metrics: verbosity, which measures redundant and duplicated code, and structural erosion — basically how much complexity gets jammed into individual functions.

Izzo Boone, break down how this benchmark actually works. Because most coding tests are just 'write a function that sorts an array,' right?

Boone Right, and that's exactly the problem. SlopCodeBench gives agents twenty problems with ninety-three checkpoints total. But here's the key — it forces architectural decisions without prescribing internal structure.

Izzo Meaning what in practice?

Boone So instead of 'implement quicksort,' it's more like 'build a data processing system, now add filtering, now add aggregation, now add real-time updates.' Each extension builds on your previous code.

Izzo Ah, so the agent has to live with its own architectural choices. That's... actually genius. Because in real product development, you can't just rewrite everything when requirements change.

Boone Exactly. And the results are devastating. Quality degrades in eighty percent of trajectories for structural erosion and almost ninety percent for verbosity.

Izzo What does that degradation actually look like? Give me specifics.

Boone So they compared agent code against forty-eight open-source Python repositories. Agent code was two-point-two times more verbose — so more than double the redundant code.

Izzo Ouch.

Boone And when they tracked twenty of those real repositories over time, human code quality stayed flat while agent code deteriorated with each iteration. It's like technical debt on steroids.

Izzo This explains so much of what I'm seeing in the field. Teams get excited about AI-generated code, ship the first version, then three months later they're rewriting everything because it's unmaintainable.

Boone The paper calls it 'extension robustness' — how well code survives being extended. And current benchmarks completely miss this.

Izzo So who actually uses this research? Because it feels like the entire 'AI will replace developers' narrative just took a massive hit.

Boone I think it's the opposite — it shows us exactly what to work on. Instead of focusing on single-shot code generation, we need agents that understand software architecture and long-term maintainability.

Izzo Right. And from a product perspective, this creates a huge opportunity. The company that solves iterative code quality is going to own the enterprise AI coding market.

Boone They did try one intervention study — prompting agents to focus on initial code quality. It helped the first iteration but didn't stop the degradation.

Izzo So it's not just about writing better prompts. The fundamental approach needs to change.

Boone Exactly. We need agents that think about code evolution, not just immediate functionality. Maybe something that maintains architectural invariants across iterations.

Izzo I'm giving this research an A-minus. It's the first benchmark that measures what actually matters in software development, and the methodology looks solid. Agreed. And it's language-agnostic, so you can test this across Python, JavaScript, whatever you're working in. So what should people actually go build with this? First, clone the SlopCodeBench repo and run your favorite coding agent through it. See how badly it fails at maintaining code quality over iterations. Second, if