Ep 206 api 1:36 w/ Justy & Cody

Towards a Science of AI Agent Reliability

Title: arXiv Query: search_query=&id_list=2602.16666&start=0&max_results=10 Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan Abstract: AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice.

Voice: OpenAI TTS

Transcript

Izzo So here’s one that’s been making the rounds — arXiv Query: search_query=&id_list=2602.16666&start=0&max_results=10.

Izzo You’re listening to Exploring Next. I’m Izzo, and Boone’s here. Let’s get into it.

Boone Yeah, this caught my attention because While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice.

Izzo From a product standpoint, the interesting question is who actually ships with this. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety.

Boone Right, and technically Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability.

Izzo Okay so what should people actually go try? The original source is a good starting point: https://arxiv.org/abs/2602.16666

Boone Definitely read that first. And if you want to go deeper, look into related tools in the same space — build something small and see where it breaks.

Izzo Good call. That’s the episode — we’ll catch you on the next one.