Exploring Next

Exploring Next — Ep 464 w/ Justy & Cody — Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Deep-research agents like Claude and GPT solve long, multi-step tasks by searching, using tools, and synthesizing evidence. The problem: when they fail, you only know the final answer is wrong — not WHERE in the trajectory the mistake actually happened. This paper introduces TELBench, a 1,000-instance benchmark for pinpointing harmful errors in agent trajectories at the span level, and DRIFT, a claim-centric auditing framework that tracks what claims the agent makes, checks if they're supported by evidence, and traces which unsupported claims later break the answer. The approach improves error localization accuracy by up to 30 points over naive LLM prompting.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →