Where Do Deep Research Agents Go Wrong? Span Level Error Localization in Agent Trajectories
Deep-research agents like Claude and GPT solve long, multi-step tasks by searching, using tools, and synthesizing evidence. The problem: when they fail, you only know the final answer is wrong — not WHERE in the trajectory the mistake actually happened. This paper introduces TELBench, a 1,000-instance benchmark for pinpointing harmful errors in agent trajectories at the span level, and DRIFT, a claim-centric auditing framework that tracks what claims the agent makes, checks if they're supported by evidence, and traces which unsupported claims later break the answer. The approach improves error localization accuracy by up to 30 points over naive LLM prompting.
Script: Haiku 4 Voice: Inworld TTS 1.5 Max
Transcript
Justy So when these research agents actually go wrong, nobody knows where. Like, the final answer is garbage, but is it because the search failed three steps back, or the evidence was weak, or the agent made a bad assumption early and kept building on it?
Cody Right. That's the core of this paper. You get outcome-level eval — did the agent answer correctly or not — but zero visibility into which part of the trajectory actually broke.
Justy Exactly. And the trajectory is long, right? Like, eleven-ish steps or spans on average. You can't hand-debug two thousand of those.
Cody They collected 2,790 real trajectories from three benchmarks — GAIA, XBench, BrowseComp — ran them through three models, GPT-5, Gemini-2.5-Pro, Claude-Sonnet-4.5, across two agent frameworks, MiroFlow and OAgent. Then they normalized the logs into semantic spans and had humans annotate which spans actually contained harmful errors.
Justy Okay, so they're not synthetic. These are actual agent runs that failed.
Cody Yeah. And semantic spans are their unit of analysis — not raw events, which are too low-level and framework-specific, but chunks around a coherent objective. Planning, retrieval, verification, finalization. That granularity is what makes it possible to pinpoint where the agent went wrong without drowning in noise.
Justy So you're not saying every line in the trajectory is an error span or not. You're saying this chunk — this five-line planning phase — is where the agent introduced a claim that wasn't actually supported.
Cody Exactly. And the hard part is that those unsupported claims get inherited by later spans without revalidation. So the error isn't the final wrong answer. It's the early commitment that the agent kept building on.
Justy That's such a human mistake, honestly.
Cody It really is. So they built this benchmark, TELBench, with 1,000 instances where a model has to identify which spans are actually harmful errors versus benign exploration, failed searches, or just noise. The challenge is that deep-research trajectories mix all of those.
Justy Right, because the agent is searching and trying things and exploring. That's not all wrong. But somewhere in there is an assumption that breaks the whole chain.
Cody And they also propose this framework called DRIFT — claim-centric auditing. Instead of just scoring spans independently or throwing the whole trajectory at an LLM and hoping, DRIFT actually tracks what claims the agent makes, where they're introduced, where they become consequential, and which later spans depend on them.
Justy So it's like a ledger. The agent says 'X is true,' and DRIFT records that and then asks: is X actually supported by the evidence?
Cody Yeah. There's a Claim Keeper that maintains that ledger, a Support Seeker that checks whether claims are directly supported, weakly supported, unsupported, or contradicted, and then specialist auditors that do skill-routed checks — entity claims, constraint claims, evidence claims, retrieval claims, compute, process. Then a Dependency Tracer backtraces unsupported claims to figure out which ones actually matter.
Justy So it's not just 'this span has an error.' It's 'this span has an unsupported claim, and here's the chain of spans that depend on it.'
Cody Right. And that matters because not every unsupported claim breaks the answer. Some are dead ends. But if you can trace the dependency graph, you can figure out which ones actually propagated into the final result.
Justy How much better does DRIFT actually do?
Cody Up to thirty percentage points on span-level error localization and first-error accuracy over naive LLM prompting. So, not a small margin.
Justy Okay, so the claim-centric structure actually helps. I was worried this was going to be one of those papers where they make it more complicated and don't gain much.
Cody Yeah, I had the same thought. But the mechanism makes sense — you're not asking the LLM to holistically judge a ten-span trajectory. You're asking it to check specific claims against specific evidence, which is a narrower, more reliable task.
Justy And it scales across different models and frameworks?
Cody They tested on GPT-5, Gemini, Claude, under MiroFlow and OAgent, so yeah. The results hold across model families.
Justy So who actually ships this? Like, if I'm building a research agent for a product, do I run DRIFT on every trajectory?
Cody That's the real question. The paper is research-focused — they built the benchmark and the framework, proved it works better than baselines. But operationally, yeah, if you care about reliability in deep-research agents, you'd want process-level auditing on top of outcome-level eval.
Justy That sounds expensive.
Cody It probably is. You're running multiple auditor agents over every trajectory. But if the final answer is high-stakes — research report, policy recommendation, legal analysis — you need to know not just that it's right, but WHERE it could be wrong. And this gives you that.
Justy Fair. And the paper's open-sourced, right?
Cody TELBench is on Hugging Face, and DRIFT code is on GitHub. So you could take either one and build on it.
Justy Alright, so the thing that sticks with me is the dependency tracing. Like, I think a lot of agent debugging today is just 'did it get the right answer,' and if not, 'why,' and you're just rereading the trajectory by hand. This actually gives you a structural way to find the first unsupported claim that mattered.
Cody Yeah, and it's not just academic. If you're running agents in production, you need to know which errors are worth fixing. Is it the retrieval? The reasoning? The claim formation? DRIFT tells you.
Justy This is going to be one of those papers that feels obvious in retrospect, but I'm genuinely glad someone did it. Because right now, agent eval is basically binary — pass or fail — and that's not enough for anything real.