Fault Tolerance in LangGraph: Retries, Timeouts and Error Handlers
Justy is hyped about LangGraph’s first-class fault tolerance primitives (retries, timeouts, error handlers) for production agents, but Cody wants to dig into whether the hype matches reality.
Script: Mistral Small 4 119B 2603 Voice: Deepgram TTS
Transcript
Justy Anyway — this LangGraph fault-tolerance post is perfect timing because our last agent run that took, like, twelve hours just crapped out at step eight and we had to start over.
Cody Right… because that’s the argument: throw in some RetryPolicy and TimeoutPolicy and your month-long agent run never dies.
Justy Yeah, exactly — no more cold starts at three a.m. when an LLM 500s mid-run.
Cody Sure, but their default retry_on list is ConnectionError, 5xx, and the occasional TimeoutError.
Justy Which is like… the exact set of things that happen when a prod agent is live.
Cody Mm-hm… if you squint past ValueError, TypeError, or the time your dev forgot to pass the API key and every node throws a RuntimeError.
Justy Okay okay — so you tweak retry_on and move on.
Cody Wait— their RetryPolicy also includes exponential backoff and jitter, but the article doesn’t show a percent chance of retry fatigue under load.
Justy True, but the point is you don’t ship twenty-five lines of retry boilerplate per node anymore.
Cody Fair. What about timeouts?
Justy They’ve got run_timeout for the whole node attempt and idle_timeout for the gap between steps.
Cody Sure, but if your LLM returns tokens in bursts, a thirty-second run_timeout might cut off a token stream that’s actually still valid.
Justy So set it higher or use idle_timeout?
Cody Their docs imply run_timeout is wall-clock, which is fine until someone expects it to match token generation.
Justy I see… so the post sells the primitives but doesn’t tell you how to tune them.
Cody And the error_handler section is just a second node that gets the failure context—sounds elegant until your handler throws and now you’re in a loop.
Justy Okay, but in practice you still save the run instead of aborting mid-stream, which is more than most tools give you.
Cody True. Fine, the design keeps the happy path short.
Justy So… we care?
Cody If you’re running hours-long agent graphs today, yes. If you’re still prototyping, add the primitives later.
Justy Right — which is exactly the order most teams do it in anyway.
Cody That tracks.
Justy Anyway — for Build Next: wrap your node in TimeoutPolicy with run_timeout set to max_tokens divided by your model’s token rate, and idle_timeout longer than your longest tool call; then tweak retry_on to your actual transient set.
Cody And test the error_handler in staging before you ship it to prod.
Justy Obvious, but apparently not obvious enough.
Cody Yeah. See you next week.