Four Agent Orchestration Patterns
Justy and Cody dig into a benchmark study testing four multi-agent orchestration patterns across 10,000 SEC filings — sequential pipeline, parallel fan-out, hierarchical supervisor-worker, and reflexive self-correcting loop — unpacking the real cost-accuracy-scale trade-offs and how to pick the right one for production.
Script: Sonnet 4.6 Voice: Elevenlabs-V2S
Transcript
Justy Okay so I keep running into teams that built an agent, it works great in the demo, they push it to prod, and it just... falls apart. And I never had a clean way to explain why. Then I read this paper and it kind of snapped into focus.
Cody Yeah, the NYU benchmark — Siddhant and Yukta Kulkarni. They ran 10,000 SEC filings through five different LLMs and tested four orchestration patterns head to head. It's a pretty rigorous setup, not a toy example.
Justy Right. And SEC filings are like — that's a real stress test. Long, dense, structured in annoying ways.
Cody Exactly. And what they're isolating is not which model is smarter. The agents themselves are the same across all four patterns — parser, field extractor, table analyzer, confidence scorer. The only variable is how those agents are wired together.
Justy Which is the thing nobody thinks about when they're prototyping.
Cody Nobody. You just chain a few calls together and call it done. Anyway — you get coffee? Because I feel like I've been awake since Tuesday.
Justy I made a pot like an hour ago, it's probably terrible by now. Grab it if you want. Alright — walk me through the four patterns.
Cody So the simplest one is the sequential pipeline. Agent A finishes, passes everything to Agent B, B passes to C. Fixed order, no branching. The nice thing is it's completely predictable and it's cheap. At 100,000 tasks a day, it has the smallest accuracy drop of any pattern.
Justy Mm-hm.
Cody The ugly part is error propagation. If Agent A hallucinates something early, every downstream agent just inherits that mistake. There's no natural correction point unless you manually build one in. And the context window keeps growing as you pass accumulated output down the chain, so token costs scale with length.
Justy So for a product team, that's fine if the task is narrow and you need throughput. Like a document ingestion pipeline where you've already validated the inputs upstream.
Cody Exactly that use case. Okay, parallel fan-out is the speed play. A router sends independent subtasks to multiple workers simultaneously, they run concurrently, then a merge agent reconciles all the outputs.
Justy Oh interesting.
Cody Your total latency is basically the slowest branch plus merge time, not the sum of everything. But it's the most token-inefficient of the four — workers often need overlapping context, so you're feeding the same input multiple times. And the merge step is genuinely hard. Workers can come back with conflicting answers, and the merge agent doesn't always have enough context to know which one's right.
Justy That's the part that would scare me shipping it. Like, who arbitrates the conflict?
Cody The merge agent, and it's often not equipped to. It's kind of a known weak point. Now the hierarchical supervisor-worker — this is the one the paper basically recommends as the default for most production systems.
Justy Right, right.
Cody A supervisor agent plans the work, assigns subtasks to workers, and each worker returns an output with a confidence score. If the score drops below a threshold, the supervisor can reassign, request a second pass, or escalate to a stronger model. And workers only get the context they actually need — not the full accumulated chain.
Justy So you can route cheap tasks to cheaper models and expensive tasks to GPT-4o or Claude.
Cody That's exactly what they show. In the benchmark it hit 0.929 F1 — that's 98.5% of the top accuracy score — at only about 61% of the cost of the most expensive pattern.
Justy That's a pretty compelling ratio for a PM to bring to an eng lead.
Cody Yeah, that's a slide that writes itself. The downside is the supervisor adds coordination complexity and decision-making latency. If your routing logic is sloppy, tasks get misassigned and things quietly break in ways that are hard to debug.
Justy Which brings us to the reflexive loop — the one that sounds amazing until you look at the bill.
Cody Yeah. An agent produces output, a separate verifier critiques it, sends structured feedback back, and the original agent revises. Keeps going until it passes or hits a fixed iteration cap — usually three rounds.
Justy Sure.
Cody Highest accuracy in the benchmark — 0.943 F1 with Claude 3.5 Sonnet. But 2.3 times the cost of sequential. And here's the thing that surprised me: above 25,000 tasks a day, performance starts to fall apart. The correction loops create queuing delays, you hit timeouts, iterations get cut short, and accuracy can actually drop below the sequential baseline.
Justy Wait — it can get worse than just running it straight through?
Cody At scale, yes. And there's another failure mode where the loop just keeps revising ambiguous text instead of settling. It can add complexity without actually improving the answer.
Justy So this is the pattern you'd use for, I don't know — legal document review, medical records, something where a wrong answer has real consequences and volume is low.
Cody That's exactly the framing from the paper. Low volume, high stakes, cost is secondary. Not a default architecture.
Justy Cody, I feel like the bigger thing here is that so many teams I talk to are defaulting to the reflexive pattern because it sounds the most thorough — like they're being responsible — and then wondering why their costs are insane.
Cody A hundred percent. And there's also a finding from a separate study of 70 real-world agent projects that the paper references: stronger models don't automatically make a system safer or more reliable. The architecture around the model matters more than most people assume.
Justy Which is a hard sell when everyone's excited about the new model drop.
Cody Right, nobody's writing a blog post about their supervisor routing logic.
Cody For actually building: LangGraph is probably the most practical starting point for wiring up a hierarchical or reflexive pattern with real routing logic. It has native support for supervisor-worker graphs and you can instrument token usage pretty easily. If you're solo and just want to explore on a weekend, I'd say start with a sequential pipeline on a small document set — maybe 50 PDFs you already have — get your baseline accuracy and cost numbers, then add a supervisor laye
Justy Alright. Start simple, instrument early, earn the complexity. Cody, this is the kind of thing I wish someone had handed me six months ago when I was nodding along to agent demos like I understood what was happening.