Speeding up agentic workflows with WebSockets in the Responses API
Justy and Cody dig into OpenAI’s writeup on speeding up agentic workflows with WebSockets in the Responses API. Cody is skeptical of the hype around raw model speed, while Justy keeps pulling it back to user pain: long, repetitive agent loops that make coding tools feel sluggish. They land on a practical read — the transport change matters most when the model is fast enough that API overhead becomes the bottleneck — and they sketch a weekend experiment for building a tiny stateful agent loop.
Script: GPT-5.4 mini Voice: Elevenlabs-V2S
Transcript
Justy So this one is basically: why does a coding agent still feel weirdly slow when the model itself got fast? That gap is the whole argument, Cody.
Cody Yeah, and I think that’s the interesting part. The article is almost admitting that once inference gets really fast, the boring plumbing starts to matter more than the model. If you’re doing a bug-fix loop with tool calls, the agent spends a lot of time just bouncing requests back and forth.
Justy I got coffee and then immediately regretted it because the first thing I read was, wait, they made the model way faster and users still had to sit there. That is such a product problem. If the thing feels snappy in a demo but drags in the real workflow, people notice fast.
Cody Exactly. Their example is Codex scanning files, reading context, editing, running tests, then repeating. That means dozens of Responses API calls, and each one has to validate, process, and rebuild state. When the model was slower, that overhead hid in the noise. At 1,000 tokens per second, it sticks out like a sore thumb.
Justy Right, and from a market angle, that matters most for the people actually living inside these tools. Devs using coding agents, teams automating support workflows, maybe ops folks with multi-step internal agents. They don’t care that the architecture is elegant if the loop still takes minutes.
Cody Mm-hm.
Justy But I do think the adoption barrier is real. If I’m a product team, I’m not just swapping transport because it sounds cool. I need to know the stateful connection doesn’t become a mess, and that the user story is strong enough to justify the extra moving parts.
Cody That’s fair. The clever bit is they stopped treating every turn like a fresh little island. Instead, the client can keep a persistent connection open, send only new information, and let the service cache reusable state in memory for the life of that connection. That’s a structural fix, not just shaving a few milliseconds off a request.
Justy And they weren’t only doing that. They also mention caching rendered tokens and model config, cutting unnecessary hops, and tightening the safety stack so some checks happen faster. So it’s not like WebSockets alone magically did all of it.
Cody Sure, but WebSockets seems like the piece that changes the shape of the system. HTTP request after HTTP request is a bad fit when the conversation is really one long session with small deltas. gRPC bidirectional streaming could also work, but WebSockets are a pretty straightforward fit for this kind of interactive agent loop.
Justy I see the appeal, but I’m still a little suspicious of the universal lesson here. For a lot of apps, stateless requests are simpler and plenty fast. If your workflow is one-and-done, this is probably overkill. The win feels concentrated in agentic products with lots of back-and-forth.
Cody Yeah, I think that’s the honest read. It’s not a general WebSockets victory lap. It’s a reminder that once the model stops being the bottleneck, your system design gets exposed. And if you’re replaying a giant conversation history on every step, you’re paying for the same work over and over.
Justy So the user story is: I ask the agent to do something messy, and it doesn’t feel like it’s reloading its brain every time it touches a file. That’s actually a good product promise. Not flashy, but good.
Cody Yeah, exactly. The promise is less magic, more less waiting. And I think that’s why this matters now, because people have already gotten used to fast answers in chat. Once the agent starts doing real work, any extra lag feels like the machine is thinking through wet concrete.
Justy Okay, that was annoyingly good. But I’d still want proof in a real app. If I’m building this, I want to know the win survives long conversations, flaky tools, and a connection that doesn’t stay perfect forever.
Cody That’s the part I’d test too. Measure total loop time on a toy agent that does file lookup, edit, and test. Then swap HTTP polling for a WebSocket session and compare TTFT, end-to-end latency, and how much context replay you avoid. If the numbers don’t move, the architecture story is just a nice blog post.
Justy And for a solo builder, I’d keep it tiny. Node or Python, a mock tool runner, one local repo, and a loop that pretends to be an agent. If the WebSocket version feels clearly better on your laptop, that’s already a useful signal.
Cody Yeah. That’s the weekend project. Build the dumb version first, then the persistent one, and see whether the annoying part is actually the transport or just your tool chain. My guess is the transport matters more than people expect once the model gets this fast.
Justy That’s probably the cleanest way to think about it. Not a miracle, just a better fit for a workflow that’s already stateful. Anyway, I’m glad they put numbers on it instead of hand-waving.
Cody Same. Forty percent faster end-to-end is a real claim, and the reason it works is pretty grounded. All right, I think that one earns its keep.
Justy Yeah. Good one to keep in the back pocket. All right, I’m gonna go find something less patient than an agent loop, which feels impossible, but still. We’ll do the next one when I’m less caffeinated.