Introducing Langsmith Engine
Justy and Cody dig into LangSmith Engine as a practical shift from manual agent triage to a more continuous loop: production traces get clustered into named issues, tied back to likely root causes in code, and turned into draft fixes plus new eval coverage. They focus on why that matters for teams drowning in traces, how the system piggybacks on existing LangSmith tracing and evaluators, and where the real adoption friction is for product teams and solo builders.
Script: GPT-5.4 Voice: Rime Mist v3
Transcript
Justy The part that hits me is how many teams already have traces everywhere, and still don't know what to fix on Monday.
Cody Yeah. The raw data problem got solved before the triage problem. So this launch matters because it's trying to turn production mess into a ranked work queue instead of a guilt pile.
Justy Right, and that's a very real product pain. You can have an agent in production, people are clearly having a weird time with it, and the team is still stuck reading individual traces like it's forensic work.
Cody Mm-hm.
Cody What LangSmith Engine seems to do is run that whole loop continuously. It watches production traces, groups repeated failures into named issues, checks signals like explicit tool errors, evaluator failures, anomalies, bad user feedback, even cases where people ask for stuff the agent was never built to handle.
Justy I got in late last night and made the mistake of coffee at, like, nine. So my brain is somehow tired and overclocked. Anyway, that actually feels relevant, because a lot of agent teams are operating exactly like that.
Cody Exactly. And the clever bit is it doesn't stop at, hey, something failed. If your repo is connected, it tries to diagnose the root cause against the code or prompt setup, then drafts a PR and also proposes eval coverage so the same class of failure gets watched going forward.
Justy That eval part is maybe the strongest product story, honestly. Not just fixing the bad thing, but turning the bad thing that escaped into a test case so it stops sneaking back in after the next deploy.
Cody Right, right.
Cody Their example was pretty concrete. Support bot, users ask about canceling a subscription, the bot handles it badly, online evals mark it as failure, user feedback is negative, but latency is normal so no systems alert goes off. Engine surfaces one issue with severity, says it hit 12 percent of support sessions that week, notes it started four days ago and lines up with a deployment.
Justy And then it claims it can read the code and figure out the cancellation tool description is ambiguous. Which, I mean, I can believe sometimes. That's also the spot where trust gets won or lost.
Cody Sure.
Cody Yeah, same. I buy the clustering more easily than I buy perfect diagnosis. But even a decent draft is useful if the alternative is a human spending two hours spelunking through long traces and then writing the same eval by hand. Their own framing is review-and-merge, not blind autopilot, which feels sane.
Justy Who I think buys this first is not the person hacking on a toy chatbot over a weekend. It's the team that already has LangSmith tracing, probably some online evaluators, a repo with enough structure, and enough support or ops volume that repeated failures actually matter in dollars or churn.
Cody Yeah.
Cody Because it's built on top of existing LangSmith plumbing. They keep saying no new infrastructure, which is true only if you're already in their world. Connect a tracing project, optionally connect the repo, and it starts surfacing issues. If you're not already tracing and evaluating, the barrier is still the setup discipline.
Justy That's the adoption wrinkle. The product pitch sounds simple, but the prerequisite is you already believe in observability for agents. A surprising number of teams still ship prompt changes with basically vibes and screenshots.
Justy You know I'm right, Cody.
Cody No, fully. The phrase is harsh, but fair. Also, this is why the online-plus-offline combo matters. Engine uses existing evaluator results as input for issue detection, and when it finds a gap, it proposes a custom online evaluator for that exact issue and pulls failing traces into an offline dataset with per-example criteria.
Justy That closes a loop a lot of teams never close. Production becomes the source of the next eval suite, instead of this separate spreadsheet project somebody keeps postponing after standup.
Cody Oh interesting.
Cody And architecturally, I like that it's a deep agent sitting above traces, evaluator feedback, and code, rather than pretending one model call can do all of this. The trade-off is obvious, though. More access means more trust, more permissions, more questions about whether you want an external system reading your repo and proposing commits.
Justy I could be wrong, but that probably splits the market. Some teams will use issue surfacing and dataset generation immediately, then hold off on repo access until they've seen a few good catches. That's a very normal landing path for this kind of tool.
Cody I think that's right. Also, the named-issue abstraction is sneakily important. Reading one bad trace is emotionally convincing but statistically useless. Reading a cluster with a timeline, severity, recurrence, and links back to evidence is way closer to how teams actually prioritize work.
Justy And it gives product people something legible. Not just, the model seemed weird, but, this failure mode started after that release and touched this slice of sessions. That's the difference between vague concern and an actual ticket.
Cody Small side note, your kitchen somehow has no normal mugs. I'm drinking coffee out of what feels like a soup decision. Anyway, build-next wise, the obvious move is: if someone already uses LangSmith, connect an existing tracing project and a non-critical repo in the beta and watch what issue clusters show up.
Justy The mug is ambitious, not wrong.
Cody For a solo builder, I'd do a tiny version of the pattern this weekend. Log traces, tag failures manually or with a simple evaluator, cluster similar bad runs by embedding or even keyword rules, and dump the worst cluster into a dataset. Then write one targeted eval that would have caught it. Even without Engine, that workflow teaches the habit.
Justy And for teams already deeper in the stack, I'd compare three things after a week or two: did the surfaced issues match what humans would have found, were the drafted fixes directionally right, and did the new evaluators actually catch regressions later. That's the real scorecard, not the demo path.
Cody Yeah, that's the whole game. If it saves triage time and steadily hardens the eval suite, it's useful. If it mostly generates busywork with fancy labels, people will bounce fast.
Justy I think that's a good place to leave it. Episode 395 and we're still learning that the fancy mug is less important than whether it keeps the coffee in, same as the agent loop, honestly.