Ep 399 article 6:06 w/ Justy & Cody

We built SmithDB, the data layer for agent observability

Justy and Cody dig into why agent traces have become a weird database problem, and why LangSmith built SmithDB instead of stretching a normal observability stack past its limits.

Script: GPT-5.4 Voice: OpenAI TTS

Transcript

Justy The reason this matters right now is simple. People finally have agents doing real work, then the debug screen turns into soup because the traces are huge and half the run hasn't even finished yet.

Cody Yeah.

Cody What jumped out to me is they’re saying agent traces stopped looking like normal app traces a while ago. You’ve got hundreds of nested spans, images or audio mixed in, and start and end events that can be hours apart, so the old request-response assumptions just fall over.

Justy Right.

Justy Also, tiny life note, I’m on my second coffee because I slept weird after the flight. Anyway, that fuzzy-brain feeling is kind of how a product feels when the observability tool takes minutes instead of seconds.

Cody And they put numbers on it, which I appreciate. P50 around 92 milliseconds for loading a trace tree, 71 milliseconds for a single run, 82 for run filtering, and about 400 for full-text search. For this kind of data, that’s pretty solid.

Justy Mm-hm.

Justy From the product side, the user story is very clear to me, Cody. Teams building agent workflows in production need to search weird failures, slice by metadata or feedback, rebuild long threads, and export eval sets without waiting long enough to lose the thread in their own head.

Cody Exactly.

Cody The architecture is the interesting part. SmithDB is basically object-storage-backed, with a small Postgres metastore, then stateless ingestion, query, and compaction services. Underneath, they describe it as an LSM design, so writes get buffered and flushed as immutable sorted batches, then compacted later.

Justy Which is a pretty good market tell too. If they care that much about object storage and stateless services, they’re aiming at enterprise teams that need self-hosted or multi-cloud options and do not want a weird disk-heavy cluster babysitting job.

Cody I see.

Cody They also built it in Rust and use Apache DataFusion plus Vortex, but with heavy customization. The custom part matters, because their query engine has to understand that a run is not one row. It’s a sequence of events that gets merged at query time, which is messier than a normal trace store.

Justy That part feels like the adoption barrier too. Not whether the database is fast, but whether a team is mature enough to instrument runs, metadata, tags, feedback, tool outputs, all of that, so the fancy filters are actually useful.

Cody Right, right.

Cody One clever bit I genuinely like is the progressive query strategy over object storage. Instead of opening every candidate file, sorting everything, deduping, then applying a limit, they walk backward through time, grab a bounded recent window, stream and merge, and stop once they know the result is correct. That is exactly the kind of trick that makes object storage feel less awful for interactive queries.

Justy No way.

Justy That’s one of those details users never see, but they feel it. The page loads before they can mutter that the app is broken, which is honestly the whole game.

Cody And there’s another nice trick. Each file segment remembers which ingestion node wrote it, so if that node is still alive, the planner can read fresh stuff from that node’s SSD or memory cache instead of yanking a bunch of tiny files back out of object storage immediately.

Justy Okay okay.

Justy I could be wrong, but that seems like where the product gets sticky for bigger teams. If you’re logging hundreds of millions of events, seeing fresh traces in seconds instead of minutes changes whether people trust the tool during an incident or just open another tab and guess.

Cody My only mild question mark is compaction complexity. Once runs are multiple events, plus deletes, TTL expiry, index merging, and query-optimized rewrites, that gets hairy fast. Not a fake complaint, just, that system probably earns its pager.

Justy Yeah, somebody definitely has a very exciting on-call rotation. But it sounds real, not like a lab project, since they say all US cloud ingestion and the tracing UI query traffic are already on it.

Cody If I were poking at this over a weekend, I’d read the SmithDB post, then clone DataFusion and look at how custom execution plans work. After that, I’d make a tiny trace store where one run is append-only events in object storage, then implement a newest-first query with a limit and see how much data you can avoid scanning.

Justy For a solo builder, I’d keep it even more practical. Spin up a small agent app in LangSmith, generate intentionally messy traces with retries and tool calls, then test which metadata and feedback fields actually help you find a bad run later. That’s the part that decides whether observability feels magical or just expensive storage.

Cody And if you want a comparison frame, don’t compare it to a plain OLTP database. Compare it to trying to force agent traces through a generic observability stack that was built for short-lived spans and simpler query patterns. Different shape of problem.

Justy Anyway, Cody, episode 399 and we somehow made storage engines sound like normal household drama. I’m counting that as progress.