Ep 313 Research Paper April 22, 2026 4:34 w/ Justy & Cody

AgentSPEX: An Agent SPecification and EXecution Language

Justy and Cody dig into AgentSPEX, a YAML-based language and runtime for building LLM agents with explicit control flow, typed steps, reusable submodules, parallel execution, and state management. They focus on the gap between loose ReAct prompting and Python-heavy orchestration tools, then unpack how AgentSPEX separates workflow specification from execution while still supporting tools, sandboxing, checkpointing, replay, and visual editing. The conversation lands on who this is for, where it feels shippable, and what a solo builder could try this weekend.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/313"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 313 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script GPT-5.4 Voice Inworld TTS 1.5 Max

Transcript

Justy The weird part of agent building right now is the smartest systems can still feel impossible to inspect.

Justy This is Exploring Next, episode 313. I'm in Cody's kitchen in DC, still calibrating after the flight, and we're talking about AgentSPEX.

Cody What grabbed me is the target. They are going after that awkward middle ground where plain ReAct prompts are too loose, but Python orchestration frameworks get tangled fast. If you've built with LangGraph or DSPy or CrewAI, you've probably felt that.

Justy Yeah, and that matters now because more teams want agents to do long jobs, not just one-shot chat. Research flows, software tasks, proposal writing, anything with loops and branching. Product people want to change behavior without reopening a pile of orchestration code.

Cody AgentSPEX puts the workflow in YAML as an executable spec, using primitives like task, step, if, while, call, parallel, gather, and state operations instead of scattering logic across Python and prompts.

Justy And the plain-language version is, your agent gets an actual flowchart. Not just vibes and a giant system prompt. [chuckles] That is a lot easier to reason about when something goes sideways.

Cody They also split specification from the runtime harness, which adds tools, sandboxing, checkpointing, logging, replay, and resume. That makes the workflow more portable in principle.

Justy The explicit state piece matters too. Workflows keep named variables, steps can save outputs, and templates pull in only what each step needs, which helps control context creep.

Cody They even make conversation history explicit: a task can start fresh, while a step can continue a persistent thread. That gives you more control over memory, performance, and reproducibility.

Justy The example that made it click for me was a research assistant that generates search queries, calls a search-and-summarize submodule, then writes a report. Not flashy, but very shippable.

Cody And submodules are just workflows calling other workflows, with parallel and gather for concurrency. So you can compose a deeper agent from smaller pieces instead of one monster prompt.

Justy This is where I think the audience splits. If you're a solo builder, this could be a nice way to keep a weekend project understandable. If you're a company, I could see it fitting research ops, internal assistants, paper triage, maybe software workflows where auditability matters. The question is whether it's research-only.

Cody My read is no. The runtime features make it feel closer to deployable infrastructure than a demo, though YAML can still become a maze if the workflow grows without discipline.

Justy I do like that they paired benchmarks with a user study on interpretability and accessibility. Authoring experience is part of the product, not just task scores.

Cody Methodology-wise, I buy the broader claim more than any single number: explicit control flow should help on long-horizon tasks. I'd still want stronger ablations around context management and debugging nested state.

Justy So if someone wants to build next, the obvious starting point is the GitHub repo, ScaleML slash AgentSPEX, and try one of the ready-made deep research or scientific research agents.

Cody Then do a side-by-side. Recreate the same workflow in AgentSPEX and in a Python graph framework. Measure edit time, number of files touched, and whether you can replay a failed run. That's a very honest comparison.

Justy For a solo builder, I'd make a tiny literature scout. One workflow to generate search queries, one submodule to summarize sources, one final writer step to produce a markdown brief. Keep context tight on purpose so each step only sees what it needs.

Cody And if you want to stress the language, add parallel searches plus a while loop with an iteration cap for follow-up queries. That gets you branching, concurrency, and state without building a huge system. [sighs] Which is probably enough complexity for one weekend.

Justy That's AgentSPEX. A more inspectable way to build agents, provided your flowchart doesn't turn into a spaghetti diagram on Cody's countertop.