Ep 379 article 6:53 w/ Justy & Cody

Validating agentic behavior when “correct” isn’t deterministic

GitHub's new validation framework for agentic systems moves beyond brittle, step-by-step testing toward outcome-focused validation. When autonomous agents (like Copilot Coding Agent) interact with real environments, correctness is no longer deterministic—loading screens may appear or vanish, timing shifts, and multiple valid action sequences can succeed. The framework uses dominator analysis and graph-based modeling (Prefix Tree Acceptors) to distinguish between essential outcomes and incidental noise, requiring only 2–10 successful traces to build a ground-truth model. Cody finds the approach clever but questions whether it scales beyond UI automation; Justy sees real market traction in CI/CD reliability and enterprise adoption.

Script: Haiku 4 Voice: Elevenlabs-V2S

Transcript

Justy So GitHub dropped this whole framework for testing agents when the test itself can't tell a real failure from a loading screen that just took an extra second.

Cody Right, and here's the thing—it's a genuinely thorny problem. You've got Copilot Coding Agent navigating VS Code, clicking around, waiting for things to appear. Traditional testing assumes every path is the same. But agents are non-deterministic by design. They adapt. So the test fails, the agent actually succeeded, and your CI pipeline just blocks a deploy for no reason.

Justy That's the false negative thing they keep mentioning.

Cody Exactly. And the problem gets worse the more realistic the environment. Timing drifts, UI elements render differently, loading screens appear or don't. You could record a trace on Tuesday, run it on Wednesday, and get a different result even though the agent nailed the task both times. The validation broke, not the agent.

Justy Okay, but how do you actually fix that without just giving up and saying 'we'll accept any behavior as long as the outcome is right'?

Cody That's where the graph stuff comes in. They're building a Prefix Tree Acceptor—nodes are observable states like screenshots, edges are actions between them. You collect maybe two to ten successful runs, merge them into one unified graph, and apply dominator analysis. That's compiler theory—a node dominates another if every path from the start has to go through it. So you're automatically finding which states are mandatory, like 'search results appeared,' versus incidental, li

Justy So you're not saying 'click here, then wait, then look for this button.' You're saying 'these states must happen, but they can happen in different orders.'

Cody Right. And the clever part is the equivalence detection. Two screenshots might look almost identical, or they might have a timestamp change. They built a three-tier system: fast perceptual hashes first, then a multimodal LLM to ask, 'Are these the same logical state?' That feeds into the graph. The LLM isn't validating the whole trace—it's just answering a narrow question: 'Is this a meaningful difference or not?'

Justy Okay, so I'm sold on the idea. The real question is adoption. Is this built into GitHub Actions natively, or is it a thing you have to wire up yourself?

Cody That's where I get a little skeptical. This feels great on the blog post—UI automation in VS Code, containerized environments. But production is messier. You've got legacy APIs, weird edge cases, systems that don't render consistently. Does the PTA approach scale to that? And you need successful traces to start with. If you're testing a new agent workflow, you've got to manually seed it with five to ten clean runs. That's a human cost.

Justy But that's actually way better than writing assertions or maintaining a record-and-replay script. You're doing work once, upfront, and then it generalizes. And if you've got a mature agent workflow already running regularly, you've already got those traces. You're just mining what's already happening.

Cody True. And the explainability story is huge. If your test fails, you get a clear explanation—'essential state X was never reached'—instead of 'the trace diverged.' That's a trust multiplier for anyone shipping agents to production.

Justy Right. So the market angle is: if you're using agentic systems in production CI/CD, this solves a real pain. False negatives block deploys. Enterprise teams will pay attention because uptime and reliability matter.

Cody Okay, so Build Next. If you're already running an agent in a GitHub Actions workflow, start collecting successful traces. Screenshot the UI state at each step, or capture the API responses. Feed those into the PTA algorithm. GitHub probably has a reference implementation by now, or you can sketch it out yourself if you're comfortable with graph merging and dominator analysis.

Justy And the solo-builder version?

Cody Weekend project: take a simple browser automation agent—Playwright or Puppeteer—run it through a few successful scenarios, capture the DOM state at each step, build a PTA by hand, identify the essential milestones, and write a validator that checks for those instead of exact sequence matching. You'll immediately see how much noise disappears and how much more robust your tests become.

Justy That's the piece that matters. You feel the brittleness of the old way, and suddenly the graph approach clicks.