Ep 305 tool 10:20 w/ Justy & Cody

Harness engineering for coding agent users

Justy and Cody dig into harness engineering for coding agents: the practical idea that trust in AI-written code comes less from the model itself and more from the guardrails, checks, and feedback loops wrapped around it. They unpack feedforward guides versus feedback sensors, deterministic tooling versus LLM-based judgment, and why teams should treat the human as the person tuning the harness instead of reviewing every tiny diff forever.

Script: GPT-5.4 Voice: ElevenLabs

Transcript

Justy If your coding agent writes twenty files and you still have to hover over it like it's assembling furniture without the manual... then yeah, we have a problem.

Cody [chuckles] And somehow the demo always ends right before the part where you open the diff and your shoulders drop.

Justy I'm in DC this week, I got off a very average flight, and Cody handed me coffee like we were about to record for twelve people who deeply care about static analysis.

Cody That is our exact lane. Also you flew cross-country and still packed like a man who thought spring in DC meant July in LA.

Justy [laughs] Welcome back to Exploring Next, episode 305. I'm Justy, here with Cody, and today we're talking about harness engineering for coding agent users... which sounds abstract, but it's really about whether AI code can earn any trust at all.

Cody Yeah. This matters right now because people are actually using these agents on real codebases, not toy apps. And the bottleneck isn't generation anymore, it's supervision. If every AI change still needs full human babysitting, the speed boost kind of evaporates.

Justy Right, because from a product angle the promise isn't just faster typing. It's lower review toil, fewer dumb loops, fewer tokens burned on retries, and maybe getting to a place where the agent can handle bounded work without somebody staring at every line.

Cody [exhales] The article's useful move is narrowing what "harness" means. Not everything around the model in the universe, but the outer layer you, the user or team, build for your codebase. Instructions, retrieval, tests, checks, custom rules, all the stuff that shapes and judges the agent's work.

Justy And the split I liked was guides versus sensors. Guides try to steer before the model acts. Sensors watch what happened after and feed back signals so it can fix itself. That's way more practical than just saying, "well, the model got better this month."

Cody Exactly. And then she slices those again into computational and inferential. Computational is your CPU-side, deterministic stuff: tests, linters, type checks, structural analysis. Fast, cheap, repeatable. Inferential is semantic judgment, AI code review, LLM-as-judge. Richer... but slower, pricier, and a little slippery.

Justy Which is kind of the whole thing, right? You want the cheap alarm that goes off every time, and then the expensive smart friend who occasionally notices, "Hey, this technically works but it's weird." [giggles]

Cody Yes, and the article makes a subtle point there. Inferential sensors are most useful when they produce feedback the model can actually act on. So not vague scolding. More like custom linter messages or review comments phrased so the agent can self-correct on the next pass.

Justy Positive prompt injection, basically. Which sounds sinister, but in practice it's just better coaching. And that's a real user story: teams already have conventions trapped in somebody's head or buried in old docs. The harness turns those into something the agent can consume every single time.

Cody [pause] Also, this is where timing matters. Fast computational sensors should sit way left, before commit if possible. Pre-commit hooks, quick tests, type checks. Then heavier things later in CI: broader review, mutation testing, maybe a more semantic pass that looks at the larger change. You don't want to spend premium judgment on code that already failed the easy stuff.

Justy Cody, you are in documentary narrator mode again. Somewhere a penguin is learning about pre-commit hooks.

Cody [laughs] In the wild, the junior agent approaches the type checker... no, but seriously, that's the architecture choice. Put deterministic controls beside the agent all the time, and use inferential controls where the extra cost buys you something.

Justy Midway tangent, your weekend list still alive? Because this article is basically telling teams to keep adding chores forever. New harness rule, new sensor, new drift check... I'm like, this is your apartment whiteboard in software form.

Cody [sighs] The list is under control. But yes, the human's job becomes steering the harness. If the agent repeats a mistake, don't just fix the diff again. Improve the guide or sensor so the same mistake gets less likely next time.

Justy That part I buy completely. It shifts the human from line-by-line hall monitor to system tuner. Very PM-friendly too, because once a team sees recurring failure modes, they can codify them. Architecture boundaries, logging standards, bootstrap steps, even code mods with something like OpenRewrite.

Cody And the article groups the harness into maintainability, architecture fitness, and behavior. Maintainability is the easy win because tooling already exists. Duplicate code, complexity, style drift, missing coverage... machines are good at that. Architecture fitness is things like module boundaries or performance expectations. Behavior is the hard one, because if the spec is fuzzy, no sensor can reliably rescue that.

Justy Yeah, that's the honest limit. If the human asked for the wrong thing, you don't get correctness by sprinkling judges on top. Green tests can still mean the agent wrote tests for its own misunderstanding, which is like grading your own airport sandwich and giving it an A-minus because you were hungry.

Cody [chuckles] You would grade a sandwich. But that's the strongest caution in the piece: these harnesses absolutely raise trust for structure and maintainability, somewhat for semantics, and much less for misunderstood intent or unnecessary features. So autonomy goes up in slices, not all at once.

Justy So if somebody wants to get hands-on after this, I'd start small. Add an AGENTS.md or equivalent instructions file with your repo conventions. Then wire a pre-commit run for tests, lint, and type checks that the agent sees immediately. Give it the bumpers before you ask for fancy moves.

Cody And build one custom structural check. ArchUnit if you're on the JVM, or a lightweight script that enforces folder boundaries and import rules. Then try custom linter output with actionable fix text. If you're curious about code mods, go read up on OpenRewrite recipes and have the agent help draft one.

Justy I'm adding that to the list. Not your list, my list. Anyway, that's Exploring Next. If you hear a faint sound after we stop, it's Cody opening another note titled "weekend" and pretending this time he means it.