Ep 304 article 11:10 w/ Justy & Cody

Harness engineering: leveraging Codex in an agent First world

Justy and Cody dig into OpenAI’s writeup on building a product with Codex doing all the coding, and why the real shift is from typing code to designing an environment agents can reliably operate in. They cover the no-manual-code constraint, the repo-as-system-of-record approach, agent-readable docs, isolated worktrees, UI and observability access, and why this matters for teams trying to ship faster without drowning in review and QA.

Script: GPT-5.4 Voice: ElevenLabs

Transcript

Justy You’re telling me the code factory is not the model... it’s the scaffolding around the model.

Cody [chuckles] Yeah. The glamorous headline is "agent wrote the code." The less glamorous truth is somebody had to build the kitchen, label the drawers, and stop the stove from catching fire.

Justy And I flew all the way to DC for that sentence. Also, your apartment is somehow too warm and too dry at the same time. Welcome back to Exploring Next, episode 304. I’m Justy, here with Cody, and today we’re talking about harness engineering and why this matters right now.

Cody Right, because the pain people actually feel isn’t just writing code. It’s review queues, broken tests, flaky UI checks, somebody forgetting the setup steps, then everybody losing half a day. This article is basically about turning that mess into something an agent can survive.

Justy Yeah, and that’s the part I care about. If you’re a team lead or PM, the fantasy is not "wow, a million lines." It’s "can I get useful changes shipped without burning my best people on babysitting?" Because that’s the real tax.

Cody Exactly. The article’s claim is pretty wild though: brand new product, started from an empty repo in late August 2025, around a million lines of code after five months, about 1,500 pull requests, and humans never manually wrote code in the repo. Application logic, tests, CI, docs, observability, internal tools... all Codex.

Justy [laughs] That "never manually wrote code" line is the kind of thing that makes half the internet stand up and half the internet lie down.

Cody It should, but the interesting part isn’t the purity test. It’s that they say it shipped with internal daily users and external alpha testers, and they estimate roughly a tenth of the time versus hand-coding. So the question becomes... okay, what made that even remotely believable?

Justy Cody, break that down for me without going full documentary narrator.

Cody [exhales] In a world where the weary engineer approaches the repo... no, fine. The core idea is humans stop being typists and become environment designers. When the agent fails, the answer isn’t "prompt harder." It’s "what capability is missing, and how do we make it visible and enforceable?"

Justy That’s a big product shift too. You’re not managing a feature backlog the old way. You’re managing legibility. Can the system see the spec, the quality bar, the architecture, the user journey... or is half the company’s brain still trapped in docs nobody opens?

Cody Yeah, and they were very explicit about that. They tried the giant AGENTS.md approach and said it failed. Too much context crowds out the task, goes stale fast, and can’t be checked mechanically. So AGENTS.md became a short map, like roughly a hundred lines, and the actual knowledge moved into a structured docs directory inside the repo.

Justy Which I loved. Table of contents, not encyclopedia. That’s such a normal-human lesson too. Nobody wants the airport novel version of instructions when they’re just trying to find the gate.

Cody [giggles] Justy will turn anything into a flight analogy. But yes. They had architecture docs, product specs, execution plans, design references, reliability docs, even a quality score doc per domain. And then they linted that knowledge base in CI, plus a doc-gardening agent opened PRs for stale docs. That’s the harness.

Justy And that’s the market angle. This is not just for frontier labs. Any team with recurring workflows could steal this pattern. Maybe not the full no-human-code vow, which feels like a maniac move, but repo-local plans, agent-readable specs, and checks for doc drift? That’s very adoptable.

Cody The other big piece was making the app itself legible. They made each git worktree boot its own isolated app instance, then wired Chrome DevTools Protocol into the runtime so Codex could inspect DOM snapshots, screenshots, navigation... basically reproduce UI bugs itself instead of waiting for a human to squint at a screen.

Justy Which is huge, because otherwise your expensive new robot coworker still ends up texting you, spiritually speaking, "can you click the button and tell me what happened?" [laughs] That is not scale.

Cody And they did the same for observability. Each worktree got ephemeral logs, metrics, traces. The agent could query logs with LogQL and metrics with PromQL, then work against targets like startup under 800 milliseconds or no span over two seconds on critical flows. That’s way more concrete than "make it faster."

Justy [sighs] Meanwhile I’m over here grading my airport breakfast a B-minus and these agents are running six-hour overnight shifts. My wife asked what we were recording and I had to say, "it’s sort of about teaching software to read its own homework."

Cody [chuckles] That is, honestly, the cleanest explanation. And the six-hour run detail matters. Long-running agent work only pays off if the environment is isolated, inspectable, and recoverable. Otherwise you wake up to a beautifully formatted disaster.

Justy I do think there’s a limit, though. Teams hearing this and thinking "great, we can skip design discipline because the model is smart" are going to have a rough weekend. This only works if your repo is the truth and your feedback loops are brutal.

Cody Right. The article doesn’t really read like magic. It reads like systems engineering. The model is important, sure, but the compounding value came from worktrees, local tools, agent reviews, structured docs, and keeping context where the agent can actually touch it.

Justy [pause] So if somebody wants to try this without rebuilding their company by Friday, I’d start small. Put an AGENTS.md at the repo root, make it short, then add a docs folder with architecture, active plans, and product specs that are actually current.

Cody Then install Codex CLI and give it real tools, not just text prompts. Let it use gh, your test scripts, your formatter, your local app runner. If you can, add browser automation through Chrome DevTools Protocol or Playwright so it can verify UI behavior instead of guessing.

Justy And add one contained weekend project to the list, Cody. Build an isolated worktree setup where each task gets its own bootable app plus local logs and metrics. Even without the full stack from the article, that experiment will teach you fast where your process is invisible to the agent.

Cody Adding it to the list. Which is now, I think, longer than the flight delays board at Reagan. But yeah, this one passes the red-eye rule for me. If I still care after a bad travel day, it’s real.

Justy That’s episode 304 of Exploring Next. Turns out the future may involve less heroic typing and more building a workspace an agent can actually navigate... which, honestly, is a very Cody answer. I’ll see you after I re-grade your apartment climate.