Ep 342 research 7:01 w/ Justy & Cody

ClawMark: A Living World Benchmark for Multi Turn, Multi Day, Multimodal Coworker Agents

ClawMark is a benchmark for evaluating AI agents as persistent coworkers across multi-day workflows with dynamic, stateful environments. Unlike existing benchmarks that run single-episode tasks in static environments, ClawMark spans multiple in-universe workdays with exogenous state changes (emails arrive, calendars shift, files update) between turns, multimodal evidence (PDFs, audio, video, spreadsheets), and deterministic rule-based scoring via 1,537 Python checkers. The benchmark contains 100 tasks across 13 professional scenarios running against five sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet). Current frontier models reach 75.8 weighted score but only 20% strict task success, revealing that adaptation to changing state remains a core unsolved challenge.

Script: Haiku 4 Voice: ElevenLabs

Transcript

Justy So we're in a world where AI agents are supposed to be coworkers, right? Like, they stick around for days, they help you with email and calendars and files. But nobody's actually measuring whether they can handle that.

Cody Exactly. Every benchmark out there runs the agent inside a single session where the environment is frozen. You give it a task, it completes, environment resets. But real office work doesn't work that way. An email arrives while you're thinking. Someone updates a spreadsheet. The calendar shifts. And the agent has to notice, adapt, and keep going.

Justy Right. And the paper's called ClawMark—it's a benchmark specifically for multi-day, multi-turn agent workflows. That's what caught my attention. So what's actually different here compared to, say, OSWorld or tau-bench?

Cody ClawMark combines three things that no other benchmark does together. First, tasks span multiple in-universe workdays—each day is a separate turn. Second, the environment changes between turns independently of the agent. Not because the agent made changes, but because the world did. We call them 'loud events'—announced changes—and 'silent mutations'—unannounced ones. Third, evidence is fully multimodal: PDFs, scanned documents, audio, video, spreadsheets, not just text.

Justy Okay, so the infrastructure—they built actual stateful services, right? Not just mocked logs.

Cody Yes. Five of them, all running in Docker. Real filesystem, GreenMail for email, Notion-compatible knowledge base, Google Sheets–compatible spreadsheet, Radicale for calendars. They're actual stateful services, not snapshots. Between turns, they mutate. The agent sees the real current state when it wakes up on Day 2, not a cached version from Day 1.

Justy And the scoring is deterministic. No LLM-as-judge.

Cody Exactly. 1,537 Python checkers inspect the post-execution state of each service. Each task is only admitted to the benchmark after two independent runs produce bit-identical checker verdicts. It eliminates subjectivity entirely.

Justy So what do the results actually show? Because a 75.8 weighted score sounds good on paper.

Cody The gap between weighted score and strict task success is the story. Claude Sonnet 4.6 leads at 75.8 weighted, but the best strict task success is only 20%. That means agents are making progress—partial credit—but almost never completing the full workflow end-to-end. On tasks with exactly three turns, six out of seven models degrade on Day 2, right after the first exogenous environment update hits.

Justy [pause] So the agent sees the world change and just... loses track.

Cody Or doesn't notice it changed at all. The paper calls it out explicitly: 'adaptation to changing state as a key open challenge.' This isn't about reasoning or tool use. It's about maintaining context when the ground truth shifts underneath you between turns.

Justy That's a real problem for a coworker agent. So who's actually going to use this benchmark?

Cody It's open-source—github.com/evolvent-ai/ClawMark—so anyone can run it. The repo includes the evaluation harness, the five service implementations, all 1,537 checkers, and the construction pipeline. For teams building coworker agents, this becomes part of your validation pipeline before you ship. The deterministic scoring means you can run it in CI/CD.

Justy What about solo builders? Someone working on an agent framework who wants to test their own system?

Cody The construction pipeline is the key. You can use it to generate new scenarios without running the full 100-task suite. Start with one service—say, email and calendar—and build a three-day task where external events happen. Run your agent, check the deterministic verdict. You get real feedback on adaptation without needing the whole benchmark infrastructure.

Justy And the core insight here is that coworker agents aren't about single-turn reasoning. They're about persistence and adaptation when the world changes. ClawMark measures that in a way nothing else does right now.

Cody Exactly. The 20% strict task success rate is actually the honest number. It's not a failure of the benchmark; it's a failure of current agent systems to handle real coworker workflows. ClawMark just made it visible.

Justy This is Exploring Next, episode 342. We'll link the benchmark and the paper in the show notes. If you're building agent systems, this is worth your time.