Exploring Next

Exploring Next — Ep 342 w/ Justy & Cody — ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

ClawMark is a benchmark for evaluating AI agents as persistent coworkers across multi-day workflows with dynamic, stateful environments. Unlike existing benchmarks that run single-episode tasks in static environments, ClawMark spans multiple in-universe workdays with exogenous state changes (emails arrive, calendars shift, files update) between turns, multimodal evidence (PDFs, audio, video, spreadsheets), and deterministic rule-based scoring via 1,537 Python checkers. The benchmark contains 100 tasks across 13 professional scenarios running against five sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet). Current frontier models reach 75.8 weighted score but only 20% strict task success, revealing that adaptation to changing state remains a core unsolved challenge.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →