Ep 311 tool 3:34 w/ Justy & Cody

Kimi K2.6 runs agents for days — and exposes the limits of enterprise orchestration

Exploring Next, episode 311. We look at Kimi K2.6 and why agents that run for hours or days are exposing a weak spot in enterprise orchestration, governance, and state management.

Script: GPT-5.4 mini Voice: Inworld TTS 1.5 Max

Transcript

Justy Exploring Next, episode 311. Today’s about Kimi K2.6 and why agents that keep working for hours, even days, are starting to break the tools people already bought.

Cody And that matters right now because a lot of teams are already past toy demos. They’re handing agents real work, then discovering the orchestration layer was built for something that finished before lunch.

Justy That’s the user story I keep coming back to. If you’re in an enterprise, you don’t care that an agent can think for five days if you can’t tell what it changed on day three.

Cody Right, and Moonshot’s pitch with Kimi K2.6 is basically continuous execution. They say internal agents ran for hours, and one ran for five straight days on monitoring and incident response.

Justy Five days is wild. [exhales] That sounds less like a chat session and more like a system component you have to trust.

Cody Exactly. The clever part is their improved Agent Swarms setup. Moonshot says it can manage up to 300 sub-agents across 4,000 coordinated steps at once, and the model itself decides how orchestration happens.

Justy So not a fixed lead-agent playbook with rigid roles?

Cody Yeah, that’s the contrast. Claude Code and Codex both use structured orchestration with lead agents, subagents, or background execution, while K2.6 leans more on the model to figure out the control flow. I think that’s interesting, but also a little fragile.

Justy Fragile in the product sense, not the demo sense.

Cody Exactly. Long-running agents keep touching tools, APIs, and databases while the world changes underneath them. If the task runs for minutes, you can get away with loose state. If it runs for days, state management becomes the whole problem.

Justy And for buyers, that becomes governance fast. If an agent can generate code or system changes faster than review cycles, the bottleneck moves to accountability, not capability.

Cody That’s what ArmorCode’s CPO was pointing at. It’s not enough to scan after the fact. You need context, prioritization, and a clear paper trail for what the agent did and why.

Justy So who actually uses this first? My read is platform teams, security teams, and very early adopters who already have automation muscle. Normal product teams will hit the adoption barrier when they ask, ‘who owns rollback?’

Cody Yeah, and also teams doing incident response, code maintenance, or background monitoring. K2.6 is available on Hugging Face, through its API, in Kimi Code, and in the Kimi app, so the surface area is broad enough for experiments.

Justy If I were a builder, I’d try one weekend project: take a repo, give an agent a long-running maintenance task, and force it to checkpoint state every few steps. Then break the environment and see if it recovers.

Cody I’d add a second test. Run a tiny swarm with one lead task and a few sub-agents, then log every tool call and decision. The interesting metric is not completion, it’s how messy the recovery looks when something changes mid-run.

Justy That’s the real shift here. Kimi K2.6 is interesting, but the bigger story is that orchestration is becoming a product problem, a training problem, and a trust problem all at once.

Justy We’ll keep following where this goes. I’m Justy, and this was Exploring Next.