Kimi K26 Is the Open Model Release
Justy and Cody dig into why Kimi K2.6 lands at exactly the right moment for people trying to run long-lived coding agents: it’s open, strong on coding, and can actually see screenshots and video without bolting on a separate vision model. They unpack the 1T MoE design with 32B active parameters, the 262K context window, benchmark wins that matter, and Moonshot’s bigger bet on tool-heavy, long-horizon agent work. They also separate the impressive parts from the marketing gloss, then close with concrete stuff to try this week.
Script: GPT-5.4 Voice: ElevenLabs
Transcript
Justy If your coding agent still falls apart the second you hand it a screenshot and a repo at the same time... yeah, this one matters.
Cody [chuckles] Also if it melts down around hour two, which is a very glamorous failure mode for our extremely specific little show.
Justy You flew all the way to LA to say that in my kitchen. And you're wearing a hoodie in eighty degrees again, Cody.
Cody The red-eye rule. If I still care after airport coffee and no sleep, it's real. This clears that bar.
Justy Welcome back to Exploring Next, episode 307. I'm Justy, Cody is here in person, and today we're talking about Kimi K2.6 and why agent builders are suddenly paying attention.
Justy The short version is, people do not need another model that looks amazing in a two-minute demo. They need something that can code, use tools, keep context, and deal with visual stuff like bug screenshots or UI recordings without turning the workflow into a relay race.
Cody Right, and K2.6 is unusually pointed at that. Moonshot says it's a one-trillion-parameter mixture-of-experts model, but only 32 billion are active at a time. So you get the bigger capacity story without paying the full cost on every token. And in their published runs they push it to a 262K context window, plus native image and video input.
Justy Which is a product identity thing as much as a model spec thing. A lot of open coding models make you choose. Do you want the one that writes code, or the one that can look at the screen. K2.6 is saying, no, the working environment is messy, that's the job.
Cody Exactly. And the modes they expose tell you what they think the job is. Thinking mode, instant mode, preserve-thinking mode, interleaved thinking, multi-step tool calls... this is not aimed at casual chat. It's aimed at agents that keep doing stuff while you go make dinner.
Justy [laughs] Preserve-thinking mode sounds like me trying to remember why I walked into the garage.
Cody For the model, it means don't throw away the chain of work just because the task gets long. And that's important for tool use, because the failure usually isn't the first call. It's tool call nine, where it forgets what file it changed, or why the test was failing, or which screenshot matched the target.
Justy Yeah, and the benchmark table kind of reflects that. With the usual asterisk that vendor tables are vendor tables. They choose the harness, the settings, all of it. But even with that grain of salt, the numbers are hard to shrug off. HLE-Full with tools at 54.0, ahead of GPT-5.4 at 52.1, Claude at 53.0, Gemini 51.4. DeepSearchQA was even cleaner.
Cody And coding is where most people will look first. Terminal-Bench 2.0 at 66.7, ahead of GPT-5.4 and Claude, though behind Gemini. SWE-Bench Pro at 58.6, which beats the listed baselines. SWE-Bench Verified is tighter, basically in the same room as Claude and Gemini. So this isn't some cute open model that wins one cherry-picked row and then disappears.
Justy The vision rows are almost more interesting to me. Because if you're building a front-end agent, being able to compare the screenshot to the target and then edit code and check again... that's just a better loop than feeding a text-only model a sad paragraph about where the button probably is.
Cody [exhales] Yeah. That's the real differentiator. Qwen and Gemma have multimodal lines, sure. But the sharper comparison is coding-first open models people already drop into agent stacks. A lot of those are still basically text-and-code systems. K2.6 is trying to make visual state a normal part of the same loop.
Justy Which, okay... little tangent. My wife asked what we were recording, and I said, "a model that can look at a broken website and fix it." And she went, "so... eyes?" [giggles] Honestly, fair question.
Cody That's a better summary than half the launch posts, if we're being honest.
Justy Back to the long-run stuff, because I think that's where Moonshot is aiming the camera. They claim one run on a Mac went more than twelve hours, over four thousand tool calls, fourteen iterations, and took throughput from about fifteen tokens per second to roughly one-ninety-three. That's not a toy task.
Cody No, and the other case is more believable to me because it's uglier. They worked on exchange-core, an older open-source matching engine, for thirteen hours, tried a bunch of optimization paths, changed more than four thousand lines, and kept measuring. Those are still company claims, not an independent audit, but the shape of the work is right. Read, modify, benchmark, reject, try again.
Justy [sighs] This is where you go full documentary narrator on me. "In the wild, the agent returns to the terminal..."
Cody [chuckles] ...carefully observing the benchmark, aware that one bad patch could send it back into the brush. But seriously, Justy, that's the point. The useful question isn't "can it solve a benchmark." It's "does it stay coherent when the task gets boring."
Justy And that's why the timing matters. There are communities trying to run always-on coding agents, background workers, stuff that schedules, monitors, fixes, messages. Those users don't care about a heroic answer in one shot. They care whether the run survives the afternoon... and the provider doesn't tap the brakes halfway through.
Cody Yeah, model quality and serving quality get tangled together. Moonshot has weights on Hugging Face, their own API, Kimi app, all that. The article notes Novita as a provider, and people will absolutely watch for broader availability through places like Fireworks. Because the best long-run model is still useless if congestion kills the session at hour six.
Justy I'm giving that reality an A-minus. Great model, terrible if the pipe is shaky. Also I'm putting "Cody grades my grading" on the list.
Cody What list. Your list is longer than the board at LAX. But yeah, my honest take is pretty simple: the benchmark caveat is real, and the big swarm claims are a little theatrical. Three hundred sub-agents and four thousand coordinated steps sounds impressive, but I care more about one agent that doesn't get weird after lunch.
Justy Same. Still, I buy the core claim. Open weights, strong coding, native multimodal input, long tool use as the default shape of work... that's a real package. If you're building agents for front-end fixes, repo maintenance, visual QA, or long-running coding loops, this jumps onto the shortlist immediately.
Cody [pause] And if I were testing this this weekend... assuming my weekend list doesn't defeat me...
Justy [laughs] It will. I'd pull the K2.6 weights from Hugging Face and run a side-by-side on a screenshot-driven bug fix task. Then I'd wire it into an agent harness with long tool traces and see where it drifts. And I'd compare providers, not just models, because latency and reliability are part of the result. Yeah. I'd do three concrete things. Grab the Hugging Face release and inspect the model card. Set up a small repo task where the model has to read a screenshot, patch a fro