Ep 388 article 5:21 w/ Justy & Cody

Thinking Machines shows off preview of near realtime AI voice and video conversation with new 'interaction models'

Thinking Machines previews 'interaction models'—AI that processes voice and video in real-time, simultaneously listening and responding instead of waiting for user input to finish. Cody is skeptical about whether this solves a real problem or is architectural theater; Justy argues the latency gains and enterprise safety use cases (manufacturing oversight, customer service) are genuinely useful. They debate whether 'full-duplex' is a fundamental shift or incremental polish on existing models.

Script: Haiku 4 Voice: Inworld TTS 1.5 Max

Transcript

Justy So Thinking Machines just showed off this thing called 'interaction models' and I genuinely don't know if Cody thinks this is genius or if it's the most overhyped architecture flex I'll hear this month.

Cody It's somewhere in the middle, but let me start skeptical. The entire pitch is basically 'what if AI listened while it talked instead of waiting for you to shut up first.' And yeah, okay—that's technically different. But is it solving a real problem or is it solving a problem that mostly exists because we shipped turn-based models first?

Justy Right, but here's the thing—a 1.18-second delay on a customer service call is actually brutal. You call a bot, there's that dead air, and it feels broken even though technically the model is 'thinking.' Thinking Machines gets it down to 0.4 seconds.

Cody Okay, I hear that. But is 0.4 seconds from a native full-duplex model, or is it just... better inference scheduling on a regular model? Because latency optimization has been a thing for years.

Justy They're processing 200-millisecond chunks of audio and video simultaneously through the same transformer, not routing through separate encoders. So it's not just inference speed—the architecture is actually different.

Cody Fair.

Justy And here's where I think it gets real: they can backchannel while you're still talking. Like, the model says 'mm-hmm' or 'I see' without interrupting you, and then it jumps into the answer. That's not a latency trick—that requires simultaneous input and output.

Cody That's the genuinely clever bit, I'll admit. But now I'm looking at their benchmark—FD-bench—and it's benchmark designed to measure 'interaction quality.' Justy, you know how this goes. Every company ships a benchmark that makes them look good.

Justy True, but they also ran it against Gemini and GPT-realtime and crushed them. And the visual stuff—RepCount-A, where the model counts repetitions in a video in real-time—that's not a trick. Either it can do it or it can't.

Cody Okay, I'm not dismissing the results. 77.8 on FD-bench versus 46.8 for GPT-realtime-2.0 is a real gap. The visual proactivity is legitimately stronger. What I'm questioning is whether this is a 'new class of model' or just the natural next step after we stopped being lazy about inference.

Justy Maybe both. But here's what matters to me—the enterprise play. In a manufacturing plant, you've got a worker on the floor and a camera watching them. Today, an AI model watches and waits for someone to ask, 'Is this safe?' With Thinking Machines, the model can just... interrupt when it sees a violation. That's not a latency optimization—that's a different interaction paradigm.

Cody Mm-hm.

Justy And for customer service, when someone's frustrated, a natural bot that backchannels instead of dead-air processing—that actually changes how people perceive the interaction.

Cody I buy that. The use case is real. My concern is different—they've got a research preview, no general release yet, no word on pricing, and it's unclear if they're going to open-source any of this. So it's impressive technology on a roadmap, not a product you can ship with today.

Justy Totally fair. But they mentioned they're 'committed to significant open-source components,' which—if that's real—changes the game. You'd have a reference architecture for native multimodal interaction.

Cody That's the question mark. Tinker, their fine-tuning thing, didn't light the world on fire adoption-wise. So even if they open-source the interaction model, will anyone actually build on it?

Justy You're saying they could hand out free genius and we'd all just... not use it?

Cody Basically, yeah. Integration friction is real. But if someone does build a customer service bot or a manufacturing safety system on this, the latency and visual reaction time would be noticeably better than what we have now.

Justy So what does Build Next look like? A weekend project to test this?

Cody Two angles. First: the moment they open the research preview, someone should wire up a live customer service bot—measure latency, backchannel naturalness, customer satisfaction compared to a standard turn-based system. Real metrics, not benchmarks.

Justy Right.

Cody Second: if they do release the model weights or an API, build a safety detection system in a simulated manufacturing environment. Camera feed, real-time interruption when something goes wrong. That's where this either proves itself or doesn't.

Justy And the solo builder angle?

Cody If you want to play with this right now without waiting, you could prototype a dual-model system yourself—run a fast inference model for chat and a heavier async model for complex reasoning. Not full-duplex, but you'd learn where the friction actually is.

Justy Fair. So the verdict—is this real or is it marketing?

Cody It's real technology. Whether it's a paradigm shift or just really good engineering is the question I can't answer yet. The benchmarks are solid, the architecture makes sense, but I won't believe it until I see adoption.

Justy I'm more bullish. The latency is real, the visual proactivity is real, and the enterprise problems are real. If they price it right and open-source parts of it, this could actually reshape how people interact with AI at work. We'll see in a few months when the research preview lands.