Ep 177 GitHub February 11, 2026 5:14 w/ Justy & Cody

Transformers.js v4 Preview: Now Available on NPM!

Transformers.js v4 brings massive performance improvements with a new C++ WebGPU runtime, modular architecture, and standalone tokenizer library. Now runs state-of-the-art AI models directly in browsers, Node, and Deno with hardware acceleration.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/177"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 177 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Sonnet 4.5 Voice ElevenLabs

Transcript

Izzo If you've ever tried running AI models in JavaScript, you know the pain.

Izzo Welcome back to Exploring Next, episode 178. I'm here with Boone, and today we're diving into Transformers.js v4 — a preview that just hit NPM after nearly a year in development.

Boone And Izzo, this isn't just another version bump. They've completely rewritten the WebGPU runtime in C++. We're talking about running GPT-OSS 20B at sixty tokens per second on an M4 Pro.

Izzo Right? That's production-ready performance. But let's start with why this matters right now. Every product team I talk to wants to run AI locally — privacy, latency, cost. But JavaScript has always been the slow kid in the AI playground.

Boone Exactly. And the fragmentation was brutal. Different runtimes, different performance characteristics. You'd write for the browser, then rewrite for Node. Transformers.js v4 solves that with one codebase that runs everywhere.

Izzo So walk me through what they actually built here, Boone. This new WebGPU runtime — what makes it different?

Boone They partnered with the ONNX Runtime team to build this thing from scratch in C++. The key insight was leveraging specialized operators like com.microsoft.GroupQueryAttention and com.microsoft.MatMulNBits instead of generic matrix operations.

Izzo Hold on — say that again about the BERT speedup?

Boone Four times faster by switching to com.microsoft.MultiHeadAttention. That's not marginal improvement, that's architectural advantage. They're using the GPU the way it was designed to be used.

Izzo From a product perspective, this is huge. You can now cache the WASM files locally and run completely offline after the initial download. That's a game-changer for enterprise deployments.

Boone And they didn't just focus on performance. The architecture overhaul is equally impressive. They moved from a single 8,000-line models.js file to a proper modular structure using PNPM workspaces.

Izzo Eight thousand lines in one file? That's technical debt nightmare fuel.

Boone Right? Now it's split into focused modules with clear separation between core logic and model-specific implementations. Plus they migrated from Webpack to esbuild — build times went from two seconds to 200 milliseconds.

Izzo Ten-x improvement. And bundle sizes dropped by ten percent across the board, with their web build shrinking by fifty-three percent. That translates directly to faster user experiences.

Boone But here's what really caught my attention — the standalone tokenizers library. They extracted all the tokenization logic into a separate package that's just 8.8 kilobytes gzipped with zero dependencies.

Izzo That's smart product strategy. Teams that just need tokenization don't have to pull in the entire ML runtime. Clean separation of concerns.

Boone And it's fully type-safe, works across all JavaScript runtimes. The API is clean too — you fetch the tokenizer config from Hugging Face Hub, create a tokenizer instance, and you're tokenizing text in three lines of code.

Izzo What about the new model architectures? I saw they added support for Mamba and Mixture of Experts.

Boone That's where the specialized operators really shine. They implemented Multi-head Latent Attention, state-space models, MoE — all running with WebGPU acceleration. Models like GPT-OSS, Chatterbox, even some of the newer hybrid architectures.

Izzo And these all run in the browser?

Boone Yep. Same code, same performance characteristics whether you're in Chrome, Node, Bun, or Deno. That's the promise of this new runtime architecture.

Izzo I'm giving this a solid A-minus. The performance gains are real, the architecture cleanup was overdue, and the developer experience improvements are substantial.

Boone Agreed. My only hesitation is that it's still in preview. But they're publishing regular updates under the next tag, so you can start experimenting now.

Izzo So what should listeners go build with this?

Boone First, install it — npm install @huggingface/transformers@next. Then try the standalone tokenizers library for any text processing pipeline you're building. It's tiny and works everywhere.

Izzo Second, if you're doing embeddings, benchmark the new BERT performance against your current setup. That four-x speedup could change your product economics.

Boone And third — this is definitely going on my weekend project list — try running one of their new MoE models locally. The fact that you can run a 20B parameter model at sixty tokens per second in JavaScript is still kind of mind-blowing.

Izzo The examples repository is separate now, so check that out for real implementation patterns. This feels like the moment JavaScript AI went from proof-of-concept to production-ready.