Ep 263 tool 2:01 w/ Justy & Cody

Running local models on Macs gets faster with Ollama's MLX support

Ollama just added MLX support for Apple Silicon Macs, promising significantly faster local LLM performance through better unified memory usage. We break down what this actually means, why it matters as local models gain momentum, and the technical architecture that makes it work.

Script: Sonnet 4.5 Voice: Google TTS

Transcript

Izzo Local AI just got a lot more practical on Macs.

Izzo You're listening to Exploring Next, episode two-sixty-three. I'm Izzo, here with Boone, and today we're talking about Ollama's new MLX support.

Boone Perfect timing too. With OpenClaw hitting three hundred thousand GitHub stars and everyone suddenly wanting to run models locally.

Izzo Right, and I think people are finally hitting that wall with cloud costs. When you're paying Claude or ChatGPT subscription fees and still running into rate limits—

Boone —you start looking at that M3 MacBook and wondering if it can actually run something decent locally.

Izzo Exactly. So Boone, break down what MLX actually is and why it matters for this.

Boone MLX is Apple's machine learning framework that's designed specifically for their unified memory architecture. Unlike traditional setups where you have separate CPU and GPU memory—

Izzo —with all that copying data back and forth—

Boone Exactly. Apple Silicon shares memory between CPU and GPU. MLX optimizes for that. So instead of hitting GPU memory limits, you're using the full system RAM.

Izzo And that's huge because most people don't have gaming rigs with massive VRAM. But a MacBook Pro with 32 gigs? That's getting common.

Boone Plus Ollama 0.19 adds support for Nvidia's NVFP4 compression and improved caching. They're attacking the memory problem from multiple angles.

Izzo Okay but let's get specific. What can you actually run right now?

Boone Currently just Qwen3.5 with thirty-five billion parameters. You need Apple Silicon and at least thirty-two gigs of RAM.

Izzo So we're talking serious hardware requirements. That's not your base model MacBook Air.

Boone No, but here's what's interesting—if you have the new M5 series, you get access to those Neural Accelerators. Better tokens per second and faster time to first token.

Izzo Time to first token is huge for user experience. Nobody wants to wait ten seconds for the model to even start responding.

Boone And this is where the unified memory really shines. Traditional GPU setups, you're constantly moving data. With MLX, the model lives in shared memory space.

Izzo From a product perspective, this feels like the moment local models become viable for more than just hobbyists.

Boone I mean, we're still not talking frontier model quality. But good enough for code completion, document analysis, basic reasoning tasks.

Izzo Right, and privacy is becoming a real selling point. Especially for teams working on sensitive code or proprietary documents.

Boone Though I have to say—and this is important—they specifically warn against OpenClaw-style setups that give models deep system access.

Izzo Good point. Local doesn't automatically mean safe if you're giving it shell access and file system permissions.

Boone The architecture choices here are really smart though. Instead of fighting against Apple's design decisions, they're leaning into them.

Izzo What do you mean?

Boone Most ML frameworks were built for discrete GPUs. Ollama with MLX says 'okay, unified memory is actually an advantage if we design for it.'

Izzo That's clever. Work with the hardware, not against it.

Boone Exactly. And I'm curious how this scales as they add more models. Qwen3.5 is just the start.

Izzo Speaking of scaling—what should people actually try if they want to experiment with this? First, check if you have the hardware. Apple Silicon Mac, thirty-two gigs minimum. Then install Ollama and try the preview build. Command line tool, right? That's still the main barrier for less technical users. Yeah, though there are GUI wrappers popping up. But honestly, 'ollama run qwen3.5:35b' isn't that scary. Fair point. What else should people research? Look into MLX itself—Apple