Ep 244 News March 26, 2026 4:50 w/ Justy & Cody

Ai2 releases MolmoWeb, an open weight visual web agent with 30K human task trajectories and a full training stack

Ai2 releases MolmoWeb, the first open-weight visual web agent that ships with its full training data and pipeline. Unlike closed APIs or empty frameworks, MolmoWeb includes 30K human task trajectories, works purely from screenshots, and gives developers full visibility into how it was built.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/244"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 244 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Sonnet 4.5 Voice ElevenLabs

Transcript

Izzo Browser automation that actually works.

Izzo You're listening to Exploring Next, episode 244. I'm Izzo, and today Boone and I are diving into MolmoWeb — the first open-weight visual web agent that ships with everything you need to actually understand how it works.

Boone And by everything, she means everything. Thirty thousand human task trajectories, the full training pipeline, even the Chrome extension they used to collect the data.

Izzo Right. Because if you're building browser automation today, you're stuck between two bad options: closed APIs you can't inspect, or open frameworks with no trained model underneath.

Boone Exactly. Browser-use is great as a framework, but you still need to bring your own LLM and figure out the agent layer yourself.

Izzo And enterprise teams need to audit what they're running, fine-tune on internal workflows, avoid per-call API costs. MolmoWeb gives you that third option.

Boone What's fascinating is their pure visual approach. It doesn't parse HTML at all — just takes screenshots and reasons about what it sees.

Izzo Wait, that seems harder than it needs to be. Why not use the DOM?

Boone Browser compatibility, Izzo. A screenshot works the same whether you're running Chrome, Safari, or some headless service. Plus, it sees exactly what a human user sees — no accessibility tree interpretation layer.

Izzo Okay, that's actually clever from a deployment perspective.

Boone The architecture is clean too. At each step it gets a task instruction, current screenshot, action history, URL and page title. Then it outputs a natural-language thought process followed by the next action.

Izzo Natural language reasoning before acting — that's huge for debugging.

Boone Right. You can literally read its thought process: 'I need to click the login button, which appears to be the blue rectangle in the top right.' Then it executes the click at those screen coordinates.

Izzo But the real story here is MolmoWebMix, the training dataset. That's what no one else has shipped.

Boone Three components. Human demonstrations — thirty thousand task trajectories where humans actually completed browsing tasks while a Chrome extension recorded everything.

Izzo Five hundred ninety thousand individual subtasks across eleven hundred websites. That's serious scale.

Boone Then synthetic trajectories to scale beyond what humans can annotate. But here's the key — they used text-based accessibility agents, not proprietary vision models. No OpenAI Operator or Anthropic computer use in the training mix.

Izzo So it's actually reproducible.

Boone Exactly. And the third component is GUI perception data — two point two million screenshot question-answer pairs teaching it to read and reason about page content from images.

Izzo That's where the visual reasoning gets trained. 'Where is the search box?' 'What does this button do?'

Boone Performance-wise, it's leading the open-weight category across four live-website benchmarks. WebVoyager, Online-Mind2Web, DeepShop, WebTailBench.

Izzo And beating older GPT-4o agents that had both screenshots AND accessibility trees.

Boone Though they're honest about limitations. Text reading from screenshots isn't perfect, drag-and-drop is unreliable, and it struggles with ambiguous instructions.

Izzo Plus no training on logins or financial transactions — which makes sense for a public dataset.

Boone But for enterprise use cases, that audit trail is everything. You can see exactly what it was trained on, how it makes decisions, fine-tune it on your internal workflows.

Izzo I'm giving this a solid A-minus. It's not just releasing weights — it's releasing the entire playbook for building visual web agents.

Boone Alright, if you want to get hands-on: first, check out the MolmoWeb repo on GitHub. They've got the 4B and 8B parameter models ready to download.

Izzo Second, try the hosted demo at molmoweb.ai — it runs on Browserbase so you can see it working on real websites. And third, dive into the MolmoWebMix dataset. Even if you're not training agents, seeing thirty thousand human browsing trajectories is fascinating research. Adding it to my weekend project list, obviously. Obviously. This is how you do open-source AI right — not just the model, but everything needed to understand and extend it. We'll see you next time on Exploring