Exploring Next

Exploring Next — Ep 314 w/ Justy & Cody — Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Justy and Cody dig into Mind’s Eye, a new benchmark for testing whether multimodal models can actually do visual thinking like rotation, folding, analogy, and composition instead of just describing images well. They unpack the paper’s A-R-T taxonomy, the gap between human and model scores, why prompting helps some tasks and hurts others, and what this means for anyone trying to ship multimodal features.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →