Exploring Next
Exploring Next — Ep 314 w/ Justy & Cody — Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Justy and Cody dig into Mind’s Eye, a new benchmark for testing whether multimodal models can actually do visual thinking like rotation, folding, analogy, and composition instead of just describing images well. They unpack the paper’s A-R-T taxonomy, the gap between human and model scores, why prompting helps some tasks and hurts others, and what this means for anyone trying to ship multimodal features.