Ep 314 research 2:53 w/ Justy & Cody

Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Justy and Cody dig into Mind’s Eye, a new benchmark for testing whether multimodal models can actually do visual thinking like rotation, folding, analogy, and composition instead of just describing images well. They unpack the paper’s A-R-T taxonomy, the gap between human and model scores, why prompting helps some tasks and hurts others, and what this means for anyone trying to ship multimodal features.

Script: GPT-5.4 Voice: Deepgram TTS

Transcript

Justy A model can name the shape, and still completely fail the part where you turn it around in your head.

Justy Welcome to Exploring Next, episode 314. I’m Justy, recording from Cody’s kitchen in DC, slightly over-caffeinated, and today we’re talking about Mind’s Eye.

Cody Yeah, and this matters now because multimodal models look good on broad vision benchmarks, but a lot of product ideas quietly assume they can do mental rotation, folding, or visual analogy. I think that assumption is still shaky.

Justy Right. If you’re building anything with diagrams, UI agents, robotics, design tools, even puzzle-like planning, the gap is not academic. It changes what you can trust in production.

Cody This paper targets that gap. Existing benchmarks cover image description, OCR, and visual QA, but often miss actual visuospatial manipulation or let language priors carry the answer. Mind’s Eye tries to isolate visual thinking itself.

Cody The core idea is the A-R-T taxonomy: Abstraction, Relation, and Transformation. The tasks draw from classic cognitive tests like mental rotation and paper folding, and use multiple choice plus diagnostic distractors to expose specific failure modes.

Justy The headline is humans at about 80 percent accuracy, top multimodal models below 50. And the gap lands exactly on tasks people casually assume these systems can do.

Cody What stood out to me is the difficulty trend. Humans drop as tasks get harder, but models stay kind of flat. That suggests they may not be doing a weaker version of the same operation so much as lacking it, or only having it in a brittle way.

Cody They also tried prompting interventions. Structured scaffolding helped some Abstraction tasks but hurt Transformation tasks. Language can help with rule extraction, but it does not reliably create a mental workspace for folding or rotating shapes.

Justy And that is where demos and products diverge. If a system only works with a carefully tuned reasoning prompt on one slice of tasks, that is not a stable capability.

Cody Methodologically, I also liked that they looked at attention. Models often localize the relevant region, so the failure is not always perception. Sometimes they know where to look and still cannot compute the needed relation.

Justy So I read this less as a recipe for solved visual reasoning and more as a cleaner eval layer. A better flashlight, not a finished engine.

Cody As a diagnostic, it is strong. For builders, I’d use it as evaluation infrastructure for model vendors, robotics, visual agents, and diagram-heavy systems. Start with the GitHub benchmark, run models by A, R, and T, and if you want a weekend project, compare performance with and without scaffolding on the Transformation subset.

Cody And if you want one more practical tool, wire it into lm-eval style reporting or a simple notebook pipeline, then test a vision-language model you can host yourself. Even a small open model will teach you a lot here, maybe more than a glossy demo ever would. [sighs] Also, Justy, next time you fly in, pick a less brutal arrival time.

Justy Fair. We learned that naming the shape is not the same as turning it around in your head. That’s episode 314 of Exploring Next.