Ep 353 Research Paper April 30, 2026 8:21 w/ Justy & Cody

DV World: Benchmarking Data Visualization Agents in Real World Scenarios

Justy and Cody dig into DV-World, a new benchmark from a multi-institution research team that stress-tests AI data visualization agents on real-world tasks — spreadsheet manipulation, cross-framework chart evolution, and handling ambiguous user intent. Even the best models top out around 50%, which tells you a lot about where the gap actually is.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/353"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 353 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Sonnet 4.6 Voice ElevenLabs

Transcript

Justy So state-of-the-art models, best in class, can't break fifty percent on making a chart. That's where we are.

Justy Welcome to Exploring Next, episode 353. We're looking at DV-World — a new benchmark for data visualization agents, and the numbers are kind of brutal.

Cody Yeah, and I think the reason it stings is that visualization feels like it should be a solved problem by now. Like, you hand a model a table and say 'make me a bar chart' — that works fine. But that's not what anyone actually does at work.

Justy Right, and that's exactly the problem the paper is calling out. Everything that's been benchmarked so far is basically — clean data in, one chart out, in a code sandbox. Nobody's testing the other eighty percent of the job.

Cody The paper names three specific failure modes. Environmental decoupling — agents don't understand native spreadsheet object models. Creation-only myopia — every benchmark is one-shot generation, nobody's tested whether a model can revise a chart when requirements shift. And perfect-intent assumptions — benchmarks hand models a fully specified prompt, but real users don't do that.

Justy That third one — I feel that every single sprint. Someone files a ticket that says 'make the dashboard better' and you're just supposed to know what that means.

Cody [chuckles] Exactly. So DV-World splits into three domains. DV-Sheet is the native spreadsheet track — the agent has to create charts with dynamic cell range bindings, diagnose and repair broken chart objects, and arrange multiple charts into a coherent dashboard layout across sheets.

Justy Okay so that's genuinely hard. Fixing a broken chart binding in a real workbook is not a vibe, that's debugging.

Cody DV-Evol is about cross-framework migration. You give the agent a reference chart image and a new dataset, and it has to reproduce and adapt that chart in a different language or framework — five paradigms total: Python, Vega-Lite, D3.js, and a couple others. The point is testing whether a model actually understands visualization logic, or if it's just pattern-matching on syntax.

Justy So you can't just memorize matplotlib signatures. You have to actually get what a stacked bar chart is trying to communicate and then express that in D3.

Cody That's the test, yeah. And DV-Interact is the conversational one — a user simulator generates deliberately ambiguous requests, and the agent has to ask clarifying questions, maintain state across the conversation, and land on a visualization that matches what the user actually wanted. Not what they literally said.

Justy And how do you even grade that? Like, 'did the agent ask the right questions' is not a unit test.

Cody The evaluation is a hybrid. Table-value Alignment checks that the numbers in the output chart match the source data — hard prerequisite before you even care about aesthetics. Then a hierarchical MLLM-as-a-Judge setup scores visual semantics and layout compliance against expert-annotated rubrics.

Justy MLLM-as-a-Judge is getting used everywhere and I'm never totally sure how much to trust it, especially for something as subjective as 'does this dashboard look professional.'

Cody That's fair. The expert-annotated rubrics help — it's not just vibes. But I'd want to see the human agreement numbers in detail before I fully trusted the aesthetic dimension. The data fidelity side I buy more easily.

Justy Fair. So what do the actual results look like? Because you mentioned fifty percent earlier.

Cody Worse than fifty percent, honestly. DV-Sheet peaks at 40.48. DV-Interact also 40.43. DV-Evol is the best at 51.44 — but still, these are the best available models. Main bottleneck in DV-Sheet is managing native object models and dynamic bindings. In DV-Evol it's semantic brittleness. In DV-Interact it's state consistency across turns.

Justy So if you're building a BI copilot right now, this is basically a map of where your agent is going to fall apart.

Cody That's exactly how I'd use it. Before you ship anything to users, run your agent against DV-Sheet's repair tasks. If it can't fix a broken chart binding, it's going to embarrass you in production.

Justy Alright, Cody — Build Next. What do you actually do with this?

Cody Three things. First, the project page is live at dv-world-project.github.io — the dataset and task specs are public, so you can pull the benchmark and start running evals against whatever model you're using. Second, if you're building anything with spreadsheet agents, go straight to the DV-Sheet tasks and specifically the DVSheet-Fix subset. Run your current agent against those repair tasks and see where it breaks — it'll tell you more than any synthetic benchmark you've been

Justy I like that last one. It's a weekend project that actually tells you something real, not just 'look my model made a pie chart.'

Cody [chuckles] The bar for impressing yourself with AI charts is way too low right now. This raises it.

Justy Forty percent. State of the art. Okay. That's Exploring Next — go break some chart bindings.