Ep 339 research 7:24 w/ Justy & Cody

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

In this episode, Justy and Cody dig into SketchVLM, a training-free framework that lets vision-language models explain answers by drawing editable SVG annotations on top of images. They talk through why text-only answers are hard to verify, how SketchVLM uses a draft-and-refine loop plus visual grounding to produce overlays, where it looks production-friendly, and where the trade-offs still show up.

Script: GPT-5.4 mini Voice: ElevenLabs

Transcript

Justy When a model answers an image question with a wall of text, I always feel like, okay, but can I actually check that? [pause]

Justy You’re listening to Exploring Next, episode 339. Today we’re digging into SketchVLM, which basically gives vision models a way to point at the image instead of just talking at it.

Cody Yeah, and I think that matters right now because these assistants are moving into browser stuff, office stuff, everyday workflows. If the answer is about a photo, a chart, a screen, people don’t just want a conclusion. They want the model to show its work.

Justy Exactly. And humans already do this. We circle things, underline stuff, draw arrows. That’s just a way easier trust check than reading a paragraph and hoping the model saw the same thing you did.

Cody SketchVLM is trying to make that native to the model. The paper’s idea is training-free and model-agnostic, so you can wrap it around systems like Gemini-3-Pro-Preview or GPT-5 without retraining the backbone.

Justy That part feels important for production. Because if I’m building an app, I don’t want to wait for a custom fine-tune just to get a better explanation layer.

Cody Right. The output is a separate SVG overlay on top of the original image. So instead of editing pixels, it draws annotations in another layer. That means the source image stays intact, which is a big trust thing. If the model is wrong, at least it didn’t quietly rewrite the evidence.

Justy So it’s more like a markup pass than an image generation pass.

Cody Yeah. The paper frames it as visually grounding the answer. The model can label parts, connect dots, draw shapes around objects, even sketch trajectories in some of the reasoning tasks. They test across seven benchmarks, including maze navigation, ball-drop prediction, object counting, part labeling, and a few drawing-style tasks.

Justy And the results are pretty strong, right? I saw the claim about up to plus 28.5 percentage points on visual reasoning.

Cody That’s the headline, yeah. And annotation quality goes up by as much as 1.48 times versus image-editing and fine-tuned sketching baselines. I think the interesting part is they also say the annotations are more faithful to the model’s stated answer. That’s the bit I’d care about if this shipped.

Justy Because if the overlay and the text disagree, users are gonna notice fast.

Cody Exactly. The mechanism seems to be a draft-and-refine style setup. In single-turn mode, the model generates the answer and the visual annotations in one shot, and that already works surprisingly well. Then multi-turn lets the user or system push back and refine the sketch, which opens up collaboration instead of just one-and-done output.

Justy That feels a lot closer to a real product than a one-off demo. Like, if I’m checking a car’s oil level or asking what part of a diagram matters, I can see a loop where the system highlights something, then I ask it to adjust or explain a different region.

Cody Yeah, and the non-destructive part is what makes that plausible. If you’re layering editable SVG, you can nudge a circle, move an arrow, remove a label. That’s much easier than trying to regenerate an image every time the interaction changes.

Justy I do wonder, though, how brittle the grounding is when the image is messy. Like, a crowded screen or a cluttered photo seems tougher than a clean benchmark.

Cody I think that’s a fair concern. The paper is strong on the benchmarks they picked, but real-world scenes are uglier. If the grounding is off by even a little, the annotation can feel confident and still be misleading. So I wouldn’t treat it as solved. I’d treat it as a promising interaction layer.

Justy That’s the part I keep coming back to. The value isn’t just accuracy. It’s whether a normal user can glance at the overlay and say, okay, I get why the model thinks that.

Cody Yep. And if I were building on this, I’d start with the simplest possible stack. Take an existing VLM, ask it for both answer and regions of interest, then render SVG highlights in the browser. No fancy training. Just see if users trust it more and if they catch mistakes faster.

Justy For a solo builder, that actually sounds doable over a weekend. Pull a VLM API, make a tiny web app, and let it draw boxes, arrows, and labels on uploaded images.

Cody And if you want to go a little deeper, the paper’s demo and code are live, so you can inspect how they’re structuring the overlay layer and the annotation loop. I’d also test the multi-turn flow with a few ugly images, not just clean examples.

Justy Yeah, because that’s where the product question shows up. Not can it draw something, but can it help me verify the thing I care about without making me do detective work.

Cody That’s the real bar.

Justy Alright, that’s SketchVLM. I’m going to keep thinking about the overlay idea, because honestly, that’s the part that feels most shippable to me.