Visual imitation learning: Guidde trains AI agents on human 'expert video' instead of documentation
Guidde raised $50M to solve enterprise AI's 'last mile' problem by training agents on video recordings of human experts, not documentation. Instead of PDFs, they capture rich telemetry—every click, scroll, and DOM change—creating 'digital world models' that let AI navigate complex enterprise software with human-like spatial awareness.
Script: Sonnet 4.5 Voice: ElevenLabs
Transcript
Izzo Your AI agent just got stuck trying to create a ticket in ServiceNow. Again.
Izzo You're listening to Exploring Next, episode two-ten. I'm Izzo, and with me is Boone. Today we're diving into Guidde—a startup that just raised fifty million to train AI agents on video recordings of human experts instead of dusty PDFs.
Boone And this is actually brilliant timing, Izzo. We're hitting this weird inflection point where companies are deploying AI agents but they keep failing at the last mile.
Izzo Right, like you spend a million dollars on some fancy AI tool, but nobody knows how to use it because all you got was a thirty-minute training session.
Boone Exactly. And the agents themselves are even worse—they hallucinate the moment they hit your actual enterprise software because GPT-4 was never trained on your specific Salesforce setup.
Izzo So Guidde's approach is fascinating. Instead of feeding agents static documentation, they capture what they call 'Video Ground Truth'—basically recording human experts as they navigate complex software.
Boone But here's the clever part—they're not just recording pixels. They're capturing every click, scroll, DOM change, even the subtle pauses when a system lags.
Izzo Boone, break that down for me. What's the technical difference between a screen recording and what Guidde's doing?
Boone Think of it like this—a normal screen recording is like filming a driver. Guidde is capturing the steering wheel angle, brake pressure, and GPS coordinates. They're synchronizing video frames with underlying HTML metadata.
Izzo So they're building what they call a 'digital world model' of enterprise software. Each company gets their own unique map.
Boone Right, and that's their moat. Because every enterprise uses a different mix of apps and custom configurations. You can't just train on generic Salesforce—you need the telemetry from how this specific company actually uses it.
Izzo The architecture is interesting too. They're using a fleet of models instead of relying on one. Gemini for visual tasks, Claude for narrative scripts.
Boone Smart move. And they've got feedback loops—when users edit videos, that data goes back into the training to prevent the same mistakes. It's like having models fact-check each other.
Izzo From a product standpoint, I love that they're solving two problems at once. The same videos that train humans also train the AI agents.
Boone It's like building Waymo for computer usage. The human demonstrations become the training data for autonomous navigation of enterprise UIs.
Izzo And the numbers are solid—forty-one percent reduction in video creation time, thirty-four percent fewer support tickets. That's real impact.
Boone Plus they've automated the whole production pipeline. Used to take weeks to create one training video with multiple teams. Now it's seconds.
Izzo The Magic Redaction feature is clever too—automatically obscuring sensitive data during capture so it stays HIPAA-compliant.
Boone Yeah, and they're replacing like six different tools—Loom, Adobe Premiere, ElevenLabs, Synthesia—with one AI-native platform.
Izzo What I find compelling is the timing. We're right at this moment where agentic AI is becoming real, but the knowledge infrastructure is completely broken.
Boone Totally. Foundation models are great at general reasoning but terrible at 'click the third button from the left in your custom SAP interface.'
Izzo And with forty-five hundred enterprise customers already, they're building a dataset that's really hard to replicate.
Boone The Vision-Language-Action training sets they're creating—that's the secret sauce. Most companies are still thinking in terms of text documentation.
Izzo I'm giving this approach a solid A-minus. The execution looks strong, the timing is perfect, and they're solving a real pain point.
Boone Only thing I'd want to see is how well the agents actually perform in production versus demos. But the technical foundation is sound.
Izzo So what should people go build with this? First, check out Guidde's free tier—you can capture up to twenty-five videos and see how the telemetry capture works.
Boone Second, if you're building agents, look into Vision-Language-Action frameworks. There's some interesting work happening in the VLA space that connects to this.
Izzo And third, experiment with multimodal model orchestration. The idea of using different models for different tasks instead of one giant model is worth exploring. Adding that to the weekend project list. Though at this point, I might need a separate spreadsheet just for the backlog. Next time your AI agent gets confused navigating enterprise software, remember—it probably just needs better training data. We'll be back in two weeks.