Ep 373 api 8:10 w/ Justy & Cody

Gemini API File Search is now multimodal: build efficient, verifiable RAG

Justy and Cody dig into Gemini API File Search getting multimodal retrieval, metadata filters, and page-level citations, and why that matters for anyone tired of flaky RAG over PDFs and image folders.

Script: GPT-5.4 Voice: ElevenLabs

Transcript

Justy This one matters because a lot of people are still doing the same cursed thing, Cody. They dump PDFs and image folders into a system, ask a question, and get back something that sounds right but they can't verify.

Cody Yeah, and Google's update is pretty direct about that pain. Gemini API File Search now adds multimodal retrieval, custom metadata filtering, and page-level citations, which is basically them saying, okay, your RAG stack has to handle messy reality now.

Justy I had too much coffee and still somehow almost missed that this is the useful kind of platform update, not just a model refresh. Anyway, the practical shift is you can search text and visuals together instead of treating images like dead attachments.

Cody Right. Under the hood they say it's powered by Gemini Embedding 2, and the important bit is native image understanding in retrieval, not just OCR pasted into a text index. So if somebody asks for an image with a certain mood or style, the system can retrieve on visual semantics, not filename luck.

Justy That creative-agency example from the post is actually pretty good. The user story is a person who does not remember where a thing lives, only what it felt like or what job it served, and keyword search is terrible at that.

Cody And the metadata piece is almost more important than the multimodal headline. If you can tag files with key-value labels like department equals Legal or status equals Final, then at query time you shrink the search space before generation even starts. That's good for latency, cost, and honestly for not embarrassing yourself with the wrong doc.

Justy Totally. The market for this is every team that has unstructured stuff piling up and no appetite to build retrieval plumbing from scratch. Internal knowledge tools, contract lookup, support archives, brand libraries, maybe compliance-heavy workflows where provenance is the whole game.

Cody The page citations are the trust move. They say File Search captures the page number for every piece of indexed information, so when the model answers from a giant PDF, you can point back to the exact page. That's not full formal verification or anything, but it is way better than, trust me bro. [chuckles]

Justy [laughs] Yeah. And from a product angle, that changes whether someone will actually use the tool at work. If the answer comes with page references, now it fits fact-checking, review, approval flows, all the annoying real-world moments where somebody asks, okay but where did that come from?

Cody I do think there's a trade-off hiding here. Metadata filtering only helps if your ingestion pipeline is disciplined. If your team uploads six versions of the same PDF and nobody labels anything consistently, the retrieval layer gets blamed for an ops problem.

Justy That's the adoption barrier to me. Not really the API. It's getting the documents clean enough and the labels stable enough that people trust the outputs. A lot of teams want magic, but they actually need a boring content workflow first.

Cody Also, compared with rolling your own stack, this is attractive because File Search is taking on the infrastructure work. You don't have to stitch together separate image embeddings, vector storage, filtering logic, and citation mapping yourself. The question, I think, is how much control advanced teams give up versus how much time they save.

Justy And most teams should probably take the time savings. Unless retrieval is your product, shipping the thing usually beats building a bespoke index layer because a founder had a long weekend and opinions. Cody, I know that look. [chuckles]

Cody Rude but fair. [pause] I was literally thinking, okay, but I could wire this myself. Still, the clever part here is the combination, not any one feature alone. Multimodal retrieval gets you broader recall, metadata narrows the blast radius, citations make the answer usable.

Justy If I were trying this fast, I'd build an internal PDF assistant for contracts or policy docs where every answer has to show page citations. Solo-builder version is even simpler: upload a small corpus, add metadata like team, status, year, and make a tiny chat UI that refuses to answer without a cited source page.

Cody Yeah, or do the visual side. Make an asset finder for screenshots, ads, or design references where the query is a natural language brief like moody homepage hero, blue tones, product close-up. Start with the Gemini API docs and File Search guide from the post, ingest a mixed folder of PDFs and images, then test retrieval with and without metadata filters so you can feel the difference.

Justy That's a good weekend project because the win is obvious fast. If the thing can find the right page or the right image without you playing document archaeologist for an hour, it's doing real work. Anyway, I think this one has legs, Cody. Episode 373 gets to be the citations episode.

Cody Which is somehow the most us outcome possible.

Justy Yep. Go make the robot show its receipts, and maybe label your files like an adult.