Ep 391 api 5:15 w/ Justy & Cody

Local First AI Inference: A Cloud Architecture Pattern for Cost Effective Document Processing

Justy and Cody debate Local-First AI Inference — a pattern that routes most documents to deterministic local extraction while falling back to cloud AI for edge cases. They unpack the signal in the noise: who actually benefits, the clever confidence-gated routing, the real cost savings, and the architectural trade-offs. Then they lay out concrete ways to test the claims over a weekend.

Script: Mistral Small 4 119B 2603 Voice: Deepgram TTS

Transcript

Justy So we just paid like four hundred bucks this quarter for Azure OpenAI to churn through a bunch of engineering drawings that probably could’ve been scraped with regex and a little patience.

Cody Right. And you’re telling me we could’ve saved seventy-five percent of that cost if we’d just asked, ‘does this doc even need a cloud call?’ before we spun up the endpoint.

Justy Yeah. That’s the Local-First AI Inference pattern the article lays out — seventy to eighty percent of predictable PDFs route to deterministic local extraction at zero cost, then you only hit the cloud for the weird twenty or thirty percent.

Cody Mm-hm. And they built a composite scoring function to route reliably — not just text presence, not just a single metric, but four criteria: spatial, anchor, format, and contextual signals.

Cody Spots the difference between a title block scoring ninety-eight and a revision history scoring sixty-six on the same character string.

Justy Exactly. And once you wire the tiers, you’ve got a three-tier architecture: local deterministic, cloud AI as a last resort, and a human review lane to catch the rare misses. Bounds the error rate so the cloud-only ‘silent hallucination’ problem and the local-only ‘missed scans’ problem both go away.

Cody Sure. They ran it on four thousand seven hundred engineering drawings. Cloud-first would’ve cost forty-seven bucks and eaten a hundred minutes of wall time.

Cody Local-first cut it to ten to fifteen dollars and forty-five minutes, and they kept the human review queue tight enough to stay sane.

Justy But Cody, who’s actually adopting this? Verticals with boring, repeatable layouts — invoices, regulatory filings, those kinds of documents — they’re the sweet spot.

Cody Wait—you’re saying teams will refactor their whole pipeline for twenty-cent savings per doc?

Justy Not for the twenty-cent savings, for the forty-seven-dollar surprise that shows up on next month’s Azure bill when thirty percent of those invoices hallucinate a line item and you have to queue them all for review.

Cody Fine. But you’re underestimating the effort to tune that composite scorer. Five iterations of prompt hammering just to get from eighty-nine to ninety-eight percent accuracy.

Justy That’s the point — you only do that work once your corpus is stable enough to justify it. The alternative is shipping a cloud endpoint today and debugging drift six weeks later when the model quietly invents a new line item.

Cody Task-specific validation sets over vendor benchmarks, they say. They upgraded from GPT-4.1 to GPT-5+ and saw zero lift on their four-hundred-file set.

Justy Which means if you’re not measuring on your own documents, you’re paying for a shiny model that doesn’t move the needle.

Cody Okay, but who maintains the human review lane once it’s live? That queue isn’t free.

Justy It’s cheaper than twice-a-year surprise cloud bills when half your corpus leaks into hallucination retry loops.

Cody Fair. Still feels like a lot of plumbing for edge cases most teams will never hit.

Justy Maybe. But if your business runs on predictable layouts fifty weeks a year, the local-first router pays for itself the first time a cloud endpoint silently invents an invoice total.

Cody So this is basically a refactoring play for orgs that hate surprise Azure bills and love saying ‘I could’ve written a regex for that.’

Justy More like they love not getting woken up at 2 a.m. because the model hallucinated a missing PO number.

Cody Yeah. Still sounds like a weekend project someone in DevOps builds once, then forgets to maintain.

Justy Only if they treat the composite scorer like a fire-and-forget script instead of a product component tied to a stable corpus.

Cody So the real test is whether you’ll still be running the same router a year from now when the invoice layouts drift.

Justy That’s why you keep the human review lane: it’s the canary that tells you when the composite starts to stink.

Cody Alright. Let’s say I buy it. How do we actually poke at this?

Justy Two ways. One, a solo hack: grab pdfminer.six, wire a tiny composite scorer — local if format match and anchor present, cloud else — and use a local LLM as the fallback edge case.

Cody Two hundred lines of Python, no cloud budget required.

Justy Perfect. And two, if you’re on a team: drop a tiny Azure Function in front of your existing pipeline. It logs every local hit and every cloud call, and spits out cost deltas and latency over a week of real docs. Then you decide whether to refactor.

Cody So the verdict is: if your corpus is stable and layout-predictable, the three-tier Local-First router is worth the upfront legwork.

Justy Yep. Otherwise, keep hitting the cloud endpoint and budget for surprise retries — at least until the bill hits your desk.

Cody Good thing we’re not shipping anything mission-critical this week.

Justy True. Coffee’s cold anyway. Anyway—thread this thing into our backlog before someone spins up another hundred-dollar OpenAI run just to parse a PDF that’s mostly boxes and arrows.