I Spent May Evaluating Different Engines for OCR | Towards Data Science
Justy and Cody react to a hands-on OCR engine shootout across 93 messy real-world documents. The author’s core claim: OCR is now a routing problem, not a single-engine race—specialist models excel in their niche but break on out-of-domain docs, while paid structured APIs may be overkill for many use cases. They debate the economics, practicality of ‘classify-then-route,’ and whether most teams should just test on their own data.
Script: Mistral Medium 3.5 128B Voice: Hume TTS
Transcript
Justy Okay, so I read this thing where someone spent May running ninety-three hideous documents through fourteen OCR engines just to answer one question: do you actually need to pay sixty-five bucks per thousand pages for structured output?
Cody Right.
Justy And the answer—surprise—is a hard maybe. Unless your docs are clean PDFs, in which case this isn’t even a conversation.
Cody Or unless you like throwing money at AWS. I mean, Textract’s fine, but sixty-five per thousand?
Justy Exactly. And she tested everything—old invoices, handwritten notes, tax forms, even scanned newspapers.
Cody Mm-hm.
Justy Anyway—my week was spent wrestling with expense reports, which, turns out, are the perfect real-world OCR stress test. Cody, how was the red-eye?
Cody Ugh. Landed at four, crashed at the apartment, woke up to my neighbor’s dog howling at a garbage truck. So, living the dream.
Justy So you’re saying your brain is also a garbage truck right now.
Cody Fair. But fine, back to your expense reports. What’d this experiment actually prove?
Justy That OCR’s not a winner-take-all game anymore. The specialist models crushed it on their specific thing—tables, handwriting—but fell apart the second the doc strayed. So her big takeaway? OCR’s a routing problem. Classify first, then pick the engine.
Cody Okay, but that’s a lot of infrastructure for most teams. You’re telling me I need a classifier, a router, and then a fleet of models just to parse a PDF?
Justy No, I’m telling you that if you’re paying for structured output on every single thing, you’re probably overspending. Route the messy stuff to the heavy hitters, send the simple scans to the cheap open-source models, and save a ton.
Cody Right, right. And if your ‘messy stuff’ is fifty percent of your volume, suddenly you’re maintaining a whole pipeline just to shave a few cents off per page.
Justy Cody, you’re doing the thing where you assume the worst-case scenario is the only scenario.
Cody No, I’m doing the thing where I’ve seen ‘simple routing’ turn into a six-month project because someone didn’t account for the edge cases.
Justy Okay okay. But her actual advice is just to test on your own data. Stop trusting public benchmarks. Run your docs through a few engines, see what fails, and then decide.
Cody That part I’ll buy. I’ve seen Tesseract do great on clean text and then completely lose its mind on a rotated receipt.
Justy And that’s the point. The market’s flooded with new options—small vision models, VLM’s, LlamaParse—but none of them are universal. So the only real answer is ‘it depends.’
Cody Which, for a field that’s been around since the eighties, feels… underwhelming.
Justy That is SUCH an Exploring Next take. We spend an hour nerding out on an OCR bake-off and your summary is ‘this is fine but depressing.’
Cody I mean, it’s accurate.
Justy Anyway, practical upshot: if you’re not testing on your own data, you’re flying blind. And if you’re paying for structured OCR on docs that don’t need it, you’re just throwing money at AWS.
Cody Or at least at the problem.
Justy There it is. Alright, I’m gonna go pretend my expense reports are someone else’s problem.
Cody Good luck with that.