Ep 375 research 10:44 w/ Justy & Cody

Hallucinations Undermine Trust; Metacognition is a Way Forward

Justy and Cody dig into a paper arguing that the real trust problem with language models is not merely being wrong, but being wrong with unwarranted confidence. They unpack the paper’s shift from answer-versus-abstain to ‘faithful uncertainty,’ where a model’s wording should reflect its actual internal uncertainty. Cody breaks down the discrimination-versus-calibration distinction and why that matters for both chatbots and tool-using agents. Justy pushes on what this means in production, where hedging can either build trust or feel slippery if it is not tied to real behavior.

Script: GPT-5.4 Voice: ElevenLabs

Transcript

Justy The part that stuck with me is they’re basically saying the trust hit isn’t just being wrong. It’s being wrong like you’re absolutely sure, which... yeah, that’s the part people remember.

Cody Yeah, and I think that framing is the whole paper. They’re not claiming models suddenly become truthful if they say “maybe” more. They’re saying the stuck point is that we’ve mostly improved factuality by cramming in more facts, not by teaching the model to notice the edge of what it knows.

Justy Which feels very product-real. If the thing answers more often, people call it useful. If it declines too much, everyone complains it got nerfed. So the old choice has been answer and occasionally make stuff up, or play it safe and become kind of annoying.

Cody Right. Their move is to say that trade-off only looks inevitable if you define hallucination as any error. They redefine it more narrowly as a confident error. Then there’s this middle lane where the model can still offer an answer, but with uncertainty that actually matches its internal state.

Justy I had to reread that part because at first it sounded like polished hedging. And I was making coffee on no sleep, so maybe that’s on me. [chuckles] But they’re stricter than that. They want the hedge to be informative per answer, not just legal padding.

Cody Exactly, Justy. They lean hard on the difference between calibration and discrimination. Calibration is the easy-to-say one: if a model says eighty percent confidence a bunch of times, it should be right about eighty percent of those times. Discrimination is tougher. That means the confidence signal actually separates likely-right answers from likely-wrong answers on specific instances.

Justy So a model can look decent in aggregate and still be terrible in the moment that matters. That tracks with how these systems feel in practice. You get this smooth tone, and the actual confidence signal to a person is basically vibes.

Cody Yes. And the paper’s conjecture, which I think is a pretty honest one, is that models may just not have enough discriminative power to perfectly sort truth from error internally. If that’s true, then a pure abstain strategy taxes utility because you have to suppress a lot of answers that were actually fine, just to catch the bad ones too.

Justy That “utility tax” phrase is good. Because teams do feel that. If I ship a support assistant or internal knowledge bot, I can’t have it refusing every slightly fuzzy question. People stop using it. But if it states a bad answer like it’s settled fact, trust falls through the floor.

Cody And they connect that to a bunch of recent weirdness. Truthfulness probes don’t generalize well. Models can hallucinate confidently. Even training them to admit mistakes hasn’t solved it. Then there’s the uncomfortable bit where longer reasoning sometimes increases hallucinations and makes abstention worse, which is a pretty strong hint that current training pressures favor producing something over accurately signaling uncertainty.

Justy That part was kind of brutal. We all got excited about more reasoning tokens, and this paper is like, cool, sometimes that just gives the model more runway to sound convincing. [laughs]

Cody [chuckles] Yeah. More words are not automatically more self-awareness. One detail I liked is their behavioral semantics idea. If the model says it’s confident, that should mean it would likely give the same answer again. If it says it’s uncertain, it should be more likely to vary or conflict on another pass. That’s concrete. You can measure that.

Justy That’s the bridge to shipping, I think. Because then uncertainty is not just tone design. It becomes a contract. In a product, I could imagine using that signal to decide whether to show the answer inline, attach a source requirement, or kick off retrieval before the user even sees a polished response.

Cody And that’s where their metacognition angle gets more interesting than just chat UX. For agents, uncertainty is the control layer. It tells the system when to search, when to stop searching, and how much to trust retrieved evidence versus its own parametric memory. Without that, agents either spam tools or skip tools when they really shouldn’t.

Justy I buy that. I also think this is where a lot of product teams are currently faking it a little. They have retrieval thresholds and hand-built rules, but not a model that really knows when it’s out over its skis. So it’s operationally useful, but it’s not exactly self-aware.

Cody I could be wrong, but that’s my main read too. The paper is more conceptual than architectural. It points to metacognitive prompting, fine-tuning, and internal-state approaches from recent work, but it’s not handing over one blessed stack. If I were implementing it, I’d probably combine consistency checks across multiple sampled answers, confidence elicitation, and a retrieval policy trained on disagreement patterns.

Justy So not research-only, but also not a drop-in checkbox. More like a design principle with some measurable proxies. And for teams building customer-facing stuff, the hard part is making uncertainty feel honest instead of slippery. If every answer is “possibly, maybe, could be,” people hate it.

Cody Totally. The paper says that too, in spirit. Uniform hedging is useless. A model that sprinkles uncertainty everywhere might be calibrated in some aggregate sense and still tell you nothing on the instance you care about. The uncertainty has to vary with the model’s actual internal instability, otherwise it’s just style.

Justy One thing I appreciated is they don’t pretend this solves everything. If the model is confidently wrong and genuinely believes the wrong thing, that’s still an honest mistake in their framing. You fix that the old-fashioned way by expanding knowledge, better data, better training, maybe better retrieval.

Cody Yeah, it’s complementary, not magical. Build Next-wise, I’d do three things. One, read the papers they cite around metacognitive prompting and teaching models uncertainty, especially MetaFaith and the fine-tuning work they mention. Two, for a weekend project, take a small factoid QA set like Natural Questions or TriviaQA, sample five answers per prompt from an open model, and use disagreement rate as a crude uncertainty signal. Three, wire that into a simple RAG flow in somet

Justy I like that because a solo builder can actually do it without a giant eval team. Even just logging answer variance, hedge language, and whether retrieval changed the final answer would teach you a lot. Also, maybe don’t name the dashboard “metacognition cockpit,” Cody. That’s how episode 375 turns into a cry for help. [laughs]

Cody Rude. Accurate, but rude. [chuckles]

Justy Anyway, that’s the useful takeaway for me, Cody. Don’t just make the model know more. Make it act like it knows when it doesn’t, and maybe keep the cockpit off the roadmap.