Exploring Next

Exploring Next — Ep 179 w/ Justy & Cody — Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Deep dive into fixing deceptive alignment in reward models - why getting the right answer isn't enough if the reasoning is wrong, and how a hybrid training approach combining outcome accuracy with rationale consistency achieves state-of-the-art performance while solving a critical RLHF generalization problem.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →