Exploring Next
Exploring Next — Ep 179 w/ Justy & Cody — Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
Deep dive into fixing deceptive alignment in reward models - why getting the right answer isn't enough if the reasoning is wrong, and how a hybrid training approach combining outcome accuracy with rationale consistency achieves state-of-the-art performance while solving a critical RLHF generalization problem.