Exploring Next

Exploring Next — Ep 94 w/ Justy & Cody — How confessions can keep language models honest

In this episode, we dive into a fascinating research approach that trains language models to admit when they've not followed instructions correctly. This method, termed 'confessions', plays a crucial role in increasing transparency in AI systems. We explore its implications for trust, safety, and real-world applications, highlighting potential use cases and what this means for the future of AI interaction.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →