Exploring Next

Exploring Next — Ep 207 w/ Justy & Cody — Exposing biases, moods, personalities, and abstract concepts hidden in large language models

MIT researchers developed a method to identify and manipulate hidden concepts like biases, personalities, and moods in large language models using recursive feature machines (RFMs). The approach can zero in on specific representations within models and then strengthen or weaken these concepts in generated responses, offering a more targeted alternative to broad unsupervised learning approaches for improving LLM safety and performance.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →