Exploring Next

Exploring Next — Ep 390 w/ Justy & Cody — SocialReasoning Bench shows the limits of today’s AI agents

Justy and Cody dig into SocialReasoning-Bench, a new benchmark for whether AI agents actually advocate for a user instead of just finishing the task. They unpack the two test settings, the outcome and process metrics, and why near-perfect task completion can still hide pretty bad delegation.

Open source article

Full episode page with transcript →

Browse all Exploring Next episodes →