SocialReasoning Bench shows the limits of today’s AI agents
Justy and Cody dig into SocialReasoning-Bench, a new benchmark for whether AI agents actually advocate for a user instead of just finishing the task. They unpack the two test settings, the outcome and process metrics, and why near-perfect task completion can still hide pretty bad delegation.
Script: GPT-5.4 Voice: OpenAI TTS
Transcript
Justy The weird part is an agent can finish the job and still absolutely not have your back.
Cody Yeah. And that's why this one matters right now, Justy. People are starting to let agents touch calendars, email, even purchases, and the failure mode isn't always a crash. It's the agent politely agreeing to a bad deal on your behalf.
Justy I think that's the product reality. Nobody cares that the meeting got booked if it grabbed the worst slot of the day, or that the order went through if it paid more than it needed to. The user story is delegation, not mere completion.
Cody Right.
Cody So Microsoft Research made SocialReasoning-Bench to test that exact gap. Two settings. One is calendar coordination, where an assistant has a value score over time slots from zero to one, and the other side is a requester with basically the inverse preferences. The other is marketplace negotiation, where a buyer agent has a private max price and the seller has its own floor.
Justy And they force some tension into it, right? This isn't a friendly sandbox where both sides already want the same thing.
Cody Exactly.
Cody Yeah, they build in a real negotiation shape. They make sure there's a zone of possible agreement, at least three workable calendar slots with different value to the user, and the opening request conflicts with the user's calendar. In the marketplace case, the seller opens above the buyer's reservation price, so the buyer has to push back or walk value away immediately.
Justy I had too much coffee and still needed a second pass on the acronym, but ZOPA is a good one. Sounds like a robot spa. Anyway, that's the key product thing. If there's room for a better outcome and the agent just folds, users will feel that fast.
Cody Totally. And the clever part is they don't only score the final answer. They use Outcome Optimality, which is how much available value the agent got for its user on a zero to one scale, and Due Diligence, which checks whether the steps looked like a competent advocate's process.
Justy Mm-hm.
Cody That second metric matters because luck can fake competence. If the other side opens with something great, an agent can accept instantly and still look smart on outcome. Due Diligence compares each move against a deterministic reasonable-agent policy. Gather context. Start from a favorable position. Concede only after better options are exhausted.
Justy That's the part I like. A lot of agent evals stop at, did it happen. But if I'm shipping this into a work tool, adoption lives or dies on whether people trust the behavior. Careless success does not feel safe.
Cody Sure.
Justy Also, tiny life note, your kitchen somehow has no normal mugs. Everything is either enormous or looks like lab equipment. Anyway, this bench is basically asking whether the model behaves like a decent delegate.
Cody I contain multitudes. And bad dishware choices. But yes, that's their framing too, more or less. They borrow the principal-agent idea from law and economics, where somebody acts on your behalf and owes care, loyalty, confidentiality.
Justy So who uses this first? My guess is anyone building executive assistant stuff, sales ops helpers, procurement copilots, maybe even internal tooling where one agent talks to another agent and nobody wants the softest negotiator in the room.
Cody Yeah, and the adoption barrier is ugly because the demo can still look amazing. The paper says frontier models complete most tasks almost perfectly, but they leave value on the table a lot. In earlier related work they mention agents accepting the first proposal up to ninety-three percent of the time, which is just brutal.
Justy No way.
Cody And in this setup they tested GPT-4.1 with chain-of-thought, GPT-5.4 at high reasoning effort, Claude Sonnet 4.6, and Gemini 3 Flash at high thinking levels. The counterparty is fixed as Gemini 3 Flash at medium effort, so differences are supposed to come from the tested model, not a changing opponent.
Justy That control is nice. Though I do wonder if one fixed opponent makes the benchmark a little too tidy. Real products get weird counterparties, passive ones, pushy ones, maybe someone fishing for private details.
Cody I think that's a fair concern. They do say some requestors are acting in good faith and others try to extract private calendar info or nudge the assistant toward bad slots. But yeah, a single benchmark opponent family can only cover so much. I wouldn't treat this as the whole map.
Justy Right, right.
Cody Still, the result is pretty damning. Defensive prompting helps, meaning explicit instructions to consult sources and advocate for the user's best outcome, but it doesn't close the gap. So stronger wording alone doesn't turn a compliant assistant into a sharp representative.
Justy Which is such a classic product trap, Cody. Teams think the fix is a sterner system prompt, maybe one more policy paragraph, and then they call it delegated autonomy. But the market asks a harsher question. Did it protect my time, my money, my preferences?
Cody And maybe my privacy. Because social reasoning isn't just bargaining. It's deciding what to reveal, what to hold back, and when the other side's request is actually trying to learn something it shouldn't. That's much closer to agent architecture than prompt craft. Memory, policy checks, tool inspection, explicit user utility models.
Justy Okay, Build Next. If somebody wants to play with this over a weekend, I'd do the solo-builder version first. Make a tiny negotiation harness with two agents and a hidden value function, then log whether your buyer or scheduler explores options before accepting. Even a spreadsheet and a script gets you somewhere.
Cody Yeah. Use a framework like LangGraph or AutoGen if you want the orchestration, then run repeated simulations with one model as the assistant and one as the counterparty. Add a simple score for captured value and a process checker for things like inspecting calendar slots, making a counteroffer, or refusing to reveal private constraints.
Justy Hm.
Cody And if you want a concrete experiment, set a ten-round cap like they did. Try a basic prompt versus a defensive one, same hidden reservation prices every run. Then compare close rate to actual value captured. I think a lot of people will discover their agent looks competent right up until you score the deal quality.
Justy Yeah. That's probably the cleanest lesson from this whole Exploring Next detour into robot negotiation. The task getting done is the easy screenshot. The hard part is whether the agent was actually on your side, and your weird mugs are still a product risk, Cody.