Ep 402 research 5:52 w/ Justy & Cody

Many Shot CoT ICL: Making In Context Learning Truly Learn

Justy and Cody dig into a paper arguing that long-context chain-of-thought prompting behaves less like stuffing a prompt with relevant examples and more like teaching the model during inference. They unpack why many-shot tricks from classification break on reasoning, why semantic retrieval stops helping, and how the paper’s Curvilinear Demonstration Selection tries to order examples like a smooth mini-curriculum.

Script: GPT-5.4 Voice: Inworld TTS 1.5 Max

Transcript

Justy The weird part, Cody, is they’re basically saying bigger prompts don’t automatically make reasoning smarter.

Cody Yeah. The paper is pushing on a blind spot. A lot of many-shot ICL results came from classification-ish tasks, then people kind of carried those rules over to chain-of-thought reasoning as if it was the same game.

Justy Right, and that matters now because long context is cheap enough that teams will absolutely try to brute-force prompts before they fine-tune anything. I had to reheat my coffee twice reading this, which is usually a sign the paper is either onto something or messing with me. Anyway, this one is onto something.

Cody Mm-hm.

Cody Their core claim is that many-shot CoT-ICL should be seen as test-time learning inside the prompt, not just pattern matching at larger scale. More examples can be unstable, and nearest-neighbor style retrieval can stop helping on reasoning tasks.

Justy The problem is for anyone hoping long-context prompting could stand in for training. On non-reasoning stuff, dozens or hundreds of prompt examples got close to fine-tuning, but for math, geometry, multi-step logic, the behavior was still mushy.

Cody Exactly.

Cody They split experiments across model types and task types instead of mushing everything together. On reasoning tasks, piling on more CoT demos mostly helps reasoning-oriented models, while standard instruction models can get unstable as the shot count rises.

Justy Which is honestly a useful reality check for product teams. If the base model doesn't already have decent reasoning habits, stuffing 64 worked examples into the prompt may just buy you a larger bill and more variance.

Cody Right, right.

Cody And the retrieval result is maybe the cleanest intuition pump in the paper. For normal tasks, semantic similarity still helps. For reasoning, similar-looking questions can require very different procedures, so semantic closeness is a bad proxy for whether one chain of thought will actually teach the next one.

Justy So when they say procedural compatibility, I read that as: does this example set the model up to use the right moves next, not just recognize the same nouns. That's a much more annoying thing to engineer.

Cody Yeah, annoying but more real. Their reframing is almost pedagogical. One principle is ease of understanding, meaning the demonstrations should match what that specific model can already parse and imitate. They even note this helps explain why self-generated demonstrations can work better for weaker models.

Justy I could be wrong, but that felt like one of the strongest practical notes in the whole paper. People love gold-standard prompt examples. In production, the best demo set might be the one your model can actually digest, even if it’s less elegant.

Cody Yeah.

Cody The second principle is smooth progression. They look at the embedding trajectory across demonstrations and try to minimize total curvature, basically avoiding a sequence that jerks the model from one reasoning mode to some distant one and back again.

Justy That’s the Curvilinear Demonstration Selection bit, right? The name is a little grad-school, but the idea is clean. Don’t just retrieve a pile of relevant examples. Arrange them so the prompt teaches in a steady arc.

Cody Yeah, and they report up to a 5.42 point gain on geometry with 64 demonstrations from that ordering method. Also, as you add more CoT demos, variance from ordering grows rather than washing out.

Justy That part is big for shipping. If order sensitivity gets worse with more reasoning examples, then prompt construction becomes an actual systems problem. You’d want cached bundles, versioned orderings, evals per task family, maybe even per model release.

Cody Okay okay.

Cody My mild pushback is methodological. I’d want to know how stable the curvature metric is across embedding models, and whether simpler heuristics, like difficulty sorting plus diversity constraints, recover most of the gain.

Justy Yeah, that feels fair. Also, this doesn't read like a universal prompt trick for every app. It feels more shippable in narrow reasoning workflows where wrong answers are expensive and the task distribution is stable enough to curate demonstration pools.

Cody Mm-hm.

Cody I wouldn’t start with it for open-ended chat. I would start where there’s repeated structure: geometry tutoring, finance spreadsheets, code transformation tasks. Anywhere you can maintain a library of solved examples and measure whether a prompt curriculum beats plain retrieval.

Justy Also, selfishly, I like that this gives product people a middle path between raw prompting and full training. Not free, still messy, but you can imagine a service that builds per-task teaching prompts instead of retrievers. Like, congratulations, your RAG stack now needs a syllabus.

Cody That’s basically it. Build Next-wise, I’d do three things. One, replicate a tiny version with a long-context model and a reasoning set like GSM8K or a geometry subset, then compare random order, semantic retrieval order, and a smooth-order heuristic. Two, use an embedding model in something like sentence-transformers to compute path curvature over candidate demos.

Justy I’d add one more production-ish version. Store solved examples with metadata like task type, required operations, answer format, and model family that handled them well. Then your prompt builder can choose examples the model actually understands, not just the ones that look closest in vector space. That seems very buildable.

Cody Sure.

Justy Anyway, episode 402 and we’re apparently grading prompts like homework now. But yeah, Cody, this paper made long context feel less like bigger memory and more like temporary teaching. That’s a useful shift.