Ep 477 research 7:35 w/ Justy & Cody

A $1,500 foundation model that rivals larger LLMs

Justy and Cody unpack Sapient's claim that HRM-Text, a one-billion-parameter foundation model trained from scratch for about fifteen hundred dollars, can compete with larger open models by changing the architecture and training objective.

Script: GPT-5.5 Voice: ElevenLabs v3

Transcript

Justy Cody, this one is catnip for me: a foundation model from scratch for about fifteen hundred bucks, allegedly punching near bigger open models.

Cody Yeah, and my first reaction was deeply predictable. I saw the price tag and immediately started looking for the trick in the receipt.

Justy Of course you did. The Exploring Next expense police have arrived, episode four seventy-seven, checking whether the GPU invoice has vibes hidden in it.

Justy The central claim, I think, is not just cheap training. It's that Sapient is arguing the normal LLM recipe wastes a ton of compute learning from raw text, when many enterprise users mainly need a model that follows instructions and reasons over specific tasks.

Cody That is the interesting part. They built HRM-Text, a one-billion-parameter model, using a Hierarchical Recurrent Model instead of a standard Transformer. The architecture splits work between a slow H module that holds semantic context and a fast L module that does local refinement.

Cody In the article's description, processing runs through two high-level cycles, each with three fast updates before one slow update. So the pitch is: don't just predict the next token forever. Spend computation on a looping reasoning process that's more sample-efficient.

Justy And they trained only on instruction-response pairs, which is such a product-manager sentence that I am embarrassed to like it. But I do. Most workplace use is not, please continue this random internet paragraph. It's, answer my question, check this rule, solve this constrained thing.

Justy That makes the enterprise angle more plausible to me. A bank, insurer, or logistics company may not need a giant general model that memorized everything. They might want a compact reasoning core next to retrieval, permissions, and their own knowledge stores.

Cody Technically, the paper's supporting details are pretty concrete. The model used forty billion curated instruction-response tokens across general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge. They also stripped out explicit thinking tokens, trying to force the architecture to carry the reasoning rather than copying a visible chain.

Cody The hard training problem is recurrence. Loops can blow up or fade out numerically, especially on language. So they added MagicNorm to stabilize internal signals, plus a warm-up schedule that starts with shorter loops and later increases the reasoning depth.

Justy I love that the name is MagicNorm. Somewhere, a very tired researcher named a stabilization method at two in the morning and everyone just accepted it.

Cody Honestly, if it works, call it Spreadsheet Goblin for all I care.

Justy No, don't tempt enterprise software. A procurement team would buy Spreadsheet Goblin Pro by Friday.

Cody Sadly credible.

Justy Anyway, the results are why this article exists. HRM-Text reportedly got sixty point seven percent on M M L U, eighty-four point five percent on G S M eight K, and fifty-six point two percent on MATH. For a one-billion-parameter model trained in one point nine days on sixteen GPUs, that's not nothing.

Cody And the article says that is one hundred to nine hundred times fewer training tokens, and ninety-six to four hundred thirty-two times less estimated compute, compared with models like Qwen, Gemma, and Llama. That is a real compute-to-performance claim, even if I want the footnotes tattooed on my eyelids before I fully believe it.

Justy Here's where I think people should care without overbuying the headline. If you're an enterprise team that has avoided pretraining because it sounds like setting a pile of money on fire, this suggests a smaller, domain-shaped model might be a real experiment, not a fantasy.

Cody I buy that narrow version. Where I get cautious is the comparison. Training from scratch on instruction-response pairs is not the same task as broad raw-text pretraining, and critics in the article call that apples-to-oranges. Sapient pushes back by saying modern models all see instruction data anyway, but still, benchmark competitiveness is not the same as broad usefulness.

Cody Also, fifteen hundred dollars sounds clean, but it probably does not include data curation, engineering time, failed runs, evaluation work, or the boring infrastructure glue that makes a model usable. And because this is recurrent, I would want latency and serving behavior under real workloads, not just training cost.

Justy That is fair, and annoyingly responsible. My practical read is: this does not mean every company should train a foundation model next quarter. It means architecture choices may reopen the build-versus-buy conversation for narrow reasoning systems, especially when data control matters.

Cody Yeah. And I like that it separates reasoning from memorized knowledge. Pair a compact model with retrieval, let the external system fetch current facts, and use the model for rule-following and synthesis. That is a sane shape, if the reliability is there.

Justy Cody, look at you ending on a sane shape instead of a smoking crater. Growth.

Cody I contain multitudes. Mostly log files, but multitudes.

Justy Go eat actual dinner, please. I refuse to have Spreadsheet Goblin be the most nourishing thing in your day.