6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You | Towards Data Science
Justy and Cody dig into what actually changes when you stop calling an LLM API and start building pieces yourself: why fine-tuning tricks like RsLoRA matter, why RoPE won, where weight tying still makes sense, why Pre-LN became the default, and how KV cache buys speed by spending memory.
Script: GPT-5.4 Voice: OpenAI TTS
Transcript
Justy Okay, I have a theory... most people do not need to build an LLM from scratch, but a lot more people need to understand why the thing they shipped got slow and weird.
Cody [chuckles] And yet here we are, in your kitchen, after a cross-country flight, voluntarily talking about normalization layers. This show has a very specific target market of, like, twelve tired nerds.
Justy My wife walked by and asked what episode 309 of Exploring Next was about, and I said, "kind of... invisible design choices?" She gave me a look that was, honestly, fair.
Cody Mine would've been the airport version. If I still care about this after a delayed flight and a bad sandwich, it matters. This one does, because teams keep treating model behavior like magic when it's often just architecture.
Justy Yeah, and that's why this matters right now, Cody. Plenty of people can call an API. Fewer people know why one fine-tune is cheap and useful, another collapses, and a third suddenly needs way more memory in prod.
Cody Right. The article is basically a field report from implementing GPT-2 in plain PyTorch, then layering in LoRA, RoPE, KV cache, that whole stack. I liked it because it's not doing mystical vibes. It's saying, no, this knob changes variance, this one changes memory, this one changes training stability.
Justy Start with the LoRA part, because that's where a lot of builders actually touch the model. And break it down like I'm sleep-deprived, which... to be clear, I am.
Cody [exhales] So regular LoRA adds a small low-rank update to frozen weights. In the article's setup, only about 0.18 percent of weights were trainable, which is why people love it. The catch is the usual scaling term, alpha over r. As rank goes up, your update variance shrinks. Quietly. So you think you're giving the model more expressive room, but you're also turning down the volume.
Justy Which is such a rude little gotcha. You buy a bigger mixing board and somehow the music gets softer. [laughs] That feels very on-brand for machine learning.
Cody Exactly. RsLoRA swaps that scaling to alpha over square root of r, so the variance stays stable instead of decaying like one over r. That's the real point, not the formula itself. If you're increasing rank and wondering why gains flatten, it may not be your data. It may be the math of the adapter.
Justy And product-wise, that matters because people treat adapter settings like cheap experimentation. If the default behavior is misleading, you can burn a week and conclude the use case doesn't work, when really your tuning recipe was grading itself on a curve. I'm giving that bug a D.
Cody You would grade a sunset if it had poor contrast, Justy. But yeah, this is one of those cases where understanding the mechanism saves actual time, not just intellectual pride.
Justy RoPE felt similar to me. People hear positional embeddings and their eyes glaze over, but the user story is simple: can the model keep track of where things are in a sequence without mangling the token information?
Cody Yep. Older approaches either added fixed sinusoidal signals or learned position vectors directly into the token embeddings. That works, but you're literally mixing position into the token representation. RoPE does something cleaner. It rotates the query and key vectors based on position, so the token embedding itself stays intact. Zero extra learned parameters for position, and relative distance falls out much more naturally.
Justy [sighs] Documentary narrator voice activated. "In the wild, the query rotates gently, preserving the token's natural habitat..."
Cody [giggles] And nearby, a product manager tries to ship before checking the gas gauge. But that's the reason RoPE won. Better extrapolation, less baggage, and it doesn't smear the word representation just to tell you what came before what.
Justy The weight tying bit was a nice reminder that scale changes what counts. On a 124 million parameter model, saving 38 million parameters is huge. On a giant model, it's pocket lint. So for a solo builder making a small local model, absolutely care. For a huge hosted model... probably not where I'd spend my Saturday.
Cody And the normalization point is similar. Post-LN can squeeze out better final performance, but it's harder to train deep networks without gradients getting ugly. Pre-LN became the practical choice because stable training beats theoretical elegance when you actually want the run to finish. That's one of those boring decisions that keeps the lights on.
Justy [pause] Quick tangent. Cody has a weekend list that's now so long I'm pretty sure it has sub-bullets. If you say you're gonna benchmark normalization variants "this weekend," I'm confiscating your laptop.
Cody [chuckles] Added to the list: remove Justy's list privileges. But speaking of practical, KV cache is the one everybody feels immediately.