Ep 394 article 8:07 w/ Justy & Cody

How Lakebase Architecture Delivers 5x Faster Postgres Writes

Justy and Cody dig into Databricks Lakebase claiming much faster Postgres writes by turning off full page writes at the compute layer and pushing page image generation into distributed storage. Cody likes the architectural trick but questions where the complexity moved, while Justy argues the real win is for teams hitting write bottlenecks without wanting to re-architect their app.

Script: GPT-5.4 Voice: Murf.AI Gen2

Transcript

Justy Cody, this one matters because a lot of teams are still basically saying, why is my app slow when the real answer is just Postgres writes got expensive.

Cody Yeah. My skeptical read is the headline is probably directionally real, but the number is doing a lot of work. They're fixing a known Postgres pain point, not inventing new physics.

Cody The pain point is durability overhead. In normal Postgres, after a checkpoint, the first write to a page can force a full 8KB page image into WAL so crash recovery doesn't replay onto a half-written page. On write-heavy systems that balloons log traffic, and they say it can go up by like 15x in bad cases.

Justy Right.

Cody Lakebase gets to dodge that because compute is stateless and storage is separate. The compute streams WAL to a Paxos-backed safekeeper quorum, so there isn't a local data page sitting on disk that can get torn in the old-school way.

Justy I had to reheat my coffee because I made the mistake of letting you explain WAL before caffeine. Anyway, the user story is pretty clean. Small team, app works, traffic grows, writes start hurting, and nobody wants to become a database intern over the weekend.

Justy If this really gives more headroom without changing app code, that's a real product story. Especially for people building transactional backends or those AI app stacks where Postgres ends up holding session state, metadata, tool traces, all the unglamorous stuff.

Cody Sure.

Cody The clever part is they didn't just turn off full page writes and call it a day. That would save WAL bandwidth, but reads could get ugly because storage might need to replay a very long chain of tiny deltas to reconstruct one page.

Cody So their pageserver generates page images in storage once a page crosses some delta threshold. I actually like that. It's more grounded than tying image creation to checkpoint timing, which is kind of a blunt instrument.

Justy That part felt important to me too. The claim isn't only faster writes. They also say around 94 percent less WAL traffic and about 2x better read tail latency, which is exactly where product claims usually get slippery.

Cody Mm-hm.

Justy If both sides move the right way, then okay, this is more than benchmark theater. But I still think adoption depends on whether a team believes the storage layer is now smart in a good way, not smart in a mysterious way.

Cody Exactly.

Cody That's my real reservation, Justy. The complexity did not vanish. It moved. Now you need confidence in pageserver behavior, image thresholds, replay costs, branch interactions, and what happens under weird contention patterns.

Cody The blog says image generation can be shared across multiple pageservers in the background, which sounds great for scale. I just want to know what the observability looks like when one hot page goes pathological, because that's where managed systems get annoying.

Justy Wait—

Justy I think that's fair, but for the buyer, some of that is exactly the point. They are paying to not own that weirdness. The adoption barrier is less technical purity and more, do I trust this enough to move a production app that already works.

Cody Yeah.

Justy And there's some market timing here. Everybody wants one operational data system that can sit near analytics and AI workflows without doing a giant platform split. So 'Postgres, but with more write headroom because the architecture is different' lands pretty well right now.

Cody I could be wrong, but I buy the mechanism more than I buy the generic hype. No torn local pages means full page writes become optional at compute. Then storage recreates the reset points on its own terms. That's coherent engineering.

Cody What I would not assume is automatic 5x for every workload. If the app is read-heavy, lock-heavy, or bottlenecked somewhere else, this probably feels a lot less dramatic than the headline.

Justy So, classic episode 394 energy. The number on the billboard is loud, the actual idea is quieter and maybe better. Best fit feels like teams already committed to Postgres semantics, hitting write pressure, and willing to buy into a managed architecture shift instead of sharding themselves into sadness.

Cody Build-next wise, I'd do two things. One, run pgbench locally against vanilla Postgres with full_page_writes on and off, plus different checkpoint settings, just to feel the WAL amplification yourself. Two, try a disaggregated Postgres service with a Neon-style storage model and compare write throughput, WAL volume, and p99 reads under a write-heavy script.

Justy For a solo builder, even simpler: spin up a small app that writes chat history or order events, hammer it with pgbench or k6, and watch whether the bottleneck is actually WAL traffic before shopping for magic. Anyway, I think that's the honest version. Smart architecture, real use case, still needs trust.

Cody Yep. Good trick, not magic.

Justy Cool. Finish your coffee before you start explaining Paxos at my kitchen counter again.