Ep 330 research 5:58 w/ Justy & Cody

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

SLIDERS solves the aggregation bottleneck in document question answering by extracting information into a relational database and reasoning over structured data via SQL instead of concatenating chunks. It uses data reconciliation to fix duplicates and inconsistencies, outperforming GPT-4 on long-context benchmarks and scaling to 36M tokens.

Script: Haiku 4 Voice: ElevenLabs

Transcript

Justy So you've got a stack of documents—call it a thousand pages—and you need to answer a question that touches five different parts of five different files. You chunk it, throw chunks at an LLM, get back some facts. Then what? Suddenly you're sitting on a pile of evidence and no clear way to combine it all without just… concatenating everything and hoping the context window holds.

Cody Right. That's the aggregation bottleneck. And it gets worse because each chunk extraction is local. One document might mention that a company was founded in 2015. Another chunk says it was founded in 2014. The LLM doesn't know these contradict because it never saw them side by side.

Justy So SLIDERS flips that. Instead of chunking and concatenating, you extract into a database.

Cody Exactly. You define a schema—let's say companies, funding rounds, executives. You run extraction on each document chunk and insert structured records into those tables instead of collecting text snippets.

Justy And then you query it with SQL.

Cody Exactly. SQL is way more precise than reasoning over concatenated text. You can say, 'give me all companies founded between 2010 and 2015 that raised Series A funding,' and the database gives you exactly that.

Justy But if I'm extracting from a thousand documents, I'm probably going to get duplicates or conflicts.

Cody That's where reconciliation comes in. SLIDERS identifies duplicates and inconsistencies using metadata and extraction rationales to decide which record is right or whether they need to be merged.

Justy How much does this actually improve over just throwing everything at GPT-4?

Cody On existing benchmarks, SLIDERS beats GPT-4 by 6.6 points. At 3.9M and 36M token scales, it improves by 19 and 32 points respectively.

Justy But that's benchmark. Real world—who's building with this?

Cody SLIDERS is clever, but not plug-and-play. You need to design your extraction schema and set up a database. If your documents are semi-structured—financial reports, contracts—where you can define a clear schema, then yes. If they're totally unstructured, the overhead might outweigh the benefit.

Justy And latency? The reconciliation step adds a pass over the data.

Cody Right. So it's not real-time like simple RAG. But for batch jobs—'analyze these documents overnight'—it's solid. The reconciliation overhead is worth it because you get correctness guarantees you don't get from text concatenation.

Justy If you were building this yourself, what would you do differently?

Cody I might start with a lighter schema—maybe key-value pairs instead of full relational normalization. Get extraction and reconciliation working, then add structure as you learn what questions matter. Also, use SQLite instead of a full database server for solo projects.

Justy What do people actually grab and try?

Cody The GitHub repo is at stanford-oval/sliders. If you want to experiment, run it on a small document set first—like a single annual report. See if the schema makes sense for your domain.

Justy For someone working solo?

Cody Start with a simple project. Take a public dataset like Wikipedia articles or SEC filings, define a minimal schema, write a Python script to extract facts using an LLM API, insert them into SQLite, and write SQL queries to answer test questions. You'll see immediately where the schema breaks.

Justy That's a weekend project.

Cody Yep. And if it works, you've learned whether structured extraction is worth the effort for your use case.

Cody Exactly. And the reconciliation piece is the part that makes it work in the real world, where extraction is messy and inconsistent.

Justy This is Exploring Next, episode 330. Thanks for walking through the paper with me, Cody.