From Batch to Micro Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline
Justy and Cody unpack an InfoQ case study about moving an ads delta-index pipeline from scheduled batch jobs to Spark micro-batches, focusing on freshness, object-store ingestion, logical watermarks, restart behavior, and practical weekend experiments.
Script: GPT-5.5 Voice: ElevenLabs
Transcript
Justy Cody, this one hits because stale data is the invisible product bug. The page loads, the ad shows, search works, but it’s quietly yesterday.
Cody Yeah, and the article is basically a scar-tissue writeup on that. They had a delta index pipeline for search and ads retrieval, and the painful part wasn’t that Spark couldn’t process the data. It was that scheduled batch runs left dead time everywhere.
Justy Also, I made coffee in your kitchen and somehow used three mugs for one drink, so I’m already in pipeline-waste mode. [chuckles] Anyway, this system had new incremental data arriving every five to seven minutes, right?
Cody Right. Ads, campaign updates, product and item signals, customer signals, conversion, performance, co-purchase stuff. The full index was hundreds of gigs and took two to three hours to rebuild, then validation and deployment could push the whole thing near five hours. So they had a smaller delta index, maybe around a tenth of the full index, to keep production fresh between full swaps.
Justy The user story is pretty plain: somebody updates an ad or campaign metadata, and the retrieval system should reflect that before the moment is gone. If an ad burns through budget before the latest version lands, that’s not an elegant distributed systems problem. That’s just money and relevance leaking out of the product.
Cody And the clever move was not jumping all the way to per-record streaming. They tried that direction, or at least started there, and it didn’t match the shape of the work. The indexing logic grouped around product or item level representations, then expanded back out, so one ad change could mean recomputing a grouped chunk. If you stream single records, you risk having part of the grouped index updated and part of it lagging behind.
Justy That’s the adoption barrier I recognize. People hear “streaming” and immediately picture a whole new operational lifestyle. More dashboards, weird long-running job failures, on-call anxiety, and a team quietly asking if the batch job was ugly but at least understandable.
Cody Totally fair anxiety. [pause] Their compromise was Spark Structured Streaming in micro-batch mode, but not with Kafka-style event semantics. Data was time-partitioned files in S3-style object storage. They used streaming more like a continuous executor that wakes up, finds the partition state, processes a bounded slice, and moves on.
Justy So it’s not trying to make every individual record sacred. It’s saying, the meaningful unit is a complete partition, and the product wants freshness within minutes, not a perfectly replayed museum of every intermediate state.
Cody Exactly. And they didn’t lean on Spark’s native checkpointing or event-time watermarks as the source of truth. They kept an external logical watermark, basically the latest processed partition timestamp. That matters because the input wasn’t an ordered event log. It was object storage listings and partition interpretation, which is a very different beast.
Justy The success-file bit stood out to me. I’ve seen teams treat completion markers like little magic stamps. Then reality shows up: eventual consistency, late files, retries, weird producer behavior, and suddenly the stamp is less comforting.
Cody Yeah. [exhales] The article’s take is that deterministic, rate-based progress can be more reliable than waiting for perfect completion signals. In steady state, process the latest available partition according to a controlled cadence. For this freshness path, skipping ahead after lag could be better than replaying every missed historical partition, because the output covers overlapping windows anyway, around the last five hours.
Justy I like that, but I’d be careful selling it outside this exact shape. If you’re doing billing, compliance-y reconciliation, or anything where every intermediate state matters, “skip to latest” is a trap. For ads retrieval freshness, though, I get it. The latest usable index is the thing users feel.
Cody Same. The questionable part is mostly portability of the lesson. The strong lesson is not “always skip backlog.” It’s design lag behavior explicitly. They treated restarts as normal too, which I love. Long-running jobs leak memory, dependencies wobble, clusters get weird. A clean scheduled restart can be a feature if the watermark and partition logic are solid.
Justy Build-next version, I’d keep it tiny. Take Spark Structured Streaming, write partitioned Parquet files into MinIO every few minutes, and maintain a watermark row in SQLite or Postgres. Then deliberately kill the job and see if it resumes from your logical partition instead of whatever Spark feels like doing.
Cody Yeah, solo builder weekend: clone something like apache/spark examples, run `docker compose up` with MinIO, then launch `pyspark` with a local checkpoint directory. Generate folders like `dt=2026-05-04/hour=10/minute=05`, process only the newest complete-looking partition, and write a small inverted index file, even just term to document IDs. If you want the lakehouse flavor, try delta-io/delta with Spark, but don’t confuse Delta Lake with their “delta index” concept. Differe
Justy Three deltas walk into a backlog, nobody knows who owns the ticket. Cody, that’s useful. Tiny version first, then decide if streaming is actually earned.