Stream vs Batch

5 min read

Reading Progress0%
Streaming Systems Index
Streaming Systems Index

Stream vs Batch

1. What Is It?

Batch processing runs a job over a bounded dataset that has a known start and end — yesterday's orders, last hour's logs, all rows in a table. The job starts, computes, writes results, and exits. Stream processing runs over an unbounded dataset — events arriving continuously with no defined end. The job runs forever (in principle), consuming events as they arrive and emitting results incrementally.

The problem batch solves: high-throughput, complete-picture computation when freshness is measured in hours or days. The problem stream solves: keeping derived state, alerts, dashboards, and downstream systems in sync with reality within seconds or sub-second. Without streaming, you choose between staleness and brute-force re-computation; without batch, you give up the simplicity of "the data is sitting there, just read it."

QUICK CHECK

Your team runs a nightly job that aggregates all user activity from the past 24 hours to update a reporting dashboard. Users accept that the dashboard reflects data from the previous day. A new requirement arrives: the dashboard must reflect activity within 5 seconds of it occurring. Which architectural shift does this requirement demand, and why?

Choose one answer

2. How It Works

Batch job lifecycle:

  1. A scheduler (Airflow, cron, Spark submit) launches the job.
  2. The job reads the bounded input (a partitioned table, a date-range of files in S3).
  3. It computes — joins, aggregates, transforms — on the full dataset.
  4. It writes outputs atomically (commit a table , swap a directory).
  5. It exits. State is gone. The next run starts from scratch (or from the next ).

Stream job lifecycle:

  1. A long-lived worker (a TaskManager, a app, a service) subscribes to one or more topics.
  2. For each incoming event it: updates in-process state, optionally emits an output event, and periodically commits offsets or checkpoints.
  3. It runs indefinitely. Failure means restart from the last and resume.

Concrete example. "Daily active users" computed two ways:

  • Batch (Spark): Every night, scan yesterday's events table, SELECT COUNT(DISTINCT user_id) WHERE date = yesterday. Result is correct for "yesterday" but you learn it the next morning.
  • Stream (): Subscribe to the events , maintain a HyperLogLog sketch keyed by hour, emit the count on each update. The dashboard shows the current hour's DAU within seconds, the previous hour's DAU is final within minutes of the hour boundary.
QUICK CHECK

A team wants to display a real-time dashboard showing the number of distinct active users in the current hour, updating within seconds of each new event. They are choosing between a nightly batch job and a long-lived stream processor. Which characteristic of stream processing makes it the appropriate choice here, and what is the key trade-off compared to batch?

Choose one answer

3. What Mid-Senior SWEs Actually Need to Know

  • Latency targets drive the choice. If business SLA is "fresh within an hour," batch (or micro-batch) is simpler and cheaper. If it's "fresh within seconds," streaming is the only option.
  • Streaming is more expensive per byte. Long-lived workers consume RAM and CPU 24/7 even when traffic is low. A batch job for the same logic might run for 10 minutes a day.
  • State is the hard part of streaming. A batch job's state lives in the input table. A streaming job's state lives in its memory + checkpoints, and you are responsible for backups, restores, and migrations.
  • Backfills are the hard part of streaming, too. Re-running batch is trivial; re-running a streaming pipeline against historical events requires either replaying the source ( with long retention, or an archive) or a separate batch pipeline that produces the same output schema (the problem).
  • "Streaming" can mean micro-batch. Spark Structured Streaming runs mini-batches every few seconds. Latency floor is ~hundreds of ms to seconds. True per-record streaming (, ) gets to ~tens of ms.
  • Common misunderstanding: "We need real-time" → often the actual need is "we need fresh within a minute," which is achievable with a tight batch loop and much less operational complexity.
QUICK CHECK

Your team's dashboard needs to display sales totals that are 'fresh within a minute.' A colleague immediately proposes migrating to a streaming pipeline. What is the most important consideration before committing to that approach?

Choose one answer

4. Tradeoffs & Decisions

If you need...Pick...You'd choose the other when...
Hourly/daily dashboards, correct full re-computationBatchLatency requirement drops below ~1 minute
Sub-second alerts, live feature updates, derived event streamsStreamComputation needs whole-dataset semantics (rank everyone by lifetime spend)
Complex joins across many large tablesBatchThe join can be expressed as stream-table enrichment with a slowly-changing dimension
Simple operational pipelines, small teamBatchYou already have Kafka and the latency win pays for the operational cost
Backfill + reprocessing of historical dataBatchSource retains full history and you can replay through the stream job (kappa architecture)

Key tradeoff: simplicity vs freshness. Batch is operationally simpler — failures restart cleanly, state is implicit in the input. Streaming pushes complexity into the application (state, checkpoints, watermarks, ) in exchange for low-latency outputs.

QUICK CHECK

Your team needs to rank all customers by their lifetime total spend to power a weekly loyalty-tier report. The dataset spans years of transaction history across multiple large tables. Which processing approach is most appropriate, and why?

Choose one answer

5. Interview & System Design Cheat Sheet

  • Batch processes bounded data; stream processes unbounded data — that single distinction drives every downstream design choice (state, failure model, , backfill).
  • The default question is not "batch or stream" but "what is the freshness SLA?" — that determines the answer.
  • Stream jobs own their state and must be checkpointed; batch jobs derive state from the input table on every run, which is why batch is easier to operate.
  • Backfill is the hidden cost of streaming. A serious streaming system needs either long source retention or a parallel batch path.
  • True per-record streaming (~tens of ms latency) and micro-batch streaming (~seconds) are different operating points — / vs Spark Structured Streaming.

Common follow-ups:

  • "Can a stream job replace a batch job entirely?" — Often, if you can replay the source and accept the operational cost. is exactly this argument.
  • "Why is hard in streaming but trivial in batch?" — Batch's idempotency comes from atomic table writes. Streaming emits per-event side effects across system boundaries, so it needs coordination (transactions, two-phase commits).
  • "When do you use both?": stream for fresh approximate results, batch for the daily authoritative re-compute.

If asked to design X, anchor on this: Start by asking the freshness SLA. If it's hours, you design a batch pipeline and you're done. If it's seconds, you design a streaming pipeline and the next twelve decisions are about state, watermarks, exactly-once, and backfill.

QUICK CHECK

Your team is deciding whether to build a batch pipeline or a streaming pipeline for a new data feature. Which question should drive that architectural decision first?

Choose one answer