Kafka Streams vs Flink vs Spark

9 min read

Reading Progress0%
Streaming Systems Index
Tier 1 -- Foundations
Tier 2 -- Core Concepts
Tier 3 -- Production & System Design
Streaming Systems Index
Tier 1 -- Foundations
Tier 2 -- Core Concepts
Tier 3 -- Production & System Design

Kafka Streams vs Flink vs Spark

1. What Is It?

Three stream-processing technologies dominate JVM-centric production usage: (a library), Apache (a true streaming engine), and Spark Structured Streaming (a micro-batch engine built on Spark). All three can read from , transform records, manage state, and write results to or external sinks. They look interchangeable in marketing diagrams. They are not interchangeable in production.

The problem this comparison solves: every team eventually has to choose one for a stateful streaming workload (, joins, aggregations, pipelines, real-time analytics), and the wrong pick costs months in operational pain — scaled to multi-cluster topologies it wasn't designed for, dragged into a use case where a library would have sufficed, or Spark Structured Streaming used for sub-second SLAs it can't hit. The three tools occupy distinct niches that map cleanly to deployment model, execution model, and state model — once you see those three axes, the choice becomes mechanical.

QUICK CHECK

A backend team needs sub-second latency for a real-time fraud detection pipeline that reads from Kafka, maintains stateful aggregations, and must produce alerts within 200ms. Which architectural characteristic of Spark Structured Streaming makes it a poor fit for this requirement?

Choose one answer

2. How It Works

Kafka Streams

A JVM library you embed in your own application. There is no separate cluster — your service main() instantiates a KafkaStreams topology, and each instance reads from , processes records, maintains local state ( on disk), and commits offsets. Scaling = run more instances of your service; partitions are divided among instances exactly the way groups work (because under the hood, it is a ). State is partitioned by key, co-located with the of that , and backed up to an internal changelog so a crashed instance's replacement can rebuild state on another node.

  • Source/sink: Kafka only (with bridging external systems).
  • Execution: continuous, record-at-a-time, in your own process.
  • Deployment: whatever runs your JVM service (Kubernetes pod, EC2, on-prem).
  • : local + Kafka changelog for recovery.

A dedicated distributed processing engine. You write a job (DataStream API or SQL), submit it to a cluster (JobManager + TaskManagers), and the cluster runs it as a long-lived, continuously-executing graph. Each operator can be parallelized independently. State lives in operator-local state backends (HashMap on the JVM heap by default, or RocksDB on disk for large state, with periodic checkpoints to durable storage like S3/HDFS). Multiple sources (Kafka, Kinesis, files, JDBC , custom) are first-class.

  • Source/sink: Kafka, Kinesis, Pulsar, files, JDBC, connectors, custom.
  • Execution: true streaming — every event flows through immediately, with watermarks driving event-time logic.
  • Deployment: standalone cluster, YARN, Kubernetes, or managed (Ververica, AWS MSF, Confluent).
  • : RocksDB or in-memory + snapshots to durable storage.

Spark Structured Streaming

A micro-batch engine (technically supports a "continuous" mode but it's not the production-default path). The same Spark engine that runs batch ETL applies streaming semantics by slicing the stream into small batches (default: trigger-driven, often hundreds of milliseconds to seconds). Each micro-batch is processed by a Spark job. State, watermarks, and are layered on top.

  • Source/sink: Kafka, files, Delta Lake, Iceberg, JDBC, custom — extremely broad.
  • Execution: micro-batch — events accumulate until trigger fires, then a batch executes.
  • Deployment: Spark cluster (YARN, Kubernetes, Databricks, EMR).
  • State backend: HDFS-compatible filesystem (S3, HDFS) for state checkpoints.

Concrete example. A fraud-detection pipeline that scores transactions against rolling 5-minute customer-spend windows:

  • : Embedded in the existing fraud-service Java app. New version of the service ships a KStreams topology. No new infra. Scales by running more pods of the same service.
  • Flink: A standalone job submitted to a Flink cluster. Code lives in its own repo. Separate operational surface, but easy to parallelize independently from the consuming service.
  • Spark Structured Streaming: Job runs on the existing Databricks platform. Micro-batch trigger every 2 seconds. Same engineers who own batch ETL own this job; same monitoring; same notebooks.

The same logic works in all three. The differences show up in operational footprint, latency floor, and how the rest of the engineering org interacts with it.

QUICK CHECK

Your team runs a Java-based payment service deployed as Kubernetes pods and wants to add real-time fraud detection using rolling time windows. A colleague suggests using Kafka Streams. Which of the following accurately describes how scaling this solution would work?

Choose one answer

3. What Mid-Senior SWEs Actually Need to Know

  • Latency floor:

    • : low single-digit milliseconds end-to-end achievable.
    • : low single-digit milliseconds end-to-end achievable.
    • Spark Structured Streaming: roughly trigger-interval latency — hundreds of ms minimum, often seconds in practice. Not the right tool for sub-second SLAs.
  • State scale:

    • : limited by local disk on each instance; works well into GB-per-instance, painful at TB-per-instance because rebuilding state from changelog can take a long time on instance churn.
    • : scales to TB+ per cluster, with incremental checkpoints making large state practical to recover.
    • Spark Structured Streaming: scales similarly to Flink for state, but the higher latency floor means it's used for different workloads.
  • Source diversity:

    • Kafka Streams: only. Need a non- source? You're using to bring it in first, or you're using a different tool.
    • Flink: extremely broad — Kafka, Kinesis, Pulsar, JDBC (Debezium), files, S3, custom. Best choice when you have heterogeneous sources.
    • Spark Structured Streaming: very broad source ecosystem, especially strong on lakehouse formats (Delta, Iceberg).
  • Operational surface:

    • Kafka Streams: zero net-new infra — it runs inside your existing service. Best when an application team owns one logical pipeline.
    • Flink: a dedicated cluster (or managed service) — meaningful ops work, especially for self-hosted. Best when a platform team owns shared streaming infra serving many tenants.
    • Spark Structured Streaming: usually piggybacks on an existing Spark/Databricks platform — operational cost is "incremental" because the platform is already there.
  • Deployment model maps to team structure:

    • One app team, one pipeline, tightly coupled to existing JVM service → Kafka Streams.
    • Multiple jobs, shared infra, heterogeneous sources, central platform team → Flink.
    • Org already standardized on Spark/Databricks, batch and streaming live in the same notebooks → Spark Structured Streaming.
  • Common misunderstanding: "Flink is always better than Kafka Streams." Flink is more powerful and more flexible. It is also more operationally expensive. If your job is a single topology consuming and producing within one Kafka cluster and the state fits on local disk, Kafka Streams is simpler and just as capable.

  • Common misunderstanding: "Spark Structured Streaming is streaming." Strictly, it's micro-batch with streaming semantics. For most analytics-style use cases (1-minute aggregates feeding dashboards) it's a great fit. For sub-second SLAs (fraud, alerting, user-facing), it isn't.

  • Common misunderstanding: "Kafka Streams can do ." Yes — but only for Kafka-to-Kafka pipelines, via Kafka transactions. Once you write to an external system (DB, API), is on you (idempotency at the sink). Same caveat applies to Flink's EXACTLY_ONCE checkpointing — exactly-once is a contract between the engine and the sink connector; external sinks must support transactional or idempotent writes for it to hold.

  • Code shape:

    // ,Kafka Streams, — embedded in a service main()
    StreamsBuilder builder = new StreamsBuilder();
    builder.stream("transactions")
           .groupByKey()
           .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
           .count()
           .toStream()
           .to("transaction-counts");
    new KafkaStreams(builder.build(), props).start();
    // ,Flink, DataStream — submitted to a ,Flink, cluster
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.fromSource(kafkaSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)), "txns")
       .keyBy(Transaction::getCustomerId)
       .window(TumblingEventTimeWindows.of(Time.minutes(5)))
       .aggregate(new SumAggregator())
       .sinkTo(kafkaSink);
    env.execute("transaction-counts");
    -- Spark Structured Streaming (SQL) — runs on a Spark cluster
    SELECT customer_id,
           window(event_time, '5 minutes') AS w,
           count(*) AS cnt
    FROM transactions
    GROUP BY customer_id, window(event_time, '5 minutes')

    All three express the same logic. The differences are operational, not expressive.

QUICK CHECK

Your team runs a fraud detection service that must flag suspicious transactions with sub-second latency. The service is a single JVM application that already consumes from and produces to Kafka, and its state fits comfortably on local disk. Which streaming framework is the best fit, and why?

Choose one answer

4. Tradeoffs & Decisions

If you need...Pick...Why
A single stateful pipeline owned by one app team, Kafka-to-Kafka, state fits on a nodeKafka StreamsNo new infra; lives in the service that already owns the workload
Shared streaming platform serving many jobs, heterogeneous sources, large state, sub-second latencyFlinkTrue streaming engine, broad connectors, mature checkpointing
Streaming that lives next to batch ETL in a Spark/Databricks orgSpark Structured StreamingReuses platform; analytics-grade latency; same tooling as batch
Sub-second end-to-end SLAKafka Streams or FlinkSpark's micro-batch floor is too high
State in the TB range with fast recoveryFlinkIncremental RocksDB checkpoints; battle-tested at scale
CDC ingest from databasesFlink (with Debezium) or Spark (with Delta/CDC sources)Kafka Streams can't read directly from DBs; needs Kafka Connect upstream
Minimum operational footprintKafka StreamsNo cluster to run; just your JVM service
Multi-source streaming with non-Kafka inputsFlinkKafka Streams is Kafka-only by design

Key tradeoff: operational footprint vs flexibility. asks for the least new operational surface but locks you to and to state-fits-on-one-node. gives the most flexibility (any source, large state, complex topologies) at the cost of running and tuning a real cluster. Spark Structured Streaming gives broad ecosystem fit at the cost of a higher latency floor.

Secondary tradeoff: team shape vs tool shape. fits when one app team owns one pipeline. fits when a platform team runs streaming as shared infra. Spark Structured Streaming fits when a data platform team already owns Spark/Databricks. Mismatched team-to-tool shapes are a real source of operational pain.

QUICK CHECK

A platform team at a large company needs to support dozens of streaming jobs from multiple teams, reading from a mix of Kafka topics and relational databases, with state that can grow into the terabyte range. Which streaming framework is the best fit, and what is the primary reason?

Choose one answer

5. Interview & System Design Cheat Sheet

  • = library, = engine, Spark Structured Streaming = micro-batch engine. That single difference drives most of the rest.
  • Latency floor: and reach low-ms; Spark Structured Streaming's micro-batch model floors at trigger-interval — typically not sub-second.
  • State scale: Kafka Streams up to local-disk-per-instance; Flink to TB with incremental checkpoints; Spark similar at the cost of latency.
  • Source diversity: Kafka Streams is -only; Flink and Spark are broad.
  • Pick by team and topology, not by feature checklist. All three can express the same logic — the differences are operational, organizational, and latency-shaped.

Common follow-ups:

  • "When would you pick Kafka Streams over Flink?" — When the workload is a single -to-Kafka pipeline owned by one app team, state is bounded by local disk, and the team prefers running its own JVM service over operating a Flink cluster. The simplicity is the win.
  • "Can Spark Structured Streaming do sub-second latency?" — Not reliably. Its micro-batch model means trigger-interval is the floor — typically hundreds of ms to seconds in practice. There is a "continuous" mode but it has limitations and is rarely used in production. For sub-second SLAs, choose Flink or Kafka Streams.
  • "Does Flink's apply to my Postgres sink?" — Only if the sink connector supports transactional commits coordinated with Flink's checkpoints, or if you make writes idempotent at the sink. is a property of the whole pipeline, not just the engine.

If asked to design X, anchor on this: Identify three things — (1) what sources and sinks the pipeline talks to, (2) how much state it must hold and how fast it must recover, (3) what latency the business actually needs. Map those onto Kafka Streams (Kafka-only, modest state, low ms), Flink (any source, large state, low ms), or Spark Structured Streaming (broad ecosystem, large state, second-class latency). The choice usually picks itself.

QUICK CHECK

Your team needs to build a real-time pipeline with a strict sub-second latency SLA. The pipeline reads exclusively from Kafka topics, maintains a modest amount of state that fits on local disk, and will be owned end-to-end by a single application team that wants to avoid running a separate cluster. Which streaming approach best fits these requirements, and why?

Choose one answer