Stream Processor Roles
8 min read
Streaming Systems Index
Streaming Systems Index
Stream Processor Roles
1. What Is It?
A stream processor is a long-running job that reads one or more streams of events, transforms them, and writes results to other streams or systems — continuously, with no defined "end." That's the literal definition. The more useful framing is by role: in a real production architecture, stream processors do not all do the same kind of work. There are a handful of distinct roles they play, and the skill of designing streaming systems is knowing which role you actually need before reaching for a tool.
The problem it solves: once you have events flowing through (or any log), you need something to do work on them. That work falls into recognizable patterns — filtering, enriching, aggregating, joining, routing, materializing — and conflating these patterns leads to overbuilt systems. Most teams discover too late that 80% of their "streaming jobs" are stateless filters dressed up as topologies, while the genuinely hard ones (windowed joins, sessionization) get treated as if they're the same shape. Distinguishing the roles up front tells you which problems you actually have, which tool fits, and where the operational risk lives.
A backend team has a Kafka pipeline where 80% of their streaming jobs simply drop events that don't match a set of field conditions, forwarding the rest downstream unchanged. They've implemented all of these as full Flink topologies. What is the core design mistake they've made?
2. How It Works
A stream processor sits between source streams and downstream sinks. It's a process (or fleet of processes) that maintains a position in each source stream (offsets, for ), holds optional state (in memory + checkpointed to durable storage), and emits output. Beyond that shared skeleton, here are the six common roles:
Role 1: Stateless transformer / filter / router
Read a record, decide something based only on that record, emit zero-or-more records. No state. No memory of past events.
- Examples: drop events missing required fields; route by event type to A or B; redact PII before downstream consumers see it.
Role 2: Enricher (lookup join against a slowly-changing reference)
Read a stream event, look up associated context (user profile, product info, geo) from a side table, attach it, emit the enriched event.
- Examples: attach the customer tier to every order; attach the device type to every click.
- State requirement: cache of the reference data (small to medium, slowly-changing).
Role 3: Aggregator (windowed counts, sums, top-K)
Read a stream, group by some key over a window (1 minute, 1 hour, session), emit aggregated results.
- Examples: requests-per-second per endpoint; revenue per merchant per hour; top 10 trending hashtags.
- State requirement: per-key, per-window accumulator. Watermarks matter (see Tier 2).
Role 4: Joiner (stream–stream or stream–table)
Combine two streams on a key, within a time bound (stream-stream) or against a versioned table (stream-table).
- Examples: match
clickwithimpressionwithin 5 minutes; correlateorder_placedwithpayment_received. - State requirement: buffered events from both sides within the join window. This is the most expensive state.
Role 5: Materializer (build a queryable view from a stream)
Consume a change stream and maintain a derived store — a key/value lookup, a search index, an OLAP cube — that other services query.
- Examples: project the
account_updatedtopic into a Cassandra row; project orders into a per-customer "active orders" view. - State requirement: the derived view itself; usually backed by or the downstream store.
Role 6: Reactor / side-effect emitter
Consume events, decide whether to call out to an external system (send email, charge a card, trigger an alert), emit nothing (or an outcome event).
- Examples: fraud-detection rule fires → freeze account API; threshold crossed → page on-call.
- State requirement: usually small (rate limiters, dedup caches); the hard problem is idempotency at the side-effect target.
Concrete example. An e-commerce platform's streaming layer has, in production, all six roles:
- A filter that drops bot traffic from
clicksbefore it reaches anything else. - An enricher that joins
clicksagainst auserstable to attach loyalty tier. - An aggregator computing per-product views per minute for the trending widget.
- A joiner matching
add_to_cartevents withpurchaseevents within 24h to compute conversion. - A materializer projecting
order_status_changedinto a per-order key/value store the frontend queries. - A reactor that watches
payment_failedand calls the retry-payment API.
These are six different jobs with six different state, latency, and correctness profiles. Treating them as "the streaming app" is the mistake — they should be (and almost always end up as) separate jobs.
Your team needs to watch a payment_failed stream and automatically call an external retry-payment API whenever a failure event arrives. Occasionally, the stream processor crashes and replays events from its last checkpoint, meaning the same payment_failed event can be delivered more than once. Which type of stream processor role fits this use case, and what is the primary correctness challenge it introduces?
3. What Mid-Senior SWEs Actually Need to Know
- Most "streaming jobs" in the wild are Role 1 or Role 2 (stateless or simple lookup). For these, a plain in Java/Go/Python is enough — you do not need . Reaching for for a stateless filter is the most common over-engineering pattern in streaming.
- Role 3 and Role 4 are where stream processors earn their keep. , watermarks, , and state management make these jobs genuinely hard to write correctly by hand. This is where Flink, , or Spark Structured Streaming pay off.
- Role 5 (materialization) is often confused with caching. A cache is best-effort, evictable, and can be lossy. A materialized view from a stream is the authoritative answer for that query — it must be correct, must replay-able, and must not silently fall behind.
- Role 6 (reactor) is where matters most and is hardest. Side effects to external systems can't be transactional with . The only correct approach is idempotent external calls (deduplication by event ID at the receiver) and treating duplicate fires as expected.
- State size dictates tool choice. No state → any works. Small state (MBs per key, thousands of keys) → in-memory state in or Flink is fine. Large state (GBs per , millions of keys) → Flink with or external store.
- Latency requirement dictates batch vs streaming. Sub-second latency requires a real streaming engine. Minute-level latency may be cheaper served by a frequent batch job (often called "micro-batch") that runs every few minutes.
- Common misunderstanding: "We need Flink because we're streaming." No — you need Flink (or Kafka Streams) when you have stateful streaming (Roles 3, 4, partly 5) and you can't tolerate the operational pain of building correct /state/replay yourself. For Roles 1, 2, and 6, a plain consumer is simpler, cheaper, and easier to run.
- Common misunderstanding: "A single Flink job will do everything." Don't pile six logical jobs into one topology — you make every deploy a high-blast-radius change, and you can't scale or tune each role independently. One role per job is the default.
Your team is building a streaming pipeline that filters out bot traffic from a clickstream before writing valid events to a downstream topic. The job holds no state, performs no windowing, and has no aggregation logic. A junior engineer suggests adopting Apache Flink to handle this workload. What is the most accurate assessment of this proposal?
4. Tradeoffs & Decisions
| Role | When to use a real stream processor (Flink/KStreams) vs plain consumer |
|---|---|
| Role 1: Stateless transform / filter | Plain consumer almost always. Use a stream processor only if it lives in a larger topology that already needs one. |
| Role 2: Enricher (lookup) | Plain consumer + cached lookup is usually fine. Use stream processor if the reference table is itself a stream (CDC) and you need a versioned join. |
| Role 3: Windowed aggregator | Stream processor wins. Hand-rolling windowing, late data handling, and exactly-once aggregation is a year-long project to get right. |
| Role 4: Joiner | Stream processor wins, particularly for stream-stream joins. Hand-rolling buffered windowed joins is even harder than aggregation. |
| Role 5: Materializer | Depends. Materializing into the same engine's state (e.g., Kafka Streams KTable, Flink Table) → stream processor. Materializing into an external DB (Cassandra, Elasticsearch) → often a plain consumer with idempotent upserts is simpler. |
| Role 6: Reactor | Plain consumer + idempotency. Stream processors add little here and complicate side-effect semantics. |
Key tradeoff: one big job vs many small jobs. One big topology shares state and avoids inter-job topics, but every change is high-risk and you can't scale roles independently. Many small jobs are independently deployable and tunable, at the cost of more inter-stage topics and slightly more end-to-end latency. Default to many small jobs; merge only when the latency or coordination cost forces it.
Secondary tradeoff: latency vs cost. Sub-second processing requires keeping the processor hot and over-provisioned. Minute-level can often be served by micro-batch (run every minute, process the gap). Be honest about which one the business actually needs — most "real-time" requirements are "within a minute" requirements.
Your team needs to implement a windowed aggregator that counts API errors per service per 5-minute window, with support for late-arriving events. A colleague suggests hand-rolling this logic using a plain Kafka consumer with custom state stored in Redis. What is the strongest argument against this approach?
5. Interview & System Design Cheat Sheet
- Stream processors fall into six roles: stateless filter/route, enricher, windowed aggregator, joiner, materializer, reactor. The hard ones are aggregators and joiners.
- Stateless work doesn't need — a plain is the right tool for filters, routers, and most enrichers.
- Stateful work (, joins) is where engines like and pay off — building correct , late-data handling, and aggregation by hand is a multi-quarter project.
- One role per job is the default; merging roles into mega-topologies trades testability and independent scaling for marginal latency wins.
- Side effects to external systems are not transactional with — Role 6 (reactor) jobs must rely on idempotency at the target, not on the stream processor's guarantees.
Common follow-ups:
- "When would you not use Flink?" — Stateless or small-state work where the operational cost of running a Flink cluster outweighs the win. Plain Kafka consumers in your existing service runtime are often simpler and cheaper.
- "How do you decide between and Flink?" — Kafka Streams is a library you embed in a JVM service — best when the job is closely tied to one Kafka cluster and your team is already running Java services. Flink is a cluster — best for large state, complex topologies, multiple sources, or non-Kafka inputs.
- "What if the job is a mix of roles?" — Split it. One role per job is the default, with intermediate topics between them. The downside is more topics; the upside is independent deploy, scale, and debug.
If asked to design X, anchor on this: Before picking a tool, name the role each component plays — filter, enricher, aggregator, joiner, materializer, reactor. Stateless roles get plain consumers; stateful roles get a real stream processor; side-effecting roles get idempotent consumers. That mapping decides 80% of the architecture.
Your team needs to build a pipeline that reads user click events from Kafka, drops any events missing a required field, and forwards valid events to a downstream topic. No state needs to be maintained between events. Which tooling approach is most appropriate for this job?
Glossary History
Click dotted jargon to save explanations here.
Glossary History
Click dotted jargon to save explanations here.