Consumer Groups
6 min read
Streaming Systems Index
Tier 1 -- Foundations
Event-Driven Foundations
Kafka Mental Model
Stream Processing Landscape
Tier 2 -- Core Concepts
Tier 3 -- Production & System Design
Streaming Systems Index
Tier 1 -- Foundations
Event-Driven Foundations
Kafka Mental Model
Stream Processing Landscape
Tier 2 -- Core Concepts
Tier 3 -- Production & System Design
Consumer Groups
1. What Is It?
A is a set of consumers that cooperate to read from the same (s), where each is read by exactly one in the group at a time. Different groups read the same independently — each maintains its own position. The is how delivers both queue semantics (within a group) and semantics (across groups) over the same log primitive.
The problem it solves: a single consumer process cannot keep up with a high-throughput topic. You need to fan out the work across multiple processes while preserving per- ordering and exactly-one-owner semantics. And independently, you want multiple unrelated systems (analytics, billing, fraud) to each read every event without coordinating. Consumer groups handle both with one mechanism.
Your team runs three separate services — analytics, billing, and fraud detection — all of which need to process every event from a high-throughput Kafka topic. You also want each service to scale internally by running multiple instances. Which setup correctly achieves both goals?
2. How It Works
- Each specifies
group.id(e.g.,fraud-scoring). All consumers sharing that ID form one group. - One is selected as the group coordinator for the group, responsible for membership and commits.
- When members join or leave, the coordinator triggers a rebalance — the assignment is recomputed and distributed to members. Different assignors (range, round-robin, cooperative sticky — modern default) choose how.
- Each member then independently reads its assigned partitions, processes records, and commits offsets to a special internal
__consumer_offsets(keyed bygroup.id + +). - If a member fails its heartbeat (
session.timeout.ms) or doesn't poll withinmax.poll.interval.ms, the coordinator kicks it out and rebalances.
Concrete example. Topic orders with 6 partitions, two groups:
fraud-scoringwith 3 instances → each gets 2 partitions. If a 4th instance joins, assignment becomes 6/4 ≈ 1–2 per consumer. If a 7th joins, one sits idle.analyticswith 1 instance → reads all 6 partitions independently. Its progress has nothing to do withfraud-scoring's offset progress.
The two groups don't know about each other. Adding a new group email-receipts later doesn't affect the existing two; it just starts reading from auto.offset.reset (earliest or latest, depending on config).
A Kafka topic has 6 partitions. A consumer group named order-processing starts with 3 consumers, each handling 2 partitions. A fourth consumer joins the group. Shortly after, a seventh consumer is added. What happens to partition assignments at each stage?
3. What Mid-Senior SWEs Actually Need to Know
- count is the ceiling on group concurrency. A 6- supports at most 6 active consumers per group. Adding a 7th gives you idle capacity, not more throughput. Plan partition count for peak parallelism.
- Group ID identifies progress. Two consumers with the same
group.idshare work and share offsets. Two with differentgroup.ids each read everything. Never accidentally reuse agroup.idbetween unrelated apps — they'll fight over partitions. - Rebalances pause everyone in the group. During a rebalance, all members stop processing while the coordinator reassigns. With the older eager assignment, every member gives up all partitions, then re-acquires. With cooperative sticky (modern default), only moved partitions are revoked — much smoother.
- Rebalances are triggered by: members joining or leaving, partition count changing, subscriptions changing, member failing heartbeat or
max.poll.interval.ms. - Slow processing → rebalance storm. The most common production pathology: one takes too long in
processBatch(), missesmax.poll.interval.ms, gets kicked, group rebalances, lag spikes, work isn't drained, the next consumer also misses the deadline. Fix: speed up processing, lowermax.poll.records, or move slow work off the poll thread. - commits are per-group, per-partition. They live in
__consumer_offsets. Deleting a group ID (or changing it) loses progress — the new group starts fromauto..reset. auto..resetonly applies when there's no committed offset. Once you've committed even one offset, future restarts resume from there, ignoringauto..reset.- Static membership (
group.instance.id) avoids rebalances on rolling restart. A consumer that disappears briefly (e.g., during a deploy) keeps its assignment untilsession.timeout.ms* a long timeout, so a fast restart resumes the same partitions. Worth using for stateful consumers. - Common misunderstanding: "I'll just add more consumers to scale." Works until you hit the partition count. Real scaling = more partitions and more consumers, and partition count is hard to change later.
Useful CLI:
-consumer-groups.sh --describe --group fraud-scoring— shows current offset, log end, lag, and which consumer owns which partition.-consumer-groups.sh --reset-offsets --group X --to-earliest --topic Y --execute— operational tool for replays (use carefully — only when group has no active members).
A backend service consumes from a 4-partition Kafka topic and occasionally performs slow downstream API calls inside its poll loop. Ops notices a repeating pattern: consumer lag spikes every few minutes, then partially recovers, then spikes again. What is the most likely root cause of this cycle?
4. Tradeoffs & Decisions
| If you need... | Choose... | Tradeoff |
|---|---|---|
| Independent fan-out to multiple systems | Distinct group.id per consuming system | Each group adds offset-commit traffic; not a real cost at typical scale |
| Maximum throughput in one consumer system | More consumers + enough partitions for them | Wasted consumers if more than partitions |
| Smooth rolling restarts | group.instance.id (static membership) | Slightly longer time to detect a truly-dead consumer |
| Minimal rebalance disruption | Cooperative sticky assignor (modern default) | Slightly more complex assignment protocol; not user-visible |
| Replay from beginning | Use a new group.id or reset offsets to earliest | Reprocessing cost; downstream must be idempotent |
| Per-partition ordering preserved across scaling | Partition by key; consumers within the group naturally process each partition in order | Cannot parallelize within a partition; hot keys cap throughput |
Key tradeoff: rebalance disruption vs failure detection speed. Aggressive timeouts catch dead consumers fast but rebalance frequently on transient slowdowns. Cooperative sticky + static membership + reasonable timeouts (5–10s heartbeat, several minutes max.poll.interval.ms) is the sane default.
Secondary tradeoff: explicit commit vs auto-commit. Auto-commit is simpler but breaks on a crash mid-batch. Explicit commit after processing is the correct default for production; auto-commit is fine for "metrics where I don't care about a few lost records."
A backend team uses Kafka auto-commit for a payment processing consumer. During a crash mid-batch, some records are lost without being processed. What is the root cause of this issue, and what is the recommended fix?
5. Interview & System Design Cheat Sheet
- A is the unit of cooperative consumption: one → one in the group, but unlimited groups can read the same independently.
- count is the ceiling on group concurrency — choose it for peak parallelism, not current.
- Rebalances pause everyone — minimize them via cooperative sticky assignor, static membership, and keeping
processBatchundermax.poll.interval.ms. - commits are per-group, per-partition, stored in
__consumer_offsets. Group ID changes = lost progress. - The same delivers queue semantics within a group and semantics across groups — that's the entire model.
Common follow-ups:
- "What's the difference between an eager and cooperative rebalance?" — Eager: all members release all partitions, then everyone re-acquires (long stop-the-world). Cooperative (incremental sticky): only moving partitions are revoked, others keep processing. Modern default is cooperative.
- "How would you replay data for a new feature?" — Either spin up a new with
auto..reset=earliest, or use-consumer-groupsto reset offsets on the existing group (after stopping it). - "My is spiking — what's going on?" — Check (a) processing throughput per partition, (b) rebalance rate, (c) whether one specific partition is hot, (d) whether
max.poll.interval.msis being exceeded. Lag spikes from rebalances look very different from lag spikes from slow processing — diagnose by metric.
If asked to design X, anchor on this: Identify which systems are independent (separate group.ids) and within each, how many consumer instances you need vs how many partitions the topic has. The answer to "how do we scale this consumer" is almost always "more partitions and more consumers — but first verify processing isn't the bottleneck."
Your team runs a Kafka-based order processing service with one consumer group reading from a topic that has 6 partitions. A new analytics service needs to process every order event independently, without affecting the existing service's offset tracking. What is the correct approach?
Glossary History
Click dotted jargon to save explanations here.
Glossary History
Click dotted jargon to save explanations here.