Consumer Groups

6 min read

Reading Progress0%
Streaming Systems Index
Tier 1 -- Foundations
Tier 2 -- Core Concepts
Tier 3 -- Production & System Design
Streaming Systems Index
Tier 1 -- Foundations
Tier 2 -- Core Concepts
Tier 3 -- Production & System Design

Consumer Groups

1. What Is It?

A is a set of consumers that cooperate to read from the same (s), where each is read by exactly one in the group at a time. Different groups read the same independently — each maintains its own position. The is how delivers both queue semantics (within a group) and semantics (across groups) over the same log primitive.

The problem it solves: a single consumer process cannot keep up with a high-throughput topic. You need to fan out the work across multiple processes while preserving per- ordering and exactly-one-owner semantics. And independently, you want multiple unrelated systems (analytics, billing, fraud) to each read every event without coordinating. Consumer groups handle both with one mechanism.

QUICK CHECK

Your team runs three separate services — analytics, billing, and fraud detection — all of which need to process every event from a high-throughput Kafka topic. You also want each service to scale internally by running multiple instances. Which setup correctly achieves both goals?

Choose one answer

2. How It Works

  1. Each specifies group.id (e.g., fraud-scoring). All consumers sharing that ID form one group.
  2. One is selected as the group coordinator for the group, responsible for membership and commits.
  3. When members join or leave, the coordinator triggers a rebalance — the assignment is recomputed and distributed to members. Different assignors (range, round-robin, cooperative sticky — modern default) choose how.
  4. Each member then independently reads its assigned partitions, processes records, and commits offsets to a special internal __consumer_offsets (keyed by group.id + + ).
  5. If a member fails its heartbeat (session.timeout.ms) or doesn't poll within max.poll.interval.ms, the coordinator kicks it out and rebalances.

Concrete example. Topic orders with 6 partitions, two groups:

  • fraud-scoring with 3 instances → each gets 2 partitions. If a 4th instance joins, assignment becomes 6/4 ≈ 1–2 per consumer. If a 7th joins, one sits idle.
  • analytics with 1 instance → reads all 6 partitions independently. Its progress has nothing to do with fraud-scoring's offset progress.

The two groups don't know about each other. Adding a new group email-receipts later doesn't affect the existing two; it just starts reading from auto.offset.reset (earliest or latest, depending on config).

QUICK CHECK

A Kafka topic has 6 partitions. A consumer group named order-processing starts with 3 consumers, each handling 2 partitions. A fourth consumer joins the group. Shortly after, a seventh consumer is added. What happens to partition assignments at each stage?

Choose one answer

3. What Mid-Senior SWEs Actually Need to Know

  • count is the ceiling on group concurrency. A 6- supports at most 6 active consumers per group. Adding a 7th gives you idle capacity, not more throughput. Plan partition count for peak parallelism.
  • Group ID identifies progress. Two consumers with the same group.id share work and share offsets. Two with different group.ids each read everything. Never accidentally reuse a group.id between unrelated apps — they'll fight over partitions.
  • Rebalances pause everyone in the group. During a rebalance, all members stop processing while the coordinator reassigns. With the older eager assignment, every member gives up all partitions, then re-acquires. With cooperative sticky (modern default), only moved partitions are revoked — much smoother.
  • Rebalances are triggered by: members joining or leaving, partition count changing, subscriptions changing, member failing heartbeat or max.poll.interval.ms.
  • Slow processing → rebalance storm. The most common production pathology: one takes too long in processBatch(), misses max.poll.interval.ms, gets kicked, group rebalances, lag spikes, work isn't drained, the next consumer also misses the deadline. Fix: speed up processing, lower max.poll.records, or move slow work off the poll thread.
  • commits are per-group, per-partition. They live in __consumer_offsets. Deleting a group ID (or changing it) loses progress — the new group starts from auto..reset.
  • auto..reset only applies when there's no committed offset. Once you've committed even one offset, future restarts resume from there, ignoring auto..reset.
  • Static membership (group.instance.id) avoids rebalances on rolling restart. A consumer that disappears briefly (e.g., during a deploy) keeps its assignment until session.timeout.ms * a long timeout, so a fast restart resumes the same partitions. Worth using for stateful consumers.
  • Common misunderstanding: "I'll just add more consumers to scale." Works until you hit the partition count. Real scaling = more partitions and more consumers, and partition count is hard to change later.

Useful CLI:

  • -consumer-groups.sh --describe --group fraud-scoring — shows current offset, log end, lag, and which consumer owns which partition.
  • -consumer-groups.sh --reset-offsets --group X --to-earliest --topic Y --execute — operational tool for replays (use carefully — only when group has no active members).
QUICK CHECK

A backend service consumes from a 4-partition Kafka topic and occasionally performs slow downstream API calls inside its poll loop. Ops notices a repeating pattern: consumer lag spikes every few minutes, then partially recovers, then spikes again. What is the most likely root cause of this cycle?

Choose one answer

4. Tradeoffs & Decisions

If you need...Choose...Tradeoff
Independent fan-out to multiple systemsDistinct group.id per consuming systemEach group adds offset-commit traffic; not a real cost at typical scale
Maximum throughput in one consumer systemMore consumers + enough partitions for themWasted consumers if more than partitions
Smooth rolling restartsgroup.instance.id (static membership)Slightly longer time to detect a truly-dead consumer
Minimal rebalance disruptionCooperative sticky assignor (modern default)Slightly more complex assignment protocol; not user-visible
Replay from beginningUse a new group.id or reset offsets to earliestReprocessing cost; downstream must be idempotent
Per-partition ordering preserved across scalingPartition by key; consumers within the group naturally process each partition in orderCannot parallelize within a partition; hot keys cap throughput

Key tradeoff: rebalance disruption vs failure detection speed. Aggressive timeouts catch dead consumers fast but rebalance frequently on transient slowdowns. Cooperative sticky + static membership + reasonable timeouts (5–10s heartbeat, several minutes max.poll.interval.ms) is the sane default.

Secondary tradeoff: explicit commit vs auto-commit. Auto-commit is simpler but breaks on a crash mid-batch. Explicit commit after processing is the correct default for production; auto-commit is fine for "metrics where I don't care about a few lost records."

QUICK CHECK

A backend team uses Kafka auto-commit for a payment processing consumer. During a crash mid-batch, some records are lost without being processed. What is the root cause of this issue, and what is the recommended fix?

Choose one answer

5. Interview & System Design Cheat Sheet

  • A is the unit of cooperative consumption: one → one in the group, but unlimited groups can read the same independently.
  • count is the ceiling on group concurrency — choose it for peak parallelism, not current.
  • Rebalances pause everyone — minimize them via cooperative sticky assignor, static membership, and keeping processBatch under max.poll.interval.ms.
  • commits are per-group, per-partition, stored in __consumer_offsets. Group ID changes = lost progress.
  • The same delivers queue semantics within a group and semantics across groups — that's the entire model.

Common follow-ups:

  • "What's the difference between an eager and cooperative rebalance?" — Eager: all members release all partitions, then everyone re-acquires (long stop-the-world). Cooperative (incremental sticky): only moving partitions are revoked, others keep processing. Modern default is cooperative.
  • "How would you replay data for a new feature?" — Either spin up a new with auto..reset=earliest, or use -consumer-groups to reset offsets on the existing group (after stopping it).
  • "My is spiking — what's going on?" — Check (a) processing throughput per partition, (b) rebalance rate, (c) whether one specific partition is hot, (d) whether max.poll.interval.ms is being exceeded. Lag spikes from rebalances look very different from lag spikes from slow processing — diagnose by metric.

If asked to design X, anchor on this: Identify which systems are independent (separate group.ids) and within each, how many consumer instances you need vs how many partitions the topic has. The answer to "how do we scale this consumer" is almost always "more partitions and more consumers — but first verify processing isn't the bottleneck."

QUICK CHECK

Your team runs a Kafka-based order processing service with one consumer group reading from a topic that has 6 partitions. A new analytics service needs to process every order event independently, without affecting the existing service's offset tracking. What is the correct approach?

Choose one answer