Brokers & Replication
7 min read
Streaming Systems Index
Streaming Systems Index
Brokers & Replication
1. What Is It?
A is one server process, holding partitions on local disk. is 's mechanism for keeping multiple copies of each across different brokers so that the loss of one (disk, host, AZ) doesn't lose data or stop the cluster. Each has one leader broker (handles reads and writes) and N-1 follower brokers that keep in-sync replicas.
The problem this solves: any single disk or machine eventually fails. Without , a broker outage means losing every partition assigned to it — both for reads and writes — and likely losing data. Replication turns single-machine durability into cluster-level durability; the cost is more disks, more network, and a more careful write path.
A Kafka cluster has a topic with 3 partitions and a replication factor of 3, spread across 3 brokers. One broker suddenly goes offline due to a hardware failure. What happens to the partitions whose leader was on that broker?
2. How It Works
Roles
- Leader of a : handles all client produce and fetch requests for that . Each partition has exactly one leader at a time.
- Followers: continuously fetch from the leader and write the data to their own logs.
- In-Sync Replicas (): the set of replicas that are "caught up" — currently within
replica.lag.time.max.msof the leader. The leader is always in the .
Write path with acks=all
- sends the record to the partition leader.
- Leader appends to its local log.
- All followers in the ISR fetch the record and append to their logs.
- Once every ISR follower has written the record, the leader advances the high and acknowledges the .
- Consumers can only read up to the high — uncommitted records are invisible.
Leader failure
- If the current leader fails (no longer in ISR or fails health check), the controller promotes one of the remaining in-sync replicas to leader.
- Producers and consumers fetch fresh metadata, retry against the new leader.
- Failover typically takes seconds (network heartbeat + controller decision + metadata propagation). Brief produce/fetch errors are normal.
min.insync.replicas
- -level setting: minimum ISR size required for an
acks=allproduce to succeed. - If too few replicas are in-sync, the leader returns
NotEnoughReplicasExceptionand the producer cannot write (preserving durability over availability).
Concrete example. orders with factor 3, min.insync.replicas=2, acks=all, on a 3- cluster spanning 3 AZs:
- Steady state: Every partition has 3 replicas (1 leader + 2 followers), all in-sync.
- One dies (e.g., AZ outage): ISR drops to 2. Writes still succeed because
min.insync.replicas=2is met. Affected partitions whose leader was on the dead broker fail over to one of the remaining replicas. Brief produce/fetch errors during failover. - Two brokers die: ISR drops to 1. Writes fail (
NotEnoughReplicasException) becausemin.insync.replicas=2is not met. The cluster preserves durability — you'd rather block writes than lose data. - Dead broker comes back: Follower catches up by re-fetching from the leader; once within
replica.lag.time.max.ms, it rejoins the ISR.
A Kafka topic has replication factor 3 and min.insync.replicas=2. Producers use acks=all. Two of the three brokers go down simultaneously. What happens to produce requests?
3. What Mid-Senior SWEs Actually Need to Know
- The standard safe combo is
RF=3, min.insync.replicas=2, acks=all. This tolerates one failure without data loss or write outage. Anything weaker risks losing acknowledged messages during failover. acks=alldoes not mean "wait for all replicas" — it means "wait for all in-sync replicas." That's whymin.insync.replicasexists: it preventsacks=allfrom silently downgrading toacks=1when the has shrunk.- Unclean is a footgun. With
unclean.leader.election.enable=true, an out-of-sync replica can become leader after a worst-case failure — preserving availability at the cost of losing acknowledged messages. Default is false; keep it false unless you understand the tradeoff (rarely worth it). - Rack awareness must be configured. Set
.rackon each broker (e.g., to AZ). Otherwise may place all 3 replicas of a in the same AZ, and a single AZ outage takes the fully offline. - is network and disk heavy. Each produce writes 1 leader + (RF-1) followers — total cluster write amplification is RF. Plan inter-broker network bandwidth and disk IOPS accordingly.
- A shrinking is a leading indicator of trouble. Use
-topics --describeand the JMX metricUnderReplicatedPartitions— if it's nonzero, something is wrong (slow disk, network saturation, GC pause, broker overloaded). - Broker storage is local-disk by design. Adding/removing a broker streams partition data across the network, which can take hours for large clusters. Replacements are not zero-cost; plan capacity for catch-up.
- Common misunderstanding: "If RF=3 I can lose 2 brokers and keep writing." Only if you set
min.insync.replicas=1— which means you might lose acknowledged data if the lone surviving broker dies before catching up. Most teams wantmin.insync.replicas=2and accept that 2 simultaneous broker losses block writes.
A Kafka cluster is configured with RF=3 and acks=all. Due to slow disk I/O on two brokers, the ISR for a critical partition shrinks from 3 replicas down to 1. What happens to produce requests targeting that partition, assuming min.insync.replicas=2?
4. Tradeoffs & Decisions
| If you need... | Set... | Tradeoff |
|---|---|---|
| Maximum durability, accept write outage on 2 broker losses | RF=3, min.insync.replicas=2, acks=all, unclean=false | Most expensive; standard safe default |
| High availability, can tolerate rare data loss | RF=3, min.insync.replicas=1, acks=all, unclean=false | One surviving replica can ack; if it then dies, loss is possible |
| Maximum throughput, accept loss on broker failure | acks=1 (leader only) | ~10–30% throughput win; lose data if leader dies before replicating |
| Survive AZ outage | RF=3 across 3 AZs, broker.rack set | Cross-AZ replication latency (~ms) and inter-AZ data egress costs |
| Survive whole-region outage | MirrorMaker 2 / Cluster Linking to a passive cluster in another region | DR complexity, lag between regions, failover runbook |
Key tradeoff: durability vs availability vs cost. More replicas survive more failures but cost more disk and network. Higher min.insync.replicas protects data but blocks writes earlier when replicas are unhealthy. There is no single right answer — match the 's business value.
Secondary tradeoff: fast failover vs stable leadership. Aggressive failure detection (low replica.lag.time.max.ms) catches dead followers fast but flaps membership under load spikes. The defaults are reasonable; tune only with evidence.
A Kafka topic stores financial transaction records. The team configures RF=3, min.insync.replicas=2, acks=all, and unclean.leader.election.enable=false. During a rolling maintenance window, two of the three brokers go offline simultaneously. What happens to producer writes during this time?
5. Interview & System Design Cheat Sheet
- A is one server; a cluster is N brokers. Each has one leader (serves reads/writes) and N-1 followers (replicate).
- In-Sync Replicas () = the set of replicas currently caught up;
min.insync.replicasdefines how many must be in the foracks=allproduces to succeed. - Standard safe setting:
RF=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false, with rack awareness spanning AZs. - Leader failure triggers controller-driven failover in seconds; brief produce/fetch errors are expected and clients retry transparently.
UnderReplicatedPartitionsis the canary metric — nonzero means a follower is behind, which is the first symptom of disk saturation, network issues, or overload.
Common follow-ups:
- "What's the difference between
acks=allandmin.insync.replicas?" —acks=allis the asking the leader to wait for all ISR followers.min.insync.replicasis the leader's check that the ISR is large enough to honor that contract; without it,acks=allsilently weakens when followers fall out of sync. - "What does unclean cost?" — Availability win after worst-case failure (no in-sync replicas left) at the cost of losing acknowledged messages. Disable unless you have explicitly chosen availability over correctness.
- "How do you tune for inter-AZ deployments?" —
broker.rackfor replica placement; tunereplica.fetch.max.bytesandsocket.receive.buffer.bytesfor higher RTTs; expect ~ms-level cross-AZ latency.
If asked to design X, anchor on this: State the factor, the min.insync.replicas, the acks setting, whether you're rack-aware, and your DR posture (single-region with multi-AZ replication, or multi-region with MM2 / Cluster Linking). Those five answers are the durability story.
A Kafka topic is configured with replication factor 3 and min.insync.replicas=2. One follower falls significantly behind and drops out of the ISR, leaving only the leader and one other follower in sync. A producer with acks=all sends a message. What happens?
Glossary History
Click dotted jargon to save explanations here.
Glossary History
Click dotted jargon to save explanations here.