Brokers & Replication

7 min read

Reading Progress0%

Streaming Systems Index

Brokers & Replication

1. What Is It?

A is one server process, holding partitions on local disk. is 's mechanism for keeping multiple copies of each across different brokers so that the loss of one (disk, host, AZ) doesn't lose data or stop the cluster. Each has one leader broker (handles reads and writes) and N-1 follower brokers that keep in-sync replicas.

The problem this solves: any single disk or machine eventually fails. Without , a broker outage means losing every partition assigned to it — both for reads and writes — and likely losing data. Replication turns single-machine durability into cluster-level durability; the cost is more disks, more network, and a more careful write path.

QUICK CHECK

A Kafka cluster has a topic with 3 partitions and a replication factor of 3, spread across 3 brokers. One broker suddenly goes offline due to a hardware failure. What happens to the partitions whose leader was on that broker?

2. How It Works

Roles

Leader of a : handles all client produce and fetch requests for that . Each partition has exactly one leader at a time.
Followers: continuously fetch from the leader and write the data to their own logs.
In-Sync Replicas (): the set of replicas that are "caught up" — currently within replica.lag.time.max.ms of the leader. The leader is always in the .

Write path with `acks=all`

sends the record to the partition leader.
Leader appends to its local log.
All followers in the ISR fetch the record and append to their logs.
Once every ISR follower has written the record, the leader advances the high and acknowledges the .
Consumers can only read up to the high — uncommitted records are invisible.

Leader failure

If the current leader fails (no longer in ISR or fails health check), the controller promotes one of the remaining in-sync replicas to leader.
Producers and consumers fetch fresh metadata, retry against the new leader.
Failover typically takes seconds (network heartbeat + controller decision + metadata propagation). Brief produce/fetch errors are normal.

`min.insync.replicas`

-level setting: minimum ISR size required for an acks=all produce to succeed.
If too few replicas are in-sync, the leader returns NotEnoughReplicasException and the producer cannot write (preserving durability over availability).

Concrete example. orders with factor 3, min.insync.replicas=2, acks=all, on a 3- cluster spanning 3 AZs:

Steady state: Every partition has 3 replicas (1 leader + 2 followers), all in-sync.
One dies (e.g., AZ outage): ISR drops to 2. Writes still succeed because min.insync.replicas=2 is met. Affected partitions whose leader was on the dead broker fail over to one of the remaining replicas. Brief produce/fetch errors during failover.
Two brokers die: ISR drops to 1. Writes fail (NotEnoughReplicasException) because min.insync.replicas=2 is not met. The cluster preserves durability — you'd rather block writes than lose data.
Dead broker comes back: Follower catches up by re-fetching from the leader; once within replica.lag.time.max.ms, it rejoins the ISR.

QUICK CHECK

A Kafka topic has replication factor 3 and min.insync.replicas=2. Producers use acks=all. Two of the three brokers go down simultaneously. What happens to produce requests?

3. What Mid-Senior SWEs Actually Need to Know

The standard safe combo is RF=3, min.insync.replicas=2, acks=all. This tolerates one failure without data loss or write outage. Anything weaker risks losing acknowledged messages during failover.
acks=all does not mean "wait for all replicas" — it means "wait for all in-sync replicas." That's why min.insync.replicas exists: it prevents acks=all from silently downgrading to acks=1 when the has shrunk.
Unclean is a footgun. With unclean.leader.election.enable=true, an out-of-sync replica can become leader after a worst-case failure — preserving availability at the cost of losing acknowledged messages. Default is false; keep it false unless you understand the tradeoff (rarely worth it).
Rack awareness must be configured. Set .rack on each broker (e.g., to AZ). Otherwise may place all 3 replicas of a in the same AZ, and a single AZ outage takes the fully offline.
is network and disk heavy. Each produce writes 1 leader + (RF-1) followers — total cluster write amplification is RF. Plan inter-broker network bandwidth and disk IOPS accordingly.
A shrinking is a leading indicator of trouble. Use -topics --describe and the JMX metric UnderReplicatedPartitions — if it's nonzero, something is wrong (slow disk, network saturation, GC pause, broker overloaded).
Broker storage is local-disk by design. Adding/removing a broker streams partition data across the network, which can take hours for large clusters. Replacements are not zero-cost; plan capacity for catch-up.
Common misunderstanding: "If RF=3 I can lose 2 brokers and keep writing." Only if you set min.insync.replicas=1 — which means you might lose acknowledged data if the lone surviving broker dies before catching up. Most teams want min.insync.replicas=2 and accept that 2 simultaneous broker losses block writes.

QUICK CHECK

A Kafka cluster is configured with RF=3 and acks=all. Due to slow disk I/O on two brokers, the ISR for a critical partition shrinks from 3 replicas down to 1. What happens to produce requests targeting that partition, assuming min.insync.replicas=2?

4. Tradeoffs & Decisions

If you need...	Set...	Tradeoff
Maximum durability, accept write outage on 2 broker losses	RF=3, `min.insync.replicas=2`, `acks=all`, unclean=false	Most expensive; standard safe default
High availability, can tolerate rare data loss	RF=3, `min.insync.replicas=1`, `acks=all`, unclean=false	One surviving replica can ack; if it then dies, loss is possible
Maximum throughput, accept loss on broker failure	`acks=1` (leader only)	~10–30% throughput win; lose data if leader dies before replicating
Survive AZ outage	RF=3 across 3 AZs, `broker.rack` set	Cross-AZ replication latency (~ms) and inter-AZ data egress costs
Survive whole-region outage	MirrorMaker 2 / Cluster Linking to a passive cluster in another region	DR complexity, lag between regions, failover runbook

Key tradeoff: durability vs availability vs cost. More replicas survive more failures but cost more disk and network. Higher min.insync.replicas protects data but blocks writes earlier when replicas are unhealthy. There is no single right answer — match the 's business value.

Secondary tradeoff: fast failover vs stable leadership. Aggressive failure detection (low replica.lag.time.max.ms) catches dead followers fast but flaps membership under load spikes. The defaults are reasonable; tune only with evidence.

QUICK CHECK

A Kafka topic stores financial transaction records. The team configures RF=3, min.insync.replicas=2, acks=all, and unclean.leader.election.enable=false. During a rolling maintenance window, two of the three brokers go offline simultaneously. What happens to producer writes during this time?

5. Interview & System Design Cheat Sheet

A is one server; a cluster is N brokers. Each has one leader (serves reads/writes) and N-1 followers (replicate).
In-Sync Replicas () = the set of replicas currently caught up; min.insync.replicas defines how many must be in the for acks=all produces to succeed.
Standard safe setting: RF=3, min.insync.replicas=2, acks=all, unclean.leader.election.enable=false, with rack awareness spanning AZs.
Leader failure triggers controller-driven failover in seconds; brief produce/fetch errors are expected and clients retry transparently.
UnderReplicatedPartitions is the canary metric — nonzero means a follower is behind, which is the first symptom of disk saturation, network issues, or overload.

Common follow-ups:

"What's the difference between acks=all and min.insync.replicas?" — acks=all is the asking the leader to wait for all ISR followers. min.insync.replicas is the leader's check that the ISR is large enough to honor that contract; without it, acks=all silently weakens when followers fall out of sync.
"What does unclean cost?" — Availability win after worst-case failure (no in-sync replicas left) at the cost of losing acknowledged messages. Disable unless you have explicitly chosen availability over correctness.
"How do you tune for inter-AZ deployments?" — broker.rack for replica placement; tune replica.fetch.max.bytes and socket.receive.buffer.bytes for higher RTTs; expect ~ms-level cross-AZ latency.

If asked to design X, anchor on this: State the factor, the min.insync.replicas, the acks setting, whether you're rack-aware, and your DR posture (single-region with multi-AZ replication, or multi-region with MM2 / Cluster Linking). Those five answers are the durability story.

QUICK CHECK

A Kafka topic is configured with replication factor 3 and min.insync.replicas=2. One follower falls significantly behind and drops out of the ISR, leaving only the leader and one other follower in sync. A producer with acks=all sends a message. What happens?

Glossary History

Click dotted jargon to save explanations here.

Glossary History

Click dotted jargon to save explanations here.

Brokers & Replication

Streaming Systems Index

Streaming Systems Index

Brokers & Replication

1. What Is It?

2. How It Works

Roles

Write path with `acks=all`

Leader failure

`min.insync.replicas`

3. What Mid-Senior SWEs Actually Need to Know

4. Tradeoffs & Decisions

5. Interview & System Design Cheat Sheet

On This Page

On This Page

Glossary History

Glossary History

Brokers & Replication

Streaming Systems Index

Streaming Systems Index

Brokers & Replication

1. What Is It?

2. How It Works

Roles

Write path with acks=all

Leader failure

min.insync.replicas

3. What Mid-Senior SWEs Actually Need to Know

4. Tradeoffs & Decisions

5. Interview & System Design Cheat Sheet

On This Page

On This Page

Glossary History

Glossary History

Write path with `acks=all`

`min.insync.replicas`