This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design a Distributed Coordination Service (like Apache ZooKeeper).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Is this optimized for reads or writes? | Coordination services are typically read-heavy (10:1 or 100:1 read-to-write ratio). |
| What are the core primitives we must support? | Leader election, configuration management, and distributed locks dictate the need for Ephemeral nodes and Watches. |
| Does it need to survive network partitions? | Yes, CP systems must prioritize consistency over availability during partitions. |
Scope
In scope
- Hierarchical data model (ZNodes)
- Watches and event notification
- Consensus and Leader Election (ZAB protocol)
- Client session management
Out of scope (state explicitly)
- Storing large files (blob storage)
- High-throughput data streaming (Kafka use case)
Assumptions
- Data size is small (fits entirely in RAM)
- Strong consistency is required for all writes
- Clients will maintain long-lived TCP connections
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Hierarchical Namespace: Data is organized in a file-system-like tree of nodes (called
znodes). - Znode Types: Support Persistent, Ephemeral (deleted when client session ends), and Sequential (auto-incrementing suffix) znodes.
- Watch Mechanism: Clients can set watches on znodes to receive push notifications when the znode changes or its children change.
- Core API:
create,delete,exists,getData,setData,getChildren.
- High Availability: The service must remain available as long as a majority (quorum) of servers are alive.
- Strict Ordering: All updates are strictly ordered. A client never sees the system go "backwards" in time.
- High Read Throughput: Typical workloads are 10:1 read-to-write ratio. Reads must be extremely fast and scalable.
- Linearizable Writes: Writes must be serializable and strongly consistent.
ZooKeeper uses a replicated ensemble of servers (typically 3, 5, or 7). One server is elected the Leader, and the rest are Followers.
1. The Data Model (Znode Tree)
Unlike a standard key-value store, ZooKeeper organizes keys in a tree structure mimicking a file system. The entire tree is kept in memory on every server to guarantee extremely fast reads.
- Persistent Znodes: Remain in the tree until explicitly deleted. Used for static configuration.
- Ephemeral Znodes: Bound to the client's session. If the client disconnects or crashes (and misses heartbeat deadlines), the znode is automatically deleted. Critical for failure detection and service discovery.
- Sequential Znodes: ZooKeeper automatically appends a monotonically increasing 10-digit counter to the name (e.g.,
node-00000001). Critical for implementing distributed queues and fair locks.
2. ZAB Protocol (ZooKeeper Atomic Broadcast) ⭐
ZAB is ZooKeeper's consensus protocol, ensuring all nodes in the ensemble agree on the exact sequence of state changes. It operates in two main phases: Broadcast and Recovery.
--- ZAB (ZooKeeper Atomic Broadcast) Write Protocol --- 1. Client sends WRITE request (e.g., SET /config) to Follower 1. 2. Follower 1 forwards WRITE to Leader. 3. Leader creates a PROPOSAL (zxid: 100) and broadcasts to all Followers. 4. Followers append PROPOSAL to their local Transaction Log on disk. 5. Followers reply with ACK to Leader. 6. Leader waits for a QUORUM of ACKs (e.g., 2 out of 3 nodes). 7. Leader broadcasts COMMIT to all Followers. 8. Leader applies change to its own In-Memory Tree and replies SUCCESS to Client. 9. Followers apply change to their In-Memory Trees.
Because the Leader coordinates all writes, write throughput does not scale linearly as you add more nodes. In fact, adding nodes slightly decreases write throughput because a larger quorum is required for consensus, increasing network round-trip overhead.
3. Read & Write Paths
Write Path: 1. Client sends write to any node. 2. If node is a Follower, it forwards the write to the Leader. 3. Leader executes ZAB protocol (Propose -> Ack -> Commit). 4. Once committed, Leader applies it to memory and replies to the client. Read Path: 1. Client sends read to the node it is connected to (Leader or Follower). 2. The node reads directly from its local in-memory Znode tree. 3. Fast reads, but may return slightly stale data if the Follower hasn't processed the latest Commit. (To guarantee reading latest data, client can issue a 'sync' command first).
Wait, aren't reads stale? Yes, ZooKeeper provides Sequential Consistency, not Strict Consistency. A read from a follower might return slightly stale data if it hasn't applied the latest commit. However, a client will never see time go backward. If a client absolutely needs the latest data, it can call sync() before reading, which forces the follower to catch up with the leader.
4. Watches & Push Notifications
Polling for changes (e.g., polling for a lock release every 100ms) would overwhelm the cluster. ZooKeeper provides Watches for event-driven updates.
- When a client calls
getData("/config", true), the server records the client's session as interested in/config. - If another client updates
/config, the server pushes a watch event to the client over its persistent TCP connection. - Herd Effect Prevention: Watches are one-time triggers. After firing, the client must set a new watch. Furthermore, in distributed locks, clients should only watch the specific znode immediately preceding their own in the sequence, rather than all clients watching the lock holder.
5. Persistence & Snapshots
Since the tree is in RAM, a cluster-wide power loss would erase everything. ZooKeeper prevents this using dual disk persistence:
- Transaction Log: Every write proposal is sequentially appended to a write-ahead log on disk. It must be flushed (fsync) before acknowledging the leader.
- Fuzzy Snapshots: Periodically, the in-memory tree is serialized and dumped to disk. It's "fuzzy" because writes continue while the dump happens, meaning the snapshot isn't point-in-time perfect.
- Recovery: On boot, a server loads the latest fuzzy snapshot, then replays the transaction log from the snapshot's start point to restore the exact state. Idempotency guarantees this is safe.
ZooKeeper provides a simple, filesystem-like API. These primitives are combined by engineers to build complex distributed mechanisms (locks, leader election).
// 1. Create a ZNode
// Flags: EPHEMERAL, SEQUENTIAL, or PERSISTENT
String path = zk.create("/services/payment/node_1", "10.0.0.5".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
// 2. Read Data & Set Watch
// The 'true' flag tells ZooKeeper to notify this client if the data changes
byte[] data = zk.getData("/config/db_url", true, stat);
// 3. Write Data
// Uses 'version' for CAS (Compare-And-Swap) optimistic concurrency
zk.setData("/config/db_url", "jdbc:postgresql://new-db".getBytes(), stat.getVersion());
// 4. Get Children & Set Watch
// Useful for Service Discovery or Distributed Locks
List<String> children = zk.getChildren("/locks/resource_1", true);- Watches: One-time triggers sent by the server to the client when a znode changes, eliminating the need for client polling.
- Versions: Every write requires the client to pass the last known version. If another client modified it concurrently, the write fails (Optimistic Concurrency Control).
The entire data tree is stored in RAM for extreme read performance. Each node in the tree is called a ZNode.
// A ZNode is not just a byte array; it holds critical metadata (Stat structure)
struct Stat {
int64 czxid; // Zxid of the transaction that created this znode
int64 mzxid; // Zxid of the transaction that last modified this znode
int64 pzxid; // Zxid of the transaction that last modified children
int32 version; // Incremented on every data change (used for CAS)
int32 cversion; // Incremented on every child change
int32 aversion; // Incremented on every ACL change
int64 ephemeralOwner; // Session ID if ephemeral, 0 if persistent
int32 dataLength; // Length of the byte array
int32 numChildren;
};
// Internal representation in RAM
class DataNode {
byte[] data;
Long acl;
StatPersisted stat;
Set<String> children;
}- Ephemeral Nodes: Tied to the active TCP session of the client that created them. If the client crashes or loses network, the ZooKeeper server automatically deletes the node. Crucial for Service Discovery.
- Sequential Nodes: ZooKeeper appends a monotonically increasing counter to the path (e.g.,
/lock-0001). Crucial for Distributed Locks.
ZooKeeper requires a strict majority (Quorum) to elect a leader and commit writes. For an ensemble of N nodes, the quorum size is (N/2) + 1.
| Ensemble Size | Quorum Size | Fault Tolerance (Nodes can fail) |
|---|---|---|
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
Split-Brain Prevention: In a network partition of a 5-node cluster splitting into groups of 2 and 3, only the group of 3 (the majority) can elect a leader and process writes. The group of 2 will stop serving writes, preventing data divergence. This guarantees strict consistency over availability (CP system).
In-Memory Data vs. Data Size
Because the entire Znode tree must fit in RAM, ZooKeeper is NOT a general-purpose database. It is designed to store megabytes, not gigabytes, of data (e.g., config strings, host IP addresses). A single znode should generally not exceed 1 MB. Attempting to store large binaries in ZooKeeper will cause severe GC pauses and network saturation during snapshot synchronization.
Read Scalability vs. Write Bottlenecks
ZooKeeper scales reads beautifully—just add more Follower nodes or Observers (nodes that serve reads but don't vote in quorums). However, writes hit a hard bottleneck because every write must be serialized through the single Leader and acknowledged by a quorum.
ZAB vs. Raft vs. Paxos
Paxos: The theoretical foundation of consensus, but notoriously difficult to implement in real systems.
Raft (etcd): Designed for understandability. Joint consensus for membership changes. Strongly consistent reads by default.
ZAB (ZooKeeper): Designed specifically for primary-backup state machine replication. It implements FIFO client ordering and prioritizes high-throughput transaction broadcasting. ZAB's recovery phase ensures that any committed transaction will eventually be delivered to all nodes, even across leader elections.
CP vs AP (CAP Theorem)
ZooKeeper is strongly a CP system (Consistent and Partition Tolerant). In the event of a network partition where a majority quorum cannot be established, ZooKeeper will stop serving read and write requests entirely (sacrificing Availability) rather than serving split-brain/divergent data.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: Single Node In-Memory Store
A simple tree structure in memory with a TCP server handling connections and ephemeral node tracking.
Key components: Tree Data Structure · Session Manager · TCP Server
Move to next phase when: Single point of failure; cannot survive node crashes.
Phase 2: High Availability & Replication
Introduce a cluster of nodes with Leader Election and a consensus protocol to replicate writes.
Key components: ZAB Protocol · Quorum Peer · Write-Ahead Log (WAL)
Move to next phase when: Need for high read throughput and fault tolerance.
Phase 3: Watchers & Observer Nodes
Add Watch mechanisms for push-based notifications, and Observer nodes that don't participate in voting to scale reads globally.
Key components: Watch Manager · Observer Nodes · Fuzzy Snapshots
Move to next phase when: Polling overloads the network; read volume requires scaling beyond the voting quorum.
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Read Latency | 99% < 2ms | Reads are served from RAM by any replica. |
| Write Latency | 99% < 10ms | Requires network round-trip for quorum consensus and disk fsync. |
| Availability | 99.99% | Core infrastructure dependency; if ZK goes down, Kafka/Hadoop go down. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| GC Pause on the Leader | Followers timeout waiting for heartbeats; false leader election is triggered. | Tune JVM GC (use G1GC or ZGC). Ensure session timeouts are generous enough to survive typical GC pauses. |
| Watch Storm (Thundering Herd) | A popular ZNode changes, and the server spends all its CPU sending notifications to thousands of clients, who then immediately reconnect. | Clients should jitter their reconnections. In modern designs, use tree-based or delegated watchers to distribute the load. |
| Disk I/O Contention | Write latency spikes because WAL fsyncs are waiting on disk. | ZooKeeper's WAL must be on a dedicated, high-IOPS SSD. Do not share the disk with swap or other heavy logging processes. |
Cost Drivers (Staff lens)
- Memory (entire dataset must fit in RAM)
- High-performance SSDs for the Write-Ahead Log
- Network bandwidth for cross-node replication and watch events
Multi-Region & DR
Usually deployed in 3 or 5 nodes within a single region or closely peered regions. Cross-global-region deployments suffer from high write latency because ZAB requires a synchronous majority quorum.