Design a Distributed Coordination Service (ZooKeeper)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Focus on the hierarchical namespace (ZNodes), ephemeral nodes, and watches.
Arch 75	Staff angles: ZAB consensus protocol, strictly ordered updates, and handling split-brain scenarios.

Interview Prompt

Design a Distributed Coordination Service (like Apache ZooKeeper).

Clarifying Questions (ask before designing)

Question	Why it matters
Is this optimized for reads or writes?	Coordination services are typically read-heavy (10:1 or 100:1 read-to-write ratio).
What are the core primitives we must support?	Leader election, configuration management, and distributed locks dictate the need for Ephemeral nodes and Watches.
Does it need to survive network partitions?	Yes, CP systems must prioritize consistency over availability during partitions.

Scope

In scope

Hierarchical data model (ZNodes)
Watches and event notification
Consensus and Leader Election (ZAB protocol)
Client session management

Out of scope (state explicitly)

Storing large files (blob storage)
High-throughput data streaming (Kafka use case)

Assumptions

Data size is small (fits entirely in RAM)
Strong consistency is required for all writes
Clients will maintain long-lived TCP connections

Hierarchical Namespace: Data is organized in a file-system-like tree of nodes (called znodes).
Znode Types: Support Persistent, Ephemeral (deleted when client session ends), and Sequential (auto-incrementing suffix) znodes.
Watch Mechanism: Clients can set watches on znodes to receive push notifications when the znode changes or its children change.
Core API: create, delete, exists, getData, setData, getChildren.

ZooKeeper uses a replicated ensemble of servers (typically 3, 5, or 7). One server is elected the Leader, and the rest are Followers.

Loading...

1. The Data Model (Znode Tree)

Unlike a standard key-value store, ZooKeeper organizes keys in a tree structure mimicking a file system. The entire tree is kept in memory on every server to guarantee extremely fast reads.

Loading...

Persistent Znodes: Remain in the tree until explicitly deleted. Used for static configuration.
Ephemeral Znodes: Bound to the client's session. If the client disconnects or crashes (and misses heartbeat deadlines), the znode is automatically deleted. Critical for failure detection and service discovery.
Sequential Znodes: ZooKeeper automatically appends a monotonically increasing 10-digit counter to the name (e.g., node-00000001). Critical for implementing distributed queues and fair locks.

2. ZAB Protocol (ZooKeeper Atomic Broadcast) ⭐

ZAB is ZooKeeper's consensus protocol, ensuring all nodes in the ensemble agree on the exact sequence of state changes. It operates in two main phases: Broadcast and Recovery.


--- ZAB (ZooKeeper Atomic Broadcast) Write Protocol ---

1. Client sends WRITE request (e.g., SET /config) to Follower 1.
2. Follower 1 forwards WRITE to Leader.
3. Leader creates a PROPOSAL (zxid: 100) and broadcasts to all Followers.
4. Followers append PROPOSAL to their local Transaction Log on disk.
5. Followers reply with ACK to Leader.
6. Leader waits for a QUORUM of ACKs (e.g., 2 out of 3 nodes).
7. Leader broadcasts COMMIT to all Followers.
8. Leader applies change to its own In-Memory Tree and replies SUCCESS to Client.
9. Followers apply change to their In-Memory Trees.

Because the Leader coordinates all writes, write throughput does not scale linearly as you add more nodes. In fact, adding nodes slightly decreases write throughput because a larger quorum is required for consensus, increasing network round-trip overhead.

3. Read & Write Paths

Write Path:
1. Client sends write to any node.
2. If node is a Follower, it forwards the write to the Leader.
3. Leader executes ZAB protocol (Propose -> Ack -> Commit).
4. Once committed, Leader applies it to memory and replies to the client.

Read Path:
1. Client sends read to the node it is connected to (Leader or Follower).
2. The node reads directly from its local in-memory Znode tree.
3. Fast reads, but may return slightly stale data if the Follower hasn't processed the latest Commit.
   (To guarantee reading latest data, client can issue a 'sync' command first).

Wait, aren't reads stale? Yes, ZooKeeper provides Sequential Consistency, not Strict Consistency. A read from a follower might return slightly stale data if it hasn't applied the latest commit. However, a client will never see time go backward. If a client absolutely needs the latest data, it can call sync() before reading, which forces the follower to catch up with the leader.

4. Watches & Push Notifications

Polling for changes (e.g., polling for a lock release every 100ms) would overwhelm the cluster. ZooKeeper provides Watches for event-driven updates.

When a client calls getData("/config", true), the server records the client's session as interested in /config.
If another client updates /config, the server pushes a watch event to the client over its persistent TCP connection.
Herd Effect Prevention: Watches are one-time triggers. After firing, the client must set a new watch. Furthermore, in distributed locks, clients should only watch the specific znode immediately preceding their own in the sequence, rather than all clients watching the lock holder.

5. Persistence & Snapshots

Since the tree is in RAM, a cluster-wide power loss would erase everything. ZooKeeper prevents this using dual disk persistence:

Transaction Log: Every write proposal is sequentially appended to a write-ahead log on disk. It must be flushed (fsync) before acknowledging the leader.
Fuzzy Snapshots: Periodically, the in-memory tree is serialized and dumped to disk. It's "fuzzy" because writes continue while the dump happens, meaning the snapshot isn't point-in-time perfect.
Recovery: On boot, a server loads the latest fuzzy snapshot, then replays the transaction log from the snapshot's start point to restore the exact state. Idempotency guarantees this is safe.

ZooKeeper provides a simple, filesystem-like API. These primitives are combined by engineers to build complex distributed mechanisms (locks, leader election).

JAVA

// 1. Create a ZNode
// Flags: EPHEMERAL, SEQUENTIAL, or PERSISTENT
String path = zk.create("/services/payment/node_1", "10.0.0.5".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);

// 2. Read Data & Set Watch
// The 'true' flag tells ZooKeeper to notify this client if the data changes
byte[] data = zk.getData("/config/db_url", true, stat);

// 3. Write Data
// Uses 'version' for CAS (Compare-And-Swap) optimistic concurrency
zk.setData("/config/db_url", "jdbc:postgresql://new-db".getBytes(), stat.getVersion());

// 4. Get Children & Set Watch
// Useful for Service Discovery or Distributed Locks
List<String> children = zk.getChildren("/locks/resource_1", true);

Watches: One-time triggers sent by the server to the client when a znode changes, eliminating the need for client polling.
Versions: Every write requires the client to pass the last known version. If another client modified it concurrently, the write fails (Optimistic Concurrency Control).

The entire data tree is stored in RAM for extreme read performance. Each node in the tree is called a ZNode.

C

// A ZNode is not just a byte array; it holds critical metadata (Stat structure)
struct Stat {
    int64 czxid;      // Zxid of the transaction that created this znode
    int64 mzxid;      // Zxid of the transaction that last modified this znode
    int64 pzxid;      // Zxid of the transaction that last modified children
    int32 version;    // Incremented on every data change (used for CAS)
    int32 cversion;   // Incremented on every child change
    int32 aversion;   // Incremented on every ACL change
    int64 ephemeralOwner; // Session ID if ephemeral, 0 if persistent
    int32 dataLength; // Length of the byte array
    int32 numChildren;
};

// Internal representation in RAM
class DataNode {
    byte[] data;
    Long acl;
    StatPersisted stat;
    Set<String> children;
}

Ephemeral Nodes: Tied to the active TCP session of the client that created them. If the client crashes or loses network, the ZooKeeper server automatically deletes the node. Crucial for Service Discovery.
Sequential Nodes: ZooKeeper appends a monotonically increasing counter to the path (e.g., /lock-0001). Crucial for Distributed Locks.

ZooKeeper requires a strict majority (Quorum) to elect a leader and commit writes. For an ensemble of N nodes, the quorum size is (N/2) + 1.

Ensemble Size	Quorum Size	Fault Tolerance (Nodes can fail)
3	2	1
5	3	2
7	4	3

Split-Brain Prevention: In a network partition of a 5-node cluster splitting into groups of 2 and 3, only the group of 3 (the majority) can elect a leader and process writes. The group of 2 will stop serving writes, preventing data divergence. This guarantees strict consistency over availability (CP system).

SLOs & Error Budgets

Metric	Target	Rationale
Read Latency	99% < 2ms	Reads are served from RAM by any replica.
Write Latency	99% < 10ms	Requires network round-trip for quorum consensus and disk fsync.
Availability	99.99%	Core infrastructure dependency; if ZK goes down, Kafka/Hadoop go down.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
GC Pause on the Leader	Followers timeout waiting for heartbeats; false leader election is triggered.	Tune JVM GC (use G1GC or ZGC). Ensure session timeouts are generous enough to survive typical GC pauses.
Watch Storm (Thundering Herd)	A popular ZNode changes, and the server spends all its CPU sending notifications to thousands of clients, who then immediately reconnect.	Clients should jitter their reconnections. In modern designs, use tree-based or delegated watchers to distribute the load.
Disk I/O Contention	Write latency spikes because WAL fsyncs are waiting on disk.	ZooKeeper's WAL must be on a dedicated, high-IOPS SSD. Do not share the disk with swap or other heavy logging processes.

Cost Drivers (Staff lens)

Memory (entire dataset must fit in RAM)
High-performance SSDs for the Write-Ahead Log
Network bandwidth for cross-node replication and watch events

Multi-Region & DR

Usually deployed in 3 or 5 nodes within a single region or closely peered regions. Cross-global-region deployments suffer from high write latency because ZAB requires a synchronous majority quorum.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. The Data Model (Znode Tree)

2. ZAB Protocol (ZooKeeper Atomic Broadcast) ⭐

3. Read & Write Paths

4. Watches & Push Notifications

5. Persistence & Snapshots

In-Memory Data vs. Data Size

Read Scalability vs. Write Bottlenecks

ZAB vs. Raft vs. Paxos

CP vs AP (CAP Theorem)

Phase 1: Single Node In-Memory Store

Phase 2: High Availability & Replication

Phase 3: Watchers & Observer Nodes

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR