This problem appears in multiple sheets. Depth expectations increase as you progress:
| Track | What to demonstrate |
|---|---|
| Arch 50 | Demonstrate understanding of ACID properties, WAL (Write-Ahead Logging), and B-Tree indexing. |
| Arch 75 | Staff angles: MVCC (Multi-Version Concurrency Control), transaction isolation levels, and buffer pool management. |
Interview Prompt
Design a Relational Database (like PostgreSQL).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Are we designing a single-node engine or a distributed database? | Single-node focuses on disk I/O and locks; distributed focuses on consensus and replication. |
| What is the primary workload? | OLTP requires row-oriented storage and B-Trees; OLAP requires columnar storage. |
| Do we need to support strict serializability? | Dictates the complexity of our concurrency control implementation. |
Scope
In scope
- Storage engine and page layouts
- Write-Ahead Logging (WAL) for durability
- Concurrency control (MVCC)
- Index structures (B-Tree)
Out of scope (state explicitly)
- SQL parsing and query optimization engine
- Network protocol implementation
- Distributed consensus (focusing on single-node internals first)
Assumptions
- The database must guarantee ACID properties
- The working set size exceeds available RAM
- Storage is backed by SSDs/HDDs, not purely in-memory
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Relational Model: Store data in tables with rows and typed columns.
- SQL Support: Support standard SQL queries (SELECT, INSERT, UPDATE, DELETE, JOINs, aggregations).
- ACID Transactions: Atomicity, Consistency, Isolation, and Durability must be guaranteed.
- Indexing: Support B-Tree indexes to speed up point queries and range scans.
- Concurrency Control: Allow multiple clients to read and write simultaneously without locking the entire database (MVCC).
- Durability (Zero Data Loss): Committed transactions must survive power failures and crashes.
- High Performance: Must leverage in-memory caching (Buffer Pool) and sequential disk I/O to maximize throughput.
- Scalability: Vertical scaling is the primary model, though read-replicas should be supported for scaling read traffic.
- Crash Recovery: The system must recover to a consistent state rapidly after an unexpected shutdown.
A relational database consists of a query processing layer on top of a transactional storage engine.
1. Query Execution Pipeline
When a SQL query arrives, it goes through several stages:
- Parser: Verifies syntax and generates a Parse Tree.
- Analyzer: Checks semantics (do tables/columns exist? permissions?).
- Optimizer (Cost-Based): Generates multiple execution plans (e.g., Index Scan vs. Sequential Scan, Hash Join vs. Nested Loop Join) and estimates the I/O and CPU cost of each using statistics. It picks the cheapest plan.
- Executor: Processes the plan node-by-node (Volcano model), pulling tuples from the storage engine.
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30 ORDER BY name LIMIT 10;
Limit (cost=0.42..1.53 rows=10 width=45) (actual time=0.045..0.055 rows=10 loops=1)
-> Index Scan using users_name_idx on users (cost=0.42..111.45 rows=1000 width=45)
Filter: (age > 30)
Rows Removed by Filter: 15
Planning Time: 0.150 ms
Execution Time: 0.080 ms2. Buffer Pool (Shared Memory)
Disk I/O is slow. PostgreSQL allocates a large chunk of RAM called the shared_buffers. The database reads data from disk in fixed-size blocks (usually 8 KB) called Pages.
- When a query requests a row, the storage engine checks if the page containing that row is in the Buffer Pool.
- If it is (Cache Hit), it's returned immediately.
- If not (Cache Miss), the page is loaded from disk into the Buffer Pool, evicting an older page if necessary (using an eviction policy like Clock-Sweep or LRU).
- Modifications are made in-memory first. The page becomes "dirty".
3. Write-Ahead Log (WAL) & Durability ⭐
If the database crashes before dirty pages are flushed to disk, data is lost. Flushing 8 KB pages randomly to disk for every transaction is terribly slow. The solution is the WAL.
- Before modifying a page in the Buffer Pool, a small log entry describing the change is appended to the WAL in memory.
- On
COMMIT, the WAL is flushed to disk sequentially (which is extremely fast, even on HDDs). - The actual dirty data pages in the Buffer Pool are flushed lazily in the background by the
bgwriterprocess. - Rule: The WAL record must be on disk before the corresponding dirty data page is written to disk.
LSN 0/1A2B3C: Transaction ID: 5092 Resource Manager: Heap Action: INSERT Relation: users (filenode: 16384) Block: 42 Offset: 12 Tuple Data: (id=5, name='Bob', age=35)
4. Multi-Version Concurrency Control (MVCC) ⭐
"Readers must not block writers, and writers must not block readers." Instead of placing a lock on a row when updating it, PostgreSQL creates a completely new version of the row.
- Every row has two hidden columns:
xmin(transaction ID that created it) andxmax(transaction ID that deleted/updated it). - A transaction only sees rows where
xmin <= current_tx_idandxmaxis either 0 or> current_tx_id. - This provides Snapshot Isolation without requiring explicit read locks.
5. Indexes (B-Trees & Hash)
Indexes speed up lookups from O(N) sequential scans to O(log N). B-Trees (specifically B+Trees) are the default.
- Leaf nodes contain pointers (Tuple IDs) to the actual row locations in the heap files.
- Because the tree is balanced and highly branched, even tables with billions of rows only require a tree depth of 3 or 4 (i.e., only 3-4 disk reads are needed to find any row).
- PostgreSQL also supports Hash indexes, GiST (for geospatial), and GIN (for full-text search and JSONB arrays).
6. Connection Pooling (PgBouncer)
PostgreSQL forks a dedicated OS process for every client connection. This consumes ~10MB of RAM per connection and introduces process-forking overhead. At scale (e.g., thousands of microservices connecting), this leads to memory exhaustion.PgBouncer is used as a lightweight connection pooler sitting in front of PostgreSQL, multiplexing thousands of client connections onto a small pool of actual database connections.
7. High Availability & Replication
To survive a catastrophic node failure, PostgreSQL uses Replication.
- Physical Replication (Streaming): The primary database streams its WAL records to a standby replica byte-for-byte. The replica applies the WAL, keeping an exact block-level copy of the primary.
- Logical Replication: Replicates data based on decoding the WAL into SQL-like logical changes (INSERT, UPDATE). Useful for replicating specific tables or upgrading between major versions with zero downtime.
- Automated Failover: Tools like Patroni (using ZooKeeper/etcd) monitor the primary. If it dies, Patroni automatically promotes a standby to primary and updates the routing layer.
PostgreSQL doesn't use standard REST APIs; it uses a highly optimized, stateful binary TCP protocol called the PostgreSQL Wire Protocol.
Message Flow (Query)
// Frontend (Client) -> Backend (PostgreSQL)
// Query Message ('Q')
Q: "SELECT id, name FROM users WHERE age > 30;"
// Backend -> Frontend
// Row Description ('T')
T: [Field1: "id" (Int4), Field2: "name" (VarChar)]
// Data Row ('D')
D: [5, "Bob"]
// Data Row ('D')
D: [12, "Alice"]
// Command Complete ('C')
C: "SELECT 2"
// Ready For Query ('Z')
Z: (Transaction Status: Idle)This stateful connection model is why PgBouncer is required at scale. PgBouncer sits in front of the database, speaking this exact Wire Protocol to clients, but multiplexing those "Q" messages onto a very small pool of persistent backend connections.
Data isn't stored as JSON or CSV. It's stored in highly structured 8KB Blocks (Pages). This format aligns with OS filesystem blocks and allows the Buffer Pool to read/write exact 8KB chunks efficiently.
// Standard 8KB Page Layout in PostgreSQL
struct PageHeaderData {
uint64 pd_lsn; // Log Sequence Number (ties page to WAL)
uint16 pd_checksum; // Detects data corruption / bit rot
uint16 pd_flags; // Flag bits
uint16 pd_lower; // Offset to start of free space
uint16 pd_upper; // Offset to end of free space
uint16 pd_special; // Offset to special space (used by B-trees)
uint16 pd_pagesize_version;
};
struct ItemIdData { // Line Pointer Array (grows forwards)
unsigned lp_off:15; // Offset to the actual tuple data
unsigned lp_flags:2; // State of tuple (Normal, Redirect, Dead)
unsigned lp_len:15; // Length of tuple
};
// Actual Tuple Data (HeapTupleHeader) grows backwards from the end of the 8KB page.
// MVCC metadata is stored directly on the Tuple Header:
struct HeapTupleHeaderData {
TransactionId t_xmin; // Tx ID that inserted this tuple
TransactionId t_xmax; // Tx ID that deleted/updated this tuple
CommandId t_cmin_cmax;
ItemPointerData t_ctid; // Pointer to the newer version of this row (for updates)
};- Line Pointers (ItemIds): Grow forward from the header. They point to the exact byte offset where the row data starts.
- Tuple Data: Grows backward from the end of the page. The gap in the middle is "Free Space".
- MVCC Coherence: Because
t_xminandt_xmaxare stored directly on the Tuple Header on disk, PostgreSQL doesn't need a separate Undo Log. To check if a row is visible to a transaction, it just compares its current Transaction ID with the tuple'st_xmin/t_xmax.
If the server crashes (e.g. power failure), memory is wiped, meaning all dirty pages in the Buffer Pool are lost.
Recovery Process (ARIES algorithm principles)
- Upon startup, PostgreSQL locates the last valid Checkpoint. (A checkpoint is a moment where all dirty pages were safely flushed to disk, and the WAL can be safely truncated).
- It begins reading the WAL sequentially from that checkpoint forward.
- REDO Phase: It reapplies every change recorded in the WAL to the data pages, bringing the database exactly back to the state it was in at the moment of the crash.
- Transactions that were in-progress (no COMMIT record in the WAL) are rolled back natively due to MVCC (their XIDs are marked as aborted).
PostgreSQL MVCC vs. MySQL (InnoDB) MVCC
PostgreSQL stores all row versions directly in the main table (Heap). This causes table bloat and requires aggressive background VACUUMing. However, it makes rollbacks instantaneous (just abort the transaction ID).
MySQL stores old row versions in a separate Undo Log. The main table is kept clean, but long-running transactions can cause the Undo Log to grow massively, and rolling back requires physically applying undo operations.
B-Tree vs. LSM Tree (LevelDB/RocksDB)
B-Trees are optimized for read-heavy workloads but suffer from write-amplification and random I/O during massive random inserts. LSM Trees (used in Cassandra, RocksDB) are optimized for write-heavy workloads by writing sequentially to memory and flushing to immutable files, but they suffer from read-amplification due to compaction and searching multiple levels.
Synchronous vs. Asynchronous Replication
Async Replication: Primary commits locally and returns success immediately, then ships WAL to standby in the background. High performance, but risks data loss if the primary crashes before shipping.
Sync Replication: Primary waits for the standby to acknowledge writing the WAL before returning success to the client. Guarantees zero data loss (RPO=0), but doubles commit latency and halts writes if the standby network partitions.
Vertical Scaling vs. Sharding
Relational databases scale vertically (adding more CPU/RAM) very well. However, past a certain point, a single machine cannot handle the write throughput.Sharding (horizontal partitioning) splits data across multiple PostgreSQL nodes (e.g., using Citus). Sharding provides infinite scale but severely breaks standard SQL paradigms (cross-shard JOINs become incredibly slow, and distributed transactions require complex 2PC algorithms).
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: In-Memory + Append-Only Log
A basic engine that keeps all data in memory and writes an append-only log to disk for durability.
Key components: In-memory Hash Map · Append-only File
Move to next phase when: Data exceeds RAM; need structured disk access.
Phase 2: Paged Storage & Buffer Pool
Data is organized into fixed-size pages on disk. A Buffer Pool caches hot pages in memory.
Key components: Page Manager · Buffer Pool (LRU cache) · B-Tree Indexes
Move to next phase when: Need for concurrent transactions and ACID guarantees.
Phase 3: MVCC & WAL
Full implementation of Multi-Version Concurrency Control and Write-Ahead Logging for crash recovery.
Key components: MVCC Tuple visibility · WAL Manager · Vacuum Process
Move to next phase when: Read locks are blocking write throughput.
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Transaction Commit Latency | 99% < 5ms | fsync latency on modern NVMe drives is sub-millisecond; software overhead should be minimal. |
| Data Durability | 100% | A relational database cannot lose committed transactions. |
| Buffer Pool Hit Ratio | > 95% | Disk I/O is the primary bottleneck; hot data must stay in RAM. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Transaction ID Wraparound | System alerts approaching the maximum 32-bit transaction ID. | PostgreSQL stops accepting writes if wraparound is imminent. Must run an aggressive database-wide VACUUM FREEZE to mark old rows as universally visible. |
| Severe Table Bloat | Query latency increases; sequential scan times double. | Identify long-running transactions holding back the MVCC horizon. Kill them. Run pg_repack or VACUUM FULL to rewrite the table. |
| Disk I/O Saturation during Checkpointing | I/O wait metrics spike periodically. | Spread out checkpointing over a longer period (tuning checkpoint_completion_target) to avoid I/O spikes. |
Cost Drivers (Staff lens)
- High-IOPS SSDs for WAL and data files
- Large RAM allocations for the Buffer Pool
- CPU for sorting, hashing, and concurrent connections
Multi-Region & DR
Single-node databases don't scale natively across regions. Require logical replication or specialized distributed layers (like Citus or CockroachDB architectures) to achieve multi-region consensus.