Design a Relational Database (PostgreSQL)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Demonstrate understanding of ACID properties, WAL (Write-Ahead Logging), and B-Tree indexing.
Arch 75	Staff angles: MVCC (Multi-Version Concurrency Control), transaction isolation levels, and buffer pool management.

Interview Prompt

Design a Relational Database (like PostgreSQL).

Clarifying Questions (ask before designing)

Question	Why it matters
Are we designing a single-node engine or a distributed database?	Single-node focuses on disk I/O and locks; distributed focuses on consensus and replication.
What is the primary workload?	OLTP requires row-oriented storage and B-Trees; OLAP requires columnar storage.
Do we need to support strict serializability?	Dictates the complexity of our concurrency control implementation.

Scope

In scope

Storage engine and page layouts
Write-Ahead Logging (WAL) for durability
Concurrency control (MVCC)
Index structures (B-Tree)

Out of scope (state explicitly)

SQL parsing and query optimization engine
Network protocol implementation
Distributed consensus (focusing on single-node internals first)

Assumptions

The database must guarantee ACID properties
The working set size exceeds available RAM
Storage is backed by SSDs/HDDs, not purely in-memory

Relational Model: Store data in tables with rows and typed columns.
SQL Support: Support standard SQL queries (SELECT, INSERT, UPDATE, DELETE, JOINs, aggregations).
ACID Transactions: Atomicity, Consistency, Isolation, and Durability must be guaranteed.
Indexing: Support B-Tree indexes to speed up point queries and range scans.
Concurrency Control: Allow multiple clients to read and write simultaneously without locking the entire database (MVCC).

A relational database consists of a query processing layer on top of a transactional storage engine.

Loading...

1. Query Execution Pipeline

When a SQL query arrives, it goes through several stages:

Parser: Verifies syntax and generates a Parse Tree.
Analyzer: Checks semantics (do tables/columns exist? permissions?).
Optimizer (Cost-Based): Generates multiple execution plans (e.g., Index Scan vs. Sequential Scan, Hash Join vs. Nested Loop Join) and estimates the I/O and CPU cost of each using statistics. It picks the cheapest plan.
Executor: Processes the plan node-by-node (Volcano model), pulling tuples from the storage engine.

SQL

EXPLAIN ANALYZE SELECT * FROM users WHERE age > 30 ORDER BY name LIMIT 10;

Limit  (cost=0.42..1.53 rows=10 width=45) (actual time=0.045..0.055 rows=10 loops=1)
  ->  Index Scan using users_name_idx on users  (cost=0.42..111.45 rows=1000 width=45)
        Filter: (age > 30)
        Rows Removed by Filter: 15
Planning Time: 0.150 ms
Execution Time: 0.080 ms

2. Buffer Pool (Shared Memory)

Disk I/O is slow. PostgreSQL allocates a large chunk of RAM called the shared_buffers. The database reads data from disk in fixed-size blocks (usually 8 KB) called Pages.

When a query requests a row, the storage engine checks if the page containing that row is in the Buffer Pool.
If it is (Cache Hit), it's returned immediately.
If not (Cache Miss), the page is loaded from disk into the Buffer Pool, evicting an older page if necessary (using an eviction policy like Clock-Sweep or LRU).
Modifications are made in-memory first. The page becomes "dirty".

3. Write-Ahead Log (WAL) & Durability ⭐

If the database crashes before dirty pages are flushed to disk, data is lost. Flushing 8 KB pages randomly to disk for every transaction is terribly slow. The solution is the WAL.

Before modifying a page in the Buffer Pool, a small log entry describing the change is appended to the WAL in memory.
On COMMIT, the WAL is flushed to disk sequentially (which is extremely fast, even on HDDs).
The actual dirty data pages in the Buffer Pool are flushed lazily in the background by the bgwriter process.
Rule: The WAL record must be on disk before the corresponding dirty data page is written to disk.

LSN 0/1A2B3C: 
  Transaction ID: 5092
  Resource Manager: Heap
  Action: INSERT
  Relation: users (filenode: 16384)
  Block: 42
  Offset: 12
  Tuple Data: (id=5, name='Bob', age=35)

4. Multi-Version Concurrency Control (MVCC) ⭐

"Readers must not block writers, and writers must not block readers." Instead of placing a lock on a row when updating it, PostgreSQL creates a completely new version of the row.

Loading...

Every row has two hidden columns: xmin (transaction ID that created it) and xmax (transaction ID that deleted/updated it).
A transaction only sees rows where xmin <= current_tx_id and xmax is either 0 or > current_tx_id.
This provides Snapshot Isolation without requiring explicit read locks.

5. Indexes (B-Trees & Hash)

Indexes speed up lookups from O(N) sequential scans to O(log N). B-Trees (specifically B+Trees) are the default.

Loading...

Leaf nodes contain pointers (Tuple IDs) to the actual row locations in the heap files.
Because the tree is balanced and highly branched, even tables with billions of rows only require a tree depth of 3 or 4 (i.e., only 3-4 disk reads are needed to find any row).
PostgreSQL also supports Hash indexes, GiST (for geospatial), and GIN (for full-text search and JSONB arrays).

6. Connection Pooling (PgBouncer)

PostgreSQL forks a dedicated OS process for every client connection. This consumes ~10MB of RAM per connection and introduces process-forking overhead. At scale (e.g., thousands of microservices connecting), this leads to memory exhaustion.PgBouncer is used as a lightweight connection pooler sitting in front of PostgreSQL, multiplexing thousands of client connections onto a small pool of actual database connections.

7. High Availability & Replication

To survive a catastrophic node failure, PostgreSQL uses Replication.

Physical Replication (Streaming): The primary database streams its WAL records to a standby replica byte-for-byte. The replica applies the WAL, keeping an exact block-level copy of the primary.
Logical Replication: Replicates data based on decoding the WAL into SQL-like logical changes (INSERT, UPDATE). Useful for replicating specific tables or upgrading between major versions with zero downtime.
Automated Failover: Tools like Patroni (using ZooKeeper/etcd) monitor the primary. If it dies, Patroni automatically promotes a standby to primary and updates the routing layer.

Data isn't stored as JSON or CSV. It's stored in highly structured 8KB Blocks (Pages). This format aligns with OS filesystem blocks and allows the Buffer Pool to read/write exact 8KB chunks efficiently.

C

// Standard 8KB Page Layout in PostgreSQL

struct PageHeaderData {
    uint64 pd_lsn;          // Log Sequence Number (ties page to WAL)
    uint16 pd_checksum;     // Detects data corruption / bit rot
    uint16 pd_flags;        // Flag bits
    uint16 pd_lower;        // Offset to start of free space
    uint16 pd_upper;        // Offset to end of free space
    uint16 pd_special;      // Offset to special space (used by B-trees)
    uint16 pd_pagesize_version;
};

struct ItemIdData {         // Line Pointer Array (grows forwards)
    unsigned lp_off:15;     // Offset to the actual tuple data
    unsigned lp_flags:2;    // State of tuple (Normal, Redirect, Dead)
    unsigned lp_len:15;     // Length of tuple
};

// Actual Tuple Data (HeapTupleHeader) grows backwards from the end of the 8KB page.
// MVCC metadata is stored directly on the Tuple Header:
struct HeapTupleHeaderData {
    TransactionId t_xmin;   // Tx ID that inserted this tuple
    TransactionId t_xmax;   // Tx ID that deleted/updated this tuple
    CommandId t_cmin_cmax;
    ItemPointerData t_ctid; // Pointer to the newer version of this row (for updates)
};

Line Pointers (ItemIds): Grow forward from the header. They point to the exact byte offset where the row data starts.
Tuple Data: Grows backward from the end of the page. The gap in the middle is "Free Space".
MVCC Coherence: Because t_xmin and t_xmax are stored directly on the Tuple Header on disk, PostgreSQL doesn't need a separate Undo Log. To check if a row is visible to a transaction, it just compares its current Transaction ID with the tuple's t_xmin/t_xmax.

SLOs & Error Budgets

Metric	Target	Rationale
Transaction Commit Latency	99% < 5ms	fsync latency on modern NVMe drives is sub-millisecond; software overhead should be minimal.
Data Durability	100%	A relational database cannot lose committed transactions.
Buffer Pool Hit Ratio	> 95%	Disk I/O is the primary bottleneck; hot data must stay in RAM.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Transaction ID Wraparound	System alerts approaching the maximum 32-bit transaction ID.	PostgreSQL stops accepting writes if wraparound is imminent. Must run an aggressive database-wide VACUUM FREEZE to mark old rows as universally visible.
Severe Table Bloat	Query latency increases; sequential scan times double.	Identify long-running transactions holding back the MVCC horizon. Kill them. Run pg_repack or VACUUM FULL to rewrite the table.
Disk I/O Saturation during Checkpointing	I/O wait metrics spike periodically.	Spread out checkpointing over a longer period (tuning checkpoint_completion_target) to avoid I/O spikes.

Cost Drivers (Staff lens)

High-IOPS SSDs for WAL and data files
Large RAM allocations for the Buffer Pool
CPU for sorting, hashing, and concurrent connections

Multi-Region & DR

Single-node databases don't scale natively across regions. Require logical replication or specialized distributed layers (like Citus or CockroachDB architectures) to achieve multi-region consensus.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Query Execution Pipeline

2. Buffer Pool (Shared Memory)

3. Write-Ahead Log (WAL) & Durability ⭐

4. Multi-Version Concurrency Control (MVCC) ⭐

5. Indexes (B-Trees & Hash)

6. Connection Pooling (PgBouncer)

7. High Availability & Replication

Message Flow (Query)

Recovery Process (ARIES algorithm principles)

PostgreSQL MVCC vs. MySQL (InnoDB) MVCC

B-Tree vs. LSM Tree (LevelDB/RocksDB)

Synchronous vs. Asynchronous Replication

Vertical Scaling vs. Sharding

Phase 1: In-Memory + Append-Only Log

Phase 2: Paged Storage & Buffer Pool

Phase 3: MVCC & WAL

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR