Write-Ahead Logging (WAL) – System Design Core Concept

What:

Write-Ahead Logging (WAL) is an append-only sequential log stored on disk.

Primary purpose:

Providing ACID transaction durability and crash recovery at sub-millisecond latencies.

Usually used for:

Relational databases, key-value stores (LSM-Trees), message brokers, and consensus state machines.

How should I think about this inside system architectures?

✍️ Log Intent Sequentially

Always record transaction intents to the append-only WAL first. Only update complex in-memory indexes and database pages after the log is written.

⚡ Sequential > Random I/O

Writing sequentially to disk is extremely fast, while updating random leaf pages inside B+Tree files requires slow disk heads search times.

🔄 Recoverable State

If power fails, memory is lost, but the WAL survives. Scan the log on boot to redo committed transactions and undo uncommitted updates.

Log Sequence Number (LSN): An ever-increasing unique integer assigned to every WAL transaction to verify precise sync order.
Sequential Disk I/O: Eliminates mechanical head latency by only appending records to the end of the log file.
Checkpointing: Background sweep flushes all dirty memory pages to disk, letting the engine safely truncate old WAL segments.
Fsync Write Frequency Matrix: Choosing how often to physically lock data to disk plateaus:

Policy	Durability	Latency	Throughput
`fsync` every commit	Max (Zero data loss)	High (~5-15ms per write)	Low
`fsync` every N seconds (group commit)	Medium (Lose up to N sec)	Low	High (batched sequential writes)
No `fsync` (OS buffered)	Low (All dirty buffer at risk)	Lowest (sub-millisecond)	Highest

Benefit	Cost
Guaranteed Durability (ensures zero data loss under crash conditions by logging intents)	Hot Disk Hotspots (the WAL disk is a highly concurrent single point of write pressure)
Ultra-Fast Writes (replaces heavy, random B+Tree page writes with O(1) sequential appends)	Disk Space Consumption (un-checkpointed logs can quickly eat gigabytes of storage)
Replication Stream & Point-In-Time Recovery (simplifies log-shipping replication and logical rollback)	Recovery Startup Delay (crashed databases can take minutes to replay WAL at startup)

Problem	Usage
Payment Transaction Ledger	WAL with synchronous `fsync` on every commit to prevent double-spending
Redis Append-Only File (AOF)	WAL layer written asynchronously to disk to reconstruct Redis cache on reboot
LSM-Tree Storage Engine (RocksDB)	Sequential WAL protecting in-memory MemTable before data is flushed to SSTables
Distributed Consensus Logs (Raft/Paxos)	Raft log acts as the WAL replicated to a majority quorum before execution
Database Change Data Capture (CDC)	CDC pipelines stream updates by direct-parsing the database engine's binary WAL

Fsync Group Commit Internals

To prevent database performance from falling off a cliff under heavy write concurrency, database engines implement Group Commit. When a transaction completes, instead of calling fsync immediately, it joins a wait-queue. The leading thread blocks for a microsecond window (e.g. commit_delay) to aggregate subsequent concurrent commits. It then executes a single physical disk flush for the entire batch, reducing physical disk synchronization overhead to a fraction.

ARIES Recovery Algorithm Internals

Most transaction recovery systems are based on the **ARIES (Algorithms for Recovery and Isolation Exploiting Semantics)** model, executing three phases during boot recovery:

Analysis Phase: Scan the WAL forward starting from the last checkpoint to identify active transactions (loser list) and dirty pages in memory at the moment of the crash.
REDO Phase: Replay all logged operations forward (both committed and uncommitted) to return the database state to the exact point of crash.
UNDO Phase: Scan the log backward to rollback (reverse) all transactions that were active but uncommitted (loser list) during the crash, ensuring database atomicity.

Streaming Replication & Log Shipping

In high-availability configurations, WAL is shipped across network nodes to keep replica databases synchronized:

Loading...

File-Based Log Shipping: The primary database writes complete WAL segment files (typically 16MB) and sends them to replicas on completion. This introduces data lag up to the segment write time.
Streaming Replication: Replicas connect directly to the primary's WAL stream, receiving real-time byte updates down to the Log Sequence Number (LSN) level, reducing latency to near zero.

Redis Append-Only File (AOF) Compaction

In Redis, WAL is represented by the AOF file. Because Redis records every mutate command sequentially, AOF file sizes grow rapidly over time. To prevent disk overflow, Redis executes **AOF Rewrite** in the background: a child process forks, scans the current memory database state, and writes the minimum required commands to represent the final state (e.g., compaction of 100 increments into a single set command), cleanly replacing the historic WAL log.