What:
Write-Ahead Logging (WAL) is an append-only sequential log stored on disk.
Primary purpose:
Providing ACID transaction durability and crash recovery at sub-millisecond latencies.
Usually used for:
Relational databases, key-value stores (LSM-Trees), message brokers, and consensus state machines.
How should I think about this inside system architectures?
✍️ Log Intent Sequentially
Always record transaction intents to the append-only WAL first. Only update complex in-memory indexes and database pages after the log is written.
⚡ Sequential > Random I/O
Writing sequentially to disk is extremely fast, while updating random leaf pages inside B+Tree files requires slow disk heads search times.
🔄 Recoverable State
If power fails, memory is lost, but the WAL survives. Scan the log on boot to redo committed transactions and undo uncommitted updates.
Needed When:
You require strict financial or transactional durability (ACID), fast write times, or real-time logical backup replication.
Avoids:
Silent database page corruptions, write bottleneck queues on live tables, and mismatched state machines after system crashes.
Optimizes For:
Transaction commit latency, hardware failure tolerance, point-in-time state recovery, and high write volume scaling.
In a WAL architecture, transactions write intent to a sequential file in the synchronous path, deferring slow data file updates to background workers:
The Crash Recovery Pipeline
When database processes crash, physical memory pages are wiped. On boot, the engine replays the log forward from the last known checkpoint:
- Log Sequence Number (LSN): An ever-increasing unique integer assigned to every WAL transaction to verify precise sync order.
- Sequential Disk I/O: Eliminates mechanical head latency by only appending records to the end of the log file.
- Checkpointing: Background sweep flushes all dirty memory pages to disk, letting the engine safely truncate old WAL segments.
- Fsync Write Frequency Matrix: Choosing how often to physically lock data to disk plateaus:
| Policy | Durability | Latency | Throughput |
|---|---|---|---|
fsync every commit | Max (Zero data loss) | High (~5-15ms per write) | Low |
fsync every N seconds (group commit) | Medium (Lose up to N sec) | Low | High (batched sequential writes) |
No fsync (OS buffered) | Low (All dirty buffer at risk) | Lowest (sub-millisecond) | Highest |
| Benefit | Cost |
|---|---|
| Guaranteed Durability (ensures zero data loss under crash conditions by logging intents) | Hot Disk Hotspots (the WAL disk is a highly concurrent single point of write pressure) |
| Ultra-Fast Writes (replaces heavy, random B+Tree page writes with O(1) sequential appends) | Disk Space Consumption (un-checkpointed logs can quickly eat gigabytes of storage) |
| Replication Stream & Point-In-Time Recovery (simplifies log-shipping replication and logical rollback) | Recovery Startup Delay (crashed databases can take minutes to replay WAL at startup) |
Problem: Running high-concurrency database queries with fsync every commit forces the physical storage controller to sync disk operations repeatedly, choking throughput and pushing latency to ~15ms.
Mitigation: Implement Group Commits (batching multiple concurrent transactions into a single physical WAL sync) or utilize high-speed NVMe storage with battery-backed write caches.
Problem: When database checkpoint sweeps trigger, the engine flushes massive amounts of dirty in-memory pages to database files on disk, saturating the disk controller and inducing user latency spikes.
Mitigation: Tune rate-limited checkpointing parameters (e.g. Postgres's checkpoint_completion_target) to spread page flushes continuously over time instead of in single burst storms.
Problem: Long-running transaction or broken replication listener prevents database engine from advancing checkpoints, letting WAL files accumulate indefinitely until disk space is fully exhausted.
Mitigation: Deploy disk alert monitors and configure strict max WAL segment storage limits (e.g., PostgreSQL's max_wal_size or active replica limits).
| Problem | Usage |
|---|---|
| Payment Transaction Ledger | WAL with synchronous fsync on every commit to prevent double-spending |
| Redis Append-Only File (AOF) | WAL layer written asynchronously to disk to reconstruct Redis cache on reboot |
| LSM-Tree Storage Engine (RocksDB) | Sequential WAL protecting in-memory MemTable before data is flushed to SSTables |
| Distributed Consensus Logs (Raft/Paxos) | Raft log acts as the WAL replicated to a majority quorum before execution |
| Database Change Data Capture (CDC) | CDC pipelines stream updates by direct-parsing the database engine's binary WAL |
- You are designing storage engines, transactional ledgers, or custom databases.
- You must support real-time point-in-time data recovery (PITR).
- You require absolute guarantees that "acknowledged" writes survive complete hardware power failures.
- You need to replicate transactions across network nodes reliably (log shipping / streaming replication).
- You face heavy write workloads and want to avoid random-access disk write overhead in the transaction path.
- LSM-Tree Storage Engines (utilizing a sequential WAL paired with SSTables)
- Raft Log Consensus (replicating WAL updates to a quorum to reach global agreement)
- Postgres Streaming Replication (direct network log shipping of LSN files)
- Point-in-Time Recovery (PITR) (replaying archival WAL files to construct snapshots)
- ACID Guarantees (the foundation of atomic rollback and database durability)
Fsync Group Commit Internals
To prevent database performance from falling off a cliff under heavy write concurrency, database engines implement Group Commit. When a transaction completes, instead of calling fsync immediately, it joins a wait-queue. The leading thread blocks for a microsecond window (e.g. commit_delay) to aggregate subsequent concurrent commits. It then executes a single physical disk flush for the entire batch, reducing physical disk synchronization overhead to a fraction.
ARIES Recovery Algorithm Internals
Most transaction recovery systems are based on the **ARIES (Algorithms for Recovery and Isolation Exploiting Semantics)** model, executing three phases during boot recovery:
- Analysis Phase: Scan the WAL forward starting from the last checkpoint to identify active transactions (loser list) and dirty pages in memory at the moment of the crash.
- REDO Phase: Replay all logged operations forward (both committed and uncommitted) to return the database state to the exact point of crash.
- UNDO Phase: Scan the log backward to rollback (reverse) all transactions that were active but uncommitted (loser list) during the crash, ensuring database atomicity.
Streaming Replication & Log Shipping
In high-availability configurations, WAL is shipped across network nodes to keep replica databases synchronized:
- File-Based Log Shipping: The primary database writes complete WAL segment files (typically 16MB) and sends them to replicas on completion. This introduces data lag up to the segment write time.
- Streaming Replication: Replicas connect directly to the primary's WAL stream, receiving real-time byte updates down to the Log Sequence Number (LSN) level, reducing latency to near zero.
Redis Append-Only File (AOF) Compaction
In Redis, WAL is represented by the AOF file. Because Redis records every mutate command sequentially, AOF file sizes grow rapidly over time. To prevent disk overflow, Redis executes **AOF Rewrite** in the background: a child process forks, scans the current memory database state, and writes the minimum required commands to represent the final state (e.g., compaction of 100 increments into a single set command), cleanly replacing the historic WAL log.