What:
Structured Relational databases (PostgreSQL, MySQL) mapping strict tables vs unstructured/semi-structured Non-Relational stores (Cassandra, DynamoDB, MongoDB).
Primary purpose:
Selecting the optimal data persistence paradigm based on transaction safety requirements, scalability limits, and query dynamics.
Usually used for:
Transactional records, distributed session storage, real-time message streams, and heavy full-text search indexes.
How should I think about this inside system architectures?
💍 Relational Normalization
Eliminate duplication. Normalize records into distinct primary tables and join them on-the-fly using foreign key references.
🌾 Denormalized Wide-Column
Duplicate data aggressively. In NoSQL (Cassandra), write your schemas exactly to match your read query patterns (one table per query).
🏗️ Polyglot Persistence
Never select 'one DB for everything'. Deploy PostgreSQL for financials, Redis for sessions, and Cassandra for telemetry history.
Needed When:
Designing system backends, justifying storage structures, or sizing database write/read throughputs.
Avoids:
Database locks during peak traffic, relational JOIN melts under high scales, schema migrations downtime, and data corruption.
Optimizes For:
Data integrity, transactional consistency, write latency scaling, and flexible query structures.
Modern distributed platforms deploy a **Polyglot Persistence** pattern, routing different business models to the optimal database engines:
- Storage Paradigm Classifications: Structuring persistence bounds:
| Storage Style | Schema Model | Transaction Bound | Scaling Paradigm |
|---|---|---|---|
| Relational (SQL) | Strictly typed tabular schemas | Full ACID transaction support | Vertical scaling (rely on read replicas for reads) |
| Key-Value NoSQL | Schemaless key-value mappings | Atomic single-key updates | Horizontal scaling (hash partitioned arrays) |
| Wide-Column NoSQL | Column family groupings | Row-level atomicity | Horizontal masterless ring arrays |
| Benefit | Cost |
|---|---|
| Strict ACID Guarantees (relational consistency checks ensure zero double-spending or orphan references) | Horizontal Scaling Walls (splitting tables across multiple nodes requires distributed joints and 2PC coordination) |
| Scalable Write Concurrency (wide-column systems bypass locking completely using append-only LSM-trees, scaling linearly) | Limited Query Flexibility (unable to join tables on-the-fly; must pre-denormalize data schemas per read pattern) |
Problem: Executing queries with multiple nested tables `JOIN` statements scales quadratically. As tables hit millions of rows, joins trigger disk table scans, locking the database engine.
Mitigation: Implement strict indexing, pre-denormalize hot read columns, or offload heavy joint lookups to a caching layer.
Problem: Distributed NoSQL stores (like Cassandra) rely on asynchronous replication across nodes. Readers querying different nodes immediately after a write can read stale data.
Mitigation: Enforce QUORUM read/write consistency settings (W + R > N) inside Cassandra when strong consistency is required.
| Production System | Selected Engine | Architectural Rationale |
|---|---|---|
| Banking Ledger | PostgreSQL (SQL) | Financial updates demand strict ACID guarantees (Atomic money moves, Isolated records, and durable WAL locks). |
| Uber Driver Tracker | Cassandra / Redis (NoSQL) | High-velocity coordinate telemetry updates require high-write scaling, eventually matching eventual consistency. |
| Amazon Shopping Cart | DynamoDB (NoSQL) | Predictable latency, decentralized write availability, and absolute resilience override transaction relationships. |
- You are designing payment ledgers, inventory trackers, or booking systems where transactions MUST be ACID compliant.
- You must scale to handle write volumes exceeding 50,000 QPS (wide-column LSM append-only databases).
- You need to store unstructured logs, chat histories, dynamic product properties, or document indexes.
- Database Indexing & Query Optimization: Optimizing point lookup speeds in SQL.
- Sharding & Database Partitioning: Spreading SQL databases horizontally.
- Consistent Hashing: Hash rings in Cassandra Wide-Column clusters.
ACID vs BASE Paradigms
Databases represent two opposing design philosophies on the CAP theorem spectrum:
1. ACID (SQL Baseline)
- Atomicity: Entire transaction succeeds or rolls back.
- Consistency: Database transition matches all schema constraints.
- Isolation: Concurrent transactions do not bleed into each other.
- Durability: Committed transactions survive power crashes (via WAL log flush).
2. BASE (NoSQL Baseline)
- Basically Available: The database prioritizes accepting write commands over immediate synchronization.
- Soft State: Node replica values can drift over time without active lock enforcement.
- Eventual Consistency: Replicas eventually sync to store identical states after update streams stop.
Select ACID when consistency is absolute; choose BASE when write availability is the scaling priority.
Access-Pattern-First Data Modeling
In SQL interviews, candidates often start with entity diagrams (User, Post, Comment) and hope JOINs will work at scale. In NoSQL and high-scale SQL designs, flip the process: **list every read and write query first**, then design tables around those access paths.
1. Start from Queries, Not Entities
Write down the hot paths before drawing schemas: "fetch user timeline by user_id sorted by created_at DESC", "lookup order by order_id", "list products in category X under $50". Each distinct query pattern may need its own physical storage layout — one normalized table cannot serve all of them efficiently at billion-row scale.
2. Denormalize for Read Paths
If a feed query joins Users + Posts + LikeCounts on every page load, precompute and embed like_count and author_name directly into the Post row (or a feed-specific table). You pay extra write amplification on each like, but eliminate a three-table JOIN on every read — the classic read-heavy trade-off.
3. Composite Keys for Wide-Column Stores
In Cassandra or DynamoDB, the partition key determines which node holds the data; the sort key orders rows within that partition. Design the composite key to match your query: (user_id, created_at) for a per-user timeline, (category_id, price) for category-sorted product browse. A query that does not include the partition key triggers a full cluster scan — a design failure, not a tuning problem.
# Timeline table — one query, one table PRIMARY KEY ((user_id), created_at, post_id) → SELECT * FROM timeline WHERE user_id = ? ORDER BY created_at DESC LIMIT 20 # Separate table for the inverse query (posts by hashtag) PRIMARY KEY ((hashtag), created_at, post_id)
4. When to Duplicate Data Across Tables
- Different partition keys for the same entity: Store a user's posts under user_id for their profile and under hashtag for search — two write paths, zero JOINs at read time.
- Read replicas of hot fields: Copy display_name into every comment row so rendering a thread never hits the Users table.
- Materialized counters: Maintain a denormalized follower_count column updated asynchronously rather than COUNT(*) on every profile view.
Rule of thumb: duplicate when the read QPS on a JOIN path exceeds what a single node can serve, and the duplicated fields change infrequently or can tolerate brief staleness.