Design a Code Hosting Platform (GitHub)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Focus on how Git objects are stored and transferred over the network (SSH/HTTPS).
Arch 75	Staff angles: How to scale a monolithic Git repo (like Linux kernel), replication strategies (Gitaly), and handling massive clone traffic.

Interview Prompt

Design a Code Hosting Platform (like GitHub or GitLab).

Clarifying Questions (ask before designing)

Question	Why it matters
What scale are we designing for?	Millions of repos require a distributed storage tier, not just a single NFS mount.
Are we focusing on Git operations or the web UI?	Git operations (Push/Pull) are I/O and network intensive; the Web UI is a standard CRUD application.
How large are the repositories?	Large repos require delta compression, shallow clones, and specialized Git backend tuning.

Scope

In scope

Git Push and Pull operations over SSH/HTTPS
Distributed storage of Git repositories
Web interface for browsing code and commits
High availability and disaster recovery

Out of scope (state explicitly)

CI/CD pipelines and runners
Issue tracking and project management boards
Detailed GitHub Actions implementation

Assumptions

Read traffic (git pull / clone) heavily outweighs write traffic (git push)
Git is a decentralized system; the server acts as a central remote
Security and access control are paramount

Git Operations: Support git push, git pull, and git clone over SSH and HTTPS.
Web Interface: Users can browse the repository file tree, view commit history, and read file contents.
Pull Requests: Users can create PRs, diff code, and leave line-by-line comments.
Issues & Collaboration: Issue tracking, starring, and forking repositories.

GitHub separates its Web/API tier (which handles UI, PRs, Issues, and DB metadata) from its Git Storage Tier. We can structure this into three distinct layers:

Loading...

API / Gateway Layer: The HAProxy / Nginx load balancers and SSH entry nodes that terminate connections and route HTTP/SSH Git commands to the appropriate backend routers.
Service Layer: Includes the Web Fleet (Ruby on Rails apps handling the UI, Pull Requests, and Issues), the Git Router (maps repo URLs to internal storage nodes), and background CI/CD Worker fleets.
Data / Storage Layer: The most critical tier. It consists of PostgreSQL for relational metadata (users, PRs), Redis for caching and Sidekiq queues, and Gitaly (Git Storage Nodes) for highly-available, RPC-based storage of the actual Git commit DAGs.

1. Understanding Git Internals (Content-Addressable Storage)

Git is fundamentally a content-addressable file system. Every piece of data is hashed (SHA-1/SHA-256) and stored as an immutable object in a key-value store where the key is the hash.

Loading...

Blob: The raw binary contents of a file. Filenames are NOT stored here.
Tree: A directory listing. It maps filenames and permissions to Blob SHAs or other sub-Tree SHAs.
Commit: Immutable metadata (author, message, timestamp, parent commit SHA) and a pointer to the root Tree SHA representing the state of the repository at that exact moment.
Ref (Branch/Tag): A mutable, human-readable pointer (e.g., refs/heads/main) pointing to a specific Commit SHA. Pushing to a branch just updates this text file pointer.

2. The Storage Layer Evolution (NFS vs RPC) ⭐

Originally, GitHub and GitLab stored raw `.git` folders on massive Network File Systems (NFS/NAS). Stateless Web workers would mount the NFS drive over the network and run Git binaries locally.
The Bottleneck: Git commands (like git log) perform thousands of tiny, random disk reads to traverse the commit DAG. Doing this over a network mount introduces massive latency per file read, causing severe IOPS starvation on the NFS appliance.

The RPC Solution (Gitaly / DGit):Move the Git binary to the storage node. Instead of mounting a disk over the network, the Web Worker makes a single gRPC call to the Storage Node (e.g., GetCommitHistory()). The Storage Node runs the native Git binary directly on its local SSD/NVMe and returns the final aggregated result over the wire. This eliminated network IOPS bottlenecks entirely.

Loading...

3. Routing and Sharding (Git Router)

A single massive cluster cannot hold all repositories. Repositories must be sharded across thousands of Storage Nodes.

When a request arrives for torvalds/linux.git, the Git Router acts as an L7 reverse proxy.
It looks up the repository location in a fast routing database (Redis or a highly cached PostgreSQL).
It discovers that linux.git is hosted on Storage Node 402 and seamlessly proxies the SSH/HTTPS connection or gRPC call there.

4. Forking (Zero-Cost Copy via Object Pools)

If 10,000 users fork an open-source repository (e.g., Linux kernel at ~3GB), duplicating the repository 10,000 times would waste 30 Terabytes of expensive SSD storage.

Git Alternates / Object Pools: When you click "Fork", GitHub does not copy the Git objects. It creates an empty repository that symlinks (via objects/info/alternates) to a shared "Object Pool" belonging to the upstream repository network.
When a user pushes a new commit uniquely to their fork, only that new delta is saved in their isolated folder.
This deduplication makes forking instantaneous (O(1) time complexity) and incredibly storage-efficient.

5. Pull Requests and Caching Diff Computations

To render a Pull Request UI, the web tier needs the diff between two commits. Because the upstream repo and the fork share the same underlying Object Pool on the same physical Storage Node, the node can easily run a local git diff SHA1 SHA2. Diff calculation is highly CPU-intensive. Once calculated, the resulting HTML/JSON is heavily cached in Redis/Memcached so subsequent PR reviewers load the page instantly.

A Code Hosting Platform requires a strict separation of concerns between Git Object Data (stored in the storage tier) and Relational Metadata (stored in PostgreSQL).

SQL

// 1. Git Object Storage (On Disk / Gitaly)
// Everything in Git is hashed via SHA-1 (or SHA-256) into an object.
Blob {
    length: size_t
    content: byte[]   // The actual file contents
}
Tree {
    entries: [
        { mode: 100644, type: "blob", sha: "abc1234...", name: "index.js" },
        { mode: 040000, type: "tree", sha: "def5678...", name: "src" }
    ]
}
Commit {
    tree_sha: "xyz9876..."
    parent_shas: ["prev123..."]
    author: "Ronnie <ronnie@example.com>"
    message: "Fix bug"
}

// 2. Relational Metadata (PostgreSQL)
CREATE TABLE repositories (
    id SERIAL PRIMARY KEY,
    owner_id UUID REFERENCES users(id),
    name VARCHAR(255),
    storage_shard_id VARCHAR(50) -- Crucial: Tells the router which Gitaly node holds the Git data
);

CREATE TABLE pull_requests (
    id SERIAL PRIMARY KEY,
    repo_id INT REFERENCES repositories(id),
    base_branch VARCHAR(255),  -- e.g. "main"
    head_branch VARCHAR(255),  -- e.g. "feature-branch"
    status ENUM('OPEN', 'MERGED', 'CLOSED')
);

Failure Case	System Solution Design
Storage Node Failure	Each repo is replicated to 3 nodes using a Raft-based consensus protocol for Git updates. If the Primary fails, a Secondary is instantly elected as the new Primary.
Network Partition	If a Primary cannot reach the Quorum, it steps down to prevent Split-Brain writes. Pushes will temporarily fail until a new Primary is established.
Heavy CI/CD Clone Traffic	Read traffic (Clones) can be load-balanced across the Secondary replicas to protect the Primary, which is busy handling Git Pushes and PR merges.

SLOs & Error Budgets

Metric	Target	Rationale
Web UI Latency	99% < 300ms	Developers expect instantaneous browsing of code.
Git Push Success Rate	99.99%	Failing a push disrupts developer workflows and CI/CD triggers.
Repository Durability	99.999999999%	Losing source code is an existential failure for a hosting platform.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
DDoS attack via massive repository clones	Network egress spikes; CPU on storage nodes hits 100% generating packfiles.	Rate limit by IP or user. Enable packfile caching. Serve a static 429 Too Many Requests response at the HAProxy layer.
Split Brain in Storage Replication	Users see inconsistent commit histories depending on which replica serves their read.	The replication coordinator must strictly enforce quorum (e.g., 2 out of 3 writes must succeed). Use strong leader election per repository to resolve write conflicts.
Database Migration locks a critical table	Web UI becomes completely unresponsive; 502 Bad Gateway errors spike.	Kill the blocking migration transaction immediately. All schema changes must use online schema migration tools (like gh-ost) to avoid long table locks.

Cost Drivers (Staff lens)

Storage costs (Petabytes of Git data, though highly compressed)
Network Egress (Git clones transfer massive amounts of data)
Compute for CI/CD runners (if included in scope)

Multi-Region & DR

Metadata (users, PRs) can be globally distributed, but the Git repository itself usually lives in one primary region with asynchronous replicas in other regions. Writes are always routed to the primary region's storage node.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Understanding Git Internals (Content-Addressable Storage)

2. The Storage Layer Evolution (NFS vs RPC) ⭐

3. Routing and Sharding (Git Router)

4. Forking (Zero-Cost Copy via Object Pools)

5. Pull Requests and Caching Diff Computations

Git Clone / Fetch (Read)

Git Push (Write)

Strong Consistency vs. Eventual Consistency

Monorepo vs. Multirepo Architecture

Phase 1: Standard NFS Mounts

Phase 2: RPC Storage Tier

Phase 3: Replication & Global Edge

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR