This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design a Code Hosting Platform (like GitHub or GitLab).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What scale are we designing for? | Millions of repos require a distributed storage tier, not just a single NFS mount. |
| Are we focusing on Git operations or the web UI? | Git operations (Push/Pull) are I/O and network intensive; the Web UI is a standard CRUD application. |
| How large are the repositories? | Large repos require delta compression, shallow clones, and specialized Git backend tuning. |
Scope
In scope
- Git Push and Pull operations over SSH/HTTPS
- Distributed storage of Git repositories
- Web interface for browsing code and commits
- High availability and disaster recovery
Out of scope (state explicitly)
- CI/CD pipelines and runners
- Issue tracking and project management boards
- Detailed GitHub Actions implementation
Assumptions
- Read traffic (git pull / clone) heavily outweighs write traffic (git push)
- Git is a decentralized system; the server acts as a central remote
- Security and access control are paramount
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Git Operations: Support
git push,git pull, andgit cloneover SSH and HTTPS. - Web Interface: Users can browse the repository file tree, view commit history, and read file contents.
- Pull Requests: Users can create PRs, diff code, and leave line-by-line comments.
- Issues & Collaboration: Issue tracking, starring, and forking repositories.
- High Availability: Source code must always be accessible for CI/CD pipelines.
- High Read Throughput:
git cloneandgit fetch(reads) vastly outnumbergit push(writes). - Storage Efficiency: Repositories can grow to tens of gigabytes; storage must be deduplicated and compressed.
- Data Integrity: Zero tolerance for source code corruption.
GitHub separates its Web/API tier (which handles UI, PRs, Issues, and DB metadata) from its Git Storage Tier. We can structure this into three distinct layers:
- API / Gateway Layer: The HAProxy / Nginx load balancers and SSH entry nodes that terminate connections and route HTTP/SSH Git commands to the appropriate backend routers.
- Service Layer: Includes the Web Fleet (Ruby on Rails apps handling the UI, Pull Requests, and Issues), the Git Router (maps repo URLs to internal storage nodes), and background CI/CD Worker fleets.
- Data / Storage Layer: The most critical tier. It consists of PostgreSQL for relational metadata (users, PRs), Redis for caching and Sidekiq queues, and Gitaly (Git Storage Nodes) for highly-available, RPC-based storage of the actual Git commit DAGs.
1. Understanding Git Internals (Content-Addressable Storage)
Git is fundamentally a content-addressable file system. Every piece of data is hashed (SHA-1/SHA-256) and stored as an immutable object in a key-value store where the key is the hash.
- Blob: The raw binary contents of a file. Filenames are NOT stored here.
- Tree: A directory listing. It maps filenames and permissions to Blob SHAs or other sub-Tree SHAs.
- Commit: Immutable metadata (author, message, timestamp, parent commit SHA) and a pointer to the root Tree SHA representing the state of the repository at that exact moment.
- Ref (Branch/Tag): A mutable, human-readable pointer (e.g.,
refs/heads/main) pointing to a specific Commit SHA. Pushing to a branch just updates this text file pointer.
2. The Storage Layer Evolution (NFS vs RPC) ⭐
Originally, GitHub and GitLab stored raw `.git` folders on massive Network File Systems (NFS/NAS). Stateless Web workers would mount the NFS drive over the network and run Git binaries locally.
The Bottleneck: Git commands (like git log) perform thousands of tiny, random disk reads to traverse the commit DAG. Doing this over a network mount introduces massive latency per file read, causing severe IOPS starvation on the NFS appliance.
The RPC Solution (Gitaly / DGit):Move the Git binary to the storage node. Instead of mounting a disk over the network, the Web Worker makes a single gRPC call to the Storage Node (e.g., GetCommitHistory()). The Storage Node runs the native Git binary directly on its local SSD/NVMe and returns the final aggregated result over the wire. This eliminated network IOPS bottlenecks entirely.
3. Routing and Sharding (Git Router)
A single massive cluster cannot hold all repositories. Repositories must be sharded across thousands of Storage Nodes.
- When a request arrives for
torvalds/linux.git, the Git Router acts as an L7 reverse proxy. - It looks up the repository location in a fast routing database (Redis or a highly cached PostgreSQL).
- It discovers that
linux.gitis hosted onStorage Node 402and seamlessly proxies the SSH/HTTPS connection or gRPC call there.
4. Forking (Zero-Cost Copy via Object Pools)
If 10,000 users fork an open-source repository (e.g., Linux kernel at ~3GB), duplicating the repository 10,000 times would waste 30 Terabytes of expensive SSD storage.
- Git Alternates / Object Pools: When you click "Fork", GitHub does not copy the Git objects. It creates an empty repository that symlinks (via
objects/info/alternates) to a shared "Object Pool" belonging to the upstream repository network. - When a user pushes a new commit uniquely to their fork, only that new delta is saved in their isolated folder.
- This deduplication makes forking instantaneous (O(1) time complexity) and incredibly storage-efficient.
5. Pull Requests and Caching Diff Computations
To render a Pull Request UI, the web tier needs the diff between two commits. Because the upstream repo and the fork share the same underlying Object Pool on the same physical Storage Node, the node can easily run a local git diff SHA1 SHA2. Diff calculation is highly CPU-intensive. Once calculated, the resulting HTML/JSON is heavily cached in Redis/Memcached so subsequent PR reviewers load the page instantly.
Standard REST APIs are used for PRs and Issues, but actual Git code transfer uses the specialized Git Smart HTTP Protocol to minimize bandwidth.
Git Clone / Fetch (Read)
git clone https://github.com/user/repo.git 1. Client sends GET /user/repo.git/info/refs?service=git-upload-pack - Router routes to the Primary Storage Node for 'repo.git'. - Node spawns 'git-upload-pack' and returns the list of references (branches/tags) and their commit SHAs. 2. Client calculates what objects it needs. 3. Client sends POST /user/repo.git/git-upload-pack - Includes "want <commit-sha>" and "have <commit-sha>". - Storage Node dynamically generates a highly compressed 'packfile' containing only the missing objects. - Streams the packfile back to the client.
Git Push (Write)
git push origin main 1. Client sends POST /user/repo.git/git-receive-pack - Client sends a stream containing a 'packfile' of the new commits/trees/blobs. - Storage Node receives the packfile, unpacks or stores it, and updates the branch reference (ref/heads/main). 2. Once the ref is updated, a post-receive hook fires. - Emits an event to Kafka (e.g., "push_event"). - Async workers pick this up to trigger CI/CD (Actions), update PRs, and send webhooks.
A Code Hosting Platform requires a strict separation of concerns between Git Object Data (stored in the storage tier) and Relational Metadata (stored in PostgreSQL).
// 1. Git Object Storage (On Disk / Gitaly)
// Everything in Git is hashed via SHA-1 (or SHA-256) into an object.
Blob {
length: size_t
content: byte[] // The actual file contents
}
Tree {
entries: [
{ mode: 100644, type: "blob", sha: "abc1234...", name: "index.js" },
{ mode: 040000, type: "tree", sha: "def5678...", name: "src" }
]
}
Commit {
tree_sha: "xyz9876..."
parent_shas: ["prev123..."]
author: "Ronnie <ronnie@example.com>"
message: "Fix bug"
}
// 2. Relational Metadata (PostgreSQL)
CREATE TABLE repositories (
id SERIAL PRIMARY KEY,
owner_id UUID REFERENCES users(id),
name VARCHAR(255),
storage_shard_id VARCHAR(50) -- Crucial: Tells the router which Gitaly node holds the Git data
);
CREATE TABLE pull_requests (
id SERIAL PRIMARY KEY,
repo_id INT REFERENCES repositories(id),
base_branch VARCHAR(255), -- e.g. "main"
head_branch VARCHAR(255), -- e.g. "feature-branch"
status ENUM('OPEN', 'MERGED', 'CLOSED')
);| Failure Case | System Solution Design |
|---|---|
| Storage Node Failure | Each repo is replicated to 3 nodes using a Raft-based consensus protocol for Git updates. If the Primary fails, a Secondary is instantly elected as the new Primary. |
| Network Partition | If a Primary cannot reach the Quorum, it steps down to prevent Split-Brain writes. Pushes will temporarily fail until a new Primary is established. |
| Heavy CI/CD Clone Traffic | Read traffic (Clones) can be load-balanced across the Secondary replicas to protect the Primary, which is busy handling Git Pushes and PR merges. |
Strong Consistency vs. Eventual Consistency
Git must be strongly consistent. If a developer pushes a commit to main and a CI server immediately clones the repo 100 milliseconds later, the new commit must be present. Therefore, code hosting platforms cannot use asynchronous eventual consistency for repository storage. They must use synchronous distributed transactions (e.g., three-phase commit or Raft consensus) to replicate the push across a quorum of storage nodes before returning an HTTP 200 "Success" to the user's git push command. This trades write latency for absolute correctness.
Monorepo vs. Multirepo Architecture
While GitHub handles millions of small repositories efficiently, giant monorepos (like those at Google/Meta) break standard Git architecture. Checking out a 100GB repository to an engineer's laptop is impossible. To support monorepos, systems trade full decentralization for Virtual File Systems (VFS for Git / Scalar), where the client only downloads the specific files they are actively editing, lazily fetching objects from the server on-demand.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: Standard NFS Mounts
App servers execute git commands directly against repositories stored on a shared NFS drive.
Key components: NFS Server · Ruby/Go App Servers · PostgreSQL for Metadata
Move to next phase when: NFS becomes a massive I/O bottleneck and single point of failure.
Phase 2: RPC Storage Tier
Eliminate NFS. Create a dedicated storage tier where nodes manage local disk repositories and expose Git operations via gRPC.
Key components: Gitaly / Spokes (RPC Service) · Routing Database · Local NVMe storage
Move to next phase when: Need for high availability; a single node crash takes down thousands of repos.
Phase 3: Replication & Global Edge
Implement 3-way replication for all repositories, and deploy edge nodes to cache packfiles for global users.
Key components: Three-way Replication Coordinator · Edge Packfile Cache · Kafka for webhooks/events
Move to next phase when: Global latency is too high; enterprise users require strict disaster recovery guarantees.
SLOs & Error Budgets
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| DDoS attack via massive repository clones | Network egress spikes; CPU on storage nodes hits 100% generating packfiles. | Rate limit by IP or user. Enable packfile caching. Serve a static 429 Too Many Requests response at the HAProxy layer. |
| Split Brain in Storage Replication | Users see inconsistent commit histories depending on which replica serves their read. | The replication coordinator must strictly enforce quorum (e.g., 2 out of 3 writes must succeed). Use strong leader election per repository to resolve write conflicts. |
| Database Migration locks a critical table | Web UI becomes completely unresponsive; 502 Bad Gateway errors spike. | Kill the blocking migration transaction immediately. All schema changes must use online schema migration tools (like gh-ost) to avoid long table locks. |
Cost Drivers (Staff lens)
- Storage costs (Petabytes of Git data, though highly compressed)
- Network Egress (Git clones transfer massive amounts of data)
- Compute for CI/CD runners (if included in scope)
Multi-Region & DR
Metadata (users, PRs) can be globally distributed, but the Git repository itself usually lives in one primary region with asynchronous replicas in other regions. Writes are always routed to the primary region's storage node.