This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design an Online Judge System (like LeetCode or HackerRank).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What is the peak load? | Contests drive massive traffic spikes in short windows (e.g., 50k submissions in 5 minutes). |
| What programming languages must we support? | Different languages need different compilation and sandboxing strategies (e.g., JVM vs C++). |
| Is this for real-time interviews or asynchronous contests? | Real-time execution requires sub-second P99 latency, whereas contests can queue evaluations. |
Scope
In scope
- Code submission and queuing workflow
- Secure code execution environment (Sandboxing)
- Result aggregation and ranking
- Handling massive traffic spikes during contests
Out of scope (state explicitly)
- Discussion forums and comments
- Payment processing for premium users
- IDE UI implementation
Assumptions
- Submissions are usually small (< 100KB)
- We must assume all submitted code is highly malicious
- Execution time is strictly bounded (e.g., 2 seconds)
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- View Problems: Users can browse and view coding problems, including descriptions, constraints, and public test cases.
- Code Submission: Users can write and submit code in multiple programming languages (e.g., Python, Java, C++).
- Code Execution: The system evaluates submitted code against hidden test cases.
- Results Reporting: Return execution results (Accepted, Wrong Answer, Time Limit Exceeded (TLE), Memory Limit Exceeded (MLE), Runtime Error).
- Leaderboard: Rank users based on problems solved and competition ratings.
- High Availability: The platform must be highly available for browsing and submissions.
- Scalability: Must handle sudden spikes during contests (e.g., 100x normal load).
- Security & Isolation: User-submitted code must be executed in a strictly isolated environment to prevent malicious actions (e.g., infinite loops, network access, file system tampering).
- Low Latency Execution: The overhead of setting up the environment should be minimal so users get feedback quickly (< 2-3 seconds for normal execution).
- Fairness: Resource allocation (CPU, RAM) must be strictly enforced so execution times are consistent and deterministic.
| Metric | Calculation | Value |
|---|---|---|
| Daily Active Users (DAU) | Given | 1M |
| Peak concurrent users | DAU × 10% | 100K |
| Submissions / sec (peak) | 100K × (1 sub / min) | ~1,600 subs/s |
| Problem reads / sec | 10x submissions | ~16,000 reads/s |
| Storage per submission | Code + Meta | ~2 KB |
| Storage per year (subs) | 1M DAU × 2 subs/day × 365 × 2KB | ~1.5 TB/year |
| Compute power (Sandboxes) | 1600/s × 2s execution | ~3,200 cores |
Compute Constraints
If a peak submission spike during a contest reaches 1,600 submissions/sec, and each submission takes ~2 seconds to compile and run against 50 test cases, we need 1600 × 2 = 3200 concurrent sandbox environments available. This necessitates a horizontally scalable fleet of execution workers.
The architecture is split between a standard web tier for browsing problems and a specialized async execution pipeline for running untrusted code. We can break this down into three distinct layers:
- API / Gateway Layer: Handles SSL termination, rate limiting, and authenticates the user's JWT. It routes read requests to the standard web tier, and pushes code submissions directly to the message broker.
- Service Layer: Consists of the Submission Service (validates constraints), the Execution Fleet (spawns secure Firecracker MicroVM sandboxes to run user code), and the Result Aggregator (listens for execution results and streams them back to the user via SSE).
- Data / Storage Layer: Relies on Kafka for durable async queueing of submissions, AWS S3 for storing massive test case files and raw user code, and PostgreSQL for managing user metadata and problem schemas.
1. Submission Queue & Dispatcher (Kafka & Worker Fleet)
Synchronous execution is impossible at scale; a slow Python script could lock an API thread for 10 seconds. Thus, the API Gateway immediately pushes submission payloads to a Kafka Topic partitioned by problem_id or user geography.
- Why Kafka? It provides extreme throughput and durability. If the execution fleet crashes during a contest, Kafka retains the submissions until workers recover.
- Dispatcher/Consumer Group: A pool of Go/Rust worker nodes listens to the Kafka topic. Each worker pulls a batch of code submissions, provisions a sandbox, and streams the standard output and execution metrics back to a Result Aggregator.
- Backpressure: If Kafka lag grows (e.g., during a coding contest), an Auto-Scaler triggers the provisioning of additional EC2 Spot Instances to expand the worker fleet dynamically.
2. Execution Engine & Secure Sandboxing
Executing untrusted code is the most critical vulnerability vector. A user could write code to mine crypto, read host environment variables, or launch a DDoS attack.
- Container Isolation (Docker/containerd): Each submission runs in an ephemeral container. The filesystem is entirely read-only except for a tiny
/tmpscratch space. - Resource Quotas (cgroups v2): We strictly enforce CPU (e.g., 1 core max) and Memory (e.g., 256MB max). If a user's code exceeds this, the kernel OOM Killer immediately terminates the process, resulting in an
MLE (Memory Limit Exceeded). - System Call Filtering (seccomp/AppArmor): We drop almost all Linux capabilities. The code is forbidden from calling
fork()(preventing fork bombs),socket()(preventing network access), andexecve(). - MicroVMs (Firecracker / gVisor): For absolute security, standard Docker is often insufficient. Platforms like Leetcode/AWS Lambda use Firecracker MicroVMs—hardware-level virtualization that boots a tiny Linux kernel in <125ms, ensuring perfect tenant isolation.
3. Test Case Execution Strategy
A single problem might have 150 hidden test cases. Booting 150 isolated containers per submission would be disastrously slow.
- In-Memory Injection: We boot one container per submission. Inside the container, a specialized runner script (written by the platform) dynamically loads the user's compiled code/function.
- Batch Execution: The runner script iterates through all 150 test cases, capturing the output and timing for each. If a test case fails, the runner immediately halts and reports
Wrong AnswerorRuntime Error. - Test Data Storage: Huge test cases (e.g., arrays with 100,000 elements) are not sent over Kafka. They are pulled directly from AWS S3 or a local read-heavy Redis cache by the worker node before booting the sandbox.
4. Result Aggregation & Real-time Delivery
Once the sandbox completes, the execution worker publishes the result to a Kafka results topic. A dedicated Result Aggregator consumes this and updates the PostgreSQL database.
- Client Delivery (SSE / WebSockets / Polling): The user UI needs the result fast. Instead of forcing the UI to poll the DB heavily, the API can use Server-Sent Events (SSE). When the result hits Kafka, a routing layer forwards the event directly to the specific API server holding the user's open SSE connection.
Submit Code
POST /api/v1/submissions
Authorization: Bearer <token>
Content-Type: application/json
{
"problem_id": 123,
"language": "python3",
"code": "def twoSum(nums, target):\n d = {}\n for i, n in enumerate(nums):\n if target - n in d:\n return [d[target - n], i]\n d[n] = i"
}
Response: 202 Accepted
{
"submission_id": "sub_987654321",
"status": "PENDING"
}Stream Submission Status (SSE)
GET /api/v1/submissions/sub_987654321/stream
Authorization: Bearer <token>
Response: 200 OK
Content-Type: text/event-stream
data: {"status": "IN_PROGRESS", "testcases_passed": 10}
data: {"status": "IN_PROGRESS", "testcases_passed": 30}
data: {"status": "ACCEPTED", "runtime_ms": 45, "memory_kb": 14200}Common Error Responses
400 Bad Request: invalid input, missing fields, or malformed JSON 401 Unauthorized: missing or invalid auth token or API key 403 Forbidden: authenticated but insufficient permissions 404 Not Found: resource ID does not exist 409 Conflict: duplicate write or version conflict; retry with idempotency key 422 Unprocessable Entity: valid syntax but invalid business logic 429 Too Many Requests: rate limit exceeded; honor Retry-After header 500 Internal Error: unexpected server fault; retry with idempotency key 503 Service Unavailable: dependency down or overloaded; use exponential backoff
We use a relational database (PostgreSQL) for managing problems, users, and submission metadata.
CREATE TABLE problems (
id SERIAL PRIMARY KEY,
title VARCHAR(255) NOT NULL,
difficulty ENUM('EASY', 'MEDIUM', 'HARD'),
time_limit_ms INT DEFAULT 2000,
memory_limit_kb INT DEFAULT 256000
);
CREATE TABLE submissions (
id VARCHAR(50) PRIMARY KEY,
user_id UUID REFERENCES users(id),
problem_id INT REFERENCES problems(id),
language VARCHAR(20),
code_s3_url VARCHAR(255), -- Offloaded to S3!
status ENUM('PENDING', 'ACCEPTED', 'WRONG_ANSWER', 'TLE', 'MLE', 'ERROR'),
runtime_ms INT,
memory_kb INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE test_cases (
id SERIAL PRIMARY KEY,
problem_id INT REFERENCES problems(id),
input_s3_key VARCHAR(255), -- Massive test cases stored in S3
output_s3_key VARCHAR(255),
is_hidden BOOLEAN
);Note on Code & Test Storage: The actual raw text for the submitted code and massive test cases (which can be gigabytes of integers) are completely offloaded to AWS S3. The PostgreSQL database only holds the URI pointers (code_s3_url), maintaining strict B-Tree index efficiency and preventing DB bloat.
| Scenario | Handling Strategy |
|---|---|
| Contest Spikes (100x traffic) | Auto-scale the Execution Worker fleet based on the Kafka topic lag. Use pre-warmed container pools to avoid cold start latency. |
| Infinite Loops (TLE) | The container runtime monitors CPU time using cgroups. A supervisor process kills the container if execution exceeds the problem's time limit. |
| Worker Node Crash | If a worker dies mid-execution, the message is not ACKed in Kafka. Another worker will pick up the message and re-run it. |
| Malicious Code Escape | Network access is disabled inside the sandbox. seccomp blocks dangerous syscalls. Firecracker microVMs provide a hardware-level isolation barrier. |
1. Container Isolation vs. MicroVMs
Standard Docker containers share the host kernel. A kernel vulnerability (like a zero-day in eBPF) could allow a user to break out of the sandbox and compromise the worker node. MicroVMs (like AWS Firecracker) provide hardware-virtualized isolation, which is much safer, but have slightly higher cold-start times (~120ms vs ~50ms). Modern OJ systems trade the slight overhead of MicroVMs for the absolute security they provide against kernel exploits.
2. Real-time Delivery: SSE vs. WebSockets vs. Polling
For returning execution results, we need real-time UI updates.
- Polling: Simplest to implement, entirely stateless. But polling every 500ms creates massive read-heavy load on the DB/Redis layer.
- WebSockets: Full duplex, low latency. But highly stateful, requiring sticky sessions or complex Pub/Sub routing, overkill for a unidirectional status update.
- Server-Sent Events (SSE): The perfect middle ground. Unidirectional (Server to Client), runs over standard HTTP, natively supported by browsers, and much easier to scale than WebSockets.
3. Code Storage: DB vs. Object Storage
Storing submitted code strings directly in PostgreSQL is convenient but degrades DB performance as the table grows to Terabytes of text. A better architecture offloads the actual code string to AWS S3 (or a similar object store) and only stores the S3 URI path in PostgreSQL. This keeps the relational database lean and index-efficient, trading slight read latency for massive scalability.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: MVP (Basic Executor)
A single API server parsing submissions and running them locally inside isolated Docker containers.
Key components: Docker containers · PostgreSQL database · Polling for results
Move to next phase when: Execution blocks the API server; security risks of shared kernels.
Phase 2: Queuing & Sandboxing
Introduce asynchronous processing and robust security boundaries.
Key components: Message Queue (Kafka) · Worker pool · Firecracker MicroVMs / gVisor · Redis for status caching
Move to next phase when: Contest spikes overwhelm the system; need for real-time feedback.
Phase 3: Scale & Global Contests
Multi-region execution to reduce latency and handle millions of users.
Key components: WebSocket push notifications · CDN for static assets and problem descriptions · Sharded database for contest leaderboards
Move to next phase when: Global user base; leaderboards take too long to compute.
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Submission Ingestion Latency | 99.9% < 200ms | Users must instantly know their submission was received. |
| Execution Queue Time | 99% < 5 seconds | Delays in execution ruin the contest experience. |
| Sandbox Isolation Breach | 0 incidents | A single host-compromise destroys trust in the platform. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Kafka queue backs up during a major contest | Consumer lag metrics spike; P99 execution time exceeds 30 seconds. | Auto-scale worker nodes based on queue depth. If cloud limits reached, temporarily increase timeout visibility to users. |
| A zero-day Linux kernel exploit is discovered | Security bulletins; anomaly detection showing unusual syscalls. | Because we use Firecracker (hardware virtualization) or gVisor (user-space kernel), we are insulated from standard host kernel panics. |
| Redis goes down, causing polling failures | Elevated 5xx errors on the submission status endpoint. | Fall back to querying the primary database directly, but apply strict rate limiting to prevent cascading database failure. |
Cost Drivers (Staff lens)
- Compute instances for the worker pool (code execution is CPU intensive)
- Bandwidth costs for streaming large test cases to workers
- Database IOPS for rapidly updating submission statuses
Multi-Region & DR
Code execution can be localized. Route the user to the nearest regional worker pool. However, user profiles and global contest leaderboards must be synchronized across regions (often using an active-passive DB architecture).