Design an Online Judge System (Leetcode)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Focus on the isolation layer and secure execution environment. How are system calls restricted?
Arch 75	Staff angles: How do you handle malicious infinite loops vs legitimate slow algorithms? Discuss MicroVMs vs Containers for multi-tenancy.

Interview Prompt

Design an Online Judge System (like LeetCode or HackerRank).

Clarifying Questions (ask before designing)

Question	Why it matters
What is the peak load?	Contests drive massive traffic spikes in short windows (e.g., 50k submissions in 5 minutes).
What programming languages must we support?	Different languages need different compilation and sandboxing strategies (e.g., JVM vs C++).
Is this for real-time interviews or asynchronous contests?	Real-time execution requires sub-second P99 latency, whereas contests can queue evaluations.

Scope

In scope

Code submission and queuing workflow
Secure code execution environment (Sandboxing)
Result aggregation and ranking
Handling massive traffic spikes during contests

Out of scope (state explicitly)

Discussion forums and comments
Payment processing for premium users
IDE UI implementation

Assumptions

Submissions are usually small (< 100KB)
We must assume all submitted code is highly malicious
Execution time is strictly bounded (e.g., 2 seconds)

View Problems: Users can browse and view coding problems, including descriptions, constraints, and public test cases.
Code Submission: Users can write and submit code in multiple programming languages (e.g., Python, Java, C++).
Code Execution: The system evaluates submitted code against hidden test cases.
Results Reporting: Return execution results (Accepted, Wrong Answer, Time Limit Exceeded (TLE), Memory Limit Exceeded (MLE), Runtime Error).
Leaderboard: Rank users based on problems solved and competition ratings.

Metric	Calculation	Value
Daily Active Users (DAU)	Given	1M
Peak concurrent users	DAU × 10%	100K
Submissions / sec (peak)	100K × (1 sub / min)	~1,600 subs/s
Problem reads / sec	10x submissions	~16,000 reads/s
Storage per submission	Code + Meta	~2 KB
Storage per year (subs)	1M DAU × 2 subs/day × 365 × 2KB	~1.5 TB/year
Compute power (Sandboxes)	1600/s × 2s execution	~3,200 cores

Compute Constraints

If a peak submission spike during a contest reaches 1,600 submissions/sec, and each submission takes ~2 seconds to compile and run against 50 test cases, we need 1600 × 2 = 3200 concurrent sandbox environments available. This necessitates a horizontally scalable fleet of execution workers.

The architecture is split between a standard web tier for browsing problems and a specialized async execution pipeline for running untrusted code. We can break this down into three distinct layers:

Loading...

API / Gateway Layer: Handles SSL termination, rate limiting, and authenticates the user's JWT. It routes read requests to the standard web tier, and pushes code submissions directly to the message broker.
Service Layer: Consists of the Submission Service (validates constraints), the Execution Fleet (spawns secure Firecracker MicroVM sandboxes to run user code), and the Result Aggregator (listens for execution results and streams them back to the user via SSE).
Data / Storage Layer: Relies on Kafka for durable async queueing of submissions, AWS S3 for storing massive test case files and raw user code, and PostgreSQL for managing user metadata and problem schemas.

1. Submission Queue & Dispatcher (Kafka & Worker Fleet)

Synchronous execution is impossible at scale; a slow Python script could lock an API thread for 10 seconds. Thus, the API Gateway immediately pushes submission payloads to a Kafka Topic partitioned by problem_id or user geography.

Why Kafka? It provides extreme throughput and durability. If the execution fleet crashes during a contest, Kafka retains the submissions until workers recover.
Dispatcher/Consumer Group: A pool of Go/Rust worker nodes listens to the Kafka topic. Each worker pulls a batch of code submissions, provisions a sandbox, and streams the standard output and execution metrics back to a Result Aggregator.
Backpressure: If Kafka lag grows (e.g., during a coding contest), an Auto-Scaler triggers the provisioning of additional EC2 Spot Instances to expand the worker fleet dynamically.

2. Execution Engine & Secure Sandboxing

Loading...

Executing untrusted code is the most critical vulnerability vector. A user could write code to mine crypto, read host environment variables, or launch a DDoS attack.

Container Isolation (Docker/containerd): Each submission runs in an ephemeral container. The filesystem is entirely read-only except for a tiny /tmp scratch space.
Resource Quotas (cgroups v2): We strictly enforce CPU (e.g., 1 core max) and Memory (e.g., 256MB max). If a user's code exceeds this, the kernel OOM Killer immediately terminates the process, resulting in an MLE (Memory Limit Exceeded).
System Call Filtering (seccomp/AppArmor): We drop almost all Linux capabilities. The code is forbidden from calling fork() (preventing fork bombs), socket() (preventing network access), and execve().
MicroVMs (Firecracker / gVisor): For absolute security, standard Docker is often insufficient. Platforms like Leetcode/AWS Lambda use Firecracker MicroVMs—hardware-level virtualization that boots a tiny Linux kernel in <125ms, ensuring perfect tenant isolation.

3. Test Case Execution Strategy

A single problem might have 150 hidden test cases. Booting 150 isolated containers per submission would be disastrously slow.

In-Memory Injection: We boot one container per submission. Inside the container, a specialized runner script (written by the platform) dynamically loads the user's compiled code/function.
Batch Execution: The runner script iterates through all 150 test cases, capturing the output and timing for each. If a test case fails, the runner immediately halts and reports Wrong Answer or Runtime Error.
Test Data Storage: Huge test cases (e.g., arrays with 100,000 elements) are not sent over Kafka. They are pulled directly from AWS S3 or a local read-heavy Redis cache by the worker node before booting the sandbox.

4. Result Aggregation & Real-time Delivery

Once the sandbox completes, the execution worker publishes the result to a Kafka results topic. A dedicated Result Aggregator consumes this and updates the PostgreSQL database.

Client Delivery (SSE / WebSockets / Polling): The user UI needs the result fast. Instead of forcing the UI to poll the DB heavily, the API can use Server-Sent Events (SSE). When the result hits Kafka, a routing layer forwards the event directly to the specific API server holding the user's open SSE connection.

Submit Code

HTTP

POST /api/v1/submissions
Authorization: Bearer <token>
Content-Type: application/json

{
  "problem_id": 123,
  "language": "python3",
  "code": "def twoSum(nums, target):\n    d = {}\n    for i, n in enumerate(nums):\n        if target - n in d:\n            return [d[target - n], i]\n        d[n] = i"
}

Response: 202 Accepted
{
  "submission_id": "sub_987654321",
  "status": "PENDING"
}

Stream Submission Status (SSE)

HTTP

GET /api/v1/submissions/sub_987654321/stream
Authorization: Bearer <token>

Response: 200 OK
Content-Type: text/event-stream

data: {"status": "IN_PROGRESS", "testcases_passed": 10}
data: {"status": "IN_PROGRESS", "testcases_passed": 30}
data: {"status": "ACCEPTED", "runtime_ms": 45, "memory_kb": 14200}

Common Error Responses

400 Bad Request: invalid input, missing fields, or malformed JSON
401 Unauthorized: missing or invalid auth token or API key
403 Forbidden: authenticated but insufficient permissions
404 Not Found: resource ID does not exist
409 Conflict: duplicate write or version conflict; retry with idempotency key
422 Unprocessable Entity: valid syntax but invalid business logic
429 Too Many Requests: rate limit exceeded; honor Retry-After header
500 Internal Error: unexpected server fault; retry with idempotency key
503 Service Unavailable: dependency down or overloaded; use exponential backoff

We use a relational database (PostgreSQL) for managing problems, users, and submission metadata.

SQL

CREATE TABLE problems (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    difficulty ENUM('EASY', 'MEDIUM', 'HARD'),
    time_limit_ms INT DEFAULT 2000,
    memory_limit_kb INT DEFAULT 256000
);

CREATE TABLE submissions (
    id VARCHAR(50) PRIMARY KEY,
    user_id UUID REFERENCES users(id),
    problem_id INT REFERENCES problems(id),
    language VARCHAR(20),
    code_s3_url VARCHAR(255), -- Offloaded to S3!
    status ENUM('PENDING', 'ACCEPTED', 'WRONG_ANSWER', 'TLE', 'MLE', 'ERROR'),
    runtime_ms INT,
    memory_kb INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE test_cases (
    id SERIAL PRIMARY KEY,
    problem_id INT REFERENCES problems(id),
    input_s3_key VARCHAR(255), -- Massive test cases stored in S3
    output_s3_key VARCHAR(255),
    is_hidden BOOLEAN
);

Note on Code & Test Storage: The actual raw text for the submitted code and massive test cases (which can be gigabytes of integers) are completely offloaded to AWS S3. The PostgreSQL database only holds the URI pointers (code_s3_url), maintaining strict B-Tree index efficiency and preventing DB bloat.

Scenario	Handling Strategy
Contest Spikes (100x traffic)	Auto-scale the Execution Worker fleet based on the Kafka topic lag. Use pre-warmed container pools to avoid cold start latency.
Infinite Loops (TLE)	The container runtime monitors CPU time using cgroups. A supervisor process kills the container if execution exceeds the problem's time limit.
Worker Node Crash	If a worker dies mid-execution, the message is not ACKed in Kafka. Another worker will pick up the message and re-run it.
Malicious Code Escape	Network access is disabled inside the sandbox. seccomp blocks dangerous syscalls. Firecracker microVMs provide a hardware-level isolation barrier.

SLOs & Error Budgets

Metric	Target	Rationale
Submission Ingestion Latency	99.9% < 200ms	Users must instantly know their submission was received.
Execution Queue Time	99% < 5 seconds	Delays in execution ruin the contest experience.
Sandbox Isolation Breach	0 incidents	A single host-compromise destroys trust in the platform.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Kafka queue backs up during a major contest	Consumer lag metrics spike; P99 execution time exceeds 30 seconds.	Auto-scale worker nodes based on queue depth. If cloud limits reached, temporarily increase timeout visibility to users.
A zero-day Linux kernel exploit is discovered	Security bulletins; anomaly detection showing unusual syscalls.	Because we use Firecracker (hardware virtualization) or gVisor (user-space kernel), we are insulated from standard host kernel panics.
Redis goes down, causing polling failures	Elevated 5xx errors on the submission status endpoint.	Fall back to querying the primary database directly, but apply strict rate limiting to prevent cascading database failure.

Cost Drivers (Staff lens)

Compute instances for the worker pool (code execution is CPU intensive)
Bandwidth costs for streaming large test cases to workers
Database IOPS for rapidly updating submission statuses

Multi-Region & DR

Code execution can be localized. Route the user to the nearest regional worker pool. However, user profiles and global contest leaderboards must be synchronized across regions (often using an active-passive DB architecture).