Design an LLM Chat Application (ChatGPT)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Focus on WebSocket connections, chat history storage, and streaming LLM responses.
Arch 75	Staff angles: Context window management, GPU orchestration, and KV cache scaling.

Interview Prompt

Design an LLM Chat Application (like ChatGPT or Claude).

Clarifying Questions (ask before designing)

Question	Why it matters
Are we hosting our own LLM models or using a third-party API?	Hosting own models requires deep GPU cluster management and inference routing. Using APIs shifts the focus entirely to state management and UI streaming.
Do we need to retain chat history indefinitely?	Impacts database choice (e.g., Cassandra for massive write-heavy logs).
What is the expected latency for the first token?	Time To First Token (TTFT) is the most critical UX metric for LLM apps.

Scope

In scope

Real-time streaming of tokens to the client
Chat session and history management
Context window truncation and summarization
Orchestration of inference requests (if hosting models)

Out of scope (state explicitly)

Training or fine-tuning the foundational models
Detailed implementation of the transformer architecture

Assumptions

The system handles millions of concurrent users
Responses are generated via an autoregressive language model
Users expect immediate streaming feedback

Conversational UI: Users can send text prompts and receive text responses from an AI.
Streaming Responses: Responses must be streamed back to the user token-by-token in real-time.
Context/History: The AI must remember the context of the current conversation session.
Chat History: Users can view, resume, and delete past conversation threads.

An LLM Chat application combines standard web architecture (databases for chat history, rate limiters) with specialized ML inference infrastructure. We can structure this into three distinct layers:

Loading...

API / Gateway Layer: Terminates SSL and handles user authentication and rate limiting. It maintains long-lived Server-Sent Event (SSE) connections to stream generated tokens directly back to the user's browser.
Service Layer: The Context Manager fetches conversation history from the DB to build the full prompt, and the Inference Engine Fleet (vLLM / TensorRT-LLM) manages massive GPU clusters, performing Continuous Batching and KV-Caching to generate tokens.
Data / Storage Layer: Relies on an ultra-fast NoSQL store (DynamoDB or Cassandra) for O(1) retrieval of chat histories, and distributed blob storage (S3) for persisting multi-gigabyte open-weights LLM model files.

1. Streaming via Server-Sent Events (SSE)

Waiting for a 500-word response to fully generate on the GPU before sending it to the client takes 10+ seconds. This is unacceptable UX. We use Server-Sent Events (SSE) (or WebSockets) to stream chunks to the client as soon as the GPU outputs them. SSE is preferred over WebSockets here because it is strictly unidirectional (Server to Client) and leverages standard HTTP/1.1 multiplexing without connection hijacking overhead.

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "The"}}]}

data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " capital"}}]}

data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " of"}}]}

data: [DONE]

2. Context Manager & Prompt Builder

LLMs are strictly stateless mathematical functions. To give the illusion of memory, the server must fetch the recent chat history from the database and prepend it to the user's prompt before sending it to the model.

JSON

User: "What's the capital of France?"

System internally builds the array:
[
  {"role": "system", "content": "You are a helpful AI assistant. Limit answers to 1 sentence."},
  {"role": "user", "content": "Hi"},
  {"role": "assistant", "content": "Hello! How can I help?"},
  {"role": "user", "content": "What's the capital of France?"}
]

Tokenized and sent to Inference Engine.

If the history exceeds the model's context window (e.g., 8K or 128K tokens), the Context Manager must intervene:
1. Truncation (Sliding Window): Simply drop the oldest messages.
2. Summarization: Use a background worker to summarize older blocks of conversation into a dense paragraph and inject that as a system prompt.
3. RAG (Retrieval-Augmented Generation): Vectorize past messages into a VectorDB and retrieve only semantically relevant past messages.

3. The Inference Engine & GPU Optimization ⭐

Running an LLM in production requires specialized, highly optimized inference engines (like vLLM or TensorRT-LLM), not just a raw PyTorch script.

Continuous Batching (Iteration-Level Scheduling)

Loading...

In traditional request-level batching, the GPU waits for the longest request in the batch to finish before accepting new requests.Continuous Batching ejects finished requests at the exact token iteration they finish and inserts new requests instantly. This dramatically increases GPU utilization and throughput.

KV Cache & PagedAttention

Loading...

The KV Cache stores the internal self-attention state tensors (Keys and Values) for previously generated tokens so they don't have to be recomputed for every new token. However, KV cache memory is highly dynamic, leading to extreme memory fragmentation on the GPU.PagedAttention (introduced by vLLM) solves this by managing GPU memory exactly like an OS manages CPU RAM: using virtual memory pages. This allows KV cache blocks to be non-contiguous in physical VRAM, virtually eliminating memory fragmentation and allowing 2-4x larger batch sizes without OOM crashes.

4. Model Router & Prefix Caching

A specialized L7 load balancer routes requests to Inference Nodes. Advanced routers implement Prefix-Aware Routing: if a user asks a follow-up question regarding a massive 10,000-word document, the router hashes the document (the prefix) and ensures the request is routed to the exact GPU node that already holds that document's KV cache in its memory. This skips the incredibly expensive "prefill" phase, saving massive amounts of compute.

Standard REST requests would timeout waiting for an LLM to generate 1000 words. We must use Server-Sent Events (SSE).

Chat Completion Stream

HTTP

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "The"}}]}

data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " capital"}}]}

data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " of"}}]}

data: [DONE]

Prompt Construction

The API server dynamically constructs the full context window before querying the LLM Engine.

User: "What's the capital of France?"

System internally builds the array:
[
  {"role": "system", "content": "You are a helpful AI assistant. Limit answers to 1 sentence."},
  {"role": "user", "content": "Hi"},
  {"role": "assistant", "content": "Hello! How can I help?"},
  {"role": "user", "content": "What's the capital of France?"}
]

Tokenized and sent to Inference Engine.

Because the LLM engine is stateless, the API Gateway must fetch the entire conversation history from the database on every request to construct the prompt array. We use a highly scalable NoSQL database (like DynamoDB or Cassandra) for O(1) history retrieval by session_id.

JSON

// DynamoDB / Cassandra Table Schema
// Partition Key: session_id
// Sort Key: message_id (or timestamp)

{
  "session_id": "sess_12345",         // PK
  "message_id": "msg_001",            // SK
  "user_id": "usr_987",               // GSI (Global Secondary Index)
  "role": "user",                     // 'user', 'assistant', 'system'
  "content": "What's the capital?",
  "created_at": "2023-10-01T12:00:00Z",
  "tokens": 6
}

Failure Case	System Solution Design
GPU Node OOM (Out of Memory)	Inference engine strictly monitors KV cache allocation and queues requests if VRAM is full. Node auto-restarts on crash.
High Traffic Spikes	Scale-out takes minutes due to large model weights (10-100GB). We must use a dynamic queue and inform users of wait times, plus keep pre-warmed standby nodes.
Database Slowdown	Use a highly scalable NoSQL database (DynamoDB / Cassandra) keyed by SessionID for O(1) history retrieval.

SLOs & Error Budgets

Metric	Target	Rationale
Time To First Token (TTFT)	99% < 1.5s	Crucial for perceived performance and UX.
Token Generation Rate	> 20 tokens/sec	Must be faster than human reading speed.
Availability	99.9%	Standard consumer application availability.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
GPU Out of Memory (OOM)	Inference servers crash with CUDA out of memory errors; 500 errors spike.	Implement continuous batching and strict KV cache memory limits (e.g., PagedAttention). Gracefully degrade by rejecting new requests rather than crashing.
Safety Filter False Positives	High volume of user reports that benign queries are blocked.	Allow users to contest flags. Implement a fast, secondary fallback classifier, or adjust the moderation threshold dynamically.
Thundering Herd on a specific model tier	GPT-4 endpoint latency spikes while GPT-3.5 is idle.	Implement dynamic pricing or hard queue limits. Surface wait times in the UI to manage expectations.

Cost Drivers (Staff lens)

GPU Compute (A100s / H100s) for inference
API costs (if using third-party providers, token costs dominate)
Database storage for billions of chat messages

Multi-Region & DR

Users can be routed to the nearest regional GPU cluster to minimize TTFT. Chat history can be replicated asynchronously since users rarely access the same chat session from two different continents simultaneously.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Streaming via Server-Sent Events (SSE)

2. Context Manager & Prompt Builder

3. The Inference Engine & GPU Optimization ⭐

Continuous Batching (Iteration-Level Scheduling)

KV Cache & PagedAttention

4. Model Router & Prefix Caching

Chat Completion Stream

Prompt Construction

Open Weights vs Managed API

Latency (TTFT) vs. Throughput (Batch Size)

Phase 1: Stateless API Wrapper

Phase 2: Streaming & History

Phase 3: Custom Inference & KV Routing

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR