This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design an LLM Chat Application (like ChatGPT or Claude).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| Are we hosting our own LLM models or using a third-party API? | Hosting own models requires deep GPU cluster management and inference routing. Using APIs shifts the focus entirely to state management and UI streaming. |
| Do we need to retain chat history indefinitely? | Impacts database choice (e.g., Cassandra for massive write-heavy logs). |
| What is the expected latency for the first token? | Time To First Token (TTFT) is the most critical UX metric for LLM apps. |
Scope
In scope
- Real-time streaming of tokens to the client
- Chat session and history management
- Context window truncation and summarization
- Orchestration of inference requests (if hosting models)
Out of scope (state explicitly)
- Training or fine-tuning the foundational models
- Detailed implementation of the transformer architecture
Assumptions
- The system handles millions of concurrent users
- Responses are generated via an autoregressive language model
- Users expect immediate streaming feedback
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Conversational UI: Users can send text prompts and receive text responses from an AI.
- Streaming Responses: Responses must be streamed back to the user token-by-token in real-time.
- Context/History: The AI must remember the context of the current conversation session.
- Chat History: Users can view, resume, and delete past conversation threads.
- Low TTFT (Time To First Token): The user should see the first word within 500ms to feel the system is responsive.
- High Throughput (Tokens per Second): Generation speed should match or exceed human reading speed (~20-30 tokens/sec).
- Scalability: GPU resources are extremely expensive and limited; the system must maximize GPU utilization via batching.
- Context Window Limits: The system must gracefully handle truncating or summarizing history when the context window is exceeded.
An LLM Chat application combines standard web architecture (databases for chat history, rate limiters) with specialized ML inference infrastructure. We can structure this into three distinct layers:
- API / Gateway Layer: Terminates SSL and handles user authentication and rate limiting. It maintains long-lived Server-Sent Event (SSE) connections to stream generated tokens directly back to the user's browser.
- Service Layer: The Context Manager fetches conversation history from the DB to build the full prompt, and the Inference Engine Fleet (vLLM / TensorRT-LLM) manages massive GPU clusters, performing Continuous Batching and KV-Caching to generate tokens.
- Data / Storage Layer: Relies on an ultra-fast NoSQL store (DynamoDB or Cassandra) for O(1) retrieval of chat histories, and distributed blob storage (S3) for persisting multi-gigabyte open-weights LLM model files.
1. Streaming via Server-Sent Events (SSE)
Waiting for a 500-word response to fully generate on the GPU before sending it to the client takes 10+ seconds. This is unacceptable UX. We use Server-Sent Events (SSE) (or WebSockets) to stream chunks to the client as soon as the GPU outputs them. SSE is preferred over WebSockets here because it is strictly unidirectional (Server to Client) and leverages standard HTTP/1.1 multiplexing without connection hijacking overhead.
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "The"}}]}
data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " capital"}}]}
data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " of"}}]}
data: [DONE]2. Context Manager & Prompt Builder
LLMs are strictly stateless mathematical functions. To give the illusion of memory, the server must fetch the recent chat history from the database and prepend it to the user's prompt before sending it to the model.
User: "What's the capital of France?"
System internally builds the array:
[
{"role": "system", "content": "You are a helpful AI assistant. Limit answers to 1 sentence."},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello! How can I help?"},
{"role": "user", "content": "What's the capital of France?"}
]
Tokenized and sent to Inference Engine.If the history exceeds the model's context window (e.g., 8K or 128K tokens), the Context Manager must intervene:
1. Truncation (Sliding Window): Simply drop the oldest messages.
2. Summarization: Use a background worker to summarize older blocks of conversation into a dense paragraph and inject that as a system prompt.
3. RAG (Retrieval-Augmented Generation): Vectorize past messages into a VectorDB and retrieve only semantically relevant past messages.
3. The Inference Engine & GPU Optimization ⭐
Running an LLM in production requires specialized, highly optimized inference engines (like vLLM or TensorRT-LLM), not just a raw PyTorch script.
Continuous Batching (Iteration-Level Scheduling)
In traditional request-level batching, the GPU waits for the longest request in the batch to finish before accepting new requests.Continuous Batching ejects finished requests at the exact token iteration they finish and inserts new requests instantly. This dramatically increases GPU utilization and throughput.
KV Cache & PagedAttention
The KV Cache stores the internal self-attention state tensors (Keys and Values) for previously generated tokens so they don't have to be recomputed for every new token. However, KV cache memory is highly dynamic, leading to extreme memory fragmentation on the GPU.PagedAttention (introduced by vLLM) solves this by managing GPU memory exactly like an OS manages CPU RAM: using virtual memory pages. This allows KV cache blocks to be non-contiguous in physical VRAM, virtually eliminating memory fragmentation and allowing 2-4x larger batch sizes without OOM crashes.
4. Model Router & Prefix Caching
A specialized L7 load balancer routes requests to Inference Nodes. Advanced routers implement Prefix-Aware Routing: if a user asks a follow-up question regarding a massive 10,000-word document, the router hashes the document (the prefix) and ensures the request is routed to the exact GPU node that already holds that document's KV cache in its memory. This skips the incredibly expensive "prefill" phase, saving massive amounts of compute.
Standard REST requests would timeout waiting for an LLM to generate 1000 words. We must use Server-Sent Events (SSE).
Chat Completion Stream
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": "The"}}]}
data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " capital"}}]}
data: {"id": "msg_1", "object": "chat.completion.chunk", "choices": [{"delta": {"content": " of"}}]}
data: [DONE]Prompt Construction
The API server dynamically constructs the full context window before querying the LLM Engine.
User: "What's the capital of France?"
System internally builds the array:
[
{"role": "system", "content": "You are a helpful AI assistant. Limit answers to 1 sentence."},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello! How can I help?"},
{"role": "user", "content": "What's the capital of France?"}
]
Tokenized and sent to Inference Engine.Because the LLM engine is stateless, the API Gateway must fetch the entire conversation history from the database on every request to construct the prompt array. We use a highly scalable NoSQL database (like DynamoDB or Cassandra) for O(1) history retrieval by session_id.
// DynamoDB / Cassandra Table Schema
// Partition Key: session_id
// Sort Key: message_id (or timestamp)
{
"session_id": "sess_12345", // PK
"message_id": "msg_001", // SK
"user_id": "usr_987", // GSI (Global Secondary Index)
"role": "user", // 'user', 'assistant', 'system'
"content": "What's the capital?",
"created_at": "2023-10-01T12:00:00Z",
"tokens": 6
}| Failure Case | System Solution Design |
|---|---|
| GPU Node OOM (Out of Memory) | Inference engine strictly monitors KV cache allocation and queues requests if VRAM is full. Node auto-restarts on crash. |
| High Traffic Spikes | Scale-out takes minutes due to large model weights (10-100GB). We must use a dynamic queue and inform users of wait times, plus keep pre-warmed standby nodes. |
| Database Slowdown | Use a highly scalable NoSQL database (DynamoDB / Cassandra) keyed by SessionID for O(1) history retrieval. |
Open Weights vs Managed API
Using OpenAI/Anthropic APIs is easy and requires zero GPU infrastructure, but costs scale linearly per token and data privacy is a major concern for enterprise customers. Hosting open-weights models (Llama 3, Mistral) on AWS EC2/GCP GPUs requires heavy MLOps engineering and high upfront fixed costs, but provides absolute data privacy, no vendor lock-in, and can be dramatically cheaper at massive scale.
Latency (TTFT) vs. Throughput (Batch Size)
To process requests efficiently, the inference engine batches multiple users' prompts together. A larger batch size drastically increases total tokens/sec (throughput), but increases the Time-To-First-Token (TTFT) latency because the GPU must process everyone's prompt simultaneously before emitting the first word. You must tune the max batch size based on whether you are optimizing for real-time human chat (optimize for low TTFT) or offline background summarization (optimize for max throughput).
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: Stateless API Wrapper
A simple web app that accepts a prompt and waits synchronously for the full response from an external API (like OpenAI).
Key components: React SPA · Node.js API · External LLM API
Move to next phase when: Users complain about 10-second wait times; TTFT is too high.
Phase 2: Streaming & History
Implement SSE for streaming tokens. Introduce a database to store conversation history and manage context windows.
Key components: Server-Sent Events (SSE) · PostgreSQL / DynamoDB for Chat History · Redis for rate limiting
Move to next phase when: Costs explode due to sending massive, unmanaged context arrays; need in-house model hosting.
Phase 3: Custom Inference & KV Routing
Deploy open-source models on a managed GPU cluster. Implement intelligent routing to optimize KV cache hits.
Key components: GPU Inference Cluster (vLLM / TGI) · Context-aware Load Balancer · Vector DB for RAG capabilities
Move to next phase when: Inference costs require deep hardware optimization.
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Time To First Token (TTFT) | 99% < 1.5s | Crucial for perceived performance and UX. |
| Token Generation Rate | > 20 tokens/sec | Must be faster than human reading speed. |
| Availability | 99.9% | Standard consumer application availability. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| GPU Out of Memory (OOM) | Inference servers crash with CUDA out of memory errors; 500 errors spike. | Implement continuous batching and strict KV cache memory limits (e.g., PagedAttention). Gracefully degrade by rejecting new requests rather than crashing. |
| Safety Filter False Positives | High volume of user reports that benign queries are blocked. | Allow users to contest flags. Implement a fast, secondary fallback classifier, or adjust the moderation threshold dynamically. |
| Thundering Herd on a specific model tier | GPT-4 endpoint latency spikes while GPT-3.5 is idle. | Implement dynamic pricing or hard queue limits. Surface wait times in the UI to manage expectations. |
Cost Drivers (Staff lens)
- GPU Compute (A100s / H100s) for inference
- API costs (if using third-party providers, token costs dominate)
- Database storage for billions of chat messages
Multi-Region & DR
Users can be routed to the nearest regional GPU cluster to minimize TTFT. Chat history can be replicated asynchronously since users rarely access the same chat session from two different continents simultaneously.