System Design Problem

Design an LLM Chat Application (ChatGPT)

Commonly Asked By:OpenAIAnthropicGoogleMeta

  • Conversational UI: Users can send text prompts and receive text responses from an AI.
  • Streaming Responses: Responses must be streamed back to the user token-by-token in real-time.
  • Context/History: The AI must remember the context of the current conversation session.
  • Chat History: Users can view, resume, and delete past conversation threads.

An LLM Chat application combines standard web architecture (databases for chat history, rate limiters) with specialized ML inference infrastructure. We can structure this into three distinct layers:

Loading...
  • API / Gateway Layer: Terminates SSL and handles user authentication and rate limiting. It maintains long-lived Server-Sent Event (SSE) connections to stream generated tokens directly back to the user's browser.
  • Service Layer: The Context Manager fetches conversation history from the DB to build the full prompt, and the Inference Engine Fleet (vLLM / TensorRT-LLM) manages massive GPU clusters, performing Continuous Batching and KV-Caching to generate tokens.
  • Data / Storage Layer: Relies on an ultra-fast NoSQL store (DynamoDB or Cassandra) for O(1) retrieval of chat histories, and distributed blob storage (S3) for persisting multi-gigabyte open-weights LLM model files.