Design a Document Q&A Platform (RAG System)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Focus on the fundamental RAG pipeline: Ingestion, Chunking, Embedding, and Retrieval.
Arch 75	Staff angles: Advanced chunking strategies, hybrid search (Sparse + Dense), and mitigating LLM hallucination.

Interview Prompt

Design a Document Q&A Platform using RAG (Retrieval-Augmented Generation).

Clarifying Questions (ask before designing)

Question	Why it matters
What types of documents are we ingesting?	PDFs require OCR and complex layout parsing; Markdown/Text is much simpler.
How large is the document corpus?	Determines the scale of the Vector Database and the ingestion pipeline.
Are there strict data privacy or access control requirements?	Enterprise RAG must respect document permissions during retrieval.

Scope

In scope

Document ingestion and parsing pipeline
Chunking and embedding generation
Vector database retrieval
Prompt orchestration and LLM generation

Out of scope (state explicitly)

Training the foundational embedding model or LLM
Real-time collaborative document editing

Assumptions

We will use an external LLM API for embeddings and generation
Accuracy and factual correctness are highly prioritized over raw speed
Users will ask questions that require synthesizing multiple chunks

Document Upload: Users can upload PDFs, Word docs, and text files.
Q&A Interface: Users can ask natural language questions about their uploaded documents.
Citations: The AI's response must include exact citations (e.g., "Source: Employee_Handbook.pdf, Page 4").
Access Control: Users should only be able to query documents they have permission to view.

Retrieval-Augmented Generation (RAG) bridges the gap between a user's private data and a public LLM. We can break this architecture down into three explicit layers handling both Ingestion (Offline) and Retrieval (Online):

1. The Ingestion Pipeline (Offline)

Loading...

2. The Retrieval & Generation Pipeline (Online)

Loading...

API / Gateway Layer: Exposes the document upload endpoints (handling multi-part form data) and the chat query endpoints (handling SSE streaming for real-time LLM responses).
Service Layer: Contains the Ingestion Workers (which run OCR, chunk text, and call Embedding Models) and the RAG Orchestrator (which receives queries, fetches vectors, constructs prompts, and calls the Inference Engine).
Data / Storage Layer: The foundation. Uses Amazon S3 for raw document storage, PostgreSQL for document metadata and user permissions, and a highly-scalable Vector Database (Pinecone, Milvus) for storing high-dimensional embeddings and executing fast nearest-neighbor searches.

1. Document Parsing & Text Chunking

LLMs have strict context window limits and degrade in reasoning capability when overloaded with text. We cannot feed a 500-page PDF directly into the prompt. The ingestion pipeline must meticulously parse and chunk the document.

Parsing Artifacts: Extracting text from raw PDFs often destroys tables and spatial layouts. High-end RAG systems use multi-modal models (like GPT-4V) or specialized OCR pipelines (Unstructured.io) to parse tables into Markdown or HTML to preserve relational data.
Chunk Size: Usually 512 to 1024 tokens. If chunks are too small, they lose semantic context. If too large, they dilute the relevance of the embedding vector.
Overlap: A 10-20% sliding window overlap (e.g., 50 tokens) between chunks ensures that sentences awkwardly cut at the boundary are not taken out of context.
Semantic Chunking: Advanced systems chunk by document structure (paragraphs, sections, headers) rather than raw character counts, ensuring cohesive thoughts remain together.

2. Vector Embeddings

An embedding model (like OpenAI's text-embedding-3) converts a chunk of text into a high-dimensional array of floats (e.g., 1536 dimensions). These models are trained via contrastive learning so that semantically similar texts (e.g., "puppy" and "dog") are mapped to points that are geometrically close together in this 1536-dimensional space, even if they share zero exact keywords.

3. Vector Database (Pinecone / Milvus / pgvector) ⭐

A Vector Database stores these embeddings and performs Approximate Nearest Neighbor (ANN) search to find the closest vectors to the user's query vector using metrics like Cosine Similarity or Dot Product.

HNSW (Hierarchical Navigable Small World): The industry standard ANN algorithm. It builds a multi-layered graph (similar to a skip-list) to quickly navigate to the neighborhood of the target vector, offering sub-millisecond search times over millions of vectors without performing a full table scan.
Metadata Filtering (Pre/Post-Filtering): Crucial for enterprise security. We attach user_id, tenant_id, and document_id as metadata to every vector. During retrieval, we instruct the Vector DB to only search vectors where user_id == current_user. Pre-filtering applies this rule before the vector search; post-filtering applies it after (which risks returning zero results if the top-K vectors belonged to someone else).

4. Prompt Building & Citations

Once the Top-K chunks are retrieved, the Context Builder injects them into the LLM's system prompt. To guarantee verifiable citations and reduce hallucinations, we prepend explicit metadata headers to each chunk.

System: You are an expert Q&A system. 
Base your answer SOLELY on the following context. If the context does not contain the answer, say "I don't know."

Context:
[Document: HR_Manual.pdf, Page: 4]
Employees are entitled to 20 days of paid time off per year. 
Unused days roll over to a maximum of 5 days.

[Document: Benefits_2024.pdf, Page: 2]
PTO requests must be submitted at least two weeks in advance.

User Query: How many PTO days do I get, and do they roll over?

Ingest Document (Async)

Document ingestion (OCR, chunking, embedding) takes too long for a synchronous response, so the API returns a Job ID immediately.

HTTP

POST /api/v1/documents
Content-Type: multipart/form-data

file: <HR_Manual.pdf>
metadata: {"department": "HR", "access_level": "employee"}

Response: 202 Accepted
{
  "document_id": "doc_8899",
  "status": "PROCESSING",
  "job_id": "job_1234"
}

The core of a RAG system is the Vector Database. It stores both dense embeddings (for semantic search) and sparse vectors (for exact keyword match), along with strictly enforced metadata for multi-tenant isolation.

JSON

// Vector Database Schema (e.g., Pinecone, Milvus, pgvector)
// We store chunks, not full documents.

{
  "id": "chunk_998877",             // Unique ID for the chunk
  "vector": [0.014, -0.052, ...],   // 1536-dimensional dense embedding
  "sparse_vector": {                // For BM25 Hybrid Search keyword matching
    "indices": [41, 992, 104],
    "values": [0.8, 0.4, 0.2]
  },
  "metadata": {
    "document_id": "doc_8899",
    "tenant_id": "org_123",         // Mandatory for multi-tenant isolation!
    "access_level": "employee",     // Used for Pre-filtering before vector search
    "page_number": 4,
    "text_content": "Employees are entitled to 20 days..." // The actual text payload
  }
}

Pre-filtering vs Post-filtering: When a user queries the Vector DB, we must ensure they only search documents they have access to. If we post-filter (search first, then discard unauthorized results), we might end up with 0 results if the top 10 nearest neighbors belonged to the CEO. Production systems use Pre-filtering (applying a strict SQL-like `WHERE tenant_id = 'org_123'` filter inside the Vector DB before calculating Cosine Similarity).

Problem	Advanced Solution
Query is too vague to retrieve good chunks.	Query Transformation: Use a fast LLM to rewrite, expand with synonyms, or split the user's query into multiple sub-queries before hitting the Vector DB (HyDE - Hypothetical Document Embeddings).
Vector search retrieves irrelevant chunks.	Re-ranking: Retrieve Top-50 chunks from the Vector DB (high recall), then use a dedicated Cross-Encoder model (Cohere Rerank) to accurately re-score and select the true Top-5 chunks (high precision).
The answer requires synthesizing multiple documents.	Agentic RAG: Instead of a single pass, deploy an Agent that can issue multiple sequential search queries to the Vector DB to gather missing puzzle pieces before synthesizing the final answer.

SLOs & Error Budgets

Metric	Target	Rationale
Q&A Latency	99% < 3s	Retrieval + LLM generation is slow; streaming helps perceived latency.
Ingestion Latency	99% < 1 minute	Users expect uploaded documents to be searchable quickly.
Vector Search Recall	Top-5 > 90%	Retrieval quality directly impacts the LLM's ability to answer correctly.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
API Rate Limits hit for Embedding Service	Ingestion pipeline throws 429 Too Many Requests errors.	Implement exponential backoff in the ingestion workers. For large backfills, batch embedding requests to maximize API throughput.
Poor Answer Quality / Hallucination Spike	User thumbs-down metrics increase; automated evaluation pipelines flag low relevance.	Inspect the retrieved chunks. Often caused by a bad parsing update (e.g., extracting table data as garbled text) or embedding drift. Re-tune chunk sizes and overlap.
Vector DB Memory Exhaustion	Vector DB nodes crash; index becomes read-only.	Vector DBs (like HNSW indexes) are highly memory intensive. Scale up RAM on DB nodes or shard the index based on tenant/workspace to distribute the load.

Cost Drivers (Staff lens)

LLM API Costs (Generation and Embeddings)
Vector DB Hosting (requires significant RAM)
Compute for OCR and document parsing

Multi-Region & DR

The Vector DB can be replicated across regions for low-latency retrieval. The ingestion pipeline can run in a central region since it is asynchronous.