This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design a Document Q&A Platform using RAG (Retrieval-Augmented Generation).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What types of documents are we ingesting? | PDFs require OCR and complex layout parsing; Markdown/Text is much simpler. |
| How large is the document corpus? | Determines the scale of the Vector Database and the ingestion pipeline. |
| Are there strict data privacy or access control requirements? | Enterprise RAG must respect document permissions during retrieval. |
Scope
In scope
- Document ingestion and parsing pipeline
- Chunking and embedding generation
- Vector database retrieval
- Prompt orchestration and LLM generation
Out of scope (state explicitly)
- Training the foundational embedding model or LLM
- Real-time collaborative document editing
Assumptions
- We will use an external LLM API for embeddings and generation
- Accuracy and factual correctness are highly prioritized over raw speed
- Users will ask questions that require synthesizing multiple chunks
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- Document Upload: Users can upload PDFs, Word docs, and text files.
- Q&A Interface: Users can ask natural language questions about their uploaded documents.
- Citations: The AI's response must include exact citations (e.g., "Source: Employee_Handbook.pdf, Page 4").
- Access Control: Users should only be able to query documents they have permission to view.
- Low Hallucination Rate: The system must strictly adhere to the source documents and avoid making up facts.
- Fast Retrieval: Searching through thousands of documents to find relevant context should take < 500ms.
- Scalable Storage: The system must efficiently store and search millions of high-dimensional vectors.
Retrieval-Augmented Generation (RAG) bridges the gap between a user's private data and a public LLM. We can break this architecture down into three explicit layers handling both Ingestion (Offline) and Retrieval (Online):
1. The Ingestion Pipeline (Offline)
2. The Retrieval & Generation Pipeline (Online)
- API / Gateway Layer: Exposes the document upload endpoints (handling multi-part form data) and the chat query endpoints (handling SSE streaming for real-time LLM responses).
- Service Layer: Contains the Ingestion Workers (which run OCR, chunk text, and call Embedding Models) and the RAG Orchestrator (which receives queries, fetches vectors, constructs prompts, and calls the Inference Engine).
- Data / Storage Layer: The foundation. Uses Amazon S3 for raw document storage, PostgreSQL for document metadata and user permissions, and a highly-scalable Vector Database (Pinecone, Milvus) for storing high-dimensional embeddings and executing fast nearest-neighbor searches.
1. Document Parsing & Text Chunking
LLMs have strict context window limits and degrade in reasoning capability when overloaded with text. We cannot feed a 500-page PDF directly into the prompt. The ingestion pipeline must meticulously parse and chunk the document.
- Parsing Artifacts: Extracting text from raw PDFs often destroys tables and spatial layouts. High-end RAG systems use multi-modal models (like GPT-4V) or specialized OCR pipelines (Unstructured.io) to parse tables into Markdown or HTML to preserve relational data.
- Chunk Size: Usually 512 to 1024 tokens. If chunks are too small, they lose semantic context. If too large, they dilute the relevance of the embedding vector.
- Overlap: A 10-20% sliding window overlap (e.g., 50 tokens) between chunks ensures that sentences awkwardly cut at the boundary are not taken out of context.
- Semantic Chunking: Advanced systems chunk by document structure (paragraphs, sections, headers) rather than raw character counts, ensuring cohesive thoughts remain together.
2. Vector Embeddings
An embedding model (like OpenAI's text-embedding-3) converts a chunk of text into a high-dimensional array of floats (e.g., 1536 dimensions). These models are trained via contrastive learning so that semantically similar texts (e.g., "puppy" and "dog") are mapped to points that are geometrically close together in this 1536-dimensional space, even if they share zero exact keywords.
3. Vector Database (Pinecone / Milvus / pgvector) ⭐
A Vector Database stores these embeddings and performs Approximate Nearest Neighbor (ANN) search to find the closest vectors to the user's query vector using metrics like Cosine Similarity or Dot Product.
- HNSW (Hierarchical Navigable Small World): The industry standard ANN algorithm. It builds a multi-layered graph (similar to a skip-list) to quickly navigate to the neighborhood of the target vector, offering sub-millisecond search times over millions of vectors without performing a full table scan.
- Metadata Filtering (Pre/Post-Filtering): Crucial for enterprise security. We attach
user_id,tenant_id, anddocument_idas metadata to every vector. During retrieval, we instruct the Vector DB to only search vectors whereuser_id == current_user. Pre-filtering applies this rule before the vector search; post-filtering applies it after (which risks returning zero results if the top-K vectors belonged to someone else).
4. Prompt Building & Citations
Once the Top-K chunks are retrieved, the Context Builder injects them into the LLM's system prompt. To guarantee verifiable citations and reduce hallucinations, we prepend explicit metadata headers to each chunk.
System: You are an expert Q&A system. Base your answer SOLELY on the following context. If the context does not contain the answer, say "I don't know." Context: [Document: HR_Manual.pdf, Page: 4] Employees are entitled to 20 days of paid time off per year. Unused days roll over to a maximum of 5 days. [Document: Benefits_2024.pdf, Page: 2] PTO requests must be submitted at least two weeks in advance. User Query: How many PTO days do I get, and do they roll over?
Ingest Document (Async)
Document ingestion (OCR, chunking, embedding) takes too long for a synchronous response, so the API returns a Job ID immediately.
POST /api/v1/documents
Content-Type: multipart/form-data
file: <HR_Manual.pdf>
metadata: {"department": "HR", "access_level": "employee"}
Response: 202 Accepted
{
"document_id": "doc_8899",
"status": "PROCESSING",
"job_id": "job_1234"
}The core of a RAG system is the Vector Database. It stores both dense embeddings (for semantic search) and sparse vectors (for exact keyword match), along with strictly enforced metadata for multi-tenant isolation.
// Vector Database Schema (e.g., Pinecone, Milvus, pgvector)
// We store chunks, not full documents.
{
"id": "chunk_998877", // Unique ID for the chunk
"vector": [0.014, -0.052, ...], // 1536-dimensional dense embedding
"sparse_vector": { // For BM25 Hybrid Search keyword matching
"indices": [41, 992, 104],
"values": [0.8, 0.4, 0.2]
},
"metadata": {
"document_id": "doc_8899",
"tenant_id": "org_123", // Mandatory for multi-tenant isolation!
"access_level": "employee", // Used for Pre-filtering before vector search
"page_number": 4,
"text_content": "Employees are entitled to 20 days..." // The actual text payload
}
}Pre-filtering vs Post-filtering: When a user queries the Vector DB, we must ensure they only search documents they have access to. If we post-filter (search first, then discard unauthorized results), we might end up with 0 results if the top 10 nearest neighbors belonged to the CEO. Production systems use Pre-filtering (applying a strict SQL-like `WHERE tenant_id = 'org_123'` filter inside the Vector DB before calculating Cosine Similarity).
| Problem | Advanced Solution |
|---|---|
| Query is too vague to retrieve good chunks. | Query Transformation: Use a fast LLM to rewrite, expand with synonyms, or split the user's query into multiple sub-queries before hitting the Vector DB (HyDE - Hypothetical Document Embeddings). |
| Vector search retrieves irrelevant chunks. | Re-ranking: Retrieve Top-50 chunks from the Vector DB (high recall), then use a dedicated Cross-Encoder model (Cohere Rerank) to accurately re-score and select the true Top-5 chunks (high precision). |
| The answer requires synthesizing multiple documents. | Agentic RAG: Instead of a single pass, deploy an Agent that can issue multiple sequential search queries to the Vector DB to gather missing puzzle pieces before synthesizing the final answer. |
Dense Vectors vs Keyword Search (Hybrid Search)
Dense Vector Search is unparalleled for semantic meaning ("What's the policy on sick leave?"), but terrible for exact keyword matching (e.g., searching for a specific product serial number "XJ-9000", an acronym, or a specific name). Production RAG systems rarely use Vector Search alone. They use Hybrid Search: executing Vector Search alongside a traditional BM25/TF-IDF keyword search (via Elasticsearch or Postgres), and mathematically merging the two result sets using algorithms like Reciprocal Rank Fusion (RRF) to get the best of both worlds.
Chunk Size vs Embedding Density
Embedding an entire 10-page document into a single vector dilutes its meaning—the vector becomes an average of everything, effectively meaning nothing in search space. Chunking into single sentences gives perfect retrieval precision, but strips the LLM of the surrounding context needed to actually answer the question. A common mitigation is Parent-Child Retrieval: embedding small chunks (sentences/paragraphs) for high-precision search, but returning the larger parent chunk (the entire page) to the LLM for generation.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: Basic RAG Script
A synchronous script that reads a text file, splits it by paragraphs, calls an embedding API, and queries a local vector index.
Key components: LangChain · Local FAISS index · OpenAI API
Move to next phase when: Need to handle thousands of users, large PDFs, and persistent data.
Phase 2: Scalable Ingestion & Managed DB
Asynchronous worker queues for parsing PDFs and generating embeddings. A managed Vector DB is introduced.
Key components: Kafka / Celery · Managed Vector DB (Pinecone) · Text Extractors (OCR)
Move to next phase when: Retrieval quality is poor; users complain about missing context or wrong answers.
Phase 3: Advanced Retrieval & Re-ranking
Implement hybrid search, semantic chunking (splitting by section, not just char count), and a re-ranking model.
Key components: Elasticsearch (BM25) · Cross-Encoder Re-ranker · Metadata Filtering
Move to next phase when: Enterprise requirements for strict document access control and high accuracy.
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Q&A Latency | 99% < 3s | Retrieval + LLM generation is slow; streaming helps perceived latency. |
| Ingestion Latency | 99% < 1 minute | Users expect uploaded documents to be searchable quickly. |
| Vector Search Recall | Top-5 > 90% | Retrieval quality directly impacts the LLM's ability to answer correctly. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| API Rate Limits hit for Embedding Service | Ingestion pipeline throws 429 Too Many Requests errors. | Implement exponential backoff in the ingestion workers. For large backfills, batch embedding requests to maximize API throughput. |
| Poor Answer Quality / Hallucination Spike | User thumbs-down metrics increase; automated evaluation pipelines flag low relevance. | Inspect the retrieved chunks. Often caused by a bad parsing update (e.g., extracting table data as garbled text) or embedding drift. Re-tune chunk sizes and overlap. |
| Vector DB Memory Exhaustion | Vector DB nodes crash; index becomes read-only. | Vector DBs (like HNSW indexes) are highly memory intensive. Scale up RAM on DB nodes or shard the index based on tenant/workspace to distribute the load. |
Cost Drivers (Staff lens)
- LLM API Costs (Generation and Embeddings)
- Vector DB Hosting (requires significant RAM)
- Compute for OCR and document parsing
Multi-Region & DR
The Vector DB can be replicated across regions for low-latency retrieval. The ingestion pipeline can run in a central region since it is asynchronous.