Design an AI Coding Assistant (Cursor / Claude Code)

This problem appears in multiple sheets. Depth expectations increase as you progress:

Track	What to demonstrate
Arch 50	Focus on the execution loop: observe, plan, act. How is the environment sandboxed?
Arch 75	Staff angles: Handling agent state persistence, state machine recovery, and secure multi-tenant execution.

Interview Prompt

Design an Autonomous Coding Agent (like Devin or SWE-agent).

Clarifying Questions (ask before designing)

Question	Why it matters
What actions can the agent take?	Read/write files, run shell commands, browse the web? Each capability expands the attack surface and orchestration complexity.
How long can a single task run?	Long-running tasks (e.g., days) require robust state persistence so the agent can be suspended and resumed.
Are we building a single-user tool or a multi-tenant SaaS?	Multi-tenant SaaS requires strict network and compute isolation between agent environments.

Scope

In scope

Agent orchestration loop (ReAct pattern)
Secure, stateful execution environments (Sandboxing)
Context and memory management for the LLM
Real-time streaming of agent actions to the frontend

Out of scope (state explicitly)

Training the foundational coding LLM
IDE UI implementation details

Assumptions

We rely on a powerful LLM (like GPT-4 or Claude 3.5 Sonnet) for reasoning
The agent requires access to the internet to read docs and fetch packages
Code execution must be completely isolated from the host system

IDE Integration: The agent operates directly within the user's local IDE (e.g., VSCode extension) or local terminal.
Codebase Context: The agent can automatically find and index relevant local files across a massive repository without manual user uploading.
Fast Apply / Inline Editing: The agent generates code and instantly applies the diffs directly to the user's open files for review.
Human-in-the-Loop (HitL): The user actively guides the agent, reviews diffs, and can seamlessly take over typing at any time.

Unlike purely autonomous cloud agents that run in isolated Firecracker VMs (like Devin), Copilots like Cursor and Claude Code run as local binaries or IDE extensions. They have direct access to the user's filesystem, Language Server (LSP), and terminal, acting as a highly integrated pairing partner.

Loading...

1. Human-in-the-Loop (HitL) Copilot Flow

Instead of a pure autonomous ReAct loop, modern IDE agents use a guided flow. They use speculative execution to test code in the background before presenting it to the user.

Loading...

The Orchestrator runs locally. It parses the user's request, uses local tools (LSP, ripgrep) to gather context, and sends a highly optimized prompt to the cloud LLM.

System: You are an AI Coding Assistant.
You have access to the following local tools:
1. read_file(path, start_line, end_line)
2. search_codebase(query) // Uses local Vector DB
3. get_lsp_references(symbol) // Uses local Language Server
4. run_terminal_command(cmd) // Requires human approval

User Request: Fix the CORS bug in the API Gateway.

2. Local Codebase Context Indexing ⭐

An LLM cannot see a 10,000-file repository. The agent must build an incredibly fast local index of the codebase.

Loading...

Local Vector DB: Upon opening a project, the extension uses a fast local embedding model (or a batch cloud API) to vectorize all files and store them in a local SQLite/LanceDB instance.
BM25 + Vector Hybrid Search: When a user asks "Fix the auth bug", the agent performs a hybrid search locally to find the top 5 most relevant files to inject into the LLM context window.
LSP Integration: Before sending, the agent queries the local tsserver (Language Server) to find exact type definitions and interface signatures used in those 5 files, drastically reducing LLM hallucinations.

3. Shadow Workspaces & Predictive Compilation

How does the agent know if its generated code actually works before showing it to you?

The agent maintains a Shadow Workspace: a hidden, synchronized copy of the project in a temporary directory.
As the LLM streams the code, the agent applies it to the Shadow Workspace and runs the local compiler/linter (e.g., tsc --noEmit) in the background.
If the compiler throws an error, the agent intercepts it, auto-prompts the LLM with the error, and fixes it before ever rendering the final diff to the user's actual editor.

4. Fast Apply (Diff Application Algorithm)

LLMs output text sequentially. If an LLM needs to change 1 line in a 1,000-line file, making it rewrite the entire 1,000 lines is slow and expensive.

Unified Diff Format: The LLM is prompted to output standard Git-style diffs or specific SEARCH/REPLACE blocks.
Fuzzy Matching: Because LLMs often make slight indentation mistakes in their SEARCH blocks, the local IDE extension uses algorithms (like Myers Diff or Hunt-Szymanski) with fuzzy heuristic matching to cleanly merge the LLM's intent into the actual local file.

Agent System Prompt

The local orchestrator injects available tools into the system prompt before calling the cloud LLM.

System: You are an AI Coding Assistant.
You have access to the following local tools:
1. read_file(path, start_line, end_line)
2. search_codebase(query) // Uses local Vector DB
3. get_lsp_references(symbol) // Uses local Language Server
4. run_terminal_command(cmd) // Requires human approval

User Request: Fix the CORS bug in the API Gateway.

Fast Apply Diff Streaming

The LLM streams back a SEARCH/REPLACE block. The local extension applies this live to the editor.

JAVASCRIPT

// The agent streams this unified diff format back.
// The local Fast Apply algorithm fuzzily matches the SEARCH block and replaces it.

<<<<
// src/gateway.ts
app.use(cors({ origin: 'localhost' }));
====
// src/gateway.ts
app.use(cors({ origin: process.env.ALLOWED_ORIGINS || '*' }));
>>>>

The agent must maintain a local vector database to quickly retrieve context without sending the entire repository to the cloud.

SQL

// Local SQLite / Vector DB Schema

CREATE TABLE file_chunks (
    id TEXT PRIMARY KEY,
    file_path TEXT NOT NULL,
    chunk_index INTEGER,
    content TEXT NOT NULL,
    embedding BLOB, -- 1536-dimensional float array
    last_modified TIMESTAMP
);

// When a file is saved, the IDE extension hashes it. 
// If changed, it asynchronously re-chunks the AST and updates this local DB.

Failure Case	System Solution Design
LLM Hallucinates a Diff Block	The Fast Apply fuzzy matcher fails to find the target code in the local file. The agent immediately halts application and asks the user to manually apply or retry.
Infinite Auto-Fix Loop	If the Shadow Workspace compiler keeps failing, the agent enforces a strict 'Max Retries' limit (e.g., 3 tries) before giving up and showing the broken code to the user with the compiler error attached.
Context Window Overflow	The local orchestrator dynamically tracks token counts using a tokenizer (like tiktoken). If the gathered context exceeds the LLM limit, it aggressively prunes the least-relevant files or uses a Tree-Sitter AST to drop function bodies and keep only signatures.

SLOs & Error Budgets

Metric	Target	Rationale
Sandbox Startup Time	99% < 2s	Users expect immediate agent initialization.
Action to UI Latency	99% < 500ms	The UI must feel responsive as the agent types in the terminal.
Agent Orchestration Uptime	99.9%	Agents running long tasks must not be interrupted by control plane outages.

Incident Scenarios (2am reality)

Scenario	How you detect	Mitigation
Agent gets stuck in an infinite loop	High API cost alerts; agent executes the exact same command 10 times in a row.	Implement orchestrator-level loop detection. If a repeated failure pattern is detected, pause the agent and ask the human user for intervention.
Malicious user attempts to mine Bitcoin	CPU usage inside the sandbox hits 100% for extended periods; network egress connects to known mining pools.	Strict CPU quotas via cgroups. Network egress filtering via eBPF or NAT gateways to block mining pool IPs.
Context window limit exceeded	LLM API returns 400 Bad Request due to token limits.	Automatically trigger a summarization step to compress the history, drop the oldest terminal logs, and retry.

Cost Drivers (Staff lens)

LLM API Token costs (Agents burn through millions of tokens rapidly)
Compute infrastructure for running isolated sandboxes
State storage for lengthy event histories

Multi-Region & DR

Agent execution is easily sharded. Route users to the regional cluster where their codebase/sandbox resides. Since agents are independent state machines, cross-region replication is not strictly necessary for active execution.

Interview Prompt

Clarifying Questions (ask before designing)

Scope

In scope

Out of scope (state explicitly)

Assumptions

1. Human-in-the-Loop (HitL) Copilot Flow

2. Local Codebase Context Indexing ⭐

3. Shadow Workspaces & Predictive Compilation

4. Fast Apply (Diff Application Algorithm)

Agent System Prompt

Fast Apply Diff Streaming

Local Execution vs Cloud Sandbox

Diff Output vs Full File Output

Phase 1: Local CLI Tool

Phase 2: Cloud Sandbox SaaS

Phase 3: Persistent MicroVMs & Advanced Memory

SLOs & Error Budgets

Incident Scenarios (2am reality)

Cost Drivers (Staff lens)

Multi-Region & DR