This problem appears in multiple sheets. Depth expectations increase as you progress:
Interview Prompt
Design an Autonomous Coding Agent (like Devin or SWE-agent).
Clarifying Questions (ask before designing)
| Question | Why it matters |
|---|---|
| What actions can the agent take? | Read/write files, run shell commands, browse the web? Each capability expands the attack surface and orchestration complexity. |
| How long can a single task run? | Long-running tasks (e.g., days) require robust state persistence so the agent can be suspended and resumed. |
| Are we building a single-user tool or a multi-tenant SaaS? | Multi-tenant SaaS requires strict network and compute isolation between agent environments. |
Scope
In scope
- Agent orchestration loop (ReAct pattern)
- Secure, stateful execution environments (Sandboxing)
- Context and memory management for the LLM
- Real-time streaming of agent actions to the frontend
Out of scope (state explicitly)
- Training the foundational coding LLM
- IDE UI implementation details
Assumptions
- We rely on a powerful LLM (like GPT-4 or Claude 3.5 Sonnet) for reasoning
- The agent requires access to the internet to read docs and fetch packages
- Code execution must be completely isolated from the host system
These foundational concepts underpin the patterns used in this problem. Review them before deep-diving into component-level trade-offs.
- IDE Integration: The agent operates directly within the user's local IDE (e.g., VSCode extension) or local terminal.
- Codebase Context: The agent can automatically find and index relevant local files across a massive repository without manual user uploading.
- Fast Apply / Inline Editing: The agent generates code and instantly applies the diffs directly to the user's open files for review.
- Human-in-the-Loop (HitL): The user actively guides the agent, reviews diffs, and can seamlessly take over typing at any time.
- Low Latency (Fast Apply): Applying a 500-line diff must feel instantaneous (< 100ms) so as not to break the developer's flow.
- Context Window Efficiency: Intelligently selecting which files to send to the LLM to avoid blowing out context limits or racking up massive API bills.
- Local Execution Safety: Executing compilation or linting checks in a safe, non-destructive background process (Shadow Workspace).
Unlike purely autonomous cloud agents that run in isolated Firecracker VMs (like Devin), Copilots like Cursor and Claude Code run as local binaries or IDE extensions. They have direct access to the user's filesystem, Language Server (LSP), and terminal, acting as a highly integrated pairing partner.
1. Human-in-the-Loop (HitL) Copilot Flow
Instead of a pure autonomous ReAct loop, modern IDE agents use a guided flow. They use speculative execution to test code in the background before presenting it to the user.
The Orchestrator runs locally. It parses the user's request, uses local tools (LSP, ripgrep) to gather context, and sends a highly optimized prompt to the cloud LLM.
System: You are an AI Coding Assistant. You have access to the following local tools: 1. read_file(path, start_line, end_line) 2. search_codebase(query) // Uses local Vector DB 3. get_lsp_references(symbol) // Uses local Language Server 4. run_terminal_command(cmd) // Requires human approval User Request: Fix the CORS bug in the API Gateway.
2. Local Codebase Context Indexing ⭐
An LLM cannot see a 10,000-file repository. The agent must build an incredibly fast local index of the codebase.
- Local Vector DB: Upon opening a project, the extension uses a fast local embedding model (or a batch cloud API) to vectorize all files and store them in a local SQLite/LanceDB instance.
- BM25 + Vector Hybrid Search: When a user asks "Fix the auth bug", the agent performs a hybrid search locally to find the top 5 most relevant files to inject into the LLM context window.
- LSP Integration: Before sending, the agent queries the local
tsserver(Language Server) to find exact type definitions and interface signatures used in those 5 files, drastically reducing LLM hallucinations.
3. Shadow Workspaces & Predictive Compilation
How does the agent know if its generated code actually works before showing it to you?
- The agent maintains a Shadow Workspace: a hidden, synchronized copy of the project in a temporary directory.
- As the LLM streams the code, the agent applies it to the Shadow Workspace and runs the local compiler/linter (e.g.,
tsc --noEmit) in the background. - If the compiler throws an error, the agent intercepts it, auto-prompts the LLM with the error, and fixes it before ever rendering the final diff to the user's actual editor.
4. Fast Apply (Diff Application Algorithm)
LLMs output text sequentially. If an LLM needs to change 1 line in a 1,000-line file, making it rewrite the entire 1,000 lines is slow and expensive.
- Unified Diff Format: The LLM is prompted to output standard Git-style diffs or specific SEARCH/REPLACE blocks.
- Fuzzy Matching: Because LLMs often make slight indentation mistakes in their SEARCH blocks, the local IDE extension uses algorithms (like Myers Diff or Hunt-Szymanski) with fuzzy heuristic matching to cleanly merge the LLM's intent into the actual local file.
Agent System Prompt
The local orchestrator injects available tools into the system prompt before calling the cloud LLM.
System: You are an AI Coding Assistant. You have access to the following local tools: 1. read_file(path, start_line, end_line) 2. search_codebase(query) // Uses local Vector DB 3. get_lsp_references(symbol) // Uses local Language Server 4. run_terminal_command(cmd) // Requires human approval User Request: Fix the CORS bug in the API Gateway.
Fast Apply Diff Streaming
The LLM streams back a SEARCH/REPLACE block. The local extension applies this live to the editor.
// The agent streams this unified diff format back.
// The local Fast Apply algorithm fuzzily matches the SEARCH block and replaces it.
<<<<
// src/gateway.ts
app.use(cors({ origin: 'localhost' }));
====
// src/gateway.ts
app.use(cors({ origin: process.env.ALLOWED_ORIGINS || '*' }));
>>>>The agent must maintain a local vector database to quickly retrieve context without sending the entire repository to the cloud.
// Local SQLite / Vector DB Schema
CREATE TABLE file_chunks (
id TEXT PRIMARY KEY,
file_path TEXT NOT NULL,
chunk_index INTEGER,
content TEXT NOT NULL,
embedding BLOB, -- 1536-dimensional float array
last_modified TIMESTAMP
);
// When a file is saved, the IDE extension hashes it.
// If changed, it asynchronously re-chunks the AST and updates this local DB.| Failure Case | System Solution Design |
|---|---|
| LLM Hallucinates a Diff Block | The Fast Apply fuzzy matcher fails to find the target code in the local file. The agent immediately halts application and asks the user to manually apply or retry. |
| Infinite Auto-Fix Loop | If the Shadow Workspace compiler keeps failing, the agent enforces a strict 'Max Retries' limit (e.g., 3 tries) before giving up and showing the broken code to the user with the compiler error attached. |
| Context Window Overflow | The local orchestrator dynamically tracks token counts using a tokenizer (like tiktoken). If the gathered context exceeds the LLM limit, it aggressively prunes the least-relevant files or uses a Tree-Sitter AST to drop function bodies and keep only signatures. |
Local Execution vs Cloud Sandbox
Fully autonomous cloud agents (Devin) run in secure Firecracker VMs. This allows them to run destructive bash commands safely without wrecking the user's machine, but limits their access to the user's uncommitted local state, active dev servers, and private VPNs. Cursor/Claude Code run locally: they have perfect access to the user's exact environment, but running arbitrary LLM-generated bash scripts locally carries a massive security risk, necessitating strict "Human-in-the-Loop" approval for terminal commands.
Diff Output vs Full File Output
Asking the LLM to output full files reduces complex diff-matching bugs but vastly increases Time-To-First-Token (TTFT) and token costs. Asking the LLM to output SEARCH/REPLACE blocks is incredibly fast and cheap, but highly fragile if the LLM hallucinates the SEARCH block or misinterprets indentation.
Staff interviews expect you to articulate how the system evolves under real growth — not jump straight to the final architecture.
Phase 1: Local CLI Tool
A Python script running locally on the user's machine, executing commands directly in their shell.
Key components: OpenAI API · Subprocess execution · Local file system
Move to next phase when: Requires moving to the cloud for SaaS monetization; local execution is too dangerous.
Phase 2: Cloud Sandbox SaaS
Agents run in isolated Docker containers in the cloud. Context is managed in memory.
Key components: Docker containers · Node.js Orchestrator · WebSockets for UI
Move to next phase when: Docker is not secure enough for untrusted code; memory context limits hit quickly.
Phase 3: Persistent MicroVMs & Advanced Memory
Hardware-isolated MicroVMs that can be paused/resumed. Vector databases for semantic codebase search.
Key components: Firecracker MicroVMs · Event-sourced State DB · Vector DB (Pinecone/Milvus)
Move to next phase when: Need for long-running, multi-day autonomous tasks safely.
SLOs & Error Budgets
| Metric | Target | Rationale |
|---|---|---|
| Sandbox Startup Time | 99% < 2s | Users expect immediate agent initialization. |
| Action to UI Latency | 99% < 500ms | The UI must feel responsive as the agent types in the terminal. |
| Agent Orchestration Uptime | 99.9% | Agents running long tasks must not be interrupted by control plane outages. |
Incident Scenarios (2am reality)
| Scenario | How you detect | Mitigation |
|---|---|---|
| Agent gets stuck in an infinite loop | High API cost alerts; agent executes the exact same command 10 times in a row. | Implement orchestrator-level loop detection. If a repeated failure pattern is detected, pause the agent and ask the human user for intervention. |
| Malicious user attempts to mine Bitcoin | CPU usage inside the sandbox hits 100% for extended periods; network egress connects to known mining pools. | Strict CPU quotas via cgroups. Network egress filtering via eBPF or NAT gateways to block mining pool IPs. |
| Context window limit exceeded | LLM API returns 400 Bad Request due to token limits. | Automatically trigger a summarization step to compress the history, drop the oldest terminal logs, and retry. |
Cost Drivers (Staff lens)
- LLM API Token costs (Agents burn through millions of tokens rapidly)
- Compute infrastructure for running isolated sandboxes
- State storage for lengthy event histories
Multi-Region & DR
Agent execution is easily sharded. Route users to the regional cluster where their codebase/sandbox resides. Since agents are independent state machines, cross-region replication is not strictly necessary for active execution.