Design an AI Code Assistant¶

What We're Building¶

An AI-powered code assistant — like Google's Gemini Code Assist, GitHub Copilot, or Cursor — that provides real-time code completions in the IDE, answers coding questions in chat, and understands the full repository context.

Two primary surfaces:

Surface	Latency	Input	Output
Inline completion	< 200ms	Cursor position + surrounding code	1-5 line suggestion
Chat	< 3s TTFT	Natural language question + code context	Multi-paragraph explanation + code

Why This Problem Is Uniquely Hard¶

Challenge	Description
Latency	Completions must appear faster than the developer can type (~200ms)
Context	A single file is insufficient; need repository-level understanding
Accuracy	Suggesting wrong code is worse than no suggestion (breaks flow)
Diversity	50+ programming languages, each with different syntax and idioms
Privacy	Enterprise code is proprietary; cannot be sent to third parties
Evaluation	"Good code suggestion" is highly subjective

Real-World Scale¶

Metric	Scale
Active developers	1M+
Completion requests/day	500M+ (every keystroke can trigger)
Chat queries/day	10M+
Median file size	200 lines
Median repo size	5,000 files
Languages	50+

Key Concepts Primer¶

Fill-in-the-Middle (FIM)¶

Standard LLMs predict the next token. Code completions often need to fill in the middle — the cursor is between existing code.

# Standard left-to-right:
def calculate_area(    →  "radius"

# Fill-in-the-Middle (FIM):
PREFIX: def calculate_area(radius):
MIDDLE: <cursor>         →  "return math.pi * radius ** 2"  
SUFFIX:
    print(f"Area: {area}")

FIM training reformats data as: <PREFIX>code_before<SUFFIX>code_after<MIDDLE>code_to_fill

Repository-Level Context¶

A single file is rarely enough. The model needs:

Current file context:
├── Lines above cursor (prefix)
├── Lines below cursor (suffix)
└── Imports and type signatures in current file

Cross-file context:
├── Recently opened/edited files
├── Files imported by current file
├── Type definitions referenced
├── Test files for current module
└── Similar code patterns in repo

Speculative Decoding for Completions¶

For inline completions, latency is paramount. Speculative decoding uses a tiny draft model:

Draft model (150M params):  generates 8 candidate tokens in ~15ms
Target model (7B params):   verifies all 8 in ~40ms  
Total: ~55ms for 5-6 accepted tokens

Without speculative: 7B model generates 5 tokens = 5 × 30ms = 150ms
Speedup: ~2.7x

Abstract Syntax Tree (AST) Awareness¶

Tree-sitter parses code into ASTs in real-time. This enables:

# AST-guided context selection
def get_relevant_context(cursor_position, ast):
    current_function = ast.find_enclosing_function(cursor_position)
    current_class = ast.find_enclosing_class(cursor_position)
    imports = ast.get_imports()

    referenced_symbols = ast.get_referenced_symbols(current_function)
    definitions = [ast.find_definition(sym) for sym in referenced_symbols]

    return Context(
        enclosing_function=current_function,
        enclosing_class=current_class,
        imports=imports,
        referenced_definitions=definitions,
    )

Step 1: Requirements Clarification¶

Questions to Ask¶

Question	Why It Matters
Inline completions, chat, or both?	Very different latency/context requirements
Cloud-hosted or on-prem option?	Enterprise privacy constraints
Which IDEs?	VS Code, JetBrains, Vim, Emacs — different APIs
Repository indexing?	Pre-index repos or real-time context only?
Code execution?	Can the assistant run tests, check types?

Functional Requirements¶

Requirement	Priority	Description
Inline code completion	Must have	Real-time suggestions as developer types
Code chat	Must have	Natural language Q&A about code
Multi-language support	Must have	Python, Java, Go, TypeScript, C++ at minimum
Repository context	Should have	Understand cross-file references
Code actions	Nice to have	Refactor, generate tests, explain code
Code search	Nice to have	Semantic search over codebase

Non-Functional Requirements¶

Requirement	Target	Rationale
Completion latency	< 200ms P95	Must not slow down typing flow
Chat TTFT	< 500ms P95	Responsive conversational experience
Completion acceptance rate	> 30%	Below this, suggestions are noise
Availability	99.9%	Developer productivity tool
Privacy	No code stored server-side	Enterprise requirement
Throughput	10K completion req/s	Peak developer hours

API Design¶

# POST /v1/completions (inline completion)
{
    "file_path": "src/main/java/com/example/UserService.java",
    "language": "java",
    "prefix": "public User findById(String id) {\n    ",
    "suffix": "\n    return user;\n}",
    "cursor_offset": 45,
    "max_tokens": 64,
    "context": {
        "open_files": [
            {"path": "src/.../User.java", "content": "public class User {...}"},
            {"path": "src/.../UserRepository.java", "content": "..."}
        ],
        "repo_snippets": [
            {"path": "src/.../ProductService.java", "snippet": "...findById pattern..."}
        ]
    }
}

# Response (< 200ms)
{
    "completions": [
        {
            "text": "User user = userRepository.findById(id)\n        .orElseThrow(() -> new UserNotFoundException(id));",
            "confidence": 0.87,
            "stop_reason": "natural_stop"
        }
    ],
    "telemetry_id": "tel-abc123"
}

# POST /v1/completions/feedback (async telemetry)
{
    "telemetry_id": "tel-abc123",
    "action": "accepted",        # accepted, rejected, partially_accepted
    "accepted_chars": 89,
    "total_chars": 95,
    "time_to_accept_ms": 1200
}

Technology Selection & Tradeoffs¶

A production AI code assistant requires choosing across completion model + code parsing/indexing + context retrieval + IDE integration layer. Each directly affects latency, accuracy, and developer experience.

Completion model¶

Option	Strengths	Weaknesses	When to choose
StarCoder2 (3B/7B/15B)	Open-weight; strong fill-in-the-middle (FIM) support; trained on permissive licenses	Smaller context than proprietary models; weaker on rare languages	Default for inline completions; on-prem deployment needed
CodeLlama (7B/13B/34B)	Excellent infilling via FIM training; long context (100K)	Meta license restrictions at scale; 34B too slow for inline	Long-context understanding; licensing acceptable
GPT-4o / Claude (API)	Highest reasoning quality; broad language coverage	Latency too high for inline (300ms+); cost; privacy concerns	Chat surface only; enterprises accepting cloud inference
DeepSeek-Coder-V2	State-of-the-art benchmarks; Mixture of Experts (MoE) reduces active parameters	Large download; MoE inference needs careful memory management	Benchmark accuracy over operational simplicity

Code parsing & indexing¶

Option	Strengths	Weaknesses	When to choose
Tree-sitter	Incremental parsing; 50+ grammars; sub-ms updates; battle-tested in editors	Grammar quality varies; no semantic type resolution	Default for real-time AST extraction and symbol-level chunking
Language Server Protocol (LSP)	Full semantic understanding (types, go-to-definition, references)	Heavy per-language setup; latency too high for inline path	Precise type-aware cross-file resolution beyond syntax
SCIP / LSIF index	Pre-computed precise code intelligence; language-agnostic format	Requires CI to generate; stale between builds	Enterprise repos with CI pipelines where pre-indexed precision is justified

Context retrieval¶

Option	Strengths	Weaknesses	When to choose
Code-specific embedding (StarEncoder, Jina-Code)	Trained on code; understands syntax and semantics	Smaller community; fewer pre-trained options	Default for code similarity search in snippet retriever
BM25 / sparse retrieval	No GPU required; fast; excellent for exact identifier matching	Misses semantic similarity	Hybrid alongside dense embeddings for keyword precision
General text embedding (text-embedding-3-small)	Widely available; good zero-shot on code	Not optimized for code structure	Quick bootstrap before code-specific models

Our choice: StarCoder2-7B for inline completions (fits on a single A100 with room for KV-cache, sub-200ms latency) and a 70B chat model for the chat surface (tolerates higher latency, stronger reasoning). Tree-sitter for real-time parsing supplemented by SCIP/LSIF for enterprise repos. Hybrid retrieval (code-specific embeddings + BM25) for context, keeping combined latency under 20ms. Dedicated VS Code and JetBrains plugins using native completion APIs for proper ghost-text rendering.

Tip

Interview angle: The completion model choice is inseparable from the latency budget. A 70B model cannot serve inline completions at 200ms without aggressive speculative decoding or distillation. Showing awareness of this coupling signals real-world experience.

Step 2: Back-of-Envelope Estimation¶

Traffic¶

Active developers:            500K
Completion triggers/dev/day:  1,000 (not every keystroke; debounced)
Total completion req/day:     500M
Completion QPS (average):     500M / 86,400 ≈ 5,800
Completion QPS (peak, 3x):    ~17,000

Chat queries/dev/day:         20
Total chat req/day:           10M
Chat QPS (peak):              ~350

Compute (Completion Model)¶

Model: 7B parameters (code-specialized, int8 quantized = 7 GB)
GPU: A100 40GB (fits model + KV-cache for batch)

Input tokens (context):      500 (prefix + suffix + cross-file)
Output tokens:               30 (typical completion)
Prefill: 500 tokens × 0.1ms  = 50ms
Decode: 30 tokens × 5ms      = 150ms
Total per request:            200ms

With continuous batching (batch=16):
  Effective throughput:       ~60 req/s per GPU

GPUs for 17K peak QPS:       17,000 / 60 ≈ 284 A100 GPUs

With speculative decoding (draft=150M, target=7B):
  Effective throughput:       ~120 req/s per GPU
  GPUs needed:                17,000 / 120 ≈ 142 A100 GPUs

Compute (Chat Model)¶

Model: 70B parameters (general code reasoning)
Input tokens:                 2,000 (code context + conversation)
Output tokens:                500
Time per request:             ~12s

With continuous batching:     ~30 req/s per 2-GPU replica
Replicas for 350 peak QPS:   350 / 30 ≈ 12 replicas = 24 A100 GPUs

Context Index Storage¶

Per-repository index:
  5,000 files × 200 lines × 50 chars = ~50 MB source
  Embeddings: 5,000 chunks × 768 × 4 bytes = ~15 MB
  AST metadata: ~5 MB
  Total: ~70 MB per repo

10K enterprise repos indexed: 700 GB total

Step 3: High-Level Design¶

flowchart TB
    subgraph ide["IDE Plugin Layer"]
        VSCode[VS Code<br/>Extension]
        JetBrains[JetBrains<br/>Plugin]
        Vim[Vim/Neovim<br/>Plugin]
    end

    subgraph gw["API Gateway"]
        LB[Load Balancer]
        Auth["Auth + Rate Limit"]
        Router[Model Router]
    end

    subgraph ctx["Context Engine"]
        LocalCtx[Local Context<br/>Assembler]
        RepoIndex[Repository<br/>Index Service]
        SnippetRetriever[Snippet<br/>Retriever]
    end

    subgraph srv["Model Serving"]
        subgraph ccomp["Completion Cluster"]
            Draft[Draft Model<br/>150M]
            Target[Target Model<br/>7B]
        end
        subgraph cchat["Chat Cluster"]
            ChatLLM[Chat Model<br/>70B]
        end
    end

    subgraph post["Post-Processing"]
        Filter[Code Filter<br/>Syntax Check]
        Dedup[Deduplication]
        License[License<br/>Scanner]
    end

    subgraph tel["Telemetry & Evaluation"]
        AcceptTracker[Acceptance<br/>Tracker]
        QualityEval[Quality<br/>Evaluation]
        ABTest[A/B Testing]
    end

    ide --> gw
    gw --> ctx
    ctx --> srv
    srv --> post
    post --> gw
    gw --> ide
    ide --> tel

Step 4: Deep Dive¶

4.1 Context Assembly Pipeline¶

The most differentiating component. Better context = better completions.

class ContextAssembler:
    """Assemble optimal context for code completion within token budget."""

    MAX_CONTEXT_TOKENS = 2048

    def assemble(self, request: CompletionRequest) -> AssembledContext:
        budget = TokenBudget(total=self.MAX_CONTEXT_TOKENS)

        prefix_tokens = self._get_prefix(request, budget.allocate(ratio=0.4))
        suffix_tokens = self._get_suffix(request, budget.allocate(ratio=0.2))

        current_file_symbols = self._extract_symbols(request.file_content)

        cross_file_context = self._get_cross_file_context(
            request, current_file_symbols, budget.remaining()
        )

        return AssembledContext(
            prefix=prefix_tokens,
            suffix=suffix_tokens,
            cross_file=cross_file_context,
        )

    def _get_cross_file_context(self, request: CompletionRequest,
                                 symbols: list[str], budget: int) -> list[Snippet]:
        """Priority-ordered cross-file context retrieval."""
        snippets = []

        imported = self._resolve_imports(request.file_content, request.language)
        for imp in imported:
            definition = self._find_definition(imp, request.repo)
            if definition:
                snippets.append(Snippet(source="import", priority=1, **definition))

        type_refs = self._find_type_references(request.file_content, request.cursor)
        for ref in type_refs:
            typedef = self._find_type_definition(ref, request.repo)
            if typedef:
                snippets.append(Snippet(source="type_ref", priority=2, **typedef))

        similar = self.repo_index.find_similar_code(
            request.prefix[-200:], request.language, top_k=5
        )
        snippets.extend(
            Snippet(source="similar", priority=3, **s) for s in similar
        )

        for f in request.recently_edited_files[:3]:
            snippets.append(Snippet(source="recent", priority=4, path=f.path, 
                                     content=f.content[:500]))

        snippets.sort(key=lambda s: s.priority)
        return self._fit_to_budget(snippets, budget)

4.2 Completion Engine with Speculative Decoding¶

class SpeculativeCompletionEngine:
    """Fast code completion using speculative decoding."""

    def __init__(self, draft_model, target_model, max_draft_tokens: int = 8):
        self.draft = draft_model        # 150M params, very fast
        self.target = target_model      # 7B params, high quality
        self.max_draft_tokens = max_draft_tokens

    def complete(self, context: AssembledContext, max_tokens: int = 64) -> Completion:
        prompt = self._format_fim_prompt(context)
        input_ids = self.tokenizer.encode(prompt)

        generated = []
        while len(generated) < max_tokens:
            draft_tokens = self.draft.generate(
                input_ids + generated, 
                max_new_tokens=self.max_draft_tokens,
                temperature=0.0,
            )

            target_logits = self.target.forward(
                input_ids + generated + draft_tokens
            )

            accepted = 0
            for i, draft_token in enumerate(draft_tokens):
                target_prob = target_logits[len(input_ids) + len(generated) + i]
                draft_prob = 1.0

                if self._should_accept(draft_token, target_prob, draft_prob):
                    accepted += 1
                else:
                    correction = self._sample_from(target_prob)
                    generated.append(correction)
                    break

            generated.extend(draft_tokens[:accepted])

            if self._is_natural_stop(generated, context):
                break

        return Completion(
            text=self.tokenizer.decode(generated),
            confidence=self._compute_confidence(generated, target_logits),
        )

    def _format_fim_prompt(self, context: AssembledContext) -> str:
        return (
            f"<|fim_prefix|>{context.cross_file_str}\n{context.prefix}"
            f"<|fim_suffix|>{context.suffix}"
            f"<|fim_middle|>"
        )

    def _is_natural_stop(self, tokens: list[int], context: AssembledContext) -> bool:
        text = self.tokenizer.decode(tokens)
        if "\n\n" in text:
            return True
        if text.rstrip().endswith(("}", ")", ";", ":")):
            return True
        return False

4.3 Repository Indexing Pipeline¶

flowchart LR
    subgraph trig["Triggers"]
        Clone[Repo Clone]
        Push[Git Push]
        FileChange[File Save<br/>in IDE]
    end

    subgraph pipe["Indexing Pipeline"]
        Parse[Parse Files<br/>tree-sitter]
        Extract["Extract Symbols<br/>Functions, Classes"]
        Embed[Generate<br/>Embeddings]
        BuildGraph[Build Dependency<br/>Graph]
    end

    subgraph istore["Index Store"]
        SymbolIndex[(Symbol Index)]
        EmbeddingIndex[(Embedding Index<br/>HNSW)]
        DepGraph[(Dependency Graph)]
    end

    trig --> pipe --> istore

class RepositoryIndexer:
    """Index a repository for code-aware context retrieval."""

    SUPPORTED_LANGUAGES = {
        ".py": "python", ".java": "java", ".go": "go",
        ".ts": "typescript", ".js": "javascript", ".cpp": "cpp",
        ".rs": "rust", ".rb": "ruby",
    }

    def index_repository(self, repo_path: str) -> RepoIndex:
        files = self._discover_files(repo_path)

        symbols = []
        chunks = []
        dependencies = {}

        for file_path in files:
            lang = self.SUPPORTED_LANGUAGES.get(file_path.suffix)
            if not lang:
                continue

            content = file_path.read_text()

            ast = self.parser.parse(content, language=lang)
            file_symbols = self._extract_symbols(ast)
            symbols.extend(file_symbols)

            file_chunks = self._chunk_by_symbol(content, file_symbols)
            chunks.extend(file_chunks)

            deps = self._extract_dependencies(ast, lang)
            dependencies[str(file_path)] = deps

        embeddings = self.embedding_model.encode_batch(
            [c.text for c in chunks], batch_size=64
        )

        return RepoIndex(
            symbol_index=self._build_symbol_index(symbols),
            embedding_index=self._build_hnsw_index(chunks, embeddings),
            dependency_graph=self._build_dep_graph(dependencies),
        )

    def _chunk_by_symbol(self, content: str, symbols: list[Symbol]) -> list[CodeChunk]:
        """Chunk code by function/class boundaries rather than fixed size."""
        chunks = []
        for sym in symbols:
            chunk_text = content[sym.start_byte:sym.end_byte]
            chunks.append(CodeChunk(
                text=chunk_text,
                symbol_name=sym.name,
                symbol_type=sym.type,
                file_path=sym.file_path,
                start_line=sym.start_line,
                end_line=sym.end_line,
            ))
        return chunks

4.4 Post-Processing Pipeline¶

class CompletionPostProcessor:
    """Filter and refine completions before showing to user."""

    def process(self, completion: Completion, context: AssembledContext) -> Completion | None:
        text = completion.text

        text = self._trim_to_natural_boundary(text, context.language)

        if not self._is_syntactically_valid(text, context):
            return None

        if self._is_duplicate_of_existing(text, context):
            return None

        if self.license_scanner.is_verbatim_copy(text, min_chars=150):
            return None

        if completion.confidence < self._get_threshold(context):
            return None

        return Completion(text=text, confidence=completion.confidence)

    def _trim_to_natural_boundary(self, text: str, language: str) -> str:
        """Trim completion at natural code boundary (end of statement/block)."""
        lines = text.split("\n")
        for i in range(len(lines) - 1, 0, -1):
            partial = "\n".join(lines[:i + 1])
            if self._ends_at_boundary(partial, language):
                return partial
        return lines[0]

    def _is_syntactically_valid(self, text: str, context: AssembledContext) -> bool:
        """Check that prefix + completion + suffix parses without errors."""
        full_code = context.prefix + text + context.suffix
        tree = self.parser.parse(full_code, language=context.language)
        return not tree.has_error

    def _get_threshold(self, context: AssembledContext) -> float:
        """Adaptive confidence threshold. Higher for visible code, lower for boilerplate."""
        if context.is_inside_string:
            return 0.5
        if context.is_inside_comment:
            return 0.4
        return 0.6

4.5 Telemetry and Evaluation¶

class CompletionTelemetry:
    """Track completion quality metrics from IDE telemetry."""

    def compute_metrics(self, events: list[TelemetryEvent]) -> CompletionMetrics:
        shown = [e for e in events if e.type == "shown"]
        accepted = [e for e in events if e.type == "accepted"]
        partially_accepted = [e for e in events if e.type == "partially_accepted"]

        acceptance_rate = len(accepted) / len(shown) if shown else 0

        total_accepted_chars = sum(e.accepted_chars for e in accepted + partially_accepted)
        total_typed_chars = sum(e.total_chars_typed for e in events if e.type == "keystroke")
        keystroke_savings = total_accepted_chars / max(1, total_typed_chars + total_accepted_chars)

        persistence_3min = sum(
            1 for e in accepted
            if not self._was_undone_within(e, seconds=180)
        ) / max(1, len(accepted))

        return CompletionMetrics(
            acceptance_rate=acceptance_rate,
            keystroke_savings=keystroke_savings,
            persistence_rate_3min=persistence_3min,
            median_latency_ms=self._median([e.latency_ms for e in shown]),
            shown_per_hour=len(shown) / self._total_hours(events),
        )

Key metrics:

Metric	What It Measures	Target
Acceptance rate	% of shown completions accepted	> 30%
Keystroke savings	% of characters from AI vs typed	> 25%
Persistence rate	% of accepted not undone in 3 min	> 90%
Latency P95	Time from trigger to suggestion shown	< 200ms
Shown rate	How often we show (vs suppress low-confidence)	40-60%

4.6 IDE Integration Architecture¶

sequenceDiagram
    participant Dev as Developer
    participant IDE as IDE
    participant Plugin as Extension
    participant Local as Local Context
    participant Server as Completion Server

    Dev->>IDE: Types code
    IDE->>Plugin: onChange event
    Plugin->>Plugin: Debounce (150ms)
    Plugin->>Local: Gather local context
    Local-->>Plugin: prefix, suffix, open files
    Plugin->>Server: POST /completions

    par Speculative Decode
        Server->>Server: Draft model → verify with target
    end

    Server-->>Plugin: Completion response
    Plugin->>Plugin: Post-filter, dedup
    Plugin->>IDE: Show inline ghost text

    alt Developer presses Tab
        Dev->>IDE: Accept (Tab)
        IDE->>Plugin: accepted event
        Plugin->>Server: POST /telemetry (accepted)
    else Developer keeps typing
        Dev->>IDE: Reject (keep typing)
        Plugin->>Server: POST /telemetry (rejected)
    end

Debouncing strategy:

Trigger	Debounce	Rationale
Character typed	150ms	Wait for typing pause
New line	0ms (immediate)	High-value moment
After `:`, `(`, `.`	50ms	Developer expects completion
Cursor repositioned	300ms	Exploratory; may keep moving

Step 5: Scaling & Production¶

Failure Handling¶

Failure	Detection	Recovery
Completion server down	Health check	Show nothing (graceful degradation)
High latency	P95 > 300ms	Reduce context window; skip cross-file
Repo index stale	File watcher gap	Fall back to open-file-only context
Model OOM	CUDA error	Reduce batch size; scale up
License match	Scanner flags	Suppress completion; log

Trade-offs¶

Decision	Option A	Option B	Recommendation
Completion model	1.5B (fast)	7B (smarter)	7B with speculative decode (quality + speed)
Context	Current file only	Cross-file repo	Cross-file (measurable acceptance rate lift)
Debounce	100ms	300ms	150ms (balance latency vs. wasted requests)
Show threshold	Low (show more)	High (show less, higher quality)	Tune per-language; start high
Repo indexing	On-demand	Pre-indexed	Pre-indexed for enterprise; on-demand for OSS

Interview Tips¶

Tip

What differentiates a strong answer: 1. Discussing FIM (fill-in-the-middle) rather than just left-to-right generation 2. Explaining why speculative decoding matters for latency 3. Cross-file context assembly — not just current file 4. Telemetry-driven evaluation (acceptance rate, persistence, keystroke savings) 5. License scanning and code attribution

Hypothetical Interview Transcript¶

Note

This transcript simulates a 45-minute Google L5/L6 system design round. The interviewer is a Staff Engineer on the Gemini Code Assist team.

Interviewer: Design an AI code assistant that provides inline completions in the IDE and has a chat interface for code questions. Think Copilot or Gemini Code Assist.

Candidate: Great. Let me clarify: are we designing both the inline completion and chat surfaces, or focusing on one? And is this for individual developers or enterprise (where code privacy matters)?

Interviewer: Both surfaces. Enterprise context — code cannot leave the customer's environment unless they opt in.

Candidate: That's an important constraint. Let me separate the two surfaces because they have fundamentally different requirements.

Inline completions need to be under 200ms — the developer is typing and any delay breaks flow. The model sees the cursor position, code above (prefix), code below (suffix), and ideally cross-file context. It generates a short suggestion, typically 1-5 lines.

Chat is more like the chatbot problem — higher latency tolerance (seconds), longer context (full files, error messages), and longer output.

For completions, I'd use a specialized code model around 7B parameters. It's small enough for fast inference but large enough for quality. Critically, this model must be trained with fill-in-the-middle — standard left-to-right models are bad at code completion because they can't see what comes after the cursor.

For chat, I'd use a larger model — 70B — that excels at reasoning and explanation.

Interviewer: Walk me through what happens when a developer types a character in VS Code.

Candidate: Sure. Let me trace the full path:

IDE event: The developer types a character. The VS Code extension receives an onDidChangeTextDocument event.
Debounce: We don't fire on every keystroke. We wait 150ms after the last keystroke. If the developer is still typing rapidly, we wait. But for high-value triggers — typing after (, ., :, or pressing Enter for a new line — we use a shorter debounce or trigger immediately.
Local context assembly: The extension gathers the prefix (code above cursor), suffix (code below cursor), and a snapshot of currently open files. This happens client-side and takes ~5ms.
Request to server: The extension sends a POST request with the prefix, suffix, file path, language, and open file snippets.
Server-side context enrichment: The server looks up the repository index (if pre-indexed) and retrieves relevant cross-file context — imported type definitions, similar code patterns in the repo, recently edited files. This takes ~20ms.
Inference with speculative decoding: We format the context as a FIM prompt: <prefix>cross_file_context + code_above<suffix>code_below<middle>. A small 150M draft model generates 8 candidate tokens quickly (~15ms). The 7B target model verifies them in a single forward pass (~40ms). With ~75% acceptance rate, we get about 6 tokens verified. We repeat until we hit a natural stopping point — end of statement, empty line, or closing brace.
Post-processing: We check that the completion is syntactically valid (parse prefix + completion + suffix with tree-sitter), not a verbatim copy of licensed code, and above our confidence threshold. If it fails any check, we suppress it entirely.
Response: The completion appears as ghost text in the IDE, typically within 150-200ms of the keystroke.
User action: If the developer presses Tab, the completion is accepted. If they keep typing, it's implicitly rejected. Either way, telemetry is sent back asynchronously.

Interviewer: You mentioned cross-file context. How do you decide which code from other files to include?

Candidate: This is the highest-leverage area for quality improvement. I use a priority-ordered retrieval:

Priority 1: Import definitions. Parse the current file's imports using tree-sitter. For each imported symbol, find its definition in the repo index. If the developer is calling userService.findById(...), including the UserService interface definition is extremely valuable.

Priority 2: Type references. If the current function signature references types like User, OrderRequest, find those class/struct definitions. This helps the model understand the shape of data.

Priority 3: Similar code patterns. Use embedding similarity to find functions in the repo that do similar things. If the developer is writing a new REST handler, finding existing handlers helps the model follow the project's conventions.

Priority 4: Recently edited files. The developer's recent edits are strong signals of relevance — they likely share context with the current task.

All of this is constrained by a token budget — typically 2048 tokens total, with ~40% for prefix, ~20% for suffix, and ~40% for cross-file context. We greedily fill by priority order.

Interviewer: How do you know if your completions are actually good? How do you measure quality?

Candidate: We can't rely on offline benchmarks alone — they don't capture real developer workflows. I'd build a multi-level evaluation:

Level 1: Acceptance rate. The most direct signal. Of all completions shown, what percentage did the developer accept (Tab/Enter)? Target: >30%. Below 25%, the suggestions are noise and should be suppressed more aggressively.

Level 2: Persistence rate. Of accepted completions, what percentage is still in the code 3 minutes later? If developers accept then immediately delete, the completion was wrong. Target: >90% persistence.

Level 3: Keystroke savings. What fraction of the developer's code was written by the AI? This measures actual productivity impact. Target: >25%.

Level 4: Edit distance. When a developer partially accepts and then modifies, how many edits were needed? Lower is better — it means the completion was close but not perfect.

All of these power our A/B testing. When we ship a new model or context strategy, we A/B test on 5% of users and compare these metrics. A new model must improve acceptance rate without degrading persistence rate.

Interviewer: What about the enterprise privacy requirement? How do you handle "code cannot leave the customer's environment"?

Candidate: Three deployment options, each with different tradeoffs:

Option 1: Customer-hosted. Deploy the model in the customer's VPC or on-prem Kubernetes cluster. They provide GPUs; we provide the container image. Full privacy, but higher operational burden for the customer and harder for us to update.

Option 2: Dedicated cloud tenancy. We run the model on dedicated GPU instances in a specific cloud region. Customer data is isolated — not shared with other tenants, not used for training. This is the most common enterprise model.

Option 3: Ephemeral processing. Code is sent to our shared infrastructure but never stored. We process the request, return the completion, and immediately discard all input data. We commit to never training on their code. This requires strong contractual guarantees and audit logging.

For the most security-conscious enterprises (finance, government), Option 1 or 2 is required. For most companies, Option 3 with data processing agreements is sufficient.

In all cases, the telemetry we collect is aggregated and anonymized — we track acceptance rate and latency, not the actual code content.

Interviewer: Great answer. One more — how would you handle the cold start problem when a developer opens a new repository for the first time?

Candidate: Good question. Without a pre-built repository index, we lose cross-file context, which hurts quality. Three approaches:

Immediate: Use only local context — the current file and any open tabs. This works surprisingly well for many completions, especially single-line suggestions.

Background indexing: On first open, trigger an async indexing job. For a 5,000-file repo, parsing with tree-sitter takes about 10 seconds. Embedding generation (5,000 chunks) takes about 30 seconds with batching. So within a minute, we have a usable index. We prioritize files near the cursor first — index the current directory, then fan out.

Incremental enrichment: As the developer opens and edits files, we learn which files are related. The first few minutes are lower quality, but it improves rapidly. By the second day, the index reflects the developer's actual working set, which is typically only 5-10% of the repo.

For enterprise, we'd pre-index repositories overnight using CI/CD hooks — every push triggers a re-index. This means when a developer opens a repo that's already indexed, the experience is instant.

Interviewer: Excellent depth. Let's wrap up.