Design an LLM-Powered Chatbot¶
What We're Building¶
A production-grade conversational AI chatbot — similar to Google Gemini, ChatGPT, or Claude — that serves millions of users with multi-turn dialogue, streaming responses, safety guardrails, and cost-efficient GPU serving.
This is not "wrap an API." We are designing the full system from scratch: model serving infrastructure, conversation management, safety pipeline, and the feedback loop that improves quality over time.
Why This Problem Is Hard¶
| Challenge | Description |
|---|---|
| GPU economics | A single A100 costs ~$3/hr; a 70B model needs 4 GPUs per replica |
| Latency | Users expect first-token in < 500ms; full response in seconds |
| Safety | One toxic response = headline news. Zero-tolerance for harm |
| Multi-turn context | Must track conversation history efficiently (KV-cache grows linearly) |
| Concurrency | Batching autoregressive generation is fundamentally different from REST API batching |
| Evaluation | "Good response" is subjective — no single metric captures quality |
Real-World Scale (Google Gemini-class)¶
| Metric | Scale |
|---|---|
| Daily active users | 100M+ |
| Queries per day | 500M+ |
| Peak QPS | ~20,000 |
| Avg tokens per response | 300–500 |
| Model size | 70B–540B parameters |
| GPU fleet | Thousands of TPUs/GPUs |
| Regions | 10+ global regions |
Key Concepts Primer¶
Autoregressive Generation¶
LLMs generate text one token at a time. Each token depends on all previous tokens.
Input: "What is the capital of France?"
Step 1: "The" (attend to input)
Step 2: "capital" (attend to input + "The")
Step 3: "of" (attend to input + "The capital")
Step 4: "France" ...
Step 5: "is" ...
Step 6: "Paris" ...
Step 7: "." ...
Step 8: <EOS> (stop)
Implication: You cannot parallelize token generation for a single request. Throughput comes from batching multiple requests.
KV-Cache¶
During generation, the model computes Key and Value tensors for every attention layer at every position. Re-computing these for each new token is wasteful.
Without KV-cache: O(n²) compute per token (re-process entire sequence)
With KV-cache: O(n) compute per token (only compute new token's attention)
Memory cost:
KV-cache per token = 2 × num_layers × hidden_dim × 2 bytes (fp16)
Llama-70B: 80 layers × 8192 hidden × 2 × 2 = ~2.5 MB per token
At 4096 context: ~10 GB KV-cache per request
Warning
KV-cache memory is the primary bottleneck for LLM serving, not compute. This is why techniques like PagedAttention, KV-cache compression, and shorter contexts matter enormously.
PagedAttention (vLLM)¶
Traditional KV-cache allocates a contiguous block for max sequence length, wasting memory for shorter sequences. PagedAttention borrows from OS virtual memory:
Traditional: [████████████░░░░░░░░] ← allocated for max_len, mostly wasted
PagedAttention: [██][██][██] ← allocate pages on demand, non-contiguous
Benefits:
- 2-4x more concurrent requests per GPU
- Near-zero memory waste
- Efficient memory sharing for beam search / parallel sampling
Continuous Batching¶
Traditional batching waits for a batch to fill, then processes all requests together. But LLM requests have variable lengths — some finish in 10 tokens, others in 2000.
Static batching:
Request A: [████████████████████████████████] 1000 tokens
Request B: [██████████]░░░░░░░░░░░░░░░░░░░░░ 200 tokens (GPU idle after)
Request C: [████████████████]░░░░░░░░░░░░░░░ 400 tokens (GPU idle after)
Continuous batching:
Request A: [████████████████████████████████]
Request B: [██████████]
Request C: [████████████████]
Request D: [████████████████████████] ← slots in when B finishes
Request E: [████████████████] ← slots in when C finishes
GPU utilization: ~95% vs ~40%
Speculative Decoding¶
Use a small "draft" model to generate candidate tokens, then verify them in parallel with the large model:
Draft model (1B): generates 5 tokens fast → [t1, t2, t3, t4, t5]
Target model (70B): verifies all 5 in ONE forward pass
Accept: [t1, t2, t3] ✓ (3 tokens verified in 1 step instead of 3)
Reject: t4 ✗ → resample from target model distribution
Speedup: 2-3x with ~95% acceptance rate on typical text
Step 1: Requirements Clarification¶
Questions to Ask¶
| Question | Why It Matters |
|---|---|
| What modalities? | Text-only vs multi-modal (images, code, files) |
| Max conversation length? | Determines KV-cache and context window needs |
| Latency budget? | Time-to-first-token vs tokens/second |
| Safety requirements? | Regulated industry? Children? General public? |
| Personalization? | Per-user memory? System prompts? |
| Cost target? | Per-query cost ceiling drives model size decisions |
Functional Requirements¶
| Requirement | Priority | Description |
|---|---|---|
| Multi-turn conversation | Must have | Maintain context across turns |
| Streaming responses | Must have | Token-by-token SSE delivery |
| Safety guardrails | Must have | Block harmful inputs/outputs |
| Conversation history | Must have | Persist and retrieve past sessions |
| System prompts | Should have | Configurable persona/instructions |
| Multi-modal input | Nice to have | Images, files, code |
| Tool/function calling | Nice to have | Web search, calculator, API calls |
| Response regeneration | Nice to have | User can request a different response |
Non-Functional Requirements¶
| Requirement | Target | Rationale |
|---|---|---|
| Time-to-first-token | < 500ms (P95) | User perceives responsiveness |
| Streaming throughput | > 30 tokens/sec | Faster than reading speed |
| Availability | 99.9% | Consumer product expectations |
| Concurrent users | 100K+ | Peak hours |
| Cost per query | < $0.01 (consumer) | Sustainability at scale |
| Safety | Zero harmful outputs | Regulatory and reputational |
API Design¶
# POST /v1/chat/completions (streaming)
{
"model": "gemini-pro",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing"},
{"role": "assistant", "content": "Quantum computing uses..."},
{"role": "user", "content": "How does entanglement work?"}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 2048,
"safety_settings": {
"harassment": "BLOCK_MEDIUM_AND_ABOVE",
"dangerous_content": "BLOCK_LOW_AND_ABOVE"
}
}
# Response (SSE stream):
# data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Quantum"},"index":0}]}
# data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" entanglement"},"index":0}]}
# ...
# data: {"id":"chatcmpl-abc","choices":[{"delta":{},"finish_reason":"stop"}]}
# data: [DONE]
Technology Selection & Tradeoffs¶
A production LLM chatbot is assembled from inference engine + conversation storage + safety classification + model routing strategy. Each decision affects cost per query, latency, and safety posture.
LLM inference engine¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| vLLM | PagedAttention for near-zero KV-cache waste; continuous batching; OpenAI-compatible API | Tensor parallelism only (no pipeline); less mature on AMD GPUs | Default for production on NVIDIA GPUs with throughput as primary concern |
| TensorRT-LLM (NVIDIA) | Highest absolute throughput; fused kernels; in-flight batching | NVIDIA-only; complex build pipeline; less flexible for custom models | Maximum tokens/second on fixed NVIDIA fleet |
| SGLang | Radix attention for prefix caching; structured generation support | Smaller community; fewer production deployments | When prefix caching (repeated system prompts) is a critical lever |
| TGI (HuggingFace) | Easy to deploy; good model hub integration; token streaming built in | Lower throughput at high concurrency | Quick prototyping or low-scale deployments |
Conversation storage¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| Spanner / CockroachDB | Globally consistent; interleaved tables for conversation locality | Expensive at scale; Spanner GCP-only; CockroachDB adds ops complexity | Global consistency and relational queries required |
| DynamoDB / Bigtable | Massive scale; low-latency point reads; natural row-key design | No cross-row transactions; poor for ad-hoc analytics | Scale exceeds relational limits; queries are primarily key-based |
| PostgreSQL | Open-source; mature; JSONB for flexible messages; strong ecosystem | Single-region default; sharding needs extensions (Citus) | Smaller deployments; avoiding vendor lock-in |
Safety classification¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| Cascade: keyword + FastText + LLM | Layered cost: 99% handled by cheap filters; LLM only for ambiguous cases | Keyword lists need maintenance; FastText has limited nuance | Default for high-throughput where per-query safety cost must stay low |
| Dedicated safety LLM (Llama Guard) | High accuracy; understands context and nuance | 50-100ms per call; requires dedicated GPU capacity | Safety accuracy is paramount (children's products, regulated) |
| Third-party API (OpenAI Moderation) | Zero ops; continuously updated; broad language coverage | Data leaves infrastructure; external latency | Prototyping or when internal models aren't mature |
Model routing strategy¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| Tiered routing (7B + 70B) | Route simple queries to cheap model; complex to flagship; 3-5x cost reduction | Requires query classifier; misrouted queries degrade quality | Default for cost-sensitive production where complexity varies |
| Single flagship (70B) | Simplest; consistent quality; one model to monitor | Highest cost; over-provisioned for simple queries | Early-stage where quality is the only priority |
| Speculative decoding (1B draft + 70B target) | 2-3x speedup with identical output quality | Added complexity; draft model must match target | Latency reduction without sacrificing quality |
Our choice: vLLM as inference engine (PagedAttention delivers 2-4x more concurrent requests per GPU, directly reducing cost). Tiered routing with 7B for simple queries (60-70% of traffic) and 70B for complex, using a lightweight classifier (distilled BERT, sub-5ms). Spanner for conversation storage (global consistency, interleaved tables) with Redis caching active conversation windows. Three-stage safety cascade (keyword → FastText → Llama Guard) keeping average overhead under 2ms while maintaining accuracy on edge cases.
Tip
Interview angle: The KV-cache memory discussion shows depth. KV-cache per request often exceeds the model weights themselves (10 GB for 70B at 4096 context). PagedAttention solves this the same way virtual memory solved physical memory fragmentation -- this analogy resonates with systems-minded interviewers.
Step 2: Back-of-Envelope Estimation¶
Traffic¶
Daily active users: 10M (mid-scale, not full Google)
Queries per user/day: 5
Total daily queries: 50M
QPS (average): 50M / 86,400 ≈ 580
QPS (peak, 5x): ~3,000
GPU Compute¶
Model: 70B parameters (fp16 = 140 GB, quantized int8 = 70 GB)
GPUs per replica: 2 × A100 (80GB) with int8 quantization
Avg input tokens: 200 (conversation context)
Avg output tokens: 400
Prefill time (200 tokens): ~50ms on A100
Decode time (400 tokens): 400 / 40 tok/s = ~10s
With continuous batching (batch=32):
Effective throughput: ~40 requests/s per 2-GPU replica
Replicas for peak QPS: 3,000 / 40 = 75 replicas = 150 A100 GPUs
KV-cache per request: ~2 GB (4096 context, 70B model)
KV-cache per GPU (batch=32): ~64 GB → fits in 80 GB A100
Storage¶
Conversation history:
50M conversations/day × avg 2 KB = 100 GB/day
30-day retention = 3 TB
User profiles: 10M × 1 KB = 10 GB
Safety audit logs: 50M × 0.5 KB = 25 GB/day
Cost¶
GPU cost (150 A100s): 150 × $2.5/hr = $375/hr = $9,000/day
Cost per query: $9,000 / 50M = $0.00018 per query ✓
With spot instances (60% discount):
$3,600/day → $0.000072 per query
Step 3: High-Level Design¶
flowchart TB
subgraph cli["Client Layer"]
Web[Web App]
Mobile[Mobile App]
API[API Clients]
end
subgraph gtw["API Gateway"]
LB[Load Balancer]
Auth["Auth & Rate Limit"]
Router[Request Router]
end
subgraph safe["Safety Pipeline"]
InputFilter[Input Classifier]
OutputFilter[Output Classifier]
PII[PII Detector]
end
subgraph serve["LLM Serving Cluster"]
Scheduler[Request Scheduler]
subgraph gpool["GPU Pool"]
GPU1[GPU Replica 1<br/>vLLM]
GPU2[GPU Replica 2<br/>vLLM]
GPUN[GPU Replica N<br/>vLLM]
end
KVStore[KV-Cache<br/>Manager]
end
subgraph ctx["Context Assembly"]
ConvMgr[Conversation<br/>Manager]
SysPrompt[System Prompt<br/>Store]
ToolRouter[Tool Router]
end
subgraph stor["Storage Layer"]
ConvDB[(Conversation DB<br/>Spanner/Bigtable)]
UserDB[(User Store<br/>Spanner)]
AuditLog[(Safety Audit Log<br/>BigQuery)]
end
subgraph fb["Feedback & Evaluation"]
ThumbsUD["Thumbs Up/Down"]
Eval[Evaluation Pipeline]
ABTest[A/B Testing]
end
cli --> gtw
gtw --> safe
safe -->|"clean input"| ctx
ctx -->|"assembled prompt"| serve
serve -->|"raw output"| safe
safe -->|"filtered output"| gtw
gtw -->|"SSE stream"| cli
ctx --> stor
serve --> stor
safe --> AuditLog
cli --> fb
fb --> Eval
Component Responsibilities¶
| Component | Responsibility |
|---|---|
| API Gateway | Authentication, rate limiting, request routing, SSE management |
| Safety Pipeline | Input/output classification, PII detection, policy enforcement |
| Conversation Manager | Assemble conversation context, manage history, truncation |
| LLM Serving Cluster | GPU inference with continuous batching, KV-cache management |
| Request Scheduler | Queue management, priority, preemption, load balancing across GPUs |
| Tool Router | Dispatch function calls to external tools (search, calculator) |
| Feedback System | Collect thumbs up/down, route to RLHF training pipeline |
Step 4: Deep Dive¶
4.1 Conversation Context Management¶
The most critical non-obvious component. With a 128K context window, naive approaches fail:
class ConversationManager:
"""Assembles the prompt from conversation history, system prompt, and tools."""
MAX_CONTEXT_TOKENS = 128_000
RESERVED_FOR_OUTPUT = 4_096
RESERVED_FOR_SYSTEM = 2_000
def assemble_prompt(self, conversation_id: str, new_message: str) -> list[dict]:
system_prompt = self.get_system_prompt(conversation_id)
history = self.get_history(conversation_id)
available_tokens = (
self.MAX_CONTEXT_TOKENS
- self.RESERVED_FOR_OUTPUT
- self.RESERVED_FOR_SYSTEM
- self.count_tokens(new_message)
)
truncated_history = self._truncate_history(history, available_tokens)
return [
{"role": "system", "content": system_prompt},
*truncated_history,
{"role": "user", "content": new_message},
]
def _truncate_history(self, history: list[dict], budget: int) -> list[dict]:
"""Keep most recent turns. Optionally summarize older turns."""
result = []
used = 0
for msg in reversed(history):
tokens = self.count_tokens(msg["content"])
if used + tokens > budget:
break
result.insert(0, msg)
used += tokens
return result
def _summarize_old_context(self, old_messages: list[dict]) -> str:
"""Use a smaller model to compress older conversation into a summary."""
summary_prompt = f"Summarize this conversation concisely:\n{old_messages}"
return self.summary_model.generate(summary_prompt, max_tokens=500)
Strategies for long conversations:
| Strategy | Pros | Cons |
|---|---|---|
| Sliding window | Simple, predictable | Loses early context |
| Summarize + window | Retains key info | Summary may lose nuance |
| Hierarchical compression | Best retention | Complex, adds latency |
| Retrieval from history | Precise recall | Requires embedding + index |
4.2 Inference Engine (vLLM-style)¶
import asyncio
from dataclasses import dataclass, field
from enum import Enum
class RequestState(Enum):
WAITING = "waiting"
RUNNING = "running"
FINISHED = "finished"
PREEMPTED = "preempted"
@dataclass
class InferenceRequest:
request_id: str
prompt_tokens: list[int]
max_output_tokens: int
temperature: float
state: RequestState = RequestState.WAITING
output_tokens: list[int] = field(default_factory=list)
kv_cache_blocks: list[int] = field(default_factory=list)
class ContinuousBatchScheduler:
"""Scheduler implementing continuous batching with PagedAttention."""
def __init__(self, max_batch_size: int, max_blocks: int, block_size: int = 16):
self.max_batch_size = max_batch_size
self.max_blocks = max_blocks
self.block_size = block_size
self.waiting_queue: list[InferenceRequest] = []
self.running_batch: list[InferenceRequest] = []
self.free_blocks: set[int] = set(range(max_blocks))
def add_request(self, request: InferenceRequest):
request.state = RequestState.WAITING
self.waiting_queue.append(request)
def schedule_step(self) -> list[InferenceRequest]:
"""Called every decode step to determine what runs next."""
finished = [r for r in self.running_batch if self._is_finished(r)]
for req in finished:
self._free_kv_blocks(req)
req.state = RequestState.FINISHED
self.running_batch = [r for r in self.running_batch if r.state == RequestState.RUNNING]
while (
self.waiting_queue
and len(self.running_batch) < self.max_batch_size
and self._has_free_blocks(self.waiting_queue[0])
):
req = self.waiting_queue.pop(0)
self._allocate_kv_blocks(req)
req.state = RequestState.RUNNING
self.running_batch.append(req)
if not self.running_batch and self.waiting_queue:
victim = self._select_preemption_victim()
if victim:
self._preempt(victim)
return self.running_batch
def _is_finished(self, req: InferenceRequest) -> bool:
return (
len(req.output_tokens) >= req.max_output_tokens
or (req.output_tokens and req.output_tokens[-1] == EOS_TOKEN)
)
def _has_free_blocks(self, req: InferenceRequest) -> bool:
needed = (len(req.prompt_tokens) + self.block_size - 1) // self.block_size
return needed <= len(self.free_blocks)
def _allocate_kv_blocks(self, req: InferenceRequest):
needed = (len(req.prompt_tokens) + self.block_size - 1) // self.block_size
blocks = [self.free_blocks.pop() for _ in range(needed)]
req.kv_cache_blocks = blocks
def _free_kv_blocks(self, req: InferenceRequest):
self.free_blocks.update(req.kv_cache_blocks)
req.kv_cache_blocks = []
def _select_preemption_victim(self) -> InferenceRequest | None:
if not self.running_batch:
return None
return max(self.running_batch, key=lambda r: len(r.output_tokens))
def _preempt(self, req: InferenceRequest):
self._free_kv_blocks(req)
req.state = RequestState.PREEMPTED
self.running_batch.remove(req)
self.waiting_queue.insert(0, req)
EOS_TOKEN = 2
4.3 Streaming Architecture¶
sequenceDiagram
participant Client
participant Gateway
participant Safety
participant LLM as LLM Engine
Client->>Gateway: POST /chat (stream=true)
Gateway->>Safety: classify_input(message)
Safety-->>Gateway: SAFE
Gateway->>LLM: schedule_request(prompt)
loop Every token
LLM-->>Gateway: token
Gateway-->>Client: SSE: data: {"delta": "token"}
end
LLM-->>Gateway: "finish_reason: stop"
Gateway->>Safety: classify_output(full_response)
alt Output SAFE
Gateway-->>Client: SSE: data: [DONE]
else Output UNSAFE
Gateway-->>Client: SSE: data: {"error": "response_filtered"}
end
Gateway->>Gateway: log(audit_record)
Key design decisions for streaming:
| Decision | Choice | Rationale |
|---|---|---|
| Protocol | Server-Sent Events (SSE) | Simpler than WebSocket; HTTP/2 compatible; sufficient for one-way streaming |
| Buffering | Token-level, no buffering | Users expect character-by-character; buffering adds perceived latency |
| Output safety | Post-hoc check after full response | Cannot classify safety mid-stream reliably; can revoke if unsafe |
| Connection timeout | 5 minutes | Long responses can take 30-60s; keep-alive for multi-turn |
4.4 Safety Pipeline¶
flowchart LR
subgraph inp["Input Safety - Pre-Generation"]
I1[Keyword<br/>Blocklist] --> I2[ML Classifier<br/>FastText]
I2 --> I3[LLM Safety<br/>Classifier]
I3 --> I4[PII<br/>Detector]
end
subgraph gen["Generation"]
G[LLM<br/>Inference]
end
subgraph out["Output Safety - Post-Generation"]
O1[Toxicity<br/>Classifier] --> O2[Factuality<br/>Check]
O2 --> O3[PII<br/>Scrubber]
O3 --> O4[Citation<br/>Injector]
end
inp -->|SAFE| gen
gen --> out
inp -->|BLOCKED| Reject[Return safe<br/>refusal message]
Cascade design — fast filters first:
class SafetyPipeline:
"""Multi-stage safety pipeline. Fast checks first, expensive checks last."""
def __init__(self):
self.keyword_filter = KeywordBlocklist() # ~0.1ms
self.fast_classifier = FastTextClassifier() # ~1ms
self.llm_classifier = LLMSafetyClassifier() # ~50ms
self.pii_detector = PIIDetector() # ~5ms
async def classify_input(self, text: str) -> SafetyResult:
if self.keyword_filter.contains_blocked(text):
return SafetyResult(blocked=True, reason="keyword_match", stage="keyword")
fast_score = await self.fast_classifier.predict(text)
if fast_score > 0.95:
return SafetyResult(blocked=True, reason="fast_classifier", stage="fast_ml")
if fast_score > 0.5:
llm_result = await self.llm_classifier.classify(text)
if llm_result.is_unsafe:
return SafetyResult(
blocked=True, reason=llm_result.category, stage="llm_classifier"
)
pii_result = self.pii_detector.scan(text)
return SafetyResult(blocked=False, pii_entities=pii_result.entities)
async def classify_output(self, text: str) -> SafetyResult:
toxicity = await self.fast_classifier.predict(text)
pii = self.pii_detector.scan(text)
if toxicity > 0.8 or pii.has_critical:
return SafetyResult(blocked=True, reason="output_safety")
return SafetyResult(blocked=False)
Safety categories (Google-style):
| Category | Examples | Threshold |
|---|---|---|
| Harassment | Bullying, threats, hate speech | BLOCK_MEDIUM_AND_ABOVE |
| Dangerous content | Weapons instructions, self-harm | BLOCK_LOW_AND_ABOVE |
| Sexually explicit | Adult content | BLOCK_MEDIUM_AND_ABOVE |
| Medical/Legal advice | Specific diagnoses, legal counsel | Disclaimer injection |
| PII exposure | SSN, credit cards, addresses | Always scrub |
4.5 GPU Auto-Scaling¶
class GPUAutoscaler:
"""Auto-scale GPU replicas based on queue depth and latency."""
SCALE_UP_QUEUE_THRESHOLD = 100
SCALE_DOWN_IDLE_SECONDS = 300
MIN_REPLICAS = 10
MAX_REPLICAS = 200
COOLDOWN_SECONDS = 120
def evaluate(self, metrics: ClusterMetrics) -> ScalingDecision:
if metrics.avg_queue_depth > self.SCALE_UP_QUEUE_THRESHOLD:
new_count = min(
self.MAX_REPLICAS,
metrics.current_replicas + max(1, metrics.current_replicas // 4),
)
return ScalingDecision(action="scale_up", target=new_count)
if (
metrics.avg_gpu_utilization < 0.3
and metrics.idle_duration_seconds > self.SCALE_DOWN_IDLE_SECONDS
and metrics.current_replicas > self.MIN_REPLICAS
):
new_count = max(
self.MIN_REPLICAS,
int(metrics.current_replicas * 0.75),
)
return ScalingDecision(action="scale_down", target=new_count)
return ScalingDecision(action="no_change", target=metrics.current_replicas)
Key scaling signals:
| Signal | Scale Up When | Scale Down When |
|---|---|---|
| Queue depth | > 100 pending requests | < 10 for 5 min |
| TTFT P95 | > 1 second | Consistently < 200ms |
| GPU memory utilization | > 90% | < 30% for 5 min |
| Time of day | Pre-schedule for known peaks | Night hours in region |
4.6 Model Routing and A/B Testing¶
flowchart LR
Request[User Request] --> Router{Model Router}
Router -->|"90%"| ModelA[Gemini Pro<br/>Production]
Router -->|"5%"| ModelB[Gemini Pro v2<br/>Experiment A]
Router -->|"5%"| ModelC[Fine-tuned<br/>Experiment B]
ModelA --> Logger[Metrics Logger]
ModelB --> Logger
ModelC --> Logger
Logger --> Dashboard[A/B Dashboard]
Routing criteria:
| Factor | Example |
|---|---|
| User bucket | Consistent hashing on user_id for sticky assignment |
| Query complexity | Route simple queries to smaller model; complex to larger |
| Cost tier | Free users → smaller model; paid users → flagship |
| Region | Route to nearest GPU cluster |
| Capacity | Overflow to secondary model during peak |
4.7 Conversation Storage Schema¶
-- Conversations table (Spanner)
CREATE TABLE conversations (
conversation_id STRING(36) NOT NULL,
user_id STRING(36) NOT NULL,
title STRING(256),
model_id STRING(64),
system_prompt STRING(MAX),
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL,
message_count INT64,
total_tokens INT64,
) PRIMARY KEY (user_id, conversation_id);
-- Messages table (interleaved for locality)
CREATE TABLE messages (
conversation_id STRING(36) NOT NULL,
user_id STRING(36) NOT NULL,
message_id STRING(36) NOT NULL,
role STRING(16) NOT NULL, -- "user", "assistant", "system"
content STRING(MAX),
token_count INT64,
model_id STRING(64),
created_at TIMESTAMP NOT NULL,
safety_result JSON,
feedback STRING(16), -- "thumbs_up", "thumbs_down", null
latency_ms INT64,
) PRIMARY KEY (user_id, conversation_id, message_id),
INTERLEAVE IN PARENT conversations ON DELETE CASCADE;
4.8 Evaluation and Feedback Loop¶
flowchart TB
subgraph onl["Online Signals"]
ThumbsUp["Thumbs Up / Down"]
Regen[Regenerate Clicks]
Abandon[Session Abandonment]
CopyPaste["Copy/Paste Rate"]
end
subgraph offl["Offline Evaluation"]
HumanEval[Human Eval<br/>Side-by-side rating]
AutoEval[Auto Eval<br/>LLM-as-Judge]
Benchmark["Benchmark Suite<br/>MMLU, HumanEval, etc."]
end
subgraph trainp["Training Pipeline"]
RLHF["RLHF / DPO<br/>Training"]
SFT[Supervised<br/>Fine-tuning]
Safety_T[Safety<br/>Tuning]
end
onl --> DataPipeline[Data Pipeline]
DataPipeline --> trainp
offl --> trainp
trainp --> NewModel[New Model<br/>Candidate]
NewModel --> ABTest[A/B Test]
ABTest --> Production[Promote to<br/>Production]
Evaluation dimensions:
| Dimension | How to Measure | Target |
|---|---|---|
| Helpfulness | Human side-by-side, thumbs up rate | Win rate > 50% vs baseline |
| Harmlessness | Safety classifier, red-team attacks | 0 critical failures |
| Honesty | Factual accuracy, hallucination rate | > 90% factual |
| Instruction following | LLM-as-judge scoring | > 85% score |
| Latency | TTFT P50, P95, P99 | P95 < 500ms |
| Cost | $ per query | < $0.01 |
Step 5: Scaling & Production¶
Failure Handling¶
| Failure | Detection | Recovery |
|---|---|---|
| GPU OOM | KV-cache allocation fails | Preempt lowest-priority request, retry |
| GPU hardware failure | Health check timeout | Drain node, reschedule to healthy GPU |
| Model loading failure | Startup probe fails | Retry with exponential backoff; alert if 3 failures |
| Safety pipeline down | Circuit breaker trips | Fail closed (reject all requests); alert |
| Conversation DB unavailable | Connection timeout | Serve from cache; degrade to stateless mode |
| Overload | Queue depth > 1000 | Shed load: 429 with Retry-After; scale up |
Monitoring Dashboard¶
| Metric | Alert Threshold |
|---|---|
| TTFT P95 | > 1s |
| Token throughput | < 20 tokens/s per replica |
| GPU utilization | > 95% for 5 min |
| KV-cache utilization | > 90% |
| Safety block rate | > 5% (indicates attack) |
| Error rate | > 1% |
| Queue depth | > 500 |
Trade-offs¶
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Model size | 7B (fast, cheap) | 70B (smarter) | 70B for flagship; 7B for cost-sensitive |
| Quantization | fp16 (accurate) | int8 (2x throughput) | int8 with quality validation |
| Context strategy | Truncate | Summarize old context | Summarize for conversations > 10 turns |
| Safety timing | Pre-generation only | Pre + post | Both (belt and suspenders) |
| Streaming | SSE | WebSocket | SSE (simpler; WebSocket only if bidirectional needed) |
Interview Tips¶
Tip
Google interviewers will probe these areas: 1. "How do you handle a 100-turn conversation that exceeds context length?" — Summarization, retrieval from history 2. "What happens when GPU utilization hits 100%?" — Preemption, load shedding, queue management 3. "How do you evaluate if a response is good?" — Multi-dimensional eval, LLM-as-judge, human eval 4. "Walk me through the lifecycle of a single request" — Full path from client to GPU and back 5. "How do you keep cost under $0.01 per query?" — Quantization, speculative decoding, caching
Common follow-up questions: - How would you add tool/function calling? - How do you handle multi-modal inputs (images)? - How would you implement a memory system across conversations? - What changes for enterprise (on-prem, data residency)? - How do you prevent prompt injection attacks?
Hypothetical Interview Transcript¶
Note
This transcript simulates a 45-minute Google L5/L6 system design round. The interviewer is a Staff Engineer on the Gemini serving team.
Interviewer: Let's design a conversational AI chatbot — think Google Gemini or ChatGPT. Millions of users, real-time responses, the works. Where would you start?
Candidate: Before diving in, let me clarify requirements. Are we designing the full stack — model serving, conversation management, safety — or focusing on a specific part? Also, what scale are we targeting? And is this text-only or multi-modal?
Interviewer: Full stack. Let's say 10 million daily active users, 5 queries each. Text-only for now, but I'd like you to think about how multi-modal would change things later.
Candidate: Great. So 50 million queries per day, roughly 580 QPS average, peaking at maybe 3,000 QPS. Let me think about the model size — for Gemini-quality responses, we'd need at least a 70B parameter model. At fp16, that's 140 GB, so minimum 2 A100 80GB GPUs per replica. With int8 quantization, we could fit it on 2 GPUs comfortably.
Let me estimate GPU needs. With continuous batching, say we get about 40 requests per second per 2-GPU replica. For 3,000 peak QPS, we'd need about 75 replicas — 150 A100 GPUs. That's roughly $9,000 per day in GPU costs, which gives us about $0.00018 per query. Very feasible.
Interviewer: Good estimation. What about KV-cache? That's often the real bottleneck.
Candidate: Exactly right. For a 70B model with 80 layers, the KV-cache per token is about 2.5 MB. At a 4096-token context, that's roughly 10 GB per request. With a batch size of 32, we need about 320 GB of KV-cache memory. That's more than the model weights themselves.
This is where PagedAttention becomes critical. Instead of pre-allocating KV-cache for the maximum sequence length — which wastes 50-70% of memory — we allocate pages on demand, similar to virtual memory in operating systems. This can give us 2-4x more concurrent requests per GPU.
I'd use something like vLLM's architecture: pages of 16 tokens each, a block table mapping logical to physical blocks, and a free-list allocator. When a request finishes, its blocks are immediately recycled.
Interviewer: Walk me through what happens when a user sends a message in an ongoing conversation.
Candidate: Sure, let me trace the full request lifecycle:
-
Client sends message — POST to
/v1/chat/completionswith the conversation ID andstream=true. -
API Gateway — authenticates the user, checks rate limits, opens an SSE connection back to the client.
-
Conversation Manager — fetches the conversation history from Spanner (or from a read-through cache). It assembles the full prompt: system prompt, conversation history, and the new message. If the history exceeds the context window, it truncates older messages, keeping the most recent turns. For very long conversations, I'd summarize older turns with a smaller, faster model.
-
Input Safety — before sending to the LLM, the message goes through a safety cascade: keyword blocklist (0.1ms), a FastText classifier (1ms), and for borderline cases, an LLM-based classifier (50ms). If blocked, we return a safe refusal message.
-
Request Scheduler — the assembled prompt enters the continuous batching scheduler. It tokenizes the prompt, checks if there's GPU capacity and free KV-cache blocks. If yes, it starts the prefill phase. If the cluster is at capacity, the request waits in a priority queue.
-
Prefill — the GPU processes all input tokens in parallel. For 200 input tokens on an A100, this takes about 50ms. This computes the KV-cache for all input positions.
-
Decode — the model generates tokens one at a time. Each token is immediately streamed back through the gateway as an SSE event. At ~40 tokens/second, a 400-token response takes about 10 seconds.
-
Output Safety — after generation completes, the full response is checked by the output safety classifier. If it fails, we send a filtered message to the client. This is a tradeoff — we can't check mid-stream reliably.
-
Persist — the assistant message is saved to Spanner asynchronously. Safety audit logs go to BigQuery.
Interviewer: You mentioned summarizing older turns. How would you handle a user who has a 100-turn conversation?
Candidate: This is one of the trickiest parts. Let me outline three approaches in increasing sophistication:
Approach 1: Sliding window. Keep the last N turns that fit in the context. Simple, but you lose earlier context. The user says "as I mentioned earlier" and the model has no idea.
Approach 2: Summarize + window. After every K turns (say 10), run a smaller model to compress older turns into a 200-token summary. The prompt becomes: [system prompt] [summary of turns 1-80] [full turns 81-100] [new message]. This preserves key facts while staying within the window.
Approach 3: Retrieval from history. Embed each message and store in a vector index. When assembling the prompt, retrieve the most relevant past messages based on semantic similarity to the current query. This handles "what did we discuss about databases?" even if it was 50 turns ago.
For production, I'd recommend Approach 2 as the default, with Approach 3 as an enhancement for power users. The summary model can be a fast 7B model — the latency hit is about 200ms, which we can absorb in the prefill phase.
Interviewer: Good. Let's talk about safety. How do you prevent the model from generating harmful content?
Candidate: I think of safety as defense-in-depth — multiple layers, each catching different things:
Layer 1: Input filtering. Before generation even starts. A cascade of increasingly expensive classifiers — keyword blocklist (microseconds), FastText (1ms), and for ambiguous cases, an LLM classifier (50ms). This catches obvious prompt injection, requests for harmful content, and adversarial attacks. About 99% of clearly harmful inputs are caught here.
Layer 2: System prompt / constitutional instructions. The model's system prompt includes safety guidelines. This is the model's "conscience" — it's been RLHF-trained to follow these.
Layer 3: Output filtering. After generation, classify the full response for toxicity, PII, and policy violations. If it fails, replace with a safe refusal. This catches cases where the input looked benign but the model produced something harmful.
Layer 4: PII scrubbing. Always scan outputs for SSNs, credit card numbers, phone numbers using regex + Named Entity Recognition (NER) models. Never let PII through.
Layer 5: Audit logging. Log every input/output pair to BigQuery for offline analysis. A separate safety team reviews flagged interactions daily.
The tradeoff: more safety layers = more latency + more false positives (blocking legitimate queries). We'd A/B test safety thresholds to find the sweet spot. For a consumer product, I'd err on the side of blocking — a false positive is a minor inconvenience; a false negative can be a PR disaster.
Interviewer: How would you evaluate whether your chatbot is actually good?
Candidate: Evaluation is multi-dimensional. No single metric captures "good."
Online metrics — things we measure in production: - Thumbs up/down rate (our primary online signal) - Regeneration rate (user asked for a different answer — implies dissatisfaction) - Session length and return rate - Copy/paste rate (user found the response useful enough to copy)
Offline evaluation: - Human side-by-side evaluation (Elo rating): show two responses from different models, human picks the better one. This is our gold standard. - LLM-as-judge: use a strong model (GPT-4 or Gemini Ultra) to score responses on helpfulness, accuracy, and safety. Cheaper than humans, correlates ~85% with human judgment. - Benchmark suites: MMLU for knowledge, HumanEval for coding, TruthfulQA for factuality. These track regression, not absolute quality.
The feedback loop: thumbs down responses are candidates for RLHF/DPO training data. We'd pair the rejected response with a preferred response (either human-written or from a better model run), then use DPO to update the model weekly.
Interviewer: Last question — if I told you the cost needs to drop by 5x, what would you do?
Candidate: Several levers, from least to most disruptive:
-
Quantization — move from fp16 to int4 (GPTQ/AWQ). This halves GPU memory, roughly doubling throughput. Quality loss is typically < 2% on benchmarks for a 70B model.
-
Speculative decoding — use a 1B draft model to generate 4-5 candidate tokens, verify them in one forward pass of the 70B model. This gives 2-3x speedup with identical output quality.
-
Prompt caching — for repeated system prompts and common conversation patterns, cache the KV-cache of the prefix. Saves prefill compute for every request with the same system prompt.
-
Semantic caching — for near-duplicate queries ("what is the capital of France" vs "capital of France?"), return cached responses. Maybe 10-15% cache hit rate.
-
Model distillation — train a smaller 7B model on the 70B model's outputs for common query types. Route simple queries (greetings, factual Q&A) to the 7B model, complex queries to 70B. If 60% of queries can use the small model, that's a 3-4x cost reduction.
-
Spot/preemptible instances — use cloud spot instances for non-critical traffic (free tier users). 60-70% cost savings with the risk of preemption — mitigated by having some on-demand capacity as a floor.
I'd start with quantization and speculative decoding (easy wins, no quality loss), then invest in model distillation for the biggest long-term savings.
Interviewer: Excellent. Great depth on the inference optimization and safety design. Let's wrap up.