Design a Document Q&A System for 10,000+ PDFs¶
What We're Building¶
A document-grounded question answering system over a large corpus of PDFs — think internal research libraries, legal discovery, compliance manuals, or technical specification archives. Users ask natural language questions; the system retrieves the most relevant passages from thousands of PDFs, re-ranks them, and generates answers with explicit citations (document, page, section).
The key difference from a generic chatbot: Answers must be attributable to specific PDF regions. PDFs add pain: scanned pages, multi-column layouts, embedded tables, figures, and mixed fonts break naive "read the file as text" pipelines.
Why This Problem Is Hard¶
| Challenge | Description |
|---|---|
| PDF is a presentation format, not semantic text | Text order, headers, and tables often require layout-aware parsing or OCR |
| Scale across documents | 10K+ PDFs implies millions of chunks; ANN search, ACL filtering, and freshness must compose |
| Retrieval vs. reasoning | Multi-document questions need either fusion retrieval or orchestrated sub-queries before generation |
| Non-text modalities | Tables and images carry information that plain text extraction loses without structure or vision |
| Access control | Per-document permissions must be enforced before or immediately after retrieval — leaks are unacceptable |
| Index lifecycle | Adds, updates, and deletes must propagate to vector, sparse, and metadata indexes consistently |
Real-World Scale¶
| Metric | Scale |
|---|---|
| PDF documents | 10,000–50,000 (single tenant); 100K+ (multi-tenant archive) |
| Total pages | 5M–20M (avg 200–400 pages per PDF) |
| Chunks (512 tokens, overlap) | ~15M–60M (depends on information density) |
| Queries per day | 50K–200K (enterprise knowledge product) |
| Concurrent users | 1K–10K peak |
| Ingestion rate | 100–2,000 new/updated PDFs per day |
| End-to-end latency target | < 3–5 s (retrieval + re-rank + LLM) |
| Embedding dimensions | 768–1024 (BGE, E5-class bi-encoders) |
Warning
Interviewers often probe failure modes: scanned PDFs, wrong reading order, tables rendered as garbage text, and "the right answer spread across three documents." Show you understand parsing, chunking, hybrid retrieval, and multi-hop / multi-doc strategies — not only "embed and call GPT."
Key Concepts Primer¶
End-to-End RAG over PDFs¶
flowchart LR
subgraph ingest["Ingestion"]
PDF[PDF Blob] --> Parse[Parser<br/>PyMuPDF / Unstructured]
Parse --> Chunk[Chunking<br/>512 / overlap 50]
Chunk --> Emb[Bi-encoder<br/>BGE / E5]
Emb --> VDB[(Vector DB<br/>HNSW)]
Chunk --> BM25[(BM25 / sparse)]
end
subgraph query["Query Path"]
Q[User Query] --> QE[Query<br/>Embedding]
QE --> ANN[ANN top-20]
BM25Q[BM25] --> Fuse[Hybrid<br/>Fusion]
ANN --> Fuse
Fuse --> XR[Cross-encoder<br/>re-rank → top-5]
XR --> Gen[LLM +<br/>citations]
end
VDB --> ANN
Bi-Encoder vs. Cross-Encoder¶
| Model class | Training | Query-time cost | Best for |
|---|---|---|---|
| Bi-encoder (BGE, E5) | Contrastive; query/doc encoded independently | Low — single forward pass per side; batch-friendly | First-stage retrieval (ANN) |
| Cross-encoder (e.g., MS MARCO–style) | Joint encoding of (query, passage) pairs | High — O(passages × forward passes) | Re-ranking top-K after ANN |
# Conceptual: bi-encoder produces fixed vectors; cross-encoder scores pairs.
import torch
import torch.nn.functional as F
def cosine_sim(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
return F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))
class BiEncoderRetrieval:
def __init__(self, query_tower, doc_tower):
self.query_tower = query_tower
self.doc_tower = doc_tower
def encode_query(self, text: str) -> torch.Tensor:
return F.normalize(self.query_tower(text), dim=-1)
def encode_docs(self, texts: list[str]) -> torch.Tensor:
return F.normalize(self.doc_tower(texts), dim=-1)
class CrossEncoderReranker:
"""Scores (query, passage) jointly — too expensive for full corpus."""
def __init__(self, model):
self.model = model
def score_pairs(self, query: str, passages: list[str]) -> list[float]:
pairs = [(query, p) for p in passages]
logits = self.model(pairs) # batch forward
return logits.tolist()
Chunking and Overlap (Intuition)¶
Recursive character splitting with 512-token chunks and 50-token overlap preserves local context across boundaries while keeping vectors within model limits. Section-aware splitting (headings from parser or heuristic rules) reduces mid-sentence cuts.
flowchart TB
Doc[Full document text] --> Sec[Split on<br/>section boundaries]
Sec --> Rec[Recursive split:<br/>paragraph → sentence → char]
Rec --> Tok[Token budget:<br/>max 512, overlap 50]
Tok --> Meta[Attach metadata:<br/>doc_id, page, section, bbox]
HNSW at a Glance¶
Hierarchical Navigable Small World (HNSW) builds a multi-layer graph for approximate nearest neighbor search. Key knobs: M (max edges per node), efConstruction (build quality), efSearch (query accuracy vs. latency).
Tip
For partitioned corpora (per collection or tenant), maintain separate HNSW graphs or namespaces in the vector store so a query scoped to one library does not scan unrelated vectors — smaller graphs mean better latency and recall.
Step 1: Requirements Clarification¶
Questions to Ask¶
| Question | Why It Matters |
|---|---|
| Are PDFs mostly digital text or scanned images? | Chooses PyMuPDF path vs. OCR / vision pipeline |
| Single vs. multiple collections / tenants? | Sharding, partitioning, and ACL model |
| Required citation granularity? | Page-level vs. bounding-box vs. table cell |
| Compliance / data residency? | On-prem embeddings vs. cloud APIs |
| Max acceptable latency? | Whether you can afford cross-encoder + large context |
| Who can see which documents? | Per-doc ACLs, RBAC, ABAC |
| Do users ask single-doc or synthesis questions? | Multi-doc retrieval and prompt strategy |
| Languages? | Multilingual encoders and tokenizers |
Functional Requirements¶
| Requirement | Priority | Description |
|---|---|---|
| Ingest PDFs at scale | Must have | Upload or connector-driven ingestion with deduplication |
| Text + table + image handling | Must have | Structured extraction; OCR / VLM fallback for scans |
| Semantic search over chunks | Must have | Dense embeddings + metadata filters |
| Natural language answers | Must have | LLM generation grounded in retrieved chunks |
| Citations | Must have | Document title, page, section (and optional bbox) |
| Hybrid retrieval | Should have | BM25 + dense for keyword + semantic coverage |
| Cross-encoder re-ranking | Should have | top-20 → top-5 before generation |
| Multi-document answers | Should have | Fuse evidence from 2+ PDFs when needed |
| Incremental index updates | Should have | New PDFs indexed without full rebuild |
| Access control | Must have | Enforce per-document permissions on every query path |
Non-Functional Requirements¶
| Requirement | Target | Rationale |
|---|---|---|
| P95 query latency | < 4 s | Includes retrieval, re-rank, and ~500-token generation |
| Ingestion freshness | < 5–15 min | Business users expect near-real-time for new docs |
| Retrieval recall@20 | > 90% (eval set) | Wrong retrieval cannot be fixed downstream |
| Faithfulness / grounding | > 93% on sampled eval | Regulatory and trust requirements |
| Availability | 99.9% | Read path degrades gracefully if LLM slow |
| Durability | No silent doc loss | Ingest pipeline idempotency + dead-letter queue |
API Design¶
# POST /v1/collections/{collection_id}/documents
{
"source": "s3://bucket/reports/2024/q3-financial.pdf",
"document_id": "doc-fin-2024-q3", # optional client id; else server-generated
"title": "Q3 2024 Financial Report",
"acl_principal_ids": ["group:finance", "user:auditor-42"],
"parse_profile": "financial_pdf_v2", # hints for table detection
"metadata": {"fiscal_year": 2024, "region": "NA"}
}
# POST /v1/query
{
"collection_id": "col-research",
"query": "How did gross margin compare to Q2 and what drove the change?",
"conversation_id": null,
"filters": {
"document_ids": null,
"metadata": {"fiscal_year": 2024}
},
"retrieval": {
"ann_top_k": 20,
"rerank_top_k": 5,
"hybrid_alpha": 0.5
},
"generation": {
"max_answer_tokens": 1024,
"citation_format": "numeric"
}
}
# Response
{
"answer": "Gross margin improved to 41.2% in Q3 from 38.7% in Q2, primarily due to lower input costs in the packaging line and favorable mix toward higher-margin SKUs [1][2].",
"citations": [
{
"ref": 1,
"document_id": "doc-fin-2024-q3",
"title": "Q3 2024 Financial Report",
"page": 14,
"section": "MD&A — Gross Margin",
"chunk_id": "chunk-abc123",
"snippet": "Gross margin increased to 41.2% compared to 38.7% in Q2..."
},
{
"ref": 2,
"document_id": "doc-ops-packaging-2024",
"title": "Packaging Cost Initiative",
"page": 3,
"section": "Summary",
"chunk_id": "chunk-def456",
"snippet": "Year-to-date packaging unit costs declined 6% vs H1..."
}
],
"retrieval_debug": {
"ann_candidates": 20,
"after_acl_filter": 18,
"after_rerank": 5
},
"latency_ms": 3200
}
Technology Selection & Tradeoffs¶
A document Q&A system is built from document parsing pipeline + chunking strategy + embedding model + vector index + LLM + citation extraction layer. The right combination depends on document types, accuracy requirements, and latency constraints.
Document parsing¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| Apache Tika + custom extractors | Broad format support (PDF, DOCX, PPTX, HTML); open-source; extensible | Table extraction quality varies; no native layout understanding; needs post-processing | General-purpose ingestion; mixed document formats |
| Azure Document Intelligence | Excellent table and form extraction; layout-aware OCR; pre-built models | Cloud dependency; per-page cost; latency for large batches | Financial documents, forms, scanned PDFs with complex layouts |
| Unstructured.io | Purpose-built for RAG pipelines; layout-aware chunking; open-source core | Newer ecosystem; hosted version adds cost; complex docs may need tuning | RAG-first pipelines where chunk quality directly drives answer quality |
| LlamaParse / LLM-based parsing | Handles complex layouts via vision models; understands context | Expensive per page; slower; overkill for simple text docs | High-value documents where parsing errors are costly (legal, medical) |
Vector index¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| FAISS (Facebook AI Similarity Search) | Blazing fast; multiple index types (IVF, HNSW, PQ); GPU support; battle-tested | No built-in metadata filtering; single-node (needs wrapper for distributed); no persistence layer | High-performance search on moderate corpus; teams comfortable managing infra |
| Pinecone | Managed; metadata filtering; namespace isolation; consistent sub-50ms latency | Vendor lock-in; cost grows with scale; less index tuning control | Managed production deployment; rapid time-to-market |
| Qdrant | Rich filtering; Rust-based (fast); open-source with managed option; payload indexing | Smaller community than alternatives; distributed mode relatively newer | Open-source requirement with strong filtering needs |
| pgvector | Leverage existing PostgreSQL; transactional consistency; simple ops | Slower at scale; limited index types; no GPU acceleration | Small-to-medium corpus; ACID guarantees needed alongside vector search |
Chunking strategy¶
| Option | Strengths | Weaknesses | When to choose |
|---|---|---|---|
| Document-aware (section/heading) | Respects document structure; preserves context boundaries; tables stay intact | Requires layout parsing; section sizes vary widely | Structured documents with clear headings (reports, wikis, specs) |
| Semantic chunking | Groups related sentences by embedding similarity; adaptive boundaries | Slower (needs embedding per sentence); tuning threshold matters | Mixed documents with varying structure |
| Parent-child (small embed, large retrieve) | Best of both: precise embedding match + sufficient context for LLM | More complex indexing; two-level retrieval adds latency | Long documents where answer context spans multiple paragraphs |
| Fixed-size with overlap | Simple; predictable chunk count; easy to implement | Splits mid-sentence or mid-table; context loss at boundaries | Baseline/prototyping; homogeneous text-heavy documents |
Tip
Interview angle: The chunking strategy is often the single biggest lever for answer quality. Start with document-aware chunking for structured PDFs, and mention parent-child as an upgrade path — it lets you embed small precise chunks but retrieve the full parent section for LLM context, solving the "chunk too small for context" problem.
Our choice: Unstructured.io (or Azure Document Intelligence for scanned/complex PDFs) for parsing, because document Q&A lives or dies on extraction quality — garbage in, garbage out. Document-aware chunking with parent-child retrieval to balance embedding precision with sufficient LLM context. Qdrant as the vector index for its strong metadata filtering (needed for per-document ACL and page-level references) with FAISS as a fallback for teams wanting self-hosted simplicity. This stack optimizes for citation accuracy and traceability — the defining requirement that separates document Q&A from generic chatbots.
Step 2: Back-of-Envelope Estimation¶
Traffic¶
Assumptions:
PDFs indexed: 12,000
Avg pages per PDF: 250 → 3M pages
Avg tokens per page (extracted): 400 → 1.2B tokens raw
Chunks (512 tokens, ~15% overhead from overlap): 1.2B / 460 ≈ 2.6M chunks
Queries per day: 80,000
Average QPS: 80_000 / 86_400 ≈ 0.93
Peak QPS (×8 business hours focus): ~8–15
Storage¶
Chunk text + metadata (~800 bytes avg):
2.6M × 800 B ≈ 2.1 GB
Embeddings (1024-dim float32):
2.6M × 1024 × 4 B ≈ 10.6 GB
HNSW overhead (often 1.5–2× vector data):
~16–22 GB in RAM working set per full index (order-of-magnitude)
BM25 inverted index (compressed):
~2–6 GB depending on vocabulary and stemming
Original PDF storage (S3 / GCS):
12K × 3 MB avg ≈ 36 GB (+ versions)
Compute¶
Initial embedding (cold corpus):
2.6M chunks / 256 batch / 50 encode/sec/GPU ≈ few GPU-days (order-of-magnitude; model dependent)
Steady-state ingestion: 500 PDFs/day × 200 chunks/PDF = 100K new chunks/day
At 2ms/chunk on GPU batching → ~200 s GPU time/day (plus OCR tail)
Per query:
1 × query embedding
1 × ANN (partitioned HNSW)
20 × cross-encoder forward (batched to 1–2 GPU calls)
1 × LLM completion (dominant latency)
Cost (Rough Monthly Order-of-Magnitude)¶
| Component | Assumption | ~USD / month |
|---|---|---|
| Object storage for PDFs | 50 GB @ $0.023/GB | ~2–5 |
| Vector DB (managed, 3-node) | HA deployment | 1.5K–4K |
| GPU for embeddings + re-rank | Shared T4/L4 pool | 2K–8K |
| LLM inference (proprietary API) | 80K q/day × 2K tokens | Highly variable (10K–50K+) |
| OCR / VLM for scans | 10% of pages need OCR | +GPU or API line item |
Note
In an interview, show the latency budget explicitly: ANN + re-rank + LLM. Argue partitioning and caching query embeddings for repeat queries to protect the tail.
Step 3: High-Level Design¶
flowchart TB
subgraph clients["Clients"]
UI[Web / API Clients]
end
subgraph dm["Document Management"]
DocAPI[Document Management API]
ACL[Access Control Layer]
MetaDB[(Metadata DB<br/>Postgres / Spanner)]
end
subgraph ingest["Document Ingestion Pipeline"]
Upload[Upload / Connector]
Queue[Ingest Queue<br/>Kafka / SQS]
Parser[Parser Service<br/>PDF / OCR / Tables]
ChunkEng[Chunking Engine]
EmbSvc[Embedding Service<br/>GPU workers]
CiteMeta[Citation Metadata<br/>enrichment]
end
subgraph store["Storage & Index"]
Obj[(Object Store<br/>PDF blobs)]
VDB[(Vector Store<br/>HNSW partitions)]
BM25[(BM25 Index<br/>OpenSearch / ES)]
end
subgraph qpath["Query Path"]
Ret[Retrieval Service]
Rerank[Cross-encoder<br/>Re-ranker]
Gen[Generation Service<br/>LLM]
CiteExt[Citation Extractor]
end
UI --> DocAPI
UI --> Ret
DocAPI --> ACL
ACL --> MetaDB
DocAPI --> Upload
Upload --> Obj
Upload --> Queue
Queue --> Parser
Parser --> ChunkEng
ChunkEng --> EmbSvc
ChunkEng --> BM25
EmbSvc --> VDB
ChunkEng --> MetaDB
CiteMeta --> MetaDB
Ret --> ACL
Ret --> VDB
Ret --> BM25
Ret --> Rerank
Rerank --> Gen
Gen --> CiteExt
CiteExt --> UI
Component Responsibilities¶
| Component | Role |
|---|---|
| Document Management API | Register documents, versions, collections, ACL bindings, ingest triggers |
| Access Control Layer | Resolves user → principals; filters chunk IDs or applies post-filter with over-fetch |
| Parser Service | PDF text (PyMuPDF), layout-aware parsing (Unstructured), OCR (Tesseract / cloud), optional multimodal captioning for figures |
| Chunking Engine | Recursive splits with token caps, overlap, section/table awareness |
| Embedding Service | Batch bi-encoder inference (BGE-large-en-v1.5, E5-mistral, etc.) with queue backpressure |
| Vector Store | HNSW (or IVF-PQ) per collection or tenant shard; stores embedding + chunk metadata |
| Retrieval Service | Hybrid dense + BM25, ANN top-20, ACL-aware filtering, optional query expansion |
| Re-ranker | Cross-encoder scores candidates → top-5 for context window |
| Generation Service | Prompt assembly, LLM call, streaming optional |
| Citation Extractor | Validates [n] references, maps to chunk metadata; optional Natural Language Inference (NLI) grounding check |
Document Ingestion Pipeline¶
flowchart LR
subgraph input["Input"]
Raw[PDF bytes]
end
subgraph detect["Classification"]
Digital{Text layer?}
Raw --> Digital
Digital -->|yes| Layout[Layout-aware<br/>extraction]
Digital -->|no| OCR[OCR pipeline]
end
Layout --> Struct[Tables / images<br/>structured paths]
OCR --> Struct
Struct --> Norm[Normalize +<br/>dedupe pages]
Norm --> Chunk[Chunking Engine]
Chunk --> Emb[Embedding Service]
Emb --> Idx[Vector + BM25<br/>upsert]
Chunk --> Idx
Step 4: Deep Dive¶
4.1 PDF Parsing Strategies (PyMuPDF vs. Unstructured vs. Multimodal)¶
| Approach | Strengths | Weaknesses |
|---|---|---|
| PyMuPDF (fitz) | Fast, good text extraction for many digital PDFs | Weak on complex layouts; reading order can be wrong |
| Unstructured / layout models | Better headings, tables, reading order | Heavier deps; slower |
| OCR (Tesseract, cloud) | Scanned PDFs | Noise, cost, latency |
| Multimodal VLM | Figures, charts, screenshots | Expensive; needs guardrails for PII |
# PDF parsing and text extraction (illustrative)
from dataclasses import dataclass
import fitz # PyMuPDF
@dataclass
class PageBlock:
page_number: int
text: str
bbox: tuple[float, float, float, float] | None
block_type: str # "paragraph", "table", "image_caption"
def extract_pages_pymupdf(path: str) -> list[PageBlock]:
doc = fitz.open(path)
blocks: list[PageBlock] = []
for i, page in enumerate(doc):
text = page.get_text("text")
blocks.append(PageBlock(i + 1, text.strip(), None, "paragraph"))
return blocks
# Unstructured (pseudo-import — API varies by version)
# from unstructured.partition.pdf import partition_pdf
# elements = partition_pdf(filename=path, infer_table_structure=True, strategy="hi_res")
# → iterate Table, Title, NarrativeText elements for section-aware chunking
Production pattern: run a fast path (PyMuPDF + heuristics). If text entropy is low or page is image-dominant, escalate to hi-res Unstructured + OCR queue.
4.2 Chunking Algorithms: Recursive, Semantic, Table-Aware¶
from typing import Iterator
from transformers import AutoTokenizer
class RecursiveChunker:
"""512-token chunks with 50-token overlap; prefers paragraph → sentence boundaries."""
def __init__(
self,
model_id: str = "BAAI/bge-large-en-v1.5",
chunk_tokens: int = 512,
overlap_tokens: int = 50,
):
self.tok = AutoTokenizer.from_pretrained(model_id)
self.chunk_tokens = chunk_tokens
self.overlap_tokens = overlap_tokens
self.separators = ["\n\n", "\n", ". ", " ", ""]
def _encode_len(self, text: str) -> int:
return len(self.tok.encode(text, add_special_tokens=False))
def split(self, text: str, section: str | None = None) -> Iterator[dict]:
def _split_recursive(s: str, seps: list[str]) -> list[str]:
if self._encode_len(s) <= self.chunk_tokens:
return [s]
if not seps:
# hard cut by tokens
ids = self.tok.encode(s, add_special_tokens=False)
parts = []
step = self.chunk_tokens - self.overlap_tokens
for start in range(0, len(ids), step):
sub = ids[start : start + self.chunk_tokens]
parts.append(self.tok.decode(sub))
return parts
sep = seps[0]
splits = s.split(sep) if sep else [s]
out: list[str] = []
buf = ""
for part in splits:
cand = part if not buf else buf + sep + part
if self._encode_len(cand) <= self.chunk_tokens:
buf = cand
else:
if buf:
out.extend(_split_recursive(buf, seps[1:]))
buf = part
if buf:
out.extend(_split_recursive(buf, seps[1:]))
return out
for chunk_text in _split_recursive(text, self.separators):
yield {"text": chunk_text, "section": section}
# Table-aware: emit one chunk per serialized table + surrounding caption, do not interleave with prose.
def table_chunks(table_markdown: str, page: int, doc_id: str) -> dict:
return {
"text": table_markdown,
"metadata": {"doc_id": doc_id, "page": page, "modality": "table"},
}
Semantic chunking (optional upgrade): embed sentences or paragraphs; merge until cosine similarity to running centroid drops below threshold — better for heterogeneous PDFs at higher compute cost.
4.3 Embedding Model Selection, Batching, and HNSW Configuration¶
| Model family | Typical dims | Notes |
|---|---|---|
| BGE | 1024 (large) | Strong general retrieval; instruction variants for asymmetric query-doc |
| E5 | 1024 | Prefixes query: / passage: matter at inference |
import torch
from sentence_transformers import SentenceTransformer
def batch_embed(texts: list[str], model_name: str, batch_size: int = 64) -> torch.Tensor:
model = SentenceTransformer(model_name, device="cuda" if torch.cuda.is_available() else "cpu")
embeddings = model.encode(
texts,
batch_size=batch_size,
convert_to_tensor=True,
normalize_embeddings=True,
show_progress_bar=False,
)
return embeddings
HNSW (conceptual config):
| Parameter | Typical range | Effect |
|---|---|---|
| M | 16–64 | Higher → better recall, more memory |
| efConstruction | 200–800 | Build quality |
| efSearch | 64–256 | Query-time accuracy vs. latency |
Tip
Partition the index by collection_id (or tenant). Each partition holds fewer points → lower efSearch for same recall, and ACL scoping can skip whole partitions.
4.4 Retrieval, Hybrid Fusion, and Cross-Encoder Re-Ranking¶
flowchart TB
Q[Query] --> EQ[Encode query<br/>bi-encoder]
Q --> BM[BM25 top-20]
EQ --> ANN[ANN top-20<br/>HNSW partition]
ANN --> Fuse[Score fusion<br/>RRF / weighted]
BM --> Fuse
Fuse --> ACL[ACL filter<br/>+ over-fetch]
ACL --> XR[Cross-encoder<br/>score 20 pairs]
XR --> Top5[Take top-5<br/>passages]
Top5 --> LLM[LLM prompt]
def reciprocal_rank_fusion(rank_lists: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for ranked in rank_lists:
for i, doc_id in enumerate(ranked):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + i + 1)
return sorted(scores, key=lambda d: scores[d], reverse=True)
class HybridRetriever:
def __init__(self, vector_index, bm25_index, alpha: float = 0.5):
self.vector_index = vector_index
self.bm25_index = bm25_index
self.alpha = alpha
def retrieve(
self,
query: str,
query_vector: list[float],
collection_id: str,
ann_k: int = 20,
) -> list[str]:
dense_ids = self.vector_index.search(query_vector, ann_k, namespace=collection_id)
sparse_ids = self.bm25_index.search(query, ann_k, collection_id=collection_id)
return reciprocal_rank_fusion([dense_ids, sparse_ids])
Cross-encoder re-ranking (Python):
from sentence_transformers import CrossEncoder
class CrossEncoderReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
def rerank(self, query: str, passages: list[str], top_k: int = 5) -> list[tuple[int, float]]:
pairs = [[query, p] for p in passages]
scores = self.model.predict(pairs)
ranked = sorted(range(len(passages)), key=lambda i: scores[i], reverse=True)
return [(i, float(scores[i])) for i in ranked[:top_k]]
Java example — batch re-rank HTTP worker (sketch):
// Pseudo-production: call internal ONNX/TorchServe re-rank service with HTTP/2 batching.
public final class RerankClient {
public record ScoredChunk(String chunkId, float score) {}
public List<ScoredChunk> rerank(String query, List<String> passages) {
var req = new RerankRequest(query, passages);
RerankResponse resp = httpClient.post("/v1/rerank", req);
return resp.topK(5);
}
}
4.5 Citation-Grounded Generation and Citation Extraction¶
import re
from dataclasses import dataclass
@dataclass
class Chunk:
chunk_id: str
text: str
document_id: str
page: int
section: str | None
SYSTEM = """You answer using ONLY the provided passages. Every factual claim must end with a numeric citation like [1] referring to the passage index. If passages disagree, say so."""
def build_prompt(query: str, chunks: list[Chunk]) -> str:
parts = []
for i, c in enumerate(chunks, start=1):
parts.append(
f"[{i}] doc={c.document_id} page={c.page} section={c.section or 'n/a'}\n{c.text}"
)
ctx = "\n\n".join(parts)
return f"{SYSTEM}\n\nPASSAGES:\n{ctx}\n\nQUESTION: {query}\nANSWER:"
_CITATION_RE = re.compile(r"\[(\d+)\]")
def extract_citations(answer: str, chunks: list[Chunk]) -> list[dict]:
refs = {int(m.group(1)) for m in _CITATION_RE.finditer(answer)}
out = []
for r in sorted(refs):
if 1 <= r <= len(chunks):
c = chunks[r - 1]
out.append(
{
"ref": r,
"chunk_id": c.chunk_id,
"document_id": c.document_id,
"page": c.page,
"section": c.section,
}
)
return out
Optional: run lightweight NLI (premise = cited passage, hypothesis = atomic claim) to drop unsupported sentences before returning to the client.
4.6 Multi-Document Query Resolution¶
Users often ask questions that require evidence from multiple PDFs (compare Q2 vs. Q3; policy + exception memo).
flowchart TB
Q[Multi-doc style query] --> Cls{Query<br/>classifier}
Cls -->|single retrieval| R1[Standard hybrid<br/>retrieval]
Cls -->|decomposable| Sub[LLM generates<br/>sub-queries]
Sub --> R2[Parallel retrieve<br/>per sub-query]
R2 --> Merge[Deduplicate +<br/>RRF merge]
R1 --> Pool[Candidate pool]
Merge --> Pool
Pool --> XR[Cross-encoder<br/>global re-rank]
XR --> Gen[Single LLM call<br/>with merged top-K]
Go sketch — parallel retrieval fan-out:
type SubQuery struct {
Text string
}
func RetrieveParallel(ctx context.Context, subs []SubQuery, ret Retriever) ([]Chunk, error) {
g, ctx := errgroup.WithContext(ctx)
chunks := make([][]Chunk, len(subs))
for i, sq := range subs {
i, sq := i, sq
g.Go(func() error {
c, err := ret.Hybrid(ctx, sq.Text, 20)
chunks[i] = c
return err
})
}
if err := g.Wait(); err != nil {
return nil, err
}
return mergeDedup(chunks), nil
}
4.7 Table and Image Understanding¶
| Modality | Strategy | Embedding |
|---|---|---|
| Tables | HTML/Markdown serialization + row IDs; optional row-level chunks for wide tables | Same bi-encoder on text; or late interaction for large tables |
| Images / charts | VLM caption → text chunk; or image embedding (Contrastive Language-Image Pre-Training (CLIP)) in separate index with fusion | Dual index + score fusion at query time |
Warning
Caption-only approaches hallucinate chart values. For numeric Q&A, prefer extracted table cells or tool-assisted chart parsing (e.g., plot data extraction) when accuracy is critical.
4.8 Access Control, Incremental Index Updates, and Deletion Propagation¶
ACL patterns:
| Strategy | Mechanism | Trade-off |
|---|---|---|
| Metadata filter | Store allowed_principal_ids on chunk; filter in vector DB |
Needs native filtered ANN; index churn on ACL change |
| Post-filter + over-fetch | Retrieve 5–10× K; drop unauthorized | Simple; watch recall when corpus is highly restricted |
| Partition by clearance | Separate indexes per level | Strong isolation; operational complexity |
flowchart LR
E[Doc event:<br/>create/update/delete] --> Q[Ingest queue]
Q --> W[Indexer worker]
W --> P[Parse + chunk]
P --> Del{delete old<br/>chunk IDs?}
Del -->|update| Old[Bulk delete by doc_id<br/>in vector + BM25]
Old --> Ins[Upsert new chunks]
Del -->|create| Ins
Ins --> V[(Vector)]
Ins --> B[(BM25)]
Ins --> M[(Metadata)]
subgraph delprop["Deletion propagation"]
D[Delete document] --> Tomb[tombstone record]
Tomb --> Purge[Async purge job<br/>chunk_ids from all indexes]
end
Incremental index update (Python):
class IncrementalIndexer:
def __init__(self, vector_index, bm25_index, chunk_store):
self.vector_index = vector_index
self.bm25_index = bm25_index
self.chunk_store = chunk_store
def reindex_document(self, doc_id: str, new_chunks: list[dict], embeddings: list[list[float]]):
old_ids = self.chunk_store.list_chunk_ids(doc_id)
if old_ids:
self.vector_index.delete(ids=old_ids)
self.bm25_index.delete(ids=old_ids)
for ch, emb in zip(new_chunks, embeddings):
self.vector_index.upsert(id=ch["chunk_id"], vector=emb, metadata=ch["metadata"])
self.bm25_index.upsert(id=ch["chunk_id"], text=ch["text"], metadata=ch["metadata"])
self.chunk_store.replace_document_chunks(doc_id, [c["chunk_id"] for c in new_chunks])
def delete_document(self, doc_id: str):
old_ids = self.chunk_store.list_chunk_ids(doc_id)
self.vector_index.delete(ids=old_ids)
self.bm25_index.delete(ids=old_ids)
self.chunk_store.delete_document(doc_id)
Step 5: Scaling & Production¶
Failure Handling¶
| Failure | Mitigation |
|---|---|
| Embedding backlog | Autoscale GPU workers; shed load with 429 + retry-after |
| Vector partition hot | Shard by hash(doc_id) within collection |
| OCR timeouts | Dead-letter queue; partial publish with "low confidence" flag |
| LLM outage | Return ranked passages + snippets without synthesis |
| Stale ACL | Fail closed; prefer denying access over leaking |
Monitoring¶
| Signal | Why |
|---|---|
| Ingest lag (p95) | Freshness SLA |
| OCR escalation rate | Corpus quality / scanner issues |
| ANN recall@K (offline) | Regression on embedding or index changes |
| Cross-encoder score distribution | Detect drift / domain mismatch |
| Citation parse errors | Prompt or model formatting regressions |
| ACL filter drop ratio | Tunes over-fetch multiplier |
Trade-offs¶
| Decision | A | B | Recommendation |
|---|---|---|---|
| Parsing | Speed (PyMuPDF) | Quality (Unstructured + OCR) | Tiered pipeline with escalation |
| Chunking | Fixed recursive | Semantic | Recursive + section hints; semantic for hard corpora |
| Index | One global HNSW | Partitioned by collection | Partitioned for latency + ACL |
| Multi-doc | Single retrieval | Sub-query decomposition | Classify query; decompose when comparative |
| Images | Skip | VLM captions | Captions + separate image index if product needs visuals |
Interview Tips¶
Tip
Strong answers explicitly cover: (1) PDF → text failure modes, (2) hybrid retrieval justification, (3) why cross-encoder after ANN, (4) multi-doc strategies, (5) incremental + delete semantics, (6) ACL enforcement point in the stack.
Common follow-ups:
- Why not embed entire PDFs instead of chunks? (context limits, retrieval precision, cost)
- How do you evaluate retrieval vs. generation quality separately?
- What happens when two chunks contradict each other?
- How would you support 100M chunks? (sharding, disk ANN, quantization, two-stage retrieval)
- When is a VLM worth the cost vs. OCR + tables only?
Hypothetical Interview Transcript¶
Note
Simulated 45-minute Google-style system design conversation (abbreviated for readability; pacing: requirements → HLD → deep dives → trade-offs).
Interviewer: Design a Q&A system over more than ten thousand PDFs. Users should get answers with citations to the document and page.
Candidate: I will clarify a few things. Are these mostly text-based PDFs or scanned? Do we need per-document access control? What latency and compliance constraints should I assume?
Interviewer: Mix of digital and scanned. Yes, per-document ACLs. Target under four seconds end-to-end. Data must stay in our cloud account.
Candidate: Got it. I would split the system into ingestion and query. Ingestion: store raw PDFs in object storage, enqueue work, run a parser service that tries a fast text extraction path and escalates to OCR and layout-aware parsing when needed. We extract tables as structured text chunks and optionally run a VLM for figures if the product needs chart Q&A. A chunking engine produces around 512-token segments with 50-token overlap, respecting section boundaries where we detect headings. Each chunk carries metadata: document_id, page, section, and ACL principals.
We batch-encode chunks with a bi-encoder like BGE or E5, and store vectors in a partitioned HNSW index — one partition per collection to keep graphs small and queries fast. We also maintain BM25 for hybrid retrieval.
Interviewer: Walk me through the query path.
Candidate: Embed the query with the same bi-encoder. Run ANN top-20 inside the right partition and BM25 top-20, fuse with reciprocal rank fusion. Apply ACL filtering — I would post-filter with over-fetch unless the vector database supports efficient metadata filters native to HNSW. Then re-rank the union with a cross-encoder down to top-5. Build a prompt listing those five passages with numeric labels, and ask the LLM to answer with [1]…[5] citations. A citation extractor maps those back to pages and document IDs; optionally NLI checks critical claims.
Interviewer: Why hybrid retrieval for PDFs specifically?
Candidate: Dense retrieval handles paraphrases and concepts — useful when users do not remember exact wording. But PDFs often contain SKU codes, legal cites, model numbers, and acronyms where exact token match still wins. BM25 also helps when embeddings under-represent rare strings. Hybrid typically improves recall compared to either alone.
Interviewer: How do you handle a question that combines two documents?
Candidate: I would add a lightweight query classifier. If the question is comparative or explicitly references multiple time periods or products, an LLM or rules produce sub-queries. Each sub-query retrieves in parallel; we deduplicate chunks, merge with RRF, then cross-encoder re-rank globally once so the LLM sees the single best five-passage context. If we still see low scores, we widen K or ask a clarifying question.
Interviewer: Incremental updates?
Candidate: Every document version gets a stable document_id with monotonic version. On update, the worker lists existing chunk_ids for that doc, deletes them from vector and BM25 stores, then upserts new chunks and embeddings in one logical transaction or with idempotent retries. Deletes tombstone the document and an async job ensures all indexes purge related chunk IDs — vector, sparse, metadata — to avoid orphans.
Interviewer: Tables and images?
Candidate: Tables: extract as Markdown/HTML per table, chunk separately from prose, maybe row-sliced chunks for very wide tables. Images: default path is VLM captions stored as text chunks linked to figure IDs; for numeric chart Q&A I would push for structured extraction because captions alone can be unreliable.
Interviewer: How would you test quality?
Candidate: Offline golden sets with labeled relevant passages for recall@K and MRR. Online: sample answers for groundedness checks, track citation validity, and monitor abstention rate when scores are low. Separate dashboards for ingest errors and OCR escalation to catch corpus issues early.
Interviewer: Sounds good. That wraps this section.
Summary¶
This design delivers citation-grounded Q&A over 10,000+ PDFs by combining: (1) a tiered parsing pipeline (PyMuPDF, Unstructured, OCR, optional VLM) for text, tables, and images; (2) recursive, section-aware chunking with 512 / 50 token settings; (3) bi-encoder embeddings stored in partitioned HNSW vector indexes with rich metadata; (4) hybrid BM25 + dense retrieval, ANN top-20, cross-encoder re-rank to top-5; (5) LLM generation with citation extraction; (6) multi-document query routing via decomposition when needed; (7) incremental reindexing and deletion propagation; and (8) a document management API with an access control layer enforced on the query path. Master the latency budget, index partitioning, and failure modes of PDFs to stand out in a system design interview.