Design a Machine Translation System (like Google Translate)¶

What We're Building¶

We are designing a large-scale neural machine translation (NMT) platform comparable in spirit to Google Translate: support for 100+ languages, billions of translations per day, and multiple modalities—plain text, image/OCR (translate text in photos), speech (ASR → translate → TTS), and real-time conversation (streaming, low-latency turn-taking).

Tip

In interviews, anchor numbers to public ballparks (orders of magnitude), then show how you derive capacity from QPS, sequence length, and model FLOPs—interviewers care about structured reasoning more than exact figures.

Reference scale (public-order benchmarks)¶

Dimension	Representative public-scale anchor
Daily volume	On the order of 100B+ words/day translated globally (industry-scale aggregate)
Language coverage	100+ languages in major consumer products; 133 languages cited for Google Translate coverage
Modalities	Web/mobile text, documents, camera/OCR, voice, conversation
Quality bar	Strong pairs often report high BLEU/chrF in research; production uses human eval + side-by-side

Why this problem is hard at scale¶

Challenge	Why it matters
Long tail of language pairs	$O(L^2)$ pairs if naive; multilingual models amortize
Morphology & script	Tokenization and normalization dominate quality for Finnish, Turkish, Arabic, Hindi
Latency vs quality	Beam search improves BLEU but hurts P99 latency
Safety & abuse	Translation can be used to evade moderation; rate limits and policy layers matter
Low-resource	Parallel data scarce; pivoting and transfer are mandatory

ML Concepts Primer¶

This section is the conceptual backbone for the rest of the design. Production MT is less “one giant LSTM” and more data + tokenization + multilingual training + serving math.

Encoder–decoder and sequence-to-sequence (seq2seq)¶

Seq2seq maps a source sequence $x_{1:T}$ to a target sequence $y_{1:S}$. Classic RNN seq2seq used two recurrent networks: an encoder summarizes the source into a context vector; a decoder generates the target token by token.

Bottleneck problem: A single fixed-size vector cannot faithfully represent long or ambiguous sentences.

Attention mechanism¶

Attention lets the decoder look at all encoder hidden states when predicting each target token. For timestep $i$, attention weights $\alpha_{ij}$ say how much to focus on source position $j$.

Conceptually:

\[ \text{context}_i = \sum_j \alpha_{ij} \cdot h_j \]

where $h_j$ are encoder states and $\alpha_{ij}$ are normalized scores (often softmax of a compatibility function).

Note

In interviews, say clearly: dot-product / additive attention were precursors; Transformers generalize this into multi-head self-attention and cross-attention.

Transformer architecture for translation¶

A Transformer block typically contains:

Component	Role in MT
Self-attention (encoder)	Each source token attends to all source tokens; builds contextual source representations
Self-attention (decoder, masked)	Each target token attends to previous target tokens only (causal mask)
Cross-attention (decoder)	Target-side queries attend to encoder outputs (source)
Feed-forward + residuals + LayerNorm	Nonlinear transforms and training stability
Positional encoding	Attention is permutation-invariant; positions are injected via sinusoidal or learned embeddings

Positional encoding (sinusoidal sketch):

\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right),\quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) \]

Modern systems often use learned positional embeddings or relative position (e.g., T5-style) for simplicity and performance.

Subword tokenization: BPE, SentencePiece, and why word-level fails¶

Method	Idea	Typical use
BPE (Byte Pair Encoding)	Merge frequent symbol pairs iteratively to build a vocabulary	Open-source NMT pipelines
SentencePiece	Unsupervised subword training; can encode raw sentences without pre-tokenization	Multilingual models (mBERT, mT5-style)
Unigram LM	Probabilistic subword segmentation	Alternative to BPE in some toolchains

Why word-level tokenization breaks:

OOV explosion for rare words, proper nouns, typos
Morphologically rich languages (Finnish, Turkish, German compounding): one “word” can encode what English expresses as a phrase
Agglutinative languages: long tokens with many morphemes; character-only models are expressive but inefficient without subwords

Warning

For production, you almost always want shared multilingual subword vocab + script-appropriate normalization (Unicode NFC/NFKC, digit mapping policies), not “split on spaces.”

Multilingual models: shared vocabulary, language tokens, zero-shot¶

Multilingual NMT trains one model on many language pairs simultaneously.

Technique	What it does
Shared subword vocabulary	One tokenizer across languages; improves parameter efficiency
Special tokens	`<2en>`, `<2de>` target-language control tokens (common in mBART/mT5-style setups)
Language embedding / adapter	Per-language conditioning without full separate models
Zero-shot / zero-resource transfer	If the model sees `A→English` and `English→B`, it may generalize to `A→B` without direct parallel data—quality varies

Trade-off: multilingual models reduce footprint and enable transfer, but can suffer interference between unrelated languages unless balanced sampling and capacity are tuned.

Quality estimation (QE)¶

QE predicts how good a translation is without references (or with limited references).

Signal	Use
Model confidence	Token-level softmax entropy, sequence log-probability (noisy but cheap)
Dedicated QE model	Regressor/classifier on source + hypothesis features; trained on human judgments
Reference-based metrics (offline)	BLEU, chrF++, COMET (neural metric) for eval—not always available online

Production pattern: show “Translation quality may be low” when QE or confidence crosses thresholds—especially for low-resource pairs or OCR/speech.

Beam search vs sampling¶

Decoding	Pros	Cons
Beam search (width $k$)	Higher BLEU, more stable	Slower, less diverse, can favor “safe” bland outputs
Sampling (top-k / top-p)	Diversity, more natural in some generative settings	Risk of hallucination in MT if mis-tuned
Greedy	Fastest	Often suboptimal for NMT

For interactive short text, you might use small beams + length normalization. For batch/documents, slightly wider beams or segment-level reranking may be acceptable if latency budget allows.

Low-resource languages: data augmentation, back-translation, pivoting¶

Technique	Description
Back-translation	Train a reverse model (English→X) on monolingual English; synthesize pseudo-parallel X→English
Self-training / iterative BT	Repeat pseudo-labeling with quality filtering
Pivot translation	Translate X→English→Y when X→Y parallel data is tiny
Data augmentation	Noising source, code-switch regularization (careful), transliteration for script gaps
Few-shot LLM prompting	Useful for gisting or rare pairs; needs guardrails for determinism and safety

flowchart LR subgraph hire["High-resource pairs"] A[Large parallel corpora] end subgraph lowr["Low-resource pair X→Y"] B[Small parallel seed] C[Monolingual data in X and Y] end A --> D[Multilingual backbone model] B --> D C --> E["Back-translation & denoising"] E --> D D --> F["Pivot: X→EN→Y as fallback"]

Step 1: Requirements¶

Functional requirements¶

ID	Requirement	Notes
F1	Text-to-text translation	Core API: source text + source/target lang → translated text
F2	Language detection	Auto-detect source language when unspecified
F3	Broad language coverage	100+ languages; graceful degradation for tail pairs
F4	Interactive + batch	Online API for product; batch jobs for large files
F5	Document translation	Preserve structure (HTML/DOCX), segment long inputs
F6	Alternatives	Optional n-best or paraphrase alternatives for UI
F7	Multimodal paths	OCR pipeline for images; ASR/TTS wrappers for speech; streaming for conversation

Non-functional requirements (targets for discussion)¶

NFR	Target	Rationale
Latency (short text)	P99 < 200 ms server-side (excluding client networking) for small requests	Snappy UI; conversation requires tighter budgets per chunk
Throughput	100K+ translations/sec aggregate (horizontally scaled)	Global traffic; isolate hot pairs
Quality	BLEU > 40 for select major pairs (offline eval)	Illustrative bar—pair-dependent; use human eval in production
Availability	99.9%+ tier-1 regions	Fallback routes (pivot, smaller model)
Cost	Minimize $/1M characters	Quantization, batching, routing
Safety & compliance	Abuse detection hooks, regional data policies	Especially for user-generated content

Note

Always separate per-request latency (interactive) from document latency (minutes acceptable). Interviewers reward this distinction.

Step 2: Estimation¶

Back-of-the-envelope reasoning ties product requirements to GPU/TPU capacity, memory, and network.

Request volume¶

Assume aggregate 100B words/day as an industry-scale anchor (order-of-magnitude storytelling).

Words/day $\approx 10^{11}$
Seconds/day $\approx 8.64 \times 10^4$
Average words/sec $\approx 10^{11} / 8.6 \times 10^4 \approx 1.16 \times 10^6$ words/s (if spread uniformly)

Not all traffic hits the heaviest model tier (routing, caching, tiny queries). For capacity planning, split:

Tier	Fraction of traffic (example)	Implication
Edge cached / repeated	10–40%	CDN-like caching of popular strings
Small interactive	40–60%	Latency-sensitive
Batch / documents	10–30%	Throughput-sensitive

Model serving compute (sketch)¶

Let:

$T$ = average target tokens per request after subword segmentation
$C_{\text{forward}}$ = forward pass cost per token for decoder (dominant at inference)
Batch size $B$ improves throughput via tensor cores

Rule of thumb: decoder steps $\approx T$ autoregressive steps; total compute scales roughly linearly with $T$ for inference (without speculative decoding). KV-cache avoids recomputing prefix keys/values for incremental decoding in conversation/documents.

Example: if an average interactive job is 30 target tokens, global 1M req/s truly hitting the GPU tier (after cache misses) implies 30M decoding steps/s—this drives GPU fleet sizing and model sharding.

Vocabulary size¶

Typical multilingual SentencePiece vocabularies range 32k–128k pieces for large models. Storage per embedding table:

FP16: 2 bytes × vocab × hidden_dim
For vocab 64k, hidden 4096: $64{,}000 \times 4{,}096 \times 2 \approx 512$ MB per embedding table (plus optimizer states only in training)

One model vs many bilingual models¶

Approach	Footprint	Pros	Cons
Per-pair bilingual models	$O(L^2)$ explosion	Tailored quality for top pairs	Operational nightmare at 100+ languages
Single massive multilingual Transformer	One stack + routing	Parameter efficient; transfer learning	Interference; harder debugging
Hybrid (common)	Multilingual backbone + adapters / LoRA / small experts	Balance	More complex serving

At “Google-scale,” multilingual models + routing + specialization (adapters, distilled student models for hot paths) are standard talking points.

Step 3: High-Level Design¶

End-to-end flow¶

flowchart TB subgraph inp["Inputs"] T[Text] I["Image / OCR"] S[Speech ASR stream] end LD[Language Detection] PRE[Preprocessing<br/>normalize + tokenize] ROUTE["Model Router<br/>pair / modality / SLA"] MT[Translation Model<br/>GPU workers] POST[Post-processing<br/>detokenize + casing + placeholders] QE[Quality Estimation] OUT["Response + warnings + alternatives"] T --> LD --> PRE --> ROUTE --> MT --> POST --> QE --> OUT I --> OCR["OCR + script detection"] --> LD S --> ASR[ASR] --> LD BATCH["Batch / Document Pipeline"] --> SEG["Segmentation + doc context cache"] --> ROUTE

Batch and document path¶

flowchart LR DOC[User uploads 40-page PDF] --> EXTRACT[Extract text + layout] EXTRACT --> CHUNK[Chunk / sentence segmentation] CHUNK --> CACHE_CTX[Cross-sentence context window] CACHE_CTX --> MTB[MT workers with KV-cache] MTB --> STITCH[Stitch + format preservation] STITCH --> QE2[QE + glossary enforcement]

Router chooses:

Model variant (tiny vs large; speech vs text)
Pivot path (X→EN→Y) if direct pair is weak or model missing
Quantization policy (INT8 for speed vs FP16 for quality)

Step 4: Deep Dive¶

4.1 Language Detection¶

Goals: Extremely fast, robust across scripts, good on short strings.

Techniques:

Method	When
fastText supervised classifier	Strong baseline; tiny model; CPU friendly
Script / Unicode ranges	Disambiguate CJK; Arabic vs Latin, etc.
Heuristics for ambiguity	`pt` vs `es` on short strings: n-gram profiles

Below is illustrative Python: a tiny character n-gram + multinomial NB sketch (stand-in for fastText API), plus script detection.

from __future__ import annotations

import re
import unicodedata
from collections import Counter
from dataclasses import dataclass


def normalize_text(s: str) -> str:
    # NFC helps compare composed vs decomposed Unicode
    return unicodedata.normalize("NFC", s.strip())


def script_histogram(text: str) -> dict[str, float]:
    """Coarse script buckets for routing / ambiguity handling."""
    counts: Counter[str] = Counter()
    for ch in text:
        o = ord(ch)
        if "\u4e00" <= ch <= "\u9fff":
            counts["CJK"] += 1
        elif "\u0600" <= ch <= "\u06ff" or "\u0750" <= ch <= "\u077f":
            counts["Arabic"] += 1
        elif "\u0590" <= ch <= "\u05ff":
            counts["Hebrew"] += 1
        elif "\u0900" <= ch <= "\u097f":
            counts["Devanagari"] += 1
        elif "a" <= ch.lower() <= "z":
            counts["Latin"] += 1
        elif ch.isdigit():
            counts["Digit"] += 1
        elif not ch.isspace():
            counts["Other"] += 1
    total = sum(counts.values()) or 1
    return {k: v / total for k, v in counts.items()}


def char_ngrams(s: str, n: int = 3) -> list[str]:
    s = re.sub(r"\s+", " ", s.lower())
    if len(s) < n:
        return [s] if s else []
    return [s[i : i + n] for i in range(len(s) - n + 1)]


@dataclass
class LangDetector:
    """
    Sketch: train with labeled (text, lang) pairs.
    Production would call fastText.train_supervised / load_model.
    """

    vocab: list[str]
    class_log_prior: dict[str, float]
    word_log_prob: dict[str, dict[str, float]]  # lang -> ngram -> log p

    @classmethod
    def train(cls, examples: list[tuple[str, str]], min_df: int = 2) -> "LangDetector":
        df = Counter()
        lang_counts = Counter()
        ngram_lang_counts: dict[str, Counter[str]] = {}

        for text, lang in examples:
            lang_counts[lang] += 1
            grams = char_ngrams(normalize_text(text), n=3)
            seen = set(grams)
            for g in seen:
                df[g] += 1
            for g in grams:
                ngram_lang_counts.setdefault(lang, Counter())[g] += 1

        vocab = [g for g, c in df.items() if c >= min_df]
        vocab_set = set(vocab)
        V = len(vocab_set) or 1

        word_log_prob: dict[str, dict[str, float]] = {}
        alpha = 0.1  # Laplace smoothing
        for lang, ctr in ngram_lang_counts.items():
            denom = sum(ctr[g] for g in vocab_set) + alpha * V
            word_log_prob[lang] = {}
            for g in vocab_set:
                num = ctr.get(g, 0) + alpha
                word_log_prob[lang][g] = float(__import__("math").log(num / denom))

        total_docs = sum(lang_counts.values()) or 1
        class_log_prior = {l: float(__import__("math").log(c / total_docs)) for l, c in lang_counts.items()}
        return cls(vocab=vocab, class_log_prior=class_log_prior, word_log_prob=word_log_prob)

    def predict(self, text: str) -> tuple[str, float]:
        import math

        grams = char_ngrams(normalize_text(text), n=3)
        if not grams:
            best = max(self.class_log_prior.items(), key=lambda x: x[1])
            return best[0], 1.0

        scores: dict[str, float] = {}
        for lang, lp in self.class_log_prior.items():
            s = lp
            wlp = self.word_log_prob.get(lang, {})
            for g in grams:
                s += wlp.get(g, -10.0)  # backoff constant for OOV in sketch
            scores[lang] = s
        best_lang = max(scores.items(), key=lambda x: x[1])
        # Convert log-score to a pseudo-confidence via softmax max margin (sketch)
        m = max(scores.values())
        exp_sum = sum(math.exp(v - m) for v in scores.values())
        top = math.exp(best_lang[1] - m) / exp_sum
        return best_lang[0], top


def disambiguate_short_text(text: str, top_langs: list[str]) -> list[str]:
    """
    If confidence is low or text is very short, return candidate set for router:
    try multiple languages or ask user.
    """
    if len(normalize_text(text)) < 8:
        return top_langs[:3]
    return top_langs[:1]

Ambiguity playbook:

Return top-k languages with scores to the router
If scores within $\epsilon$, ask user or run tiny translate-to-English compare (expensive; last resort)

4.2 Model Architecture¶

Core idea: Transformer encoder-decoder or encoder-decoder with shared multilingual stack. Below: multi-head attention, cross-attention, and a language adapter hook.

from __future__ import annotations

import math
from dataclasses import dataclass

import torch
import torch.nn as nn
import torch.nn.functional as F


def sinusoidal_pos_emb(seq_len: int, dim: int, device: torch.device) -> torch.Tensor:
    position = torch.arange(seq_len, device=device).float().unsqueeze(1)
    div = torch.exp(torch.arange(0, dim, 2, device=device).float() * (-math.log(10000.0) / dim))
    pe = torch.zeros(seq_len, dim, device=device)
    pe[:, 0::2] = torch.sin(position * div)
    pe[:, 1::2] = torch.cos(position * div)
    return pe


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        x: torch.Tensor,
        attn_mask: torch.Tensor | None = None,
        kv: torch.Tensor | None = None,
    ) -> torch.Tensor:
        """
        x: (B, T, D)
        If kv is provided, computes cross-attention where queries come from x and keys/values from kv.
        """
        if kv is None:
            kv = x
        B, T, D = x.shape
        _, S, _ = kv.shape

        q, k, v = self.qkv(x).chunk(3, dim=-1)
        q = q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, S, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, S, self.n_heads, self.d_head).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_head)
        if attn_mask is not None:
            scores = scores.masked_fill(attn_mask == 0, float("-inf"))
        attn = torch.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        out = torch.matmul(attn, v)
        out = out.transpose(1, 2).contiguous().view(B, T, D)
        return self.proj(out)


class LanguageAdapter(nn.Module):
    """Lightweight per-language shift in embedding space (adapter-style)."""

    def __init__(self, num_langs: int, d_model: int):
        super().__init__()
        self.lang_emb = nn.Embedding(num_langs, d_model)

    def forward(self, x: torch.Tensor, lang_ids: torch.Tensor) -> torch.Tensor:
        return x + self.lang_emb(lang_ids).unsqueeze(1)


class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, src_mask: torch.Tensor | None) -> torch.Tensor:
        x2 = self.self_attn(self.ln1(x), attn_mask=src_mask)
        x = x + self.dropout(x2)
        x2 = self.ff(self.ln2(x))
        x = x + self.dropout(x2)
        return x


class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.cross_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.ln3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        x: torch.Tensor,
        memory: torch.Tensor,
        tgt_mask: torch.Tensor,
        cross_mask: torch.Tensor | None = None,
    ) -> torch.Tensor:
        x2 = self.self_attn(self.ln1(x), attn_mask=tgt_mask)
        x = x + self.dropout(x2)
        x2 = self.cross_attn(self.ln2(x), attn_mask=cross_mask, kv=memory)
        x = x + self.dropout(x2)
        x2 = self.ff(self.ln3(x))
        x = x + self.dropout(x2)
        return x


@dataclass
class TinyNMTConfig:
    vocab_size: int = 32000
    d_model: int = 512
    n_heads: int = 8
    d_ff: int = 2048
    n_enc_layers: int = 6
    n_dec_layers: int = 6
    max_len: int = 1024
    num_langs: int = 128


class TinyMultilingualNMT(nn.Module):
    """
    Illustrative stack: embeddings + encoder + decoder + shared LM head.
    Real systems add MoE layers, relative positions, etc.
    """

    def __init__(self, cfg: TinyNMTConfig):
        super().__init__()
        self.cfg = cfg
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.d_model)
        self.encoder_layers = nn.ModuleList(
            [TransformerEncoderLayer(cfg.d_model, cfg.n_heads, cfg.d_ff) for _ in range(cfg.n_enc_layers)]
        )
        self.decoder_layers = nn.ModuleList(
            [TransformerDecoderLayer(cfg.d_model, cfg.n_heads, cfg.d_ff) for _ in range(cfg.n_dec_layers)]
        )
        self.adapter = LanguageAdapter(cfg.num_langs, cfg.d_model)
        self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

    def encode(self, src_ids: torch.Tensor, src_lang: torch.Tensor, src_pad_mask: torch.Tensor) -> torch.Tensor:
        B, T = src_ids.shape
        device = src_ids.device
        x = self.tok_emb(src_ids) + sinusoidal_pos_emb(T, self.cfg.d_model, device).unsqueeze(0)
        x = self.adapter(x, src_lang)
        attn_mask = src_pad_mask.unsqueeze(1).unsqueeze(2)  # (B,1,1,T)
        for layer in self.encoder_layers:
            x = layer(x, attn_mask)
        return x

    def decode_step(
        self,
        tgt_ids: torch.Tensor,
        memory: torch.Tensor,
        tgt_lang: torch.Tensor,
        causal_mask: torch.Tensor,
        src_pad_mask: torch.Tensor,
    ) -> torch.Tensor:
        B, U = tgt_ids.shape
        device = tgt_ids.device
        x = self.tok_emb(tgt_ids) + sinusoidal_pos_emb(U, self.cfg.d_model, device).unsqueeze(0)
        x = self.adapter(x, tgt_lang)
        cross_mask = src_pad_mask.unsqueeze(1).unsqueeze(2)  # attend to non-pad src
        for layer in self.decoder_layers:
            x = layer(x, memory, causal_mask, cross_mask=cross_mask)
        return self.lm_head(x)

Mixture-of-Experts (MoE) talking points:

Sparsely activated experts increase capacity without proportional FLOPs per token
Routing can be language-aware (route to region-specific experts)
Challenges: load balancing, all-to-all communication, training instability

4.3 Training Pipeline¶

Stages:

Corpus collection: OPUS, Paracrawl, CommonCrawl mining, licensed data, user feedback (with consent)
Cleaning: length ratio filters, language ID filtering, dedup (MinHash), toxicity filters
Augmentation: back-translation, word dropout, span masking (denoising objectives)
Curriculum: train on cleaner/high-resource first; gradually add noisy tail
Distillation: large teacher → smaller student for serving latency

from __future__ import annotations

import random
from dataclasses import dataclass


@dataclass
class ParallelExample:
    src: str
    tgt: str
    src_lang: str
    tgt_lang: str


def length_ratio_filter(ex: ParallelExample, min_ratio: float = 0.5, max_ratio: float = 2.0) -> bool:
    a = max(1, len(ex.src.split()))
    b = max(1, len(ex.tgt.split()))
    r = a / b
    return min_ratio <= r <= max_ratio


def noisy_copy(ex: ParallelExample, drop_prob: float = 0.1) -> ParallelExample:
    """Simple augmentation: random word dropout on source."""
    toks = ex.src.split()
    kept = [t for t in toks if random.random() > drop_prob]
    src2 = " ".join(kept) if kept else ex.src
    return ParallelExample(src=src2, tgt=ex.tgt, src_lang=ex.src_lang, tgt_lang=ex.tgt_lang)


def backtranslate_stub(mono_en: str, reverse_model) -> ParallelExample:
    """
    reverse_model: trained EN->X monolingual translator
    Returns synthetic X->EN parallel example for training X->EN forward model.
    """
    synthetic_src = reverse_model.translate(mono_en, src_lang="en", tgt_lang="xx")
    return ParallelExample(src=synthetic_src, tgt=mono_en, src_lang="xx", tgt_lang="en")


@dataclass
class CurriculumScheduler:
    """Increase weight of noisier datasets as training progresses."""

    step: int = 0

    def mixing_weights(self) -> dict[str, float]:
        t = min(1.0, self.step / 100_000)
        return {"clean": 1.0 - 0.5 * t, "noisy": 0.5 * t}


def kl_distillation_loss(student_logits, teacher_logits, temperature: float = 2.0):
    """
    student/teacher logits: (B, V) for next token
    """
    import torch.nn.functional as F
    import torch

    p = F.log_softmax(student_logits / temperature, dim=-1)
    q = F.softmax(teacher_logits / temperature, dim=-1)
    return torch.sum(q * (torch.log(q + 1e-9) - p), dim=-1).mean()

4.4 Serving Architecture¶

Optimizations:

Technique	Benefit
INT8/FP8 quantization	Higher QPS, lower memory
Dynamic batching	GPU utilization
Routing by source language	Co-locate hot pairs; reduce interference
KV-cache	Reuse prefix states in documents and conversation
Speculative decoding	Draft model proposes tokens; target verifies (latency win)

from __future__ import annotations

from collections import defaultdict, deque
from dataclasses import dataclass
import time

import torch


@dataclass
class TranslationRequest:
    req_id: str
    src_lang: str
    tgt_lang: str
    tokens: list[int]
    max_new_tokens: int


class LanguageBatchScheduler:
    """Group pending requests by (src_lang) bucket to improve locality (illustrative)."""

    def __init__(self, max_wait_ms: float = 5.0, max_batch: int = 16):
        self.max_wait_ms = max_wait_ms
        self.max_batch = max_batch
        self._queues: dict[str, deque[TranslationRequest]] = defaultdict(deque)

    def push(self, req: TranslationRequest) -> None:
        self._queues[req.src_lang].append(req)

    def pop_batch(self) -> list[TranslationRequest]:
        # Pick the longest queue; wait until deadline in real impl
        if not self._queues:
            return []
        src = max(self._queues.keys(), key=lambda k: len(self._queues[k]))
        q = self._queues[src]
        batch: list[TranslationRequest] = []
        while q and len(batch) < self.max_batch:
            batch.append(q.popleft())
        return batch


class KVCache:
    """Stores per-layer K/V tensors for a growing prefix (shape details omitted)."""

    def __init__(self, layers: int):
        self.layers = layers
        self.cache: dict[int, list[torch.Tensor]] = {}

    def update(self, layer: int, step: int, k: torch.Tensor, v: torch.Tensor) -> None:
        self.cache.setdefault(layer, []).append((k, v))


def speculative_decode_stub(draft, target, prefix_ids, max_new: int, gamma: int = 4):
    """
    draft/target are callables returning next-token logits for a given prefix.
    This is a structural sketch only.
    """
    ids = list(prefix_ids)
    for _ in range(max_new):
        # Draft proposes gamma tokens
        proposal = ids[:]
        for _ in range(gamma):
            logits = draft(proposal)
            next_tok = int(logits.argmax(-1)[-1])
            proposal.append(next_tok)
        # Target verifies proposal in parallel (real impl compares distributions)
        logits = target(proposal)
        # Accept/reject loop would go here
        ids.append(int(logits.argmax(-1)[len(ids) - 1]))
    return ids

4.5 Quality Estimation and Fallback¶

Combine cheap signals + learned QE + product policy.

from __future__ import annotations

from dataclasses import dataclass


@dataclass
class QESignals:
    mean_token_prob: float
    seq_logprob: float
    length_ratio: float


def heuristic_confidence(mean_token_prob: float) -> float:
    return max(0.0, min(1.0, mean_token_prob))


class QEModelStub:
    """
    Real QE: Transformer regressor on (src, hyp) concatenated, or COMET-style.
    """

    def score(self, src: str, hyp: str) -> float:
        # Placeholder: length sanity + char overlap toy score
        overlap = len(set(src.lower()) & set(hyp.lower()))
        return max(0.0, min(1.0, overlap / (10.0 + abs(len(hyp) - len(src)))))


@dataclass
class EscalationPolicy:
    min_conf: float = 0.35
    min_qe: float = 0.40


def should_warn_user(sig: QESignals, qe: float, pol: EscalationPolicy) -> bool:
    if heuristic_confidence(sig.mean_token_prob) < pol.min_conf:
        return True
    if qe < pol.min_qe:
        return True
    if sig.length_ratio < 0.3 or sig.length_ratio > 3.0:
        return True
    return False


def alternatives_from_nbest(nbest: list[str], max_k: int = 3) -> list[str]:
    # De-dup and filter trivial variants
    out: list[str] = []
    for s in nbest:
        s2 = s.strip()
        if s2 and s2 not in out:
            out.append(s2)
        if len(out) >= max_k:
            break
    return out

Human escalation: route to professional translation partners when:

User selects “certified translation”
Legal/medical disclaimers require it
QE signals failure on high-risk content

4.6 Document and Contextual Translation¶

Problems: inconsistent terminology, wrong anaphora, broken formatting.

from __future__ import annotations

import re
from dataclasses import dataclass


PLACEHOLDER_RE = re.compile(r"(\{\{[^}]+\}\}|<[^>]+>|https?://\S+)")


@dataclass
class Segment:
    idx: int
    text: str


def protect_placeholders(text: str) -> tuple[str, dict[str, str]]:
    mapping: dict[str, str] = {}

    def repl(m: re.Match[str]) -> str:
        key = f"__PH{len(mapping)}__"
        mapping[key] = m.group(0)
        return key

    protected = PLACEHOLDER_RE.sub(repl, text)
    return protected, mapping


def restore_placeholders(translated: str, mapping: dict[str, str]) -> str:
    out = translated
    for k, v in mapping.items():
        out = out.replace(k, v)
    return out


def sentence_segment(text: str) -> list[Segment]:
    # Production: use language-specific segmenters (spaCy, Moses, etc.)
    parts = re.split(r"(?<=[.!?])\s+", text.strip())
    return [Segment(i, p) for i, p in enumerate(parts) if p]


class ContextWindow:
    """Keeps previous source/target tails for contextual conditioning (sketch)."""

    def __init__(self, max_chars: int = 512):
        self.max_chars = max_chars
        self.src_hist: list[str] = []
        self.tgt_hist: list[str] = []

    def update(self, src_seg: str, tgt_seg: str) -> None:
        self.src_hist.append(src_seg)
        self.tgt_hist.append(tgt_seg)
        joined_src = " ".join(self.src_hist)
        while len(joined_src) > self.max_chars and len(self.src_hist) > 1:
            self.src_hist.pop(0)
            self.tgt_hist.pop(0)
            joined_src = " ".join(self.src_hist)

    def prefix(self) -> str:
        return " ".join(self.src_hist[-2:])


def glossary_enforce(text: str, term_map: dict[str, str]) -> str:
    out = text
    for src_term, tgt_term in sorted(term_map.items(), key=lambda kv: -len(kv[0])):
        out = re.sub(re.escape(src_term), tgt_term, out, flags=re.IGNORECASE)
    return out

4.7 Low-Resource Languages¶

from __future__ import annotations


def pivot_translate(text: str, translate_fn, src: str, pivot: str = "en", tgt: str | None = None) -> str:
    """
    translate_fn: (text, src_lang, tgt_lang) -> str
    """
    mid = translate_fn(text, src, pivot)
    if tgt is None:
        raise ValueError("tgt language required")
    return translate_fn(mid, pivot, tgt)


def few_shot_llm_prompt(src: str, tgt: str, examples: list[tuple[str, str]]) -> str:
    """
    For rare pairs, provide 3-5 high-quality in-context examples.
    Production systems constrain decoding, verify language IDs, and filter unsafe outputs.
    """
    ex_blocks = "\n".join([f"SRC: {s}\nTGT: {t}\n" for s, t in examples])
    return (
        "You are a careful translator. Follow the style of examples.\n\n"
        f"{ex_blocks}\n"
        f"SRC ({src}): {src}\nTGT ({tgt}):"
    )


def transfer_schedule(num_finetune_steps: int) -> dict[str, float]:
    """
    Illustrates staged unfreezing / LR for low-resource fine-tuning.
    """
    return {"head_only": 0.2, "adapters": 0.5, "full": 0.3}

Pivot risks: error compounding (X→EN→Y). Mitigate with reranking, QE, or direct pair model once data exists.

Step 5: Scaling & Production¶

Failure handling¶

Failure	Mitigation
Model host crash	Health checks, fast failover, redundant replicas per region
OOM on long doc	Chunking, streaming decode, cap `max_tokens`
Bad rollout	Canary, automatic rollback on BLEU/comet regression
Poisoned data	Provenance tracking, human audit gates for training corpora

Monitoring¶

Signal	Why
Latency P50/P95/P99	SLO tracking per region
Token throughput	Capacity planning
BLEU/chrF/COMET (offline)	Regression tests on golden sets
Human side-by-side	Gold standard for quality
User edits / copy-feedback	Implicit quality signal (with privacy controls)
Language-pair dashboards	Catch interference or data bugs

flowchart TB subgraph offev["Offline regression"] G[Golden parallel sets per pair] M["Metrics: BLEU/chrF/COMET"] end subgraph onmon["Online monitoring"] L["Latency + error rates"] Q[QE distribution shifts] U[User feedback volume] end G --> M --> Gate{Release gate} L --> Gate Q --> Gate U --> Gate

Trade-offs (interview gold)¶

Axis	Option A	Option B
Latency vs quality	Greedy / small beam	Wide beam + reranker
Cost vs quality	Quantized 8-bit	FP16
Coverage vs interference	One multilingual model	Mixture-of-experts + adapters
Automation vs safety	LLM gisting everywhere	Policy filters + escalation

Interview Tips¶

Strong story arc¶

Clarify modality, latency, and quality targets
Commit to multilingual backbone + routing + specialization
Explain tokenization and normalization as first-class
Close with eval (offline + human) and rollback

Likely Google-style follow-ups¶

Topic	What to say
Multilingual vs bilingual	Parameter efficiency, transfer, interference; hybrid MoE/adapters
Low-resource	Back-translation, pivoting, curated seed data, human-in-the-loop
Metrics	BLEU limits; use chrF++, COMET; human side-by-side for truth
Decoding	Beam width trade-offs; length normalization; n-best reranking
Document translation	Segmentation, context windows, glossary, layout preservation
Safety	Abuse, PII handling, regional policy

Tip

Practice one numeric capacity estimate (words/day → GPU tokens/s) and one failure mode (bad rollout + automatic rollback)—that combination signals senior thinking.

Term	Meaning
BLEU	N-gram overlap vs reference; common but imperfect
chrF	Character n-gram F-score; better for morphology
COMET	Neural metric estimating translation quality
Pivot	Indirect translation through a bridge language
MoE	Mixture-of-Experts: sparse routing to specialized subnetworks

Challenge	Why it matters
Long tail of language pairs	\(O(L^2)\) pairs if naive; multilingual models amortize
Morphology & script	Tokenization and normalization dominate quality for Finnish, Turkish, Arabic, Hindi
Latency vs quality	Beam search improves BLEU but hurts P99 latency
Safety & abuse	Translation can be used to evade moderation; rate limits and policy layers matter
Low-resource	Parallel data scarce; pivoting and transfer are mandatory

Decoding	Pros	Cons
Beam search (width \(k\))	Higher BLEU, more stable	Slower, less diverse, can favor “safe” bland outputs
Sampling (top-k / top-p)	Diversity, more natural in some generative settings	Risk of hallucination in MT if mis-tuned
Greedy	Fastest	Often suboptimal for NMT

Approach	Footprint	Pros	Cons
Per-pair bilingual models	\(O(L^2)\) explosion	Tailored quality for top pairs	Operational nightmare at 100+ languages
Single massive multilingual Transformer	One stack + routing	Parameter efficient; transfer learning	Interference; harder debugging
Hybrid (common)	Multilingual backbone + adapters / LoRA / small experts	Balance	More complex serving