Design a Speech Recognition System (ASR)¶

What We're Building¶

We are designing an automatic speech recognition (ASR) platform comparable in spirit to Google Cloud Speech-to-Text, OpenAI Whisper, Amazon Transcribe, or Azure Speech. The system converts spoken audio into text for real-time streaming (voice assistants, live captions, dictation) and batch workloads (call-center analytics, media transcription, legal discovery). It should support 100+ languages, robustness to noise and accents, optional speaker diarization (“who spoke when”), keyword spotting, and rich post-processing (punctuation, capitalization, formatting).

Capability	User-visible outcome
Streaming ASR	Partial transcripts appear as the user speaks; final text stabilizes after pauses
Batch ASR	Hours of audio transcribed offline with high accuracy and timestamps
Multilingual	Same API routes audio to the right language model or detects language automatically
Production quality	Low word error rate (WER), stable latency, high availability

Real-World Scale (Illustrative)¶

Signal	Order of magnitude	Why it shapes design
Google Assistant reach	500M+ users on devices (historical public figures vary by year)	Massive tail latency sensitivity; caching and routing matter globally
Voice query volume (Assistant-class)	Billions of voice queries per day (ecosystem-scale)	Requires multi-region serving, autoscaling, degraded modes
Whisper training data (OpenAI)	680k hours of weakly supervised web data	Demonstrates scale of pretraining + multitask learning for robustness
Enterprise transcription	Petabytes/year of call audio in large contact centers	Drives batch pipelines, cost per hour, compliance (retention, encryption)

Note

Interview framing: separate research accuracy (WER on benchmarks) from product metrics (task success rate, user edits per minute, caption lag). Production ASR is a systems + ML problem.

Why Speech Recognition Is Hard¶

Ambiguity: Acoustics underdetermine words (“ice cream” vs “I scream”); language models resolve ambiguity.
Variability: Speaker age, accent, microphone, codec, and room acoustics shift the feature distribution.
Alignment: Audio is a continuous signal; text is discrete. Models must solve time-to-text alignment (HMM, CTC, attention, RNN-T).
Streaming constraint: Future audio is unknown—you cannot “look ahead” indefinitely without violating latency budgets.
Long tail: Rare names, entities, and code-switching break naive LMs unless you add contextual biasing and personalization.

ML Concepts Primer¶

This section goes deeper than a typical “ML 101” because ASR interviews often probe features, decoders, and streaming trade-offs.

Audio Features: From Waveform to Mel Spectrograms, MFCCs, and Filterbanks¶

Raw waveform \(x[n]\) is sampled at rates such as 8 kHz (telephony narrowband), 16 kHz (common for ASR), or 48 kHz (broadcast). Most neural ASR models do not consume raw samples directly; they use time-frequency representations.

Representation	What it captures	Typical use in ASR
Short-time Fourier transform (STFT)	Magnitude vs frequency over time	Baseline spectrogram
Mel spectrogram	Perceptually warped frequency axis (mel scale)	Dominant input to transformers/CNNs
Log-mel filterbanks	Log compression + mel filters	Stable dynamic range for deep nets
MFCCs	DCT-compressed cepstral features	Legacy pipelines; still seen in classical systems

Mel scaling approximates human hearing sensitivity—more resolution at low frequencies. Log compression reduces loudness variation.

Tip

In interviews, say: “We use 25–80 ms windows, 10 ms hop, 40–128 mel bins, and per-utterance CMVN or global stats depending on deployment.”

Traditional ASR Pipeline: Acoustic Model → Language Model → Decoder¶

Classical systems decomposed the problem:

flowchart LR A[Audio] --> F[Feature Extraction] F --> AM[Acoustic Model HMM-DNN / GMM] AM --> D[Decoder WFST search] LM[Language Model n-gram] --> D Lex[Lexicon / Pronunciations] --> D D --> T[Text + alignment]

Component	Role
Acoustic model (AM)	\(P(\text{acoustics} \mid \text{hidden states})\) often via HMM state tying
Lexicon	Maps words to phone sequences
Language model (LM)	Prior over word sequences \(P(w_1,\dots,w_n)\)
Decoder	Weighted finite-state transducers (WFST) compose AM+LM efficiently

Why it mattered: Decoupled training; strong WFST tooling; excellent CPU decoding for small footprints.

End-to-End Models: HMM/CTC/Attention¶

Modern ASR often uses a single neural network with a differentiable alignment mechanism:

Paradigm	Alignment mechanism	Strengths	Weaknesses
CTC	Blank symbol; dynamic programming alignment	Stable training; streaming variants exist	“Peakiness”; needs LM for best WER
Attention encoder-decoder	Cross-attention aligns frames to tokens	Great accuracy on long-form	Non-streaming unless heavily modified
RNN-T (transducer)	Predict next label given audio prefix + label history	Natural streaming; label-loop	Training complexity; latency tuning

Warning

Attention models can “cheat” with future frames—great for Whisper-style offline ASR, problematic for true low-latency streaming unless you constrain attention or switch architectures.

Whisper Architecture (Representative)¶

Whisper is an encoder-decoder Transformer trained on weakly supervised web-scale audio with multitask objectives:

flowchart TB subgraph Enc[Encoder] M[Log-mel spectrogram patches] --> T[Transformer encoder] end subgraph Dec[Decoder] T --> TD[Transformer decoder] TD --> O[Token outputs] end MT["Special tokens: transcribe / translate language id, timestamps"] --> Dec

Multitask training includes:

Transcription in many languages
Translation to English (for some checkpoints)
Timestamp prediction (segment-level)
Voice activity / no-speech behavior via conditioning

This multitask setup improves robustness and generalization compared to narrow single-task training.

Streaming vs Non-Streaming¶

Mode	Definition	Key constraints
Offline / batch	Entire utterance or file available	Can use full-context attention, large beams, rescoring
Streaming	Emit partial hypotheses as audio arrives	Causal encoders, chunking, endpointing, limited lookahead

Challenges for real-time ASR:

Causal convolution / streaming conformer: Only use past + small right context per chunk.
Chunked processing: Run inference every \(\Delta t\) ms; stitch partials.
Look-ahead buffers: A small future window improves stability but increases latency.
Endpoint detection: Deciding “user finished speaking” affects UX and downstream NLU.

Language Models in ASR: Shallow Fusion, Deep Fusion, Rescoring¶

Technique	What happens	Typical latency impact
Shallow fusion	Combine AM log-probs with LM log-probs during beam search with weight \(\lambda\)	Moderate
Deep fusion / cold fusion	LM hidden states gate AM predictions (architectural)	Higher
Second-pass rescoring	Generate N-best list; neural LM rescores full hypotheses	Adds batch-like delay unless async

Formula sketch (shallow fusion):

\[ \log p_\text{joint}(y \mid x) \approx \log p_\text{ASR}(y \mid x) + \lambda \log p_\text{LM}(y) \]

Speaker Diarization: Who Spoke When¶

Diarization segments an audio stream into speaker-homogeneous regions and labels speaker identities (often anonymous: Speaker A/B).

Approach family	Idea
Embedding + clustering	Compute x-vectors / ECAPA embeddings per segment; cluster (AHC, spectral)
End-to-end diarization	Neural models directly predict speaker activity overlaps
Overlap handling	Separate modeling for simultaneous speech

Combining ASR + diarization: run VAD/segmentation, diarization to assign speaker IDs per time region, then ASR per segment or joint models in research systems.

Step 1: Requirements¶

Functional Requirements¶

ID	Requirement	Notes
F1	Real-time streaming transcription	Partial results, stable finals, punctuation policy
F2	Batch transcription	Long files, diarization, timestamps, profanity filters
F3	Language detection / routing	Auto-detect vs user-specified language
F4	Punctuation & capitalization	Either ASR-integrated or separate seq2seq post-model
F5	Speaker diarization	Optional; increases cost and latency
F6	Keyword spotting / commands	“Wake words” or constrained grammars for device UX

Non-Functional Requirements¶

ID	Requirement	Target (example)
N1	Streaming latency	<300 ms end-to-end budget is common for “snappy” UX (product-dependent)
N2	Accuracy	WER < 5% on clean major-language test sets; higher WER acceptable in noisy channels
N3	Coverage	100+ languages with tiered quality
N4	Availability	99.99% API availability (regional redundancy)
N5	Privacy	Encryption in transit/at rest; data retention controls; on-device option

Tip

Translate “300 ms” into audio buffer + feature + inference + beam + post with a rough pie chart in interviews—numbers matter more than buzzwords.

Step 2: Estimation¶

Estimation is interviewer-specific; treat numbers as Fermi-style anchors you can defend.

Audio Processing Compute (GPU/CPU)¶

Assume 16 kHz, 16-bit PCM mono:

Byte rate: \(16000 \times 2 = 32\) KB/s ≈ 115 MB/hour raw PCM.
Feature frontend (CPU): Often 1–4 ms per second of audio CPU time on modern cores for mel/STFT—cheap compared to GPU inference.
GPU inference: Dominated by encoder forward + decoder steps. Throughput scales with batch and model size.

Order-of-magnitude streaming cost driver: requests/sec × GPU memory footprint × autoscaling headroom.

Model Size¶

Class	Parameters (indicative)	Deployment implication
Edge / on-device	10M–100M (quantized)	Fits NPU/phone; limited languages
Cloud streaming	100M–1B	GPU serving; aggressive quantization
High-accuracy batch	1B+ (some architectures)	Multi-GPU or heavy batching

Bandwidth for Audio Streaming¶

Opus / AAC compressed streams: 12–64 kbps typical—orders of magnitude smaller than PCM.
Client-side VAD can reduce uplink by not sending silence (privacy + cost trade-offs).

Storage for Transcripts¶

Text is tiny vs audio: ~1 KB/s of speech text is already a lot (depends on language).
Metadata (timestamps, speaker labels, confidence) may dominate storage for analytics pipelines.

Asset	1 hour (rough order)
Raw PCM (16 kHz mono)	~115 MB
Compressed audio	~10–30 MB
Transcript text + JSON metadata	~100 KB–2 MB

Step 3: High-Level Design¶

Batch / Offline Path¶

flowchart LR IN[Audio Input file or chunk] --> FE[Feature Extraction log-mel spectrogram] FE --> AM[ASR Model encoder-decoder / CTC / RNN-T] AM --> DEC[Decoder beam search / RNN-T joint] DEC --> PP[Post-processing punctuation, ITN, formatting] PP --> OUT[Text + timestamps] LM[Language model optional fusion / rescoring] --> DEC

Streaming Path (Conceptual)¶

flowchart LR MIC[Mic / Stream] --> BUF[Ring buffer chunked frames] BUF --> FE[Streaming features causal frontend] FE --> ENC[Streaming encoder chunked Conformer / RNN-T] ENC --> EP[Endpointer / VAD] ENC --> DEC[Incremental decoder partial + final] DEC --> PP[Stabilization + punctuation policy] PP --> UI[Client UI / API callbacks]

Key differences:

Chunked inference every \(\Delta t\).
Endpointer decides utterance boundaries.
Partial results may be revised—clients should handle substitutions gracefully.

Note

Many products run two models: a tiny streaming model for UX + a larger batch rescoring pass on pauses—hybrid latency/accuracy trade-off.

Step 4: Deep Dive¶

4.1 Audio Preprocessing¶

Goals: normalize loudness, reduce noise, remove silence for efficiency, compute stable features.

"""
Educational numpy-only sketch: STFT magnitudes -> mel spectrogram.
Not production-grade; demonstrates real signal-processing steps.
"""
from __future__ import annotations

import math
import numpy as np


def hz_to_mel(hz: float) -> float:
    return 2595.0 * math.log10(1.0 + hz / 700.0)


def mel_to_hz(m: float) -> float:
    return 700.0 * (10 ** (m / 2595.0) - 1.0)


def preemphasis(x: np.ndarray, coeff: float = 0.97) -> np.ndarray:
    """High-pass emphasis to balance spectral tilt."""
    return np.append(x[0], x[1:] - coeff * x[:-1])


def framing(
    x: np.ndarray, sample_rate: int, frame_ms: float = 25.0, hop_ms: float = 10.0
) -> tuple[np.ndarray, int, int]:
    frame_len = int(sample_rate * frame_ms / 1000.0)
    hop_len = int(sample_rate * hop_ms / 1000.0)
    if frame_len <= 0 or hop_len <= 0:
        raise ValueError("Invalid framing parameters")
    num_frames = 1 + (len(x) - frame_len) // hop_len
    frames = np.stack([x[i * hop_len : i * hop_len + frame_len] for i in range(num_frames)], axis=0)
    return frames, frame_len, hop_len


def hann_window(length: int) -> np.ndarray:
    n = np.arange(length)
    return 0.5 - 0.5 * np.cos(2.0 * math.pi * n / (length - 1))


def stft_magnitude(frames: np.ndarray, sample_rate: int) -> np.ndarray:
    window = hann_window(frames.shape[1])
    windowed = frames * window
    fft_size = frames.shape[1]
    spectrum = np.fft.rfft(windowed, n=fft_size, axis=1)
    mag = np.abs(spectrum)
    return mag


def build_mel_filterbank(
    sample_rate: int, n_fft: int, n_mels: int, fmin: float, fmax: float
) -> np.ndarray:
    """Shape: (n_mels, n_bins) where n_bins = n_fft//2 + 1"""
    n_bins = n_fft // 2 + 1
    fft_freqs = np.linspace(0.0, sample_rate / 2.0, num=n_bins)
    mel_min, mel_max = hz_to_mel(fmin), hz_to_mel(fmax)
    mel_points = np.linspace(mel_min, mel_max, n_mels + 2)
    hz_points = np.array([mel_to_hz(m) for m in mel_points])
    bin_indices = np.floor((n_fft + 1) * hz_points / sample_rate).astype(int)
    fb = np.zeros((n_mels, n_bins), dtype=np.float64)
    for m in range(n_mels):
        left, center, right = bin_indices[m], bin_indices[m + 1], bin_indices[m + 2]
        for k in range(n_bins):
            if k < left or k > right:
                continue
            if k < center:
                fb[m, k] = (k - left) / max(center - left, 1)
            else:
                fb[m, k] = (right - k) / max(right - center, 1)
    # Slaney-style normalization (simplified)
    enorm = np.maximum(fb.sum(axis=1, keepdims=True), 1e-12)
    fb /= enorm
    return fb


def log_mel_spectrogram(
    x: np.ndarray,
    sample_rate: int = 16000,
    n_mels: int = 80,
    frame_ms: float = 25.0,
    hop_ms: float = 10.0,
) -> np.ndarray:
    x = preemphasis(x.astype(np.float64))
    frames, frame_len, _ = framing(x, sample_rate, frame_ms=frame_ms, hop_ms=hop_ms)
    mag = stft_magnitude(frames, sample_rate)
    mel_fb = build_mel_filterbank(sample_rate, frame_len, n_mels, fmin=0.0, fmax=sample_rate / 2.0)
    mel = np.matmul(mag, mel_fb.T)
    log_mel = np.log(np.maximum(mel, 1e-10))
    return log_mel


def simple_energy_vad(frames_power: np.ndarray, energy_threshold: float) -> np.ndarray:
    """frames_power: per-frame energy proxy (mean mel or RMS). Returns boolean voice activity."""
    return frames_power > energy_threshold


# Example usage with synthetic audio
if __name__ == "__main__":
    sr = 16000
    t = np.arange(sr * 0.5) / sr
    tone = 0.1 * np.sin(2 * math.pi * 440.0 * t)
    feats = log_mel_spectrogram(tone, sample_rate=sr)
    print(feats.shape)  # (num_frames, n_mels)

Noise reduction (production patterns):

DSP: Wiener filtering, spectral subtraction (fast on device).
Neural: Speech enhancement model as a front-end (latency trade-off).

VAD: WebRTC VAD, Silero VAD, or learned VAD for robust operation across codecs.

4.2 Model Architecture¶

Representative building blocks you should name confidently:

Component	Role
Conformer	Convolution + self-attention; strong accuracy
Streaming Conformer	Chunk-based attention with limited future context
RNN-T	Streaming-friendly transducer
CTC head	Alignment for encoder-only models

CTC alignment sketch (conceptual):

import numpy as np


def ctc_forward_backward(log_probs: np.ndarray, labels: list[int], blank: int = 0) -> float:
    """
    Toy CTC path score (log domain) for sanity checking implementations.
    log_probs: (T, V) - already log-softmax per time
    labels: collapsed label sequence without blanks inserted yet
    """
    # Collapse repeats in labels for classic CTC path construction in L_prime
    L = [labels[0]]
    for x in labels[1:]:
        if x != L[-1]:
            L.append(x)
    L_prime = [blank]
    for x in L:
        L_prime.extend([x, blank])
    U = len(L_prime)
    T = log_probs.shape[0]
    alpha = np.full((T, U), -np.inf)
    alpha[0, 0] = log_probs[0, L_prime[0]]
    if U > 1:
        alpha[0, 1] = log_probs[0, L_prime[1]]
    for t in range(1, T):
        for u in range(U):
            alpha[t, u] = log_probs[t, L_prime[u]]
            terms = [alpha[t - 1, u]]
            if u - 1 >= 0:
                terms.append(alpha[t - 1, u - 1])
            if u - 2 >= 0:
                terms.append(alpha[t - 1, u - 2])
            alpha[t, u] += np.logaddexp.reduce(np.array(terms))
    return float(np.logaddexp(alpha[T - 1, U - 1], alpha[T - 1, U - 2] if U > 1 else -np.inf))


rng = np.random.default_rng(0)
T, V = 12, 6
log_probs = np.log(rng.random((T, V)))
log_probs -= np.max(log_probs, axis=1, keepdims=True)
labels = [1, 2, 2, 3]
score = ctc_forward_backward(log_probs, labels, blank=0)
print("CTC path log-score (toy):", score)

Whisper-style multitask (conceptual API):

from dataclasses import dataclass


@dataclass
class WhisperTask:
    language: str | None  # None => auto
    translate_to_english: bool
    include_timestamps: bool


def build_decoder_prompt(task: WhisperTask) -> list[str]:
    tokens = []
    if task.language is None:
        tokens.append("<|auto|>")
    else:
        tokens.append(f"<|{task.language}|>")
    tokens.append("<|translate|>" if task.translate_to_english else "<|transcribe|>")
    if task.include_timestamps:
        tokens.append("<|timestamps|>")
    return tokens

4.3 Streaming Inference¶

Patterns:

Chunked encoder: every 80–120 ms emit hidden states.
Trigger/endpointer: VAD + speech/silence classifier.
Incremental decoding: maintain beam state across chunks; stabilization rules for UI.

from collections import deque
import numpy as np


class StreamingChunkProcessor:
    def __init__(self, chunk_ms: int = 80, sample_rate: int = 16000):
        self.chunk_ms = chunk_ms
        self.sample_rate = sample_rate
        self.chunk_samples = int(sample_rate * chunk_ms / 1000.0)
        self.buffer = deque()

    def accept_audio(self, pcm_chunk: np.ndarray) -> list[np.ndarray]:
        """Returns zero or more fixed-size chunks ready for inference."""
        self.buffer.extend(pcm_chunk.tolist())
        ready = []
        while len(self.buffer) >= self.chunk_samples:
            piece = np.array([self.buffer.popleft() for _ in range(self.chunk_samples)], dtype=np.float32)
            ready.append(piece)
        return ready


class SimpleEndpointer:
    """Energy-based endpointing sketch (replace with neural EP in prod)."""

    def __init__(self, silence_ms: int = 500, energy_ratio: float = 0.1):
        self.silence_ms = silence_ms
        self.energy_ratio = energy_ratio
        self.silence_frames = 0

    def update(self, frame_rms: float, noise_floor: float) -> bool:
        if frame_rms < self.energy_ratio * max(noise_floor, 1e-6):
            self.silence_frames += 1
        else:
            self.silence_frames = 0
        # Assume 10ms frames for illustration
        return self.silence_frames * 10 >= self.silence_ms


class IncrementalHypothesis:
    def __init__(self):
        self.partial_text = ""
        self.stable_prefix = ""

    def merge(self, new_partial: str, stable: bool) -> tuple[str, str]:
        """Returns (display_text, stable_prefix)."""
        self.partial_text = new_partial
        if stable:
            self.stable_prefix = new_partial
        return self.partial_text, self.stable_prefix

Latency vs accuracy trade-off:

Knob	Effect
Chunk duration	Larger chunks → more context, higher latency
Beam width	Wider beam → better WER, slower
Right context	More lookahead → better stability, higher latency
Second-pass rescoring	Better finals; may add delay at pauses

4.4 Language Model Integration¶

Shallow fusion scoring sketch:

import math

import numpy as np


def shallow_fusion_score(
    asr_logp: float,
    lm_logp: float,
    lam: float = 0.35,
    lm_weight: float = 1.0,
) -> float:
    return asr_logp + lam * lm_weight * lm_logp


def ngram_lm_logp(tokens: tuple[str, ...], counts: dict[tuple[str, ...], int]) -> float:
    """Toy bigram log-probability with Laplace smoothing."""
    vocab_size = 10000
    alpha = 1.0
    logp = 0.0
    prev = "<s>"
    for w in tokens:
        c = counts.get((prev, w), 0)
        d = sum(v for k, v in counts.items() if k[0] == prev)
        p = (c + alpha) / (d + alpha * vocab_size)
        logp += math.log(p)
        prev = w
    return logp


counts = {
    ("<s>", "hello"): 50,
    ("hello", "world"): 40,
    ("<s>", "world"): 1,
}
print(shallow_fusion_score(asr_logp=-4.2, lm_logp=ngram_lm_logp(("hello", "world"), counts)))

Contextual biasing for rare entities:

Class-based language model biasing
WFST biasing graphs (classical)
Neural biasing via partial prompts or keyword spotting gating

def biased_lexicon_boost(hypothesis: str, term: str, base_score: float, boost: float = 5.0) -> float:
    """Illustrative: add constant boost if hypothesis contains an important term."""
    return base_score + (boost if term in hypothesis else 0.0)


hypothesis = "contact john doe at acme"
print(biased_lexicon_boost(hypothesis, "acme", 10.0))

4.5 Speaker Diarization¶

Typical pipeline: segment audio → embedding per segment → cluster speakers → refine boundaries.

import numpy as np


def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))


def greedy_speaker_clusters(embeds: np.ndarray, threshold: float) -> np.ndarray:
    """Greedy cosine-similarity clustering: each new segment joins best centroid or opens a cluster."""
    n = embeds.shape[0]
    norms = np.linalg.norm(embeds, axis=1, keepdims=True) + 1e-8
    x = embeds / norms
    labels = np.zeros(n, dtype=np.int32)
    centroids = [x[0].copy()]
    counts = [1]
    for i in range(1, n):
        sims = [cosine_sim(x[i], c) for c in centroids]
        j = int(np.argmax(sims))
        if sims[j] >= threshold:
            labels[i] = j
            counts[j] += 1
            centroids[j] += (x[i] - centroids[j]) / counts[j]
        else:
            k = len(centroids)
            labels[i] = k
            centroids.append(x[i].copy())
            counts.append(1)
    return labels


emb = np.random.default_rng(1).normal(size=(8, 16))
print(greedy_speaker_clusters(emb, threshold=0.85))

Overlap handling: use multi-label segmentation models or separate overlap detection + ASR on separated tracks (research-heavy).

Combining ASR + diarization:

def assign_words_to_speakers(word_intervals: list[tuple[str, float, float]], speaker_intervals: list[tuple[int, float, float]]):
    """word: (text, t0, t1); speaker: (spk, t0, t1)"""
    out = []
    for w, w0, w1 in word_intervals:
        mid = 0.5 * (w0 + w1)
        spk = min(speaker_intervals, key=lambda s: 0.0 if s[1] <= mid <= s[2] else 1e9)[0]
        out.append((spk, w))
    return out

4.6 Training Pipeline¶

Stage	Purpose
Data collection	Licensed speech, user logs (with consent), calls, podcasts
Pseudo-labeling	Teacher model labels massive unlabeled audio
Augmentation	Noise, reverberation, codec simulation, tempo perturbation
SpecAugment	Mask time/frequency bands to improve robustness
SSL pretraining	wav2vec/HuBERT-style representation learning
Multilingual training	Shared encoder + language adapters or tokens

import numpy as np


def spec_augment(mel: np.ndarray, freq_mask: int = 8, time_mask: int = 30) -> np.ndarray:
    """mel: (time, freq)"""
    mel = mel.copy()
    t, f = mel.shape
    if freq_mask > 0:
        f0 = np.random.randint(0, max(f - freq_mask, 1))
        mel[:, f0 : f0 + freq_mask] = 0.0
    if time_mask > 0:
        t0 = np.random.randint(0, max(t - time_mask, 1))
        mel[t0 : t0 + time_mask, :] = 0.0
    return mel


def additive_noise(clean: np.ndarray, noise: np.ndarray, snr_db: float) -> np.ndarray:
    clean_power = np.mean(clean**2) + 1e-12
    noise = noise[: clean.shape[0]]
    noise_power = np.mean(noise**2) + 1e-12
    factor = math.sqrt(clean_power / (noise_power * 10 ** (snr_db / 10.0)))
    return clean + noise * factor


import math

rng = np.random.default_rng(0)
clean = rng.normal(size=16000).astype(np.float32)
noise = rng.normal(size=16000).astype(np.float32)
mixed = additive_noise(clean, noise, snr_db=10.0)

Self-supervised sketch (contrastive intuition):

def info_nce_loss(sim_pos: float, sim_negs: np.ndarray, tau: float = 0.07) -> float:
    logits = np.array([sim_pos / tau, *(sim_negs / tau)])
    logits -= logits.max()
    exp = np.exp(logits)
    return float(-math.log(exp[0] / exp.sum()))

4.7 Serving at Scale¶

from dataclasses import dataclass
import numpy as np


@dataclass
class GpuBatch:
    features: np.ndarray  # (B, T, F)
    lengths: np.ndarray  # (B,)


def dynamic_batch(requests: list[np.ndarray], max_batch: int = 16, pad_token: float = 0.0) -> GpuBatch:
    batch = requests[:max_batch]
    lengths = np.array([x.shape[0] for x in batch], dtype=np.int32)
    T_max = int(lengths.max())
    F = batch[0].shape[1]
    padded = np.full((len(batch), T_max, F), pad_token, dtype=np.float32)
    for i, x in enumerate(batch):
        padded[i, : x.shape[0], :] = x
    return GpuBatch(features=padded, lengths=lengths)


def quantize_int8(x: np.ndarray) -> tuple[np.ndarray, float, float]:
    xmin, xmax = x.min(), x.max()
    scale = (xmax - xmin) / 255.0 if xmax > xmin else 1.0
    q = np.round((x - xmin) / scale).astype(np.int8)
    return q, scale, xmin


def route_model(language: str, streaming: bool) -> str:
    if streaming:
        return "streaming_conformer_rnn_t_en_us_int8"
    if language.startswith("en"):
        return "whisper_large_v3_en_batch"
    return "whisper_large_v3_multilingual_batch"

gRPC streaming pseudo-interface:

class AsrStreamServicer:
    def StreamRecognize(self, request_iterator, context):
        processor = StreamingChunkProcessor(chunk_ms=80)
        for audio_chunk in request_iterator:
            pcm = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0
            for fixed in processor.accept_audio(pcm):
                feats = log_mel_spectrogram(fixed, sample_rate=16000)
                # yield partial decode ...
                yield b"PARTIAL: ..."

Edge deployment checklist:

INT8 weights, structured pruning, knowledge distillation
On-device LM small n-gram vs cloud LM
Privacy: local inference for sensitive domains

Step 5: Scaling & Production¶

Failure Handling¶

Failure	Mitigation
GPU OOM / crash	Fallback to CPU model tier; shed load; retry with smaller batch
Region outage	Multi-region active-active; DNS failover
Model regression	Canary releases; automatic rollback on WER drift
Bad audio	Detect music-only / silence; return actionable errors

Monitoring¶

Metric	Why
WER by channel (clean/noisy)	Catches robustness regressions
Real-time factor (RTF)	Cost + capacity planning
Partial stability rate	UX quality for streaming
Language confusion rate	Misrouting under multilingual traffic
GPU utilization	Autoscaling quality

Warning

Watch label drift: transcripts from human reviewers are biased by guidelines; mixing reviewer sets can shift WER without a “true” change in user-perceived quality.

Trade-offs (Interview Gold)¶

Choice	Upside	Downside
Attention offline	Best WER on long-form	Not inherently streaming
RNN-T streaming	Natural partials	Tuning complexity
Big LM rescoring	Big WER gains	Latency / infra cost
On-device	Privacy + offline	Limited model size

Security, Privacy, and Compliance¶

Concern	Practical control
Encryption in transit	TLS for all streaming RPC/websocket audio; certificate pinning on mobile SDKs
Encryption at rest	Customer-managed keys for buckets storing audio and transcript JSON
Data minimization	Optional no-retain mode: process stream, return text, discard audio
Geo-fencing	Route EU customer traffic to EU regions only (policy-driven)
Training consent	Opt-in for using customer audio to improve models; default off in enterprise contracts

Note

Regulated customers (healthcare, finance) often require on-device or VPC-isolated deployment even if cloud accuracy is higher—design for both.

Training and Release Pipeline¶

flowchart TB subgraph dplane["Data plane"] RAW["Raw corpora + user opt-in"] --> LIC["License & PII scrub"] LIC --> VER[Versioned datasets] end subgraph trn["Training"] VER --> AUG[Augmentation + SpecAugment] AUG --> SSL[Optional SSL pretrain] SSL --> FT[Supervised fine-tune] end subgraph ship["Shipping"] FT --> EVAL["Eval: WER + robustness suites"] EVAL --> SHADOW[Shadow traffic] SHADOW --> CAN[Canary] CAN --> ROLL["Rollout + rollback hooks"] end

Capacity Planning Snapshot¶

Quantity	Back-of-envelope
GPU capacity	Peak concurrent hours of audio × RTF ÷ GPU throughput × redundancy factor
CPU for features	Usually small vs GPU; dominate only at massive micro-batch edge
Egress cost	Dominated by client→cloud audio unless compressed aggressively
Queue depth	Batch jobs buffer in Kafka/SQS; SLA drives max wait alarms

Offline Beam Search (Toy)¶

Beam search keeps the top B partial hypotheses; essential for attention and many CTC decoders when paired with an LM.

import math
import numpy as np


def lm_bigram_logp(prefix: tuple[int, ...], bigram_counts: dict, vocab_size: int = 5000) -> float:
    if len(prefix) < 2:
        return 0.0
    a, b = prefix[-2], prefix[-1]
    num = bigram_counts.get((a, b), 0)
    den = sum(bigram_counts.get((a, k), 0) for k in range(vocab_size))
    return math.log((num + 1.0) / (den + vocab_size))


def beam_search_decoder(
    log_probs: np.ndarray,
    beam_size: int = 8,
    lm_fn=lm_bigram_logp,
    lm_weight: float = 0.35,
    bigram_counts: dict | None = None,
) -> tuple[int, ...]:
    """
    log_probs: (T, V) — log-softmax per frame
    Returns best token sequence (toy; no CTC blank handling).
    """
    if bigram_counts is None:
        bigram_counts = {}
    beams: list[tuple[float, tuple[int, ...]]] = [(0.0, tuple())]
    for t in range(log_probs.shape[0]):
        next_beams: list[tuple[float, tuple[int, ...]]] = []
        for score, pref in beams:
            for tok in range(log_probs.shape[1]):
                ns = score + log_probs[t, tok]
                npref = pref + (tok,)
                if lm_fn is not None:
                    ns += lm_weight * lm_fn(npref, bigram_counts)
                next_beams.append((ns, npref))
        next_beams.sort(key=lambda x: x[0], reverse=True)
        beams = next_beams[:beam_size]
    return beams[0][1]


rng = np.random.default_rng(2)
lp = np.log(rng.random((20, 50)))
lp -= lp.max(axis=1, keepdims=True)
print(beam_search_decoder(lp, beam_size=4, lm_fn=None))

Tip

In interviews, say beam search is approximate; production systems add length normalization, blank handling for CTC, and diverse decoding for N-best rescoring.

Incident Response and Rollback¶

Signal	Action
WER SLO breach	Automatic shift traffic to last-known-good model revision
Latency spike	Shed load → CPU fallback tier → extend client-side buffering
Bad region	Drain region in LB; fail over

Interview Tips¶

Likely Follow-Ups (Google-Style)¶

Streaming vs batch: Explain causal encoders, chunking, lookahead, and UI stabilization; mention hybrid two-pass approaches.
Noise robustness: Augmentation, speech enhancement, multistyle training, domain adaptation, and evaluation on matched noisy test sets.
On-device vs cloud: Privacy, latency, cost, model size, quantization, and fallback strategies.
Multilingual routing: LID model costs, language confusion, code-switching, and per-language calibration.
Entity accuracy: Contextual biasing, personalized lexicons, and careful measurement (WER misses rare words disproportionately).

How to Structure a Strong Answer¶

Start with requirements and metrics (WER, RTF, latency).
Draw the batch and streaming diagrams.
Deep dive on one of: streaming, LM fusion, or diarization.
Close with failure modes and monitoring.

A Crisp Soundbite¶

“ASR is alignment + prior: a neural acoustic model proposes hypotheses, a language model supplies prior knowledge, and production is about streaming constraints, latency, and continuous evaluation under real audio.”

Appendix: Quick Reference Tables¶

Feature Extraction Defaults (Typical)¶

Parameter	Common values
Sample rate	16 kHz (ASR), 48 kHz (some pipelines resample)
Frame length	25 ms
Hop	10 ms
Mel bins	40–128
CMVN	utterance vs global

Model Families¶

Family	Streaming	Notes
RNN-T / transducer	Strong	Used in many production assistants
CTC + LM	Moderate	Good baseline
Attention AED	Offline-first	Whisper-like

Evaluation¶

Metric	Definition hint
WER	\(\frac{S+D+I}{N}\) substitutions/deletions/insertions vs reference words
MER / CER	Token/character variants for morphologically rich languages

Tip

Pair WER with semantic task metrics for voice assistants (intent capture), not just string edit distance.

Glossary¶

Term	Meaning
ASR	Automatic speech recognition
CTC	Connectionist temporal classification
RNN-T	Recurrent neural network transducer
VAD	Voice activity detection
WER	Word error rate
RTF	Real-time factor: processing time / audio duration
LID	Language identification

End of document.