Gemini: A Family of Highly Capable Multimodal Models¶

Authors: Gemini Team, Google
Year: 2023 | Venue: arXiv / Technical Report
Link: arXiv:2312.11805

TL;DR¶

Gemini trains natively multimodal models over text, images, audio, and video interleaved in context — from the ground up, not by bolting a frozen vision encoder onto a text decoder. The family includes Ultra, Pro, and Nano sizes, with benchmark-leading results across reasoning, code, multimodal understanding, and tool use. Gemini represents Google DeepMind's unified approach to general AI models.

Why This Paper Matters¶

Gemini represents the state of the art in multimodal AI:

Native multimodal: All modalities processed in one unified architecture (not stitched together)
Interleaved context: Can process sequences mixing text, images, audio, and video tokens
Long context: Extended to million-token context windows in later versions
Safety and deployment: Comprehensive safety evaluation across all modalities
Product integration: Powers Google's AI products (Bard/Gemini app, Workspace, etc.)

Key Concepts Explained Simply¶

Native Multimodal vs. Stitched¶

Stitched approach (e.g., LLaVA, GPT-4V initial): - Train a vision encoder (CLIP) separately - Train a text decoder (LLaMA) separately - Connect them with a thin projection layer - Each component was optimized for its own modality

Native approach (Gemini): - Train one model from scratch on interleaved multimodal data - The model learns cross-modal relationships during pre-training - Text, images, audio, and video are all tokenized and processed together - Cross-modal interactions are learned deeply, not just at the projection layer

Multimodal Tokenization¶

Different modalities are converted to tokens: - Text: Standard BPE tokenization - Images: Vision Transformer (ViT) patches → tokens - Audio: Spectrograms → tokens (via encoder) - Video: Frame-level patches → tokens (with temporal encoding)

All tokens enter the same Transformer decoder — the model doesn't know "this is an image" vs. "this is text" except through the token embeddings.

Model Sizes¶

Model	Use Case
Ultra	Most capable — research benchmarks, complex reasoning
Pro	Best balance of quality and efficiency — production API
Nano	On-device — mobile, edge deployment

The Math — Explained Step by Step¶

Unified Autoregressive Objective¶

All modalities are flattened into a single token sequence \(z\):

\[ P_\theta(z) = \prod_{t=1}^{T} P_\theta(z_t \mid z_{<t}) \]

Breaking it down:

\(z = (z_1, z_2, \ldots, z_T)\): A sequence mixing text tokens, image patch tokens, audio tokens, etc.
Same autoregressive objective as GPT — predict the next token
The model learns when a text token follows an image patch, how they relate
Modality-specific embedding layers convert each type of input to the shared dimension

In the Transformer layers, self-attention operates over all tokens regardless of modality:

\[ \text{Attention}(Q_{\text{all}}, K_{\text{all}}, V_{\text{all}}) \]

An image token at position \(i\) can attend to a text token at position \(j\) with the same attention mechanism. This is how the model learns cross-modal relationships (e.g., "a photo of a sunset" should attend to the actual sunset image patches).

Vision Tokenization¶

An image is split into \(P \times P\) patches (e.g., 16×16), each patch is linearly projected:

\[ z_{\text{patch}} = \text{Linear}(\text{flatten}(\text{patch})) + e_{\text{pos}} \]

For a 256×256 image with 16×16 patches: \(256/16 = 16\) patches per side → 256 visual tokens per image.

Video Extension¶

Video adds temporal encoding:

\[ z_{\text{video}} = \text{concat}(z_{\text{frame}_1}, z_{\text{frame}_2}, \ldots, z_{\text{frame}_n}) + e_{\text{temporal}} \]

With temporal subsampling to control token count.

Python Implementation¶

import numpy as np


def stable_softmax(logits):
    z = logits - np.max(logits, axis=-1, keepdims=True)
    e = np.exp(z)
    return e / np.sum(e, axis=-1, keepdims=True)


def image_to_patches(image, patch_size=16):
    """
    Convert an image into patches (ViT-style).
    image: [H, W, C]
    Returns: [n_patches, patch_size * patch_size * C]
    """
    H, W, C = image.shape
    n_h = H // patch_size
    n_w = W // patch_size

    patches = []
    for i in range(n_h):
        for j in range(n_w):
            patch = image[
                i*patch_size:(i+1)*patch_size,
                j*patch_size:(j+1)*patch_size,
                :
            ]
            patches.append(patch.flatten())

    return np.array(patches)


def embed_patches(patches, W_proj, pos_embedding):
    """
    Project patches and add positional embeddings.
    patches: [n_patches, patch_dim]
    W_proj: [patch_dim, d_model]
    pos_embedding: [n_patches, d_model]
    """
    return patches @ W_proj + pos_embedding


def interleave_tokens(text_tokens, image_tokens, modality_embeddings):
    """
    Interleave text and image tokens for multimodal input.
    Returns combined sequence with modality type indicators.
    """
    text_emb = modality_embeddings["text"]
    image_emb = modality_embeddings["image"]

    sequence = []
    types = []

    # Image tokens first, then text (simplified layout)
    for tok in image_tokens:
        sequence.append(tok + image_emb)
        types.append("image")

    for tok in text_tokens:
        sequence.append(tok + text_emb)
        types.append("text")

    return np.array(sequence), types


def cross_modal_attention(query_tokens, kv_tokens, d_model):
    """
    Cross-modal attention: tokens from one modality attend to another.
    """
    d_k = d_model
    scores = (query_tokens @ kv_tokens.T) / np.sqrt(d_k)
    attn = stable_softmax(scores)
    return attn @ kv_tokens


def video_temporal_sampling(frames, sample_rate=4):
    """
    Subsample video frames for token efficiency.
    frames: list of [H, W, C] arrays
    """
    sampled = frames[::sample_rate]
    return sampled


class SimpleMultimodalModel:
    """Simplified native multimodal model."""

    def __init__(self, d_model=64, vocab_size=1000, patch_dim=768):
        self.d_model = d_model
        self.text_embedding = np.random.randn(vocab_size, d_model) * 0.02
        self.patch_proj = np.random.randn(patch_dim, d_model) * 0.02
        self.modality_emb = {
            "text": np.random.randn(d_model) * 0.02,
            "image": np.random.randn(d_model) * 0.02,
            "audio": np.random.randn(d_model) * 0.02,
        }

    def encode_text(self, token_ids):
        return self.text_embedding[token_ids] + self.modality_emb["text"]

    def encode_image(self, patches):
        return patches @ self.patch_proj + self.modality_emb["image"]

    def forward(self, text_ids, image_patches=None):
        """
        Process interleaved text and image tokens.
        """
        tokens = []
        token_types = []

        if image_patches is not None:
            img_tokens = self.encode_image(image_patches)
            for t in img_tokens:
                tokens.append(t)
                token_types.append("image")

        text_tokens = self.encode_text(text_ids)
        for t in text_tokens:
            tokens.append(t)
            token_types.append("text")

        sequence = np.array(tokens)

        # Self-attention (all tokens attend to all — cross-modal)
        d_k = self.d_model
        scores = (sequence @ sequence.T) / np.sqrt(d_k)
        # Causal mask
        mask = np.triu(np.full(scores.shape, -1e9), k=1)
        scores = scores + mask
        attn = stable_softmax(scores)
        output = attn @ sequence

        return output, token_types


def compare_approaches():
    """Compare native multimodal vs stitched approach."""
    approaches = {
        "Stitched (CLIP + LLM)": {
            "training": "Separate pre-training, then alignment",
            "cross_modal": "Only at projection layer",
            "modalities": "Image + Text (typically)",
            "flexibility": "Easy to swap components",
            "depth": "Shallow cross-modal interaction",
        },
        "Native (Gemini)": {
            "training": "Joint pre-training from scratch",
            "cross_modal": "Throughout all layers",
            "modalities": "Text + Image + Audio + Video",
            "flexibility": "Requires full retraining",
            "depth": "Deep cross-modal interaction",
        },
    }

    print("--- Native vs. Stitched Multimodal ---")
    for name, attrs in approaches.items():
        print(f"\n  {name}:")
        for k, v in attrs.items():
            print(f"    {k}: {v}")


def token_count_analysis():
    """Analyze token counts for multimodal inputs."""
    scenarios = [
        ("Text-only (1K words)", 1300, 0, 0),
        ("Image + caption", 50, 256, 0),
        ("10 images + description", 200, 2560, 0),
        ("1 min video (1fps)", 100, 0, 60 * 256),
        ("1 min video (4fps)", 100, 0, 240 * 256),
    ]

    print("\n--- Token Count Analysis ---")
    print(f"{'Scenario':<30} {'Text':>8} {'Image':>8} {'Video':>10} {'Total':>10}")
    print("-" * 70)
    for name, text, image, video in scenarios:
        total = text + image + video
        print(f"{name:<30} {text:>8} {image:>8} {video:>10} {total:>10,}")


# --- Demo ---
if __name__ == "__main__":
    np.random.seed(42)

    # Image to patches
    print("--- Image Tokenization ---")
    image = np.random.randn(64, 64, 3)
    patches = image_to_patches(image, patch_size=16)
    print(f"Image {image.shape} → {patches.shape[0]} patches of dim {patches.shape[1]}")

    # Multimodal forward pass
    print("\n--- Multimodal Forward Pass ---")
    model = SimpleMultimodalModel(d_model=32, vocab_size=100, patch_dim=patches.shape[1])

    text_ids = np.array([10, 20, 30, 40, 50])
    output, types = model.forward(text_ids, patches)
    print(f"Output shape: {output.shape}")
    print(f"Token types: {types}")

    n_image = sum(1 for t in types if t == "image")
    n_text = sum(1 for t in types if t == "text")
    print(f"Image tokens: {n_image}, Text tokens: {n_text}")

    # Comparison
    compare_approaches()

    # Token analysis
    token_count_analysis()

    # Video sampling
    print("\n--- Video Frame Sampling ---")
    frames = [np.random.randn(64, 64, 3) for _ in range(60)]
    for rate in [1, 2, 4, 8]:
        sampled = video_temporal_sampling(frames, sample_rate=rate)
        patches_per_frame = (64 // 16) ** 2
        total_tokens = len(sampled) * patches_per_frame
        print(f"  Sample rate {rate}: {len(sampled)} frames, "
              f"{total_tokens} visual tokens")

Interview Importance¶

Gemini is relevant for understanding multimodal AI systems, which is increasingly important as all major models become multimodal.

Difficulty Level: ⭐⭐⭐ (Medium)¶

Interview Questions & Answers¶

Q1: Compare native multimodal pretraining to frozen encoder + LLM stacks.¶

Answer: - Native (Gemini): All modalities trained together from scratch. Deep cross-modal interactions at every layer. Better at tasks requiring tight vision-language integration. Drawback: expensive to train, can't easily swap components. - Stitched (LLaVA, GPT-4V initial): Use pre-trained frozen vision encoder (CLIP) + pre-trained LLM, connected by a learned projection. Cheaper to build — leverage existing strong components. Cross-modal interaction is shallow (only at the projection layer). Good "enough" for many tasks. - Trend: Models are moving toward native multimodal training as compute becomes more available.

Q2: What evaluation challenges are unique to video or audio vs. images?¶

Answer: 1. Temporal understanding: Video requires understanding events over time (before/after, cause/effect) 2. Token cost: Video generates orders of magnitude more tokens than images — need efficient sampling 3. Audio-visual alignment: Matching what's heard with what's seen (e.g., lip sync, sound localization) 4. Benchmark scarcity: Fewer high-quality video/audio benchmarks compared to image benchmarks 5. Subjectivity: Audio/video understanding often involves subjective judgments (mood, tone) 6. Real-time processing: Audio/video applications often require low-latency processing

Q3: How would you design red-teaming for a multimodal product?¶

Answer: 1. Cross-modal attacks: Embed harmful text in images (adversarial examples, text overlays) 2. Modality-specific harms: Test for harmful image generation, deepfake detection failures 3. Jailbreaking: Attempt to bypass safety filters by encoding harmful prompts as images 4. Cultural sensitivity: Test with culturally diverse images that may be misinterpreted 5. Privacy: Test for ability to identify real individuals from photos 6. Audio attacks: Adversarial audio, voice cloning, impersonation 7. Accessibility: Ensure the model doesn't discriminate based on visual disabilities in images 8. Multi-hop attacks: Combine benign image + benign text to create harmful output

Connections to Other Papers¶

CLIP → Gemini's native approach contrasts with CLIP's contrastive dual-encoder approach
PaLM → Gemini is PaLM's successor at Google
Transformer → Same underlying architecture, extended to multimodal tokens
FlashAttention → Critical for handling the long sequences from multimodal inputs
Constitutional AI → Safety evaluation per modality builds on alignment work

Key Takeaways for Quick Review¶

Concept	Remember
Approach	Native multimodal pre-training (not stitched)
Modalities	Text, images, audio, video — interleaved in context
Architecture	Unified Transformer decoder over all modality tokens
Model sizes	Ultra (largest), Pro (balanced), Nano (on-device)
Tokenization	ViT patches for images, BPE for text, spectrograms for audio
Key advantage	Deep cross-modal interaction at every layer
Challenge	Token count explodes with video/audio inputs