Gemini: A Family of Highly Capable Multimodal Models¶
Authors: Gemini Team, Google
Year: 2023 | Venue: arXiv / Technical Report
Link: arXiv:2312.11805
TL;DR¶
Gemini trains natively multimodal models over text, images, audio, and video interleaved in context — from the ground up, not by bolting a frozen vision encoder onto a text decoder. The family includes Ultra, Pro, and Nano sizes, with benchmark-leading results across reasoning, code, multimodal understanding, and tool use. Gemini represents Google DeepMind's unified approach to general AI models.
Why This Paper Matters¶
Gemini represents the state of the art in multimodal AI:
- Native multimodal: All modalities processed in one unified architecture (not stitched together)
- Interleaved context: Can process sequences mixing text, images, audio, and video tokens
- Long context: Extended to million-token context windows in later versions
- Safety and deployment: Comprehensive safety evaluation across all modalities
- Product integration: Powers Google's AI products (Bard/Gemini app, Workspace, etc.)
Key Concepts Explained Simply¶
Native Multimodal vs. Stitched¶
Stitched approach (e.g., LLaVA, GPT-4V initial): - Train a vision encoder (CLIP) separately - Train a text decoder (LLaMA) separately - Connect them with a thin projection layer - Each component was optimized for its own modality
Native approach (Gemini): - Train one model from scratch on interleaved multimodal data - The model learns cross-modal relationships during pre-training - Text, images, audio, and video are all tokenized and processed together - Cross-modal interactions are learned deeply, not just at the projection layer
Multimodal Tokenization¶
Different modalities are converted to tokens: - Text: Standard BPE tokenization - Images: Vision Transformer (ViT) patches → tokens - Audio: Spectrograms → tokens (via encoder) - Video: Frame-level patches → tokens (with temporal encoding)
All tokens enter the same Transformer decoder — the model doesn't know "this is an image" vs. "this is text" except through the token embeddings.
Model Sizes¶
| Model | Use Case |
|---|---|
| Ultra | Most capable — research benchmarks, complex reasoning |
| Pro | Best balance of quality and efficiency — production API |
| Nano | On-device — mobile, edge deployment |
The Math — Explained Step by Step¶
Unified Autoregressive Objective¶
All modalities are flattened into a single token sequence \(z\):
Breaking it down:
- \(z = (z_1, z_2, \ldots, z_T)\): A sequence mixing text tokens, image patch tokens, audio tokens, etc.
- Same autoregressive objective as GPT — predict the next token
- The model learns when a text token follows an image patch, how they relate
- Modality-specific embedding layers convert each type of input to the shared dimension
Cross-Modal Attention¶
In the Transformer layers, self-attention operates over all tokens regardless of modality:
An image token at position \(i\) can attend to a text token at position \(j\) with the same attention mechanism. This is how the model learns cross-modal relationships (e.g., "a photo of a sunset" should attend to the actual sunset image patches).
Vision Tokenization¶
An image is split into \(P \times P\) patches (e.g., 16×16), each patch is linearly projected:
For a 256×256 image with 16×16 patches: \(256/16 = 16\) patches per side → 256 visual tokens per image.
Video Extension¶
Video adds temporal encoding:
With temporal subsampling to control token count.
Python Implementation¶
import numpy as np
def stable_softmax(logits):
z = logits - np.max(logits, axis=-1, keepdims=True)
e = np.exp(z)
return e / np.sum(e, axis=-1, keepdims=True)
def image_to_patches(image, patch_size=16):
"""
Convert an image into patches (ViT-style).
image: [H, W, C]
Returns: [n_patches, patch_size * patch_size * C]
"""
H, W, C = image.shape
n_h = H // patch_size
n_w = W // patch_size
patches = []
for i in range(n_h):
for j in range(n_w):
patch = image[
i*patch_size:(i+1)*patch_size,
j*patch_size:(j+1)*patch_size,
:
]
patches.append(patch.flatten())
return np.array(patches)
def embed_patches(patches, W_proj, pos_embedding):
"""
Project patches and add positional embeddings.
patches: [n_patches, patch_dim]
W_proj: [patch_dim, d_model]
pos_embedding: [n_patches, d_model]
"""
return patches @ W_proj + pos_embedding
def interleave_tokens(text_tokens, image_tokens, modality_embeddings):
"""
Interleave text and image tokens for multimodal input.
Returns combined sequence with modality type indicators.
"""
text_emb = modality_embeddings["text"]
image_emb = modality_embeddings["image"]
sequence = []
types = []
# Image tokens first, then text (simplified layout)
for tok in image_tokens:
sequence.append(tok + image_emb)
types.append("image")
for tok in text_tokens:
sequence.append(tok + text_emb)
types.append("text")
return np.array(sequence), types
def cross_modal_attention(query_tokens, kv_tokens, d_model):
"""
Cross-modal attention: tokens from one modality attend to another.
"""
d_k = d_model
scores = (query_tokens @ kv_tokens.T) / np.sqrt(d_k)
attn = stable_softmax(scores)
return attn @ kv_tokens
def video_temporal_sampling(frames, sample_rate=4):
"""
Subsample video frames for token efficiency.
frames: list of [H, W, C] arrays
"""
sampled = frames[::sample_rate]
return sampled
class SimpleMultimodalModel:
"""Simplified native multimodal model."""
def __init__(self, d_model=64, vocab_size=1000, patch_dim=768):
self.d_model = d_model
self.text_embedding = np.random.randn(vocab_size, d_model) * 0.02
self.patch_proj = np.random.randn(patch_dim, d_model) * 0.02
self.modality_emb = {
"text": np.random.randn(d_model) * 0.02,
"image": np.random.randn(d_model) * 0.02,
"audio": np.random.randn(d_model) * 0.02,
}
def encode_text(self, token_ids):
return self.text_embedding[token_ids] + self.modality_emb["text"]
def encode_image(self, patches):
return patches @ self.patch_proj + self.modality_emb["image"]
def forward(self, text_ids, image_patches=None):
"""
Process interleaved text and image tokens.
"""
tokens = []
token_types = []
if image_patches is not None:
img_tokens = self.encode_image(image_patches)
for t in img_tokens:
tokens.append(t)
token_types.append("image")
text_tokens = self.encode_text(text_ids)
for t in text_tokens:
tokens.append(t)
token_types.append("text")
sequence = np.array(tokens)
# Self-attention (all tokens attend to all — cross-modal)
d_k = self.d_model
scores = (sequence @ sequence.T) / np.sqrt(d_k)
# Causal mask
mask = np.triu(np.full(scores.shape, -1e9), k=1)
scores = scores + mask
attn = stable_softmax(scores)
output = attn @ sequence
return output, token_types
def compare_approaches():
"""Compare native multimodal vs stitched approach."""
approaches = {
"Stitched (CLIP + LLM)": {
"training": "Separate pre-training, then alignment",
"cross_modal": "Only at projection layer",
"modalities": "Image + Text (typically)",
"flexibility": "Easy to swap components",
"depth": "Shallow cross-modal interaction",
},
"Native (Gemini)": {
"training": "Joint pre-training from scratch",
"cross_modal": "Throughout all layers",
"modalities": "Text + Image + Audio + Video",
"flexibility": "Requires full retraining",
"depth": "Deep cross-modal interaction",
},
}
print("--- Native vs. Stitched Multimodal ---")
for name, attrs in approaches.items():
print(f"\n {name}:")
for k, v in attrs.items():
print(f" {k}: {v}")
def token_count_analysis():
"""Analyze token counts for multimodal inputs."""
scenarios = [
("Text-only (1K words)", 1300, 0, 0),
("Image + caption", 50, 256, 0),
("10 images + description", 200, 2560, 0),
("1 min video (1fps)", 100, 0, 60 * 256),
("1 min video (4fps)", 100, 0, 240 * 256),
]
print("\n--- Token Count Analysis ---")
print(f"{'Scenario':<30} {'Text':>8} {'Image':>8} {'Video':>10} {'Total':>10}")
print("-" * 70)
for name, text, image, video in scenarios:
total = text + image + video
print(f"{name:<30} {text:>8} {image:>8} {video:>10} {total:>10,}")
# --- Demo ---
if __name__ == "__main__":
np.random.seed(42)
# Image to patches
print("--- Image Tokenization ---")
image = np.random.randn(64, 64, 3)
patches = image_to_patches(image, patch_size=16)
print(f"Image {image.shape} → {patches.shape[0]} patches of dim {patches.shape[1]}")
# Multimodal forward pass
print("\n--- Multimodal Forward Pass ---")
model = SimpleMultimodalModel(d_model=32, vocab_size=100, patch_dim=patches.shape[1])
text_ids = np.array([10, 20, 30, 40, 50])
output, types = model.forward(text_ids, patches)
print(f"Output shape: {output.shape}")
print(f"Token types: {types}")
n_image = sum(1 for t in types if t == "image")
n_text = sum(1 for t in types if t == "text")
print(f"Image tokens: {n_image}, Text tokens: {n_text}")
# Comparison
compare_approaches()
# Token analysis
token_count_analysis()
# Video sampling
print("\n--- Video Frame Sampling ---")
frames = [np.random.randn(64, 64, 3) for _ in range(60)]
for rate in [1, 2, 4, 8]:
sampled = video_temporal_sampling(frames, sample_rate=rate)
patches_per_frame = (64 // 16) ** 2
total_tokens = len(sampled) * patches_per_frame
print(f" Sample rate {rate}: {len(sampled)} frames, "
f"{total_tokens} visual tokens")
Interview Importance¶
Gemini is relevant for understanding multimodal AI systems, which is increasingly important as all major models become multimodal.
Difficulty Level: ⭐⭐⭐ (Medium)¶
Interview Questions & Answers¶
Q1: Compare native multimodal pretraining to frozen encoder + LLM stacks.¶
Answer: - Native (Gemini): All modalities trained together from scratch. Deep cross-modal interactions at every layer. Better at tasks requiring tight vision-language integration. Drawback: expensive to train, can't easily swap components. - Stitched (LLaVA, GPT-4V initial): Use pre-trained frozen vision encoder (CLIP) + pre-trained LLM, connected by a learned projection. Cheaper to build — leverage existing strong components. Cross-modal interaction is shallow (only at the projection layer). Good "enough" for many tasks. - Trend: Models are moving toward native multimodal training as compute becomes more available.
Q2: What evaluation challenges are unique to video or audio vs. images?¶
Answer: 1. Temporal understanding: Video requires understanding events over time (before/after, cause/effect) 2. Token cost: Video generates orders of magnitude more tokens than images — need efficient sampling 3. Audio-visual alignment: Matching what's heard with what's seen (e.g., lip sync, sound localization) 4. Benchmark scarcity: Fewer high-quality video/audio benchmarks compared to image benchmarks 5. Subjectivity: Audio/video understanding often involves subjective judgments (mood, tone) 6. Real-time processing: Audio/video applications often require low-latency processing
Q3: How would you design red-teaming for a multimodal product?¶
Answer: 1. Cross-modal attacks: Embed harmful text in images (adversarial examples, text overlays) 2. Modality-specific harms: Test for harmful image generation, deepfake detection failures 3. Jailbreaking: Attempt to bypass safety filters by encoding harmful prompts as images 4. Cultural sensitivity: Test with culturally diverse images that may be misinterpreted 5. Privacy: Test for ability to identify real individuals from photos 6. Audio attacks: Adversarial audio, voice cloning, impersonation 7. Accessibility: Ensure the model doesn't discriminate based on visual disabilities in images 8. Multi-hop attacks: Combine benign image + benign text to create harmful output
Connections to Other Papers¶
- CLIP → Gemini's native approach contrasts with CLIP's contrastive dual-encoder approach
- PaLM → Gemini is PaLM's successor at Google
- Transformer → Same underlying architecture, extended to multimodal tokens
- FlashAttention → Critical for handling the long sequences from multimodal inputs
- Constitutional AI → Safety evaluation per modality builds on alignment work
Key Takeaways for Quick Review¶
| Concept | Remember |
|---|---|
| Approach | Native multimodal pre-training (not stitched) |
| Modalities | Text, images, audio, video — interleaved in context |
| Architecture | Unified Transformer decoder over all modality tokens |
| Model sizes | Ultra (largest), Pro (balanced), Nano (on-device) |
| Tokenization | ViT patches for images, BPE for text, spectrograms for audio |
| Key advantage | Deep cross-modal interaction at every layer |
| Challenge | Token count explodes with video/audio inputs |