Gemini 3.1 Pro — February 2026¶
What Changed¶
Gemini 3.1 Pro scored 77.1% on ARC-AGI-2 and 94.3% on GPQA Diamond — tying GPT-5.4 Pro on the Intelligence Index at roughly one-third the cost per token. It cemented Google's position at the frontier and demonstrated that compute efficiency had become as important a competitive axis as raw benchmark performance.
Key Technical Details¶
ARC-AGI-2 tests novel reasoning patterns not seen in training — it is specifically designed to resist memorization. High performance requires genuine compositional generalization:
A model that memorizes training distributions fails ARC-AGI-2 even at very large scale.
In Plain English
Compositional generalization means solving new puzzles built from familiar primitives in unfamiliar combinations. If the model only pattern-matches training examples, ARC-AGI-2 punishes that — it rewards rule induction and transfer, not recall.
Technical Details
- Cost efficiency: ~3x cheaper per token than GPT-5.4 Pro at comparable quality — a meaningful production trade-off.
- Native multimodal: text, images, audio, video; continuation of the Gemini native-multimodal approach.
- Grounded search: integrated Google Search grounding produces citations and factual grounding in outputs.
Practical Implications¶
When cost per token drops at fixed quality, re-evaluate batch jobs (summarization, classification) that were previously uneconomical at frontier quality. For grounded search, design prompts that require citations and validate URLs in post-processing. For ARC-style reasoning, prefer evaluations that mix novel rule systems over static knowledge benchmarks alone.
Interview Questions
- Why is ARC-AGI-2 considered a harder benchmark than MMLU or GPQA? What reasoning failure mode does it specifically target?
- How does cost per token become a competitive axis at the frontier, and what architectural choices drive inference efficiency at scale?
Code Example¶
Illustrative grounded generation request (parameters and names vary by API version):