Qwen 3 — April 2025¶
What Changed¶
Alibaba's Qwen 3 is a family of eight open-weight models (Apache 2.0) released in April 2025, spanning dense architectures (0.6B to 32B) and MoE architectures (30B-A3B and the flagship 235B-A22B). Qwen 3 introduced hybrid thinking modes — a single checkpoint that switches between step-by-step reasoning and fast direct responses — and expanded language support from 29 to 119 languages. A follow-up, Qwen 3.5, was released in February 2026 with further benchmark improvements.
Key Technical Details¶
Model lineup:
| Model | Architecture | Total Params | Active Params | Context |
|---|---|---|---|---|
| Qwen3-0.6B to 32B | Dense | 0.6B–32B | All | 32K–128K |
| Qwen3-30B-A3B | MoE (128E / 8A) | 30B | 3B | 128K |
| Qwen3-235B-A22B | MoE (128E / 8A) | 235B | 22B | 128K |
All models use: Grouped Query Attention (GQA), SwiGLU activation, Rotary Positional Embeddings (RoPE), extendable to 1M tokens via YaRN.
Hybrid thinking modes: Unlike separate "thinking" and "non-thinking" model variants, Qwen 3 unifies both in a single model:
- Thinking mode: Generates internal chain-of-thought reasoning tokens before the final answer. Activated via system prompt or API parameter.
- Non-thinking mode: Skips reasoning overhead for straightforward queries.
- Thinking budget: Users can cap the number of reasoning tokens, creating a latency-quality trade-off slider.
In Plain English
Previous generations required deploying two models (e.g., a fast model and a reasoning model) and routing between them. Qwen 3 merges both into one checkpoint — the model learned when and how deeply to reason during RL training. The "thinking budget" is conceptually similar to giving the model a compute allowance: more budget = deeper reasoning = higher accuracy on hard problems, but higher latency.
Training pipeline:
- Pre-training on a large multilingual corpus
- Long-context extension via YaRN (progressive scaling of RoPE base frequency)
- Thinking-mode training: a two-stage RL process:
- Stage 1: RL with only thinking mode enabled (learns to reason)
- Stage 2: RL with both modes, teaching the model to select the appropriate mode
Qwen3-Next (September 2025): A hybrid architecture variant combining:
- GatedDeltaNet (linear attention) for efficient long-range processing
- GatedAttention (standard softmax attention) for precise local reasoning
- MoE routing with 512 experts
At sequences longer than 32K tokens, Qwen3-Next delivers 10× the throughput of Qwen3-32B by using linear attention for most of the sequence and reserving full attention for critical positions.
Benchmark Performance¶
Qwen3-235B-A22B is competitive with frontier models:
| Benchmark Category | Competitive With |
|---|---|
| Coding | DeepSeek-R1, o1 |
| Mathematics | o3-mini, Grok-3 |
| General knowledge | Gemini 2.5 Pro |
| Multilingual | SotA across 119 languages |
The small Qwen3-4B reportedly rivals Qwen2.5-72B-Instruct, demonstrating significant efficiency improvements in the training recipe.
Practical Implications¶
Unified thinking model simplifies deployment — one model serves both fast and reasoning workloads, reducing infrastructure complexity. The thinking budget provides a practical latency-cost control that aligns well with tiered API pricing.
119 languages makes Qwen 3 the broadest multilingual open model, relevant for applications in underserved language markets.
Qwen3-Next's hybrid architecture points toward a future where linear attention handles the bulk of long sequences efficiently, while full attention is reserved for positions that need it — a potential path to truly sub-quadratic LLMs.
Interview Questions
- How does Qwen 3's thinking budget mechanism work? How is it different from simply truncating the chain-of-thought output?
- Compare the two-stage RL training for hybrid thinking modes with DeepSeek-R1's approach. What does each stage optimize for?
- Explain how YaRN extends context from 128K to 1M tokens. What happens to positional encoding frequencies, and why does this preserve quality?
- What is the advantage of Qwen3-Next's hybrid linear + softmax attention over pure linear attention (like Mamba) or pure softmax attention? What are the trade-offs?
- With 128 experts and 8 active, how does the expert routing in Qwen3-235B-A22B compare to Mixtral's 8-expert/2-active design? What are the implications for load balancing?