GLM-4.6 — September 2025¶

What Changed¶

Zhipu AI's GLM-4.6 is a 357B-parameter MoE model (32B active per forward pass) released open-weight under MIT license in September 2025. It extends context from 128K (GLM-4.5) to 200K tokens with 128K-token output support — enabling long-form generation at a scale most models cannot match. The GLM family has deep roots in GLM (General Language Model) pretraining, which uses autoregressive blank infilling rather than causal LM or masked LM as the pretraining objective.

Key Technical Details¶

GLM pretraining objective: the model predicts shuffled spans in a document using bidirectional attention for the context and causal attention for each span:

\[ \mathcal{L}_{\text{GLM}} = -\mathbb{E}\left[\sum_{s \in S} \log P_\theta(s \mid x_{\text{corrupt}}, s_{<i})\right] \]

where \(S\) is the set of masked spans and \(x_{\text{corrupt}}\) is the text with those spans replaced by mask tokens.

In Plain English

GLM unifies the MLM objective (bidirectional context, like BERT) with the autoregressive objective (left-to-right generation within each span, like GPT) in a single pretraining task. The model sees full surrounding context when encoding, but generates each masked span sequentially — training understanding and generation jointly.

Agentic capabilities in GLM-4.6 are enhanced via RL-trained tool use: the model autonomously plans, invokes tools, and coordinates across tool calls without explicit orchestration code.

Technical Details

Architecture: MoE Transformer, 357B total / ~32B active parameters.
Context: 200K input, 128K output — one of the highest output-token limits available.
License: MIT/Apache 2.0 — commercially permissive.
Languages: strong Chinese and English, 24 additional languages.
GLM-Z1: companion reasoning model with "deep thinking" mode.
GLM-4-32B: a dense 32B variant trained on 15T tokens, competitive with much larger models.

Practical Implications¶

128K output enables single-session artifacts (books, large codebases, data dumps) that previously required chunking and stitching — but also raises moderation and storage obligations. MoE serving needs expert-aware batching: not all 357B parameters run per token — monitor routing imbalance. Open weights ease on-prem deployment; still verify license terms for your use case.

Interview Questions

How does the GLM pretraining objective differ from BERT's MLM and GPT's causal LM? What tasks does each objective best prepare the model for?
Why is a 200K output token limit practically significant, and what types of tasks become possible that were not with a 4K output limit?
How does MoE sparsity (32B active out of 357B total) affect per-token inference cost vs. a dense 32B model?

Code Example¶

Illustrative size vs. active parameters comparison (conceptual):

Total parameters:    357B  (all experts stored)
Active per token:    ~32B  (top-K experts per layer)
Dense baseline:      32B   (full FFN every layer — different FLOPs profile)

For integration, use the vendor's OpenAI-compatible or native API as documented for GLM-4.6; set max_tokens with awareness of the 128K output ceiling for long generations.