GLM-5: Scaling Agentic Engineering with Asynchronous RL¶

Authors: Zhipu AI, Tsinghua University (THUDM) | Year: 2026 | Venue: arXiv | Link: zhipuai.cn/glm-5

TL;DR¶

GLM-5 is a 744B-parameter MoE model (40B active per forward pass) released open-weight under MIT license in February 2026. It builds on the GLM blank-infilling lineage (GLM-130B → GLM-4 → GLM-4.6) but shifts focus from bilingual chat to agentic software engineering — sustained multi-step planning, tool orchestration, and autonomous code generation across full-stack applications. The key training innovation is Slime, an asynchronous reinforcement learning framework that decouples rollout generation from policy updates, enabling higher GPU utilization during RL fine-tuning on long-horizon agentic trajectories.

Headline results: SWE-bench Verified 77.8 (SotA open-weight), Terminal Bench 2.0 56.2, approaching Claude Opus 4.5 on software engineering tasks.

One-sentence pitch: GLM-5 proves that open-weight MoE models can reach frontier-class agentic coding performance when RL training is specifically designed for long-horizon task completion.

Why This Paper Matters¶

Agentic RL at scale: Slime demonstrates that asynchronous RL infrastructure — decoupling rollout workers from policy optimizers — is critical when reward signals come from multi-step tool interactions (compile, test, iterate) that take minutes, not milliseconds.
Open-weight frontier for coding: At 77.8 on SWE-bench Verified, GLM-5 is the strongest open-weight model for autonomous code repair, narrowing the gap with proprietary APIs.
MoE scaling trajectory: The progression from GLM-4 (dense) → GLM-4.6 (357B/32B MoE) → GLM-5 (744B/40B MoE) is a clean case study in when and how to scale MoE — active parameters grew modestly (32B → 40B) while total capacity doubled.
GLM objective at MoE scale: Validates that autoregressive blank infilling works at 744B with expert routing — non-trivial since the mixed attention mask (bidirectional context + causal spans) interacts with expert selection.
Interview relevance: Combines MoE serving costs, RL for agents, async training systems, and the GLM pretraining paradigm — topics that span ML fundamentals and systems design.

Key Concepts Explained Simply¶

1. Slime: Asynchronous Reinforcement Learning¶

Standard RL training (synchronous PPO/GRPO) follows a lockstep loop: generate rollouts → compute advantages → update policy → repeat. When rollouts involve tool execution (running compilers, test suites, web browsers), each rollout can take seconds to minutes. Synchronous training wastes GPU cycles waiting for the slowest rollout.

Slime breaks this lock:

Rollout workers continuously generate trajectories using a slightly stale policy copy, buffering completed episodes.
Policy optimizer pulls completed rollouts from the buffer and updates the policy asynchronously.
Staleness mitigation: importance-weighted corrections (similar to V-trace) adjust for the gap between the behavior policy (used during rollout) and the current policy (used for the gradient step).

In Plain English

Imagine a kitchen: synchronous training means every chef waits for the slowest dish before anyone starts the next course. Slime lets chefs work independently — some are still plating the previous course while others start prepping the next. A quality controller (importance weights) ensures slightly stale preparations are still usable.

2. GLM pretraining objective (inherited)¶

The foundational GLM training objective persists in GLM-5:

\[ \mathcal{L}_{\text{GLM}} = -\mathbb{E}\left[\sum_{s \in \mathcal{S}} \sum_{i=1}^{|s|} \log P_\theta(s_i \mid x_{\text{corrupt}}, s_{1:i-1})\right] \]

Bidirectional attention over uncorrupted tokens; causal attention within each masked span. At GLM-5's scale, this operates across 744B parameters with top-K expert routing selecting ~40B active parameters per token.

3. MoE scaling: GLM-4.6 → GLM-5¶

Metric	GLM-4.6 (Sep 2025)	GLM-5 (Feb 2026)
Total parameters	357B	744B
Active parameters	~32B	~40B
Expert configuration	MoE	MoE (more experts, similar top-K)
Pre-training tokens	~20T	28.5T
Context length	200K	200K+
RL method	Standard	Slime (async)
Primary strength	Bilingual chat	Agentic engineering

The 2× scaling in total parameters with only 1.25× increase in active parameters means serving cost scales sublinearly with capacity — the marginal cost of expertise is stored in dormant experts, not active compute.

4. Long-horizon RL reward design¶

Agentic tasks require reward signals from multi-step outcomes: did the code compile? Did tests pass? Did the PR merge? GLM-5's RL training uses outcome-based rewards at the end of tool-interaction trajectories, with intermediate process signals from tool feedback (compiler errors, test results) shaping exploration.

5. Thinking modes¶

GLM-5 supports multiple inference-time reasoning modes — users can trade latency for reasoning depth. This mirrors the hybrid thinking paradigm from Kimi K2.5 and Qwen 3, implemented through GLM-specific routing that selects between fast direct generation and extended chain-of-thought.

The Math — Explained Step by Step¶

1. V-trace correction for asynchronous rollouts¶

Let \(\pi_\theta\) be the current policy and \(\mu\) be the behavior policy that generated the rollout (a stale copy of \(\pi\)). The importance weight for token \(t\) is:

\[ \rho_t = \min\left(\bar{\rho},\, \frac{\pi_\theta(a_t \mid s_t)}{\mu(a_t \mid s_t)}\right) \]

where \(\bar{\rho}\) is a clipping threshold preventing extreme weights. The V-trace target for value estimation becomes:

\[ v_s = V(s_t) + \sum_{k=t}^{t+n-1} \gamma^{k-t} \left(\prod_{j=t}^{k-1} c_j\right) \delta_k \]

with TD errors \(\delta_k = \rho_k (r_k + \gamma V(s_{k+1}) - V(s_k))\) and trace-cutting coefficients \(c_j = \min(\bar{c}, \frac{\pi_\theta(a_j|s_j)}{\mu(a_j|s_j)})\).

Why this matters: Without importance weighting, gradients computed from stale-policy rollouts would be biased. V-trace ensures the policy update is approximately correct even when the rollout was generated by a slightly different policy — the key to making asynchronous RL work.

2. MoE routing with load balancing¶

For input token representation \(h\), the router computes expert scores:

\[ g_i = \text{Softmax}_i\bigl(W_r \cdot h + b_i\bigr) \]

Top-K experts are selected. The output is:

\[ \text{MoE}(h) = \sum_{i \in \text{TopK}} g_i \cdot E_i(h) \]

GLM-5 uses bias-adjusted load balancing (following DeepSeek-V3's approach): \(b_i\) is adjusted by \(\pm\gamma\) based on utilization, avoiding auxiliary loss pollution of the primary objective.

3. Serving cost analysis¶

For a dense model with \(N\) parameters, forward pass FLOPs \(\approx 2N\) per token. For MoE with \(N_{\text{total}}\) parameters but \(N_{\text{active}}\) active:

\[ \text{FLOPs}_{\text{MoE}} \approx 2 N_{\text{active}} + \text{router overhead} \]

GLM-5: \(N_{\text{active}} = 40\text{B}\), so compute cost per token is comparable to a 40B dense model. However, memory must hold all \(744\text{B}\) parameters (or use expert offloading), making the memory-to-compute ratio the key serving constraint.

4. Agentic trajectory reward¶

For a trajectory \(\tau = (a_1, o_1, a_2, o_2, \ldots, a_T, o_T)\) where \(a_i\) are model actions (generate code, call tool) and \(o_i\) are environment observations (tool outputs):

\[ R(\tau) = R_{\text{outcome}}(\tau) + \sum_{t=1}^{T} \alpha_t \cdot r_{\text{process}}(a_t, o_t) - \lambda \cdot |\tau| \]

where \(R_{\text{outcome}}\) is the final task success signal (e.g., all tests pass), \(r_{\text{process}}\) provides intermediate shaping (e.g., compilation success), and \(\lambda |\tau|\) penalizes unnecessarily long trajectories.

Python Implementation¶

The following demonstrates asynchronous rollout buffering with V-trace importance weighting — the core mechanism behind Slime's decoupled training.

"""
Simplified Slime-style async RL buffer with V-trace corrections.
Educational only — not the full GLM-5 training stack.
"""
from __future__ import annotations

import torch
from dataclasses import dataclass, field
from collections import deque


@dataclass
class Rollout:
    states: torch.Tensor          # (T, d_state)
    actions: torch.Tensor         # (T,) token ids
    rewards: torch.Tensor         # (T,)
    log_probs_behavior: torch.Tensor  # (T,) log pi_mu(a|s) from stale policy
    done: bool = False


class AsyncRolloutBuffer:
    def __init__(self, max_size: int = 256):
        self._buffer: deque[Rollout] = deque(maxlen=max_size)

    def push(self, rollout: Rollout) -> None:
        if rollout.done:
            self._buffer.append(rollout)

    def sample(self, batch_size: int) -> list[Rollout]:
        import random
        return random.sample(list(self._buffer), min(batch_size, len(self._buffer)))

    def __len__(self) -> int:
        return len(self._buffer)


def vtrace_targets(
    log_probs_current: torch.Tensor,
    log_probs_behavior: torch.Tensor,
    rewards: torch.Tensor,
    values: torch.Tensor,
    gamma: float = 0.99,
    rho_bar: float = 1.0,
    c_bar: float = 1.0,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute V-trace targets and advantages for off-policy correction.

    Args:
        log_probs_current: log pi_theta(a|s) for current policy, shape (T,)
        log_probs_behavior: log mu(a|s) for behavior policy, shape (T,)
        rewards: per-step rewards, shape (T,)
        values: value estimates V(s), shape (T+1,) (includes bootstrap)
        gamma: discount factor
        rho_bar: importance weight clipping for targets
        c_bar: importance weight clipping for trace coefficients

    Returns:
        (vtrace_targets, advantages) each shape (T,)
    """
    T = rewards.shape[0]
    importance_weights = torch.exp(log_probs_current - log_probs_behavior)
    rho = torch.clamp(importance_weights, max=rho_bar)
    c = torch.clamp(importance_weights, max=c_bar)

    td_errors = rho * (rewards + gamma * values[1:] - values[:T])

    vs = torch.zeros(T + 1)
    vs[T] = values[T]
    for t in reversed(range(T)):
        vs[t] = values[t] + td_errors[t] + gamma * c[t] * (vs[t + 1] - values[t + 1])

    advantages = rho * (rewards + gamma * vs[1:] - values[:T])
    return vs[:T], advantages


def async_training_step(
    buffer: AsyncRolloutBuffer,
    policy_log_prob_fn,
    value_fn,
    batch_size: int = 8,
) -> torch.Tensor | None:
    """One async update step: sample from buffer, compute V-trace, return loss."""
    if len(buffer) < batch_size:
        return None

    rollouts = buffer.sample(batch_size)
    total_loss = torch.tensor(0.0)

    for rollout in rollouts:
        log_probs_current = policy_log_prob_fn(rollout.states, rollout.actions)
        values = value_fn(rollout.states)
        bootstrap = torch.zeros(1)
        values_extended = torch.cat([values, bootstrap])

        _, advantages = vtrace_targets(
            log_probs_current,
            rollout.log_probs_behavior,
            rollout.rewards,
            values_extended,
        )
        total_loss += -(advantages.detach() * log_probs_current).mean()

    return total_loss / len(rollouts)


if __name__ == "__main__":
    T, d = 20, 64
    buf = AsyncRolloutBuffer()
    for _ in range(16):
        buf.push(Rollout(
            states=torch.randn(T, d),
            actions=torch.randint(0, 1000, (T,)),
            rewards=torch.zeros(T),
            log_probs_behavior=torch.randn(T),
            done=True,
        ))
    print(f"Buffer size: {len(buf)}")
    print("V-trace async buffer ready for training steps.")

Interview Importance¶

GLM-5 sits at the intersection of four major interview themes: (1) MoE architecture and serving, (2) RL for agents, (3) asynchronous distributed training, and (4) the GLM pretraining paradigm. Expect questions comparing GLM-5's approach to DeepSeek-R1's GRPO and Kimi K2.5's PARL — all solve "RL for long-horizon reasoning" differently.

Drill themes: (1) Sync vs async RL — when does staleness hurt? (2) MoE memory vs compute cost. (3) Agentic reward design — outcome vs process signals. (4) GLM objective at MoE scale — does expert routing interact with the mixed attention mask?

Interview Questions & Answers (6 Q&As)¶

Q1: How does Slime's asynchronous RL differ from synchronous PPO/GRPO, and when is async preferable? A: Synchronous RL blocks the optimizer until all rollouts complete — wasteful when rollouts involve variable-latency tool calls. Slime decouples generation from optimization: rollout workers produce trajectories into a buffer, and the optimizer pulls batches asynchronously. V-trace importance weights correct for policy staleness. Async is preferable when rollout latency is high and variable (agentic tasks with tool use), and less critical when rollouts are fast and uniform (pure text generation).

Q2: GLM-5 has 744B total but 40B active parameters. What determines serving cost — total or active params? A: Compute (FLOPs per token) scales with active parameters (~40B), so latency is comparable to a 40B dense model. Memory must accommodate all 744B parameters (or use expert offloading with latency penalty). The binding constraint depends on the deployment: GPU-memory-bound setups are limited by total params; compute-bound batch inference is limited by active params. Expert parallelism across GPUs can distribute the memory burden.

Q3: Compare the GLM pretraining objective with standard CLM. Why might blank infilling be better for agentic tasks? A: CLM sees only left context; GLM sees bidirectional context around masked spans and generates spans autoregressively. For agentic tasks like code repair (SWE-bench), the model must understand surrounding code (bidirectional) and generate a patch (autoregressive) — naturally matching the blank-infilling paradigm. Pure CLM must learn this mapping implicitly from next-token prediction alone.

Q4: How does V-trace handle policy staleness in Slime? A: V-trace clips importance ratios \(\rho_t = \min(\bar{\rho}, \pi/\mu)\) to prevent high-variance updates from very stale rollouts. Trace-cutting coefficients \(c_j\) further limit how far corrections propagate through time. The result is a biased but low-variance target that converges to on-policy values as staleness decreases. The clip thresholds \(\bar{\rho}\) and \(\bar{c}\) are hyperparameters trading bias vs variance.

Q5: What infrastructure is needed to serve GLM-5 on-premise? A: 744B parameters in BF16 require ~1.5TB of model weights. With 80GB A100/H100 GPUs, you need at least 20 GPUs for weights alone, plus KV cache memory for 200K context. Expert parallelism distributes different experts across GPUs; tensor parallelism splits individual expert FFNs. Practical setups combine both with pipeline parallelism across nodes. Expert-aware batching groups tokens routed to the same experts to minimize communication.

Q6: Compare GLM-5's Slime, DeepSeek-R1's GRPO, and Kimi K2.5's PARL — three approaches to RL for LLMs. A: GRPO (DeepSeek-R1): critic-free, group-relative advantages across samples for the same prompt — solves "how to compute advantages without a value model." Slime (GLM-5): asynchronous training infrastructure — solves "how to keep GPUs utilized when rollouts involve slow tool execution." PARL (Kimi K2.5): parallel agent coordination — solves "how to train a coordinator that decomposes tasks for parallel sub-agents." They're complementary: you could combine GRPO advantages with Slime async infrastructure, or train PARL coordination using either.

Connections to Other Papers¶

Paper / Line	Connection to GLM-5
GLM-4 / ChatGLM	Direct predecessor — shares the blank-infilling objective; GLM-5 scales it to 744B MoE with agentic RL.
GLM-4.6	Intermediate scaling step (357B/32B MoE); GLM-5 doubles capacity while shifting focus from bilingual chat to agentic coding.
DeepSeek-V3	Shared MoE techniques: bias-adjusted load balancing, large expert pools. GLM-5 applies similar routing at 744B scale.
DeepSeek-R1	Both use RL for reasoning; R1 uses synchronous GRPO, GLM-5 uses async Slime — different systems solutions to long-horizon RL.
Kimi K2.5	Both target agentic capabilities; K2.5 via parallel sub-agent orchestration (PARL), GLM-5 via single-agent tool mastery with async RL.
IMPALA / V-trace	Slime's off-policy correction derives from the IMPALA V-trace framework, adapted from game RL to LLM agentic training.
T5	Shares span corruption heritage — T5 uses encoder-decoder span corruption, GLM uses single-stack blank infilling.

Key Takeaways for Quick Review (table)¶

Topic	One-liner
Scale	744B total, 40B active MoE; MIT license; 28.5T pretraining tokens.
Slime	Async RL: decouple rollout generation from policy updates; V-trace for staleness correction.
Agentic focus	RL reward from multi-step tool outcomes (compile → test → iterate), not single-turn helpfulness.
GLM objective	Inherited autoregressive blank infilling: bidirectional context + causal span generation.
MoE routing	Bias-adjusted load balancing (no auxiliary loss), following DeepSeek-V3's approach.
Serving	Compute ≈ 40B dense model; memory ≈ 1.5TB for full weights; expert + tensor parallelism required.
SWE-bench	77.8 Verified — SotA open-weight, approaching Claude Opus 4.5.
Thinking modes	Multiple inference-time reasoning depths; user-controlled latency-quality trade-off.
vs R1	R1 = GRPO advantages; GLM-5 = async training systems; complementary innovations.
Interview frame	MoE costs, async RL, agentic rewards, GLM objective — four themes in one model.