ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools¶

Authors: Zhipu AI, Tsinghua University (THUDM) | Year: 2024 | Venue: arXiv | Link: arXiv:2406.12793

TL;DR¶

The GLM series introduces a third pretraining paradigm — autoregressive blank infilling — alongside BERT’s masked language modeling (MLM) and GPT’s causal language modeling (CLM). The model masks random spans, uses bidirectional attention over uncorrupted context, and generates each masked span autoregressively left-to-right. GLM-4 scales this recipe to frontier quality with strong Chinese–English bilingual performance, 128K context, and native tool use (“All Tools”). GLM-4.6 extends the line to a 357B mixture-of-experts (MoE) model with 200K input and 128K output tokens.

Why This Paper Matters¶

GLM sits at the intersection of encoder-style understanding and decoder-style generation: one objective trains both, which matters when you want a single stack for retrieval-augmented workflows, dialogue, and code. For interviews, it is a clean contrast to BERT (bidirectional but non-generative) and GPT (generative but unidirectional over the full sequence). The paper also documents tool-augmented training (browser, code interpreter, custom tools) and RL for autonomous tool invocation—directly comparable to modern agent stacks. Finally, bilingual data curation and long context are recurring system-design themes (KV memory, attention complexity, routing in MoE).

Key Concepts Explained Simply¶

1. GLM pretraining objective (autoregressive blank infilling)¶

Randomly sample one or more spans in the document and replace each span with a special mask token (e.g. [MASK]). The uncorrupted tokens keep bidirectional visibility so the model can use full context to predict spans. Each span is predicted token-by-token with causal attention inside that span—like a tiny LM “inside” the hole. Positions are tracked with 2D encoding: document position (where the token sits in the original text) and span position (index within the span being generated).

2. BERT MLM vs GPT CLM vs GLM¶

Paradigm	Context for prediction	Generation
BERT / MLM	Bidirectional over visible tokens	No autoregressive generation of full text
GPT / CLM	Causal (left-to-right) over entire sequence	Strong open-ended generation
GLM	Bidirectional on context + causal within each masked span	Unifies understanding and span generation in one loss

GLM trains both understanding-like context use and generation of missing pieces, using mixed attention masks rather than only full-sequence causality or only MLM independence.

3. All Tools capability¶

GLM-4 can autonomously invoke a web browser, code interpreter, and user-defined tools without hand-written orchestration scripts for each step. Tool use is improved with reinforcement learning so the model learns when to call tools and how to chain results—closer to an agent than a single-turn chat completion.

4. Bilingual strength¶

Strong Chinese and English results come from multilingual pretraining with careful data curation (quality filtering, deduplication, domain mix), not from English-only scaling alone.

5. GLM-4.6 at a glance¶

Reported configuration highlights include 357B total parameters in a MoE layout with about 32B active parameters per forward pass, MIT license, 200K input tokens and 128K output tokens—emphasizing long-horizon reasoning and generation under practical routing constraints.

The Math — Explained Step by Step¶

GLM loss (autoregressive span prediction)¶

Let \(x\) be the original token sequence. After corruption, denote by \(x_{\text{corrupt}}\) the sequence with sampled spans replaced by masks. For each sampled span \(s\) (tokens \(s_1,\ldots,s_{|s|}\)), GLM maximizes:

\[ \mathcal{L}_{\text{GLM}} = - \mathbb{E}\left[ \sum_{s \in \mathcal{S}} \sum_{i=1}^{|s|} \log P\bigl(s_i \mid x_{\text{corrupt}}, s_{1:i-1}\bigr) \right], \]

where \(\mathcal{S}\) is the set of sampled spans, and conditioning includes bidirectional context for unmasked positions plus previously generated span tokens \(s_{1:i-1}\). Compare to BERT’s independent token predictions inside a mask and GPT’s prediction of the next token along the whole sequence.

2D positional encoding¶

Each token carries two indices: document position \(p\) (order in the original text) and span position \(q\) (0 for unmasked context; \(1,2,\ldots\) inside a generated span). Embeddings combine both, e.g.:

\[ \mathrm{pos}(t) = \bigl( p(t),\, q(t) \bigr), \]

so the model knows both global layout and local order within infilled spans.

Attention mask pattern (schematic)¶

Partition tokens into context region \(A\) (unmasked text) and span regions \(B^{(1)}, B^{(2)}, \ldots\):

Within \(A\): full bidirectional attention (each token attends to all other \(A\) tokens).
Within each span \(B^{(k)}\): causal attention (token \(i\) attends to earlier tokens in the same span).
Cross-region: typically \(A \rightarrow B\) allowed (spans see context), \(B \rightarrow A\) blocked for masked positions as designed in GLM-style masks so span generation does not “peek” backward into incompatible parts—implementation details follow the official mask layout in the paper.

A compact description: bidirectional for \([A]\) context, causal within each \([B]\) span, with controlled visibility between \(A\) and \(B\) so infilling stays well-defined.

Token utilization (informative comparison)¶

BERT: only ~15% of tokens contribute to the MLM loss (masked subset); the rest are ignored for the objective.
GPT: 100% of (shifted) positions participate in next-token loss.
GLM: roughly similar masked fraction to BERT for which positions are spans, but those positions are trained autoregressively inside each span—so the model still gets sequential supervision on corrupted regions, unlike independent MLM logits.

Python Implementation¶

The following script demonstrates span sampling, corrupted input construction, autoregressive targets for each span, and a boolean attention mask: bidirectional among context tokens, causal inside each span, with context visible to span tokens (A→B).

"""
GLM-style span infilling: corrupted input, AR targets per span, and attention mask.
Educational only — not a full GLM-4 training recipe.
"""
from __future__ import annotations

import random


def tokenize_simple(text: str) -> list[str]:
    return text.strip().split()


def sample_spans(
    n_tokens: int,
    *,
    min_spans: int = 1,
    max_spans: int = 3,
    min_len: int = 1,
    max_len: int = 4,
    rng: random.Random | None = None,
) -> list[tuple[int, int]]:
    """Return list of (start, end) half-open intervals, non-overlapping."""
    rng = rng or random.Random()
    n_spans = rng.randint(min_spans, max_spans)
    spans: list[tuple[int, int]] = []
    used = [False] * n_tokens
    for _ in range(n_spans * 4):
        if len(spans) >= n_spans:
            break
        ln = rng.randint(min_len, min(max_len, n_tokens))
        start = rng.randint(0, n_tokens - ln)
        end = start + ln
        if any(used[start:end]):
            continue
        for i in range(start, end):
            used[i] = True
        spans.append((start, end))
    spans.sort()
    return spans


def glm_corruption(
    tokens: list[str],
    spans: list[tuple[int, int]],
    mask_token: str = "[MASK]",
) -> tuple[list[str], list[list[str]], list[int]]:
    """
    Build corrupted sequence (spans -> single MASK each) and AR targets per span.
    Returns (corrupt_tokens, targets_per_span, span_starts_in_corrupt).
    """
    span_set = set()
    for a, b in spans:
        span_set.update(range(a, b))

    corrupt: list[str] = []
    span_starts: list[int] = []
    targets: list[list[str]] = []

    i = 0
    while i < len(tokens):
        if i in span_set:
            a = i
            while i < len(tokens) and i in span_set:
                i += 1
            targets.append(tokens[a:i])
            span_starts.append(len(corrupt))
            corrupt.append(mask_token)
        else:
            corrupt.append(tokens[i])
            i += 1
    return corrupt, targets, span_starts


def expand_corrupt_sequential(
    corrupt: list[str], targets: list[list[str]], mask_token: str = "[MASK]"
) -> tuple[list[str], list[tuple[int, int]]]:
    out: list[str] = []
    spans: list[tuple[int, int]] = []
    ti = 0
    for tok in corrupt:
        if tok == mask_token:
            s = len(out)
            out.extend(targets[ti])
            e = len(out)
            spans.append((s, e))
            ti += 1
        else:
            out.append(tok)
    return out, spans


def attention_bool_matrix(span_intervals: list[tuple[int, int]], n: int) -> list[list[bool]]:
    """mat[i][j] = True iff query position i may attend to key position j."""
    span_pos: set[int] = set()
    for a, b in span_intervals:
        span_pos.update(range(a, b))
    ctx_positions = [i for i in range(n) if i not in span_pos]

    mat = [[False] * n for _ in range(n)]
    # Bidirectional among context (A <-> A)
    for i in ctx_positions:
        for j in ctx_positions:
            mat[i][j] = True
    # Span queries attend to all context (B -> A); causal within each span (B -> B)
    for a, b in span_intervals:
        for i in range(a, b):
            for j in ctx_positions:
                mat[i][j] = True
            for j in range(a, i + 1):
                mat[i][j] = True
    # Strict variant: context queries do not attend to span keys (no A -> B)
    return mat


def demo() -> None:
    text = "the language model fills missing words here today"
    toks = tokenize_simple(text)
    rng = random.Random(0)
    spans = sample_spans(len(toks), rng=rng)
    corrupt, targ, _ = glm_corruption(toks, spans)
    expanded, sint = expand_corrupt_sequential(corrupt, targ)
    attn = attention_bool_matrix(sint, len(expanded))
    print("original:", toks)
    print("spans:", spans)
    print("corrupt:", corrupt)
    print("targets:", targ)
    print("expanded:", expanded)
    print("attn[0] row:", [int(x) for x in attn[0]])


if __name__ == "__main__":
    demo()

This illustrates the masking strategy and a consistent attention pattern: context fully connected, each span causal, span tokens see context, and context does not attend into span tokens (one common strict variant).

Interview Importance¶

Expect compare-and-contrast questions: GLM vs BERT vs GPT objectives, why 2D positions exist, and how span infilling changes data efficiency and downstream behavior (e.g., generation vs classification). Tool use questions may probe RL vs supervised copying, and long context may connect to KV cost and MoE routing. Be ready to sketch the attention mask on a whiteboard.

Interview Questions & Answers (6 Q&As)¶

Q1: How does GLM differ from BERT’s MLM and GPT’s CLM?
A: BERT predicts independent masked tokens with bidirectional context but does not train open-ended left-to-right generation over long outputs. GPT trains every position with causal attention. GLM masks spans, keeps bidirectional context among visible tokens, and predicts each span autoregressively—unifying context understanding and generative span completion in one objective.

Q2: What does the GLM attention mask look like at a high level?
A: Bidirectional among uncorrupted context tokens; causal inside each masked span as it is generated; span tokens attend to context so infilling is conditioned on the document; context typically does not attend to span tokens to avoid leaking answers during context encoding (exact layout follows the paper’s block mask).

Q3: Why use 2D positional encodings?
A: Tokens need a global document order \(p\) and a local order \(q\) within each infilled span. One-hot “next token” positions in GPT are not enough when spans are non-contiguous holes; 2D encodings disambiguate “where in the document” vs “which step inside the span.”

Q4: How does bilingual performance usually improve in such models?
A: Through balanced high-quality pretraining corpora in both languages, cleaning and deduplication, and often English-centric initialization plus continued multilingual training—evaluation on both MMLU-style and Chinese benchmarks.

Q5: When might GLM-style infilling be preferable to pure CLM?
A: When tasks benefit from bidirectional context (e.g., cloze, rewriting, conditional infilling, structured templates) or when you want span-level supervision without throwing away non-predicted positions entirely—at the cost of more complex masking and position machinery than standard GPT training.

Q6: How is “All Tools” training related to classic RLHF?
A: Beyond preference alignment on text, the model is rewarded for successful tool outcomes (correct code execution, useful retrieval). That requires tool-augmented rollouts, reward signals on trajectories, and credit assignment across multiple tool calls—closer to RL for agents than single-turn reward models alone.

Connections to Other Papers¶

Paper / line	Connection to GLM / GLM-4
BERT	Shared MLM heritage—masking—but GLM adds autoregressive span prediction and generation.
GPT-2 / GPT-3	CLM baseline; GLM relaxes full-sequence causality on context regions.
T5	Span corruption and encoder–decoder text-to-text; GLM uses a single transformer with mixed masks instead of explicit encoder–decoder split.
Toolformer	Early tool augmentation via supervised examples; GLM-4 pushes autonomous multi-tool use with RL.
ReAct	Reasoning + acting loop; GLM-4’s All Tools is an end-to-end LLM agent direction with native browser and code tools.

Key Takeaways for Quick Review (table)¶

Topic	One-liner
Objective	Autoregressive blank infilling: mask spans, bidirectional context, causal span generation.
Vs BERT / GPT	Understanding + generation in one loss; not pure MLM or full-sequence CLM.
Positions	2D: document index + span index inside each hole.
Loss	\(\mathcal{L} = - \mathbb{E}\bigl[\sum_{s}\sum_i \log P(s_i \mid x_{\text{corrupt}}, s_{1:i-1})\bigr]\).
Attention	Mixed mask: bidirectional A, causal B, controlled A↔B visibility.
Tokens	Similar masked fraction to BERT on which tokens are supervised; AR inside spans unlike independent MLM.
GLM-4	Strong bilingual results, 128K context, All Tools (browser, code, custom).
GLM-4.6	357B MoE, ~32B active, 200K in / 128K out, MIT license.