4. Common Models & Benchmarks

Overview

This chapter summarizes widely used LLM architectures and model families (encoder-only, decoder-only, encoder-decoder, PrefixLM, MoE), along with the design choices that matter in practice and a pragmatic evaluation checklist.

Learning goals

By the end of this chapter, you should be able to:

  • Explain the main Transformer archetypes (encoder-only, decoder-only, encoder-decoder, PrefixLM) and when each is used.
  • Interpret core formulas and what they imply in practice (attention, CLM/MLM/S2S losses, KV cache sizing, RoPE, MoE routing).
  • Reason about inference trade-offs (MHA vs MQA vs GQA; KV-cache scaling; context length).
  • Find and validate “real specs” from technical reports/model cards and configs.
  • Choose a small, diverse evaluation set and avoid common evaluation pitfalls.

Math Recap

Attention

Given token representations \(X \in \mathbb{R}^{n \times d}\), a standard attention head is: \[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \] \[ \mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V \] where \(M\) is an attention mask (e.g., causal mask for decoder-only LMs).

Compute/memory rule of thumb (per layer, dense attention)

  • Attention scores are \(n \times n\) ⇒ time/memory scale as \(O(n^2)\).
  • For hidden size \(d\) and sequence length \(n\), attention compute is roughly \(O(n^2 d)\).

KV cache (decoder-only inference)

At generation step \(t\), keys/values from prior tokens are reused:

Cache size per layer is roughly:

\[ \mathrm{KV\ bytes} \approx 2 \cdot n \cdot d_k \cdot b \]

where \(b\) is bytes per element (e.g., \(b=2\) for fp16/bf16).

The factor 2 is for \(K\) and \(V\) (and multiply again by number of heads/groups depending on attention type).

MHA vs MQA vs GQA (why it matters)

Let \(H\) be query heads and \(G\) be key/value groups.

  • MHA: \(G = H\) (each head has its own \(K,V\)) ⇒ highest quality, largest KV cache.
  • MQA: \(G = 1\) (all heads share one \(K,V\)) ⇒ smallest KV cache, sometimes quality hit.
  • GQA: \(1 < G < H\) ⇒ trade-off between quality and KV-cache size.

KV cache scales approximately with \(G\) (number of KV groups).

MLP / FFN

Classic FFN: \[ \mathrm{FFN}(x)=\sigma(xW_1 + b_1)W_2 + b_2 \] Common modern variant (GLU/SwiGLU family): \[ \mathrm{SwiGLU}(x) = (xW_a) \odot \mathrm{swish}(xW_b) \]

Three common training objectives

  • MLM (masked language modeling): predict masked tokens using both left and right context.
  • Causal LM: predict next token \(p(x_t\mid x_{<t})\) with a causal mask.
  • Seq2Seq: encoder produces hidden states; decoder predicts target sequence with cross-attention.

Losses (common interview formulas)

  • Causal LM: \[ \mathcal{L}_{\text{CLM}} = - \sum_{t=1}^{n} \log p(x_t \mid x_{<t}) \]
  • MLM (mask set \(\mathcal{M}\)): \[ \mathcal{L}_{\text{MLM}} = - \sum_{t\in \mathcal{M}} \log p(x_t \mid x_{\setminus \mathcal{M}}) \]
  • Seq2Seq (target length \(m\)): \[ \mathcal{L}_{\text{S2S}} = - \sum_{t=1}^{m} \log p(y_t \mid y_{<t},\; x) \]

Positional encoding (the minimum you should know)

RoPE is commonly written as applying a position-dependent rotation to \(Q,K\) in 2D subspaces: \[ \mathrm{RoPE}(q_t) = R(t)q_t,\quad \mathrm{RoPE}(k_t) = R(t)k_t \] \[ R(t)=\begin{bmatrix}\cos(\theta t)&-\sin(\theta t)\\ \sin(\theta t)&\cos(\theta t)\end{bmatrix} \] In practice, \(R(t)\) is applied blockwise across dimensions with a range of frequencies.

MoE (mixture-of-experts) routing

A common formulation routes tokens to top-\(k\) experts: \[ g(x)=\mathrm{softmax}(Wx)\,,\qquad \mathrm{MoE}(x)=\sum_{e\in \mathrm{TopK}(g(x))} g_e(x)\,E_e(x) \] Key trade-off: capacity/quality vs routing/serving complexity.

Architecture archetypes

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "primaryColor": "transparent", "secondaryColor": "transparent", "tertiaryColor": "transparent", "primaryTextColor": "#111111", "primaryBorderColor": "#111111", "lineColor": "#111111", "clusterBkg": "transparent", "clusterBorder": "#111111", "edgeLabelBackground": "transparent", "fontFamily": "system-ui, -apple-system, Segoe UI, Roboto, sans-serif", "fontSize": "14px"}}}%%
flowchart LR
  subgraph A["Encoder-only (BERT)"]
    A1["bi-directional self-attention"]
  end

  subgraph B["Decoder-only (GPT / LLaMA / Qwen / Mistral)"]
    B1["causal self-attention"]
  end

  subgraph C["Encoder-decoder (T5)"]
    C1["encoder: bi-directional self-attention"]
    C2["decoder: causal self-attention"]
    C3["cross-attention to encoder"]
  end

  subgraph D["PrefixLM (GLM family)"]
    D1["prefix: bi-directional self-attention"]
    D2["suffix: causal self-attention"]
  end

Quick comparison table

Family / Model Core architecture Typical objective Typical strengths Typical trade-offs
BERT Encoder-only MLM embeddings, classification, retrieval features not a native generator
T5 Encoder-decoder Seq2Seq strong conditional generation more compute than decoder-only at same params
GPT-style Decoder-only Causal LM general generation, instruction tuning quadratic attention cost; long-context challenges
LLaMA-style Decoder-only Causal LM efficient modern decoder recipe same decoder-only limitations
Mistral-style Decoder-only Causal LM efficiency tweaks (e.g., windowed attn) windowing can reduce global context interactions
GLM / PrefixLM Hybrid mask (PrefixLM) PrefixLM / Causal-ish good for understanding + generation hybrids more masking complexity
MoE families Transformer + routed FFNs usually Causal LM / Seq2Seq scale params cheaply; high capacity routing complexity; serving efficiency trade-offs

Model families (architectures, choices, trade-offs)

BERT

  • Architecture: encoder-only Transformer (bi-directional self-attention).
  • Objective: MLM (plus next-sentence prediction in original BERT).
  • Design notes:
    • Great for representations (embeddings) and classification.
    • Less natural for open-ended generation without conversion to a decoder.
  • References: BERT paper, The Illustrated BERT, Google AI Blog announcement

T5 (Text-to-Text Transfer Transformer)

  • Architecture: encoder-decoder Transformer.
  • Objective: span corruption / denoising (Seq2Seq).
  • Design notes:
    • Strong at conditional tasks (translate/summarize/structured outputs).
    • Compared to decoder-only, you pay extra compute for the encoder.
  • References: T5 paper, Google AI Blog post

GPT family (incl. “GPT-style decoders”)

  • Architecture: decoder-only Transformer with a causal mask.
  • Objective: next-token prediction (Causal LM), then instruction tuning and/or RLHF.
  • Common design choices (varies by release):
    • Pre-LN blocks; FFN activations like GELU/SwiGLU.
    • Attention variants: MHA, MQA/GQA.
    • Positional encoding: learned absolute or RoPE/relative variants.
  • References: GPT, GPT-2, GPT-2 blog, GPT-3, InstructGPT (RLHF)

PaLM

  • Architecture: large decoder-only Transformer.
  • Objective: causal LM; later instruction-tuned variants.
  • Notes:
    • Treat as a strong example of “GPT-style at scale”.
    • In the PaLM report, the authors describe several architecture choices: SwiGLU activations, parallel Transformer sublayers, Multi-Query Attention (MQA), RoPE, shared input/output embeddings, and no biases in dense kernels / layer norms. They also use a SentencePiece vocabulary (256k tokens) and report a training sequence length of 2048.
  • References: PaLM paper, Google AI Blog post

LLaMA

  • Architecture: decoder-only Transformer.
  • Typical recipe (common in LLaMA-like stacks):
    • RoPE positional encoding
    • RMSNorm
    • SwiGLU MLP
    • (Later variants often use) GQA for faster inference
  • Trade-offs:
    • Efficient and widely adopted; many downstream derivatives.
    • Long-context requires RoPE scaling / long-context adaptation.
  • References: LLaMA, Llama 2

Qwen

  • Architecture: typically LLaMA-like decoder-only Transformer family.
  • Notes:
    • Commonly shares the modern decoder recipe (RoPE + RMSNorm + SwiGLU) with model-specific tweaks.
    • For Qwen2, the official release notes state they use GQA across all sizes, and long-context support varies by checkpoint (the family is pretrained at 32K context; some instruction-tuned variants go up to 128K).
    • For Qwen2.5, the release notes describe a dense decoder-only family with variants supporting up to 128K context, and mention pretraining up to 18T tokens (per the post).
  • References: Qwen report, Qwen2 release notes, Qwen2.5 release notes, Qwen docs/model cards

Mistral

  • Architecture: decoder-only Transformer.
  • Known family-level choices (typical for Mistral-style efficiency):
    • Grouped-query attention (GQA)
    • Sliding-window / local attention in some models
  • Trade-offs:
    • Strong efficiency/latency; local attention can constrain very long-range interactions unless augmented.
  • References: Mistral 7B, Mixtral 8x7B (MoE)

GLM (e.g., GLM 4.5/4.6)

  • Architecture: PrefixLM-style hybrid masking.
  • Objective: prefix bi-directional attention for the prefix + causal decoding for the suffix.
  • Trade-offs:
    • A bridge between encoder-style understanding and decoder-style generation.
    • More complexity in masking/training setup.
  • References: GLM paper (blank infilling / PrefixLM)

DeepSeek (R1, V3.x)

  • V2: MoE model family; the repository describes 236B total parameters / 21B activated per token, 128K context for the main V2 model, pretraining on 8.1T tokens, and architecture components including MLA (Multi-head Latent Attention) and DeepSeekMoE.
  • V3: MoE model family; the repository describes 671B total parameters / 37B activated per token, 128K context, pretraining on 14.8T tokens, and use of MLA + DeepSeekMoE, plus an auxiliary-loss-free load-balancing strategy and a Multi-Token Prediction (MTP) objective.
  • R1: reasoning/post-training family; the repository describes DeepSeek-R1-Zero (RL without SFT) and DeepSeek-R1 (adds cold-start data before RL). It also states R1 models are trained based on DeepSeek-V3-Base.
  • References: DeepSeek-V2 repo, DeepSeek-V2 paper, DeepSeek-V3 repo, DeepSeek-V3 report, DeepSeek-R1 repo, DeepSeek-R1 paper

Nemotron-3-Nano (optional)

Benchmarks & websites (best places to learn and compare)

Suites / toolkits

  • OpenCompass: broad coverage and a practical pipeline for running many benchmarks consistently.
  • lm-evaluation-harness (EleutherAI): the de-facto standard harness for offline academic benchmarks; lots of existing baselines.
  • HELM (Stanford CRFM): strongest when you want scenario-style evaluation and transparent reporting dimensions (robustness, fairness, etc.).

Leaderboards / aggregators

“Where to read the real specs”

  • Official technical reports / model cards (most reliable for architecture details).
  • Inference engine repos / configs (often reveal attention type, rope scaling, context length).

Evaluation

This section focuses on how to evaluate models (methods) and what to run (recommended benchmarks).

Evaluation methods

  1. Offline (automatic) evaluation
    • Pros: fast, reproducible.
    • Cons: benchmark overfitting; data contamination; prompt sensitivity.
  2. Model-based judging (LLM-as-a-judge)
    • Pros: scalable for instruction following and writing quality.
    • Cons: bias, position effects, judge model drift; needs careful controls.
  3. Human preference tests
    • Pros: best alignment with UX.
    • Cons: expensive and slow; needs good sampling + rater calibration.
  4. System-level / agentic evaluation
    • Pros: measures end-to-end usefulness (tools, codebases, long tasks).
    • Cons: harder to standardize; infra heavy.

Setup pitfalls (what makes results misleading)

  • Prompting: 0-shot vs few-shot vs CoT; formatting; system prompts.
  • Decoding: temperature/top-p; deterministic vs sampled; seed control.
  • Contamination: train/test leakage and benchmark memorization.
  • Scoring: exact match vs pass@k; normalization; unit tests.

Summary

  • Most “LLMs” you see in practice are decoder-only Transformers; encoder-only and encoder-decoder are still common for embedding and conditional generation.
  • The core scaling constraints show up in attention (\(O(n^2)\)) and KV cache (roughly linear in \(n\) and number of KV groups).
  • Many modern families share the same decoder recipe (RoPE + RMSNorm + SwiGLU) and differ in efficiency choices (GQA/MQA, windowed attention, MoE routing).
  • Trust architecture claims that are backed by technical reports/model cards and configs.
  • Evaluation is only meaningful when prompts/decoding are controlled and you use a small, diverse benchmark set.

References