4. Common Models & Benchmarks

Overview

This chapter summarizes widely used LLM architectures and model families (encoder-only, decoder-only, encoder-decoder, PrefixLM, MoE), along with the design choices that matter in practice and a pragmatic evaluation checklist.

Learning goals

By the end of this chapter, you should be able to:

Explain the main Transformer archetypes (encoder-only, decoder-only, encoder-decoder, PrefixLM) and when each is used.
Interpret core formulas and what they imply in practice (attention, CLM/MLM/S2S losses, KV cache sizing, RoPE, MoE routing).
Reason about inference trade-offs (MHA vs MQA vs GQA; KV-cache scaling; context length).
Find and validate “real specs” from technical reports/model cards and configs.
Choose a small, diverse evaluation set and avoid common evaluation pitfalls.

Math Recap

Attention

Given token representations \(X \in \mathbb{R}^{n \times d}\), a standard attention head is: \[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \] \[ \mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V \] where \(M\) is an attention mask (e.g., causal mask for decoder-only LMs).

Compute/memory rule of thumb (per layer, dense attention)

Attention scores are \(n \times n\) ⇒ time/memory scale as \(O(n^2)\).
For hidden size \(d\) and sequence length \(n\), attention compute is roughly \(O(n^2 d)\).

KV cache (decoder-only inference)

At generation step \(t\), keys/values from prior tokens are reused:

Cache size per layer is roughly:

\[ \mathrm{KV\ bytes} \approx 2 \cdot n \cdot d_k \cdot b \]

where \(b\) is bytes per element (e.g., \(b=2\) for fp16/bf16).

The factor 2 is for \(K\) and \(V\) (and multiply again by number of heads/groups depending on attention type).

MHA vs MQA vs GQA (why it matters)

Let \(H\) be query heads and \(G\) be key/value groups.

MHA: \(G = H\) (each head has its own \(K,V\)) ⇒ highest quality, largest KV cache.
MQA: \(G = 1\) (all heads share one \(K,V\)) ⇒ smallest KV cache, sometimes quality hit.
GQA: \(1 < G < H\) ⇒ trade-off between quality and KV-cache size.

KV cache scales approximately with \(G\) (number of KV groups).

MLP / FFN

Classic FFN: \[ \mathrm{FFN}(x)=\sigma(xW_1 + b_1)W_2 + b_2 \] Common modern variant (GLU/SwiGLU family): \[ \mathrm{SwiGLU}(x) = (xW_a) \odot \mathrm{swish}(xW_b) \]

Three common training objectives

MLM (masked language modeling): predict masked tokens using both left and right context.
Causal LM: predict next token \(p(x_t\mid x_{<t})\) with a causal mask.
Seq2Seq: encoder produces hidden states; decoder predicts target sequence with cross-attention.

Losses (common interview formulas)

Causal LM: \[ \mathcal{L}_{\text{CLM}} = - \sum_{t=1}^{n} \log p(x_t \mid x_{<t}) \]
MLM (mask set \(\mathcal{M}\)): \[ \mathcal{L}_{\text{MLM}} = - \sum_{t\in \mathcal{M}} \log p(x_t \mid x_{\setminus \mathcal{M}}) \]
Seq2Seq (target length \(m\)): \[ \mathcal{L}_{\text{S2S}} = - \sum_{t=1}^{m} \log p(y_t \mid y_{<t},\; x) \]

Positional encoding (the minimum you should know)

RoPE is commonly written as applying a position-dependent rotation to \(Q,K\) in 2D subspaces: \[ \mathrm{RoPE}(q_t) = R(t)q_t,\quad \mathrm{RoPE}(k_t) = R(t)k_t \] \[ R(t)=\begin{bmatrix}\cos(\theta t)&-\sin(\theta t)\\ \sin(\theta t)&\cos(\theta t)\end{bmatrix} \] In practice, \(R(t)\) is applied blockwise across dimensions with a range of frequencies.

MoE (mixture-of-experts) routing

A common formulation routes tokens to top-\(k\) experts: \[ g(x)=\mathrm{softmax}(Wx)\,,\qquad \mathrm{MoE}(x)=\sum_{e\in \mathrm{TopK}(g(x))} g_e(x)\,E_e(x) \] Key trade-off: capacity/quality vs routing/serving complexity.

Architecture archetypes

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "primaryColor": "transparent", "secondaryColor": "transparent", "tertiaryColor": "transparent", "primaryTextColor": "#111111", "primaryBorderColor": "#111111", "lineColor": "#111111", "clusterBkg": "transparent", "clusterBorder": "#111111", "edgeLabelBackground": "transparent", "fontFamily": "system-ui, -apple-system, Segoe UI, Roboto, sans-serif", "fontSize": "14px"}}}%%
flowchart LR
  subgraph A["Encoder-only (BERT)"]
    A1["bi-directional self-attention"]
  end

  subgraph B["Decoder-only (GPT / LLaMA / Qwen / Mistral)"]
    B1["causal self-attention"]
  end

  subgraph C["Encoder-decoder (T5)"]
    C1["encoder: bi-directional self-attention"]
    C2["decoder: causal self-attention"]
    C3["cross-attention to encoder"]
  end

  subgraph D["PrefixLM (GLM family)"]
    D1["prefix: bi-directional self-attention"]
    D2["suffix: causal self-attention"]
  end

Quick comparison table

Family / Model	Core architecture	Typical objective	Typical strengths	Typical trade-offs
BERT	Encoder-only	MLM	embeddings, classification, retrieval features	not a native generator
T5	Encoder-decoder	Seq2Seq	strong conditional generation	more compute than decoder-only at same params
GPT-style	Decoder-only	Causal LM	general generation, instruction tuning	quadratic attention cost; long-context challenges
LLaMA-style	Decoder-only	Causal LM	efficient modern decoder recipe	same decoder-only limitations
Mistral-style	Decoder-only	Causal LM	efficiency tweaks (e.g., windowed attn)	windowing can reduce global context interactions
GLM / PrefixLM	Hybrid mask (PrefixLM)	PrefixLM / Causal-ish	good for understanding + generation hybrids	more masking complexity
MoE families	Transformer + routed FFNs	usually Causal LM / Seq2Seq	scale params cheaply; high capacity	routing complexity; serving efficiency trade-offs

Model families (architectures, choices, trade-offs)

BERT

Architecture: encoder-only Transformer (bi-directional self-attention).
Objective: MLM (plus next-sentence prediction in original BERT).
Design notes:
- Great for representations (embeddings) and classification.
- Less natural for open-ended generation without conversion to a decoder.
References: BERT paper, The Illustrated BERT, Google AI Blog announcement

T5 (Text-to-Text Transfer Transformer)

Architecture: encoder-decoder Transformer.
Objective: span corruption / denoising (Seq2Seq).
Design notes:
- Strong at conditional tasks (translate/summarize/structured outputs).
- Compared to decoder-only, you pay extra compute for the encoder.
References: T5 paper, Google AI Blog post

GPT family (incl. “GPT-style decoders”)

Architecture: decoder-only Transformer with a causal mask.
Objective: next-token prediction (Causal LM), then instruction tuning and/or RLHF.
Common design choices (varies by release):
- Pre-LN blocks; FFN activations like GELU/SwiGLU.
- Attention variants: MHA, MQA/GQA.
- Positional encoding: learned absolute or RoPE/relative variants.
References: GPT, GPT-2, GPT-2 blog, GPT-3, InstructGPT (RLHF)

PaLM

Architecture: large decoder-only Transformer.
Objective: causal LM; later instruction-tuned variants.
Notes:
- Treat as a strong example of “GPT-style at scale”.
- In the PaLM report, the authors describe several architecture choices: SwiGLU activations, parallel Transformer sublayers, Multi-Query Attention (MQA), RoPE, shared input/output embeddings, and no biases in dense kernels / layer norms. They also use a SentencePiece vocabulary (256k tokens) and report a training sequence length of 2048.
References: PaLM paper, Google AI Blog post

LLaMA

Architecture: decoder-only Transformer.
Typical recipe (common in LLaMA-like stacks):
- RoPE positional encoding
- RMSNorm
- SwiGLU MLP
- (Later variants often use) GQA for faster inference
Trade-offs:
- Efficient and widely adopted; many downstream derivatives.
- Long-context requires RoPE scaling / long-context adaptation.
References: LLaMA, Llama 2

Qwen

Architecture: typically LLaMA-like decoder-only Transformer family.
Notes:
- Commonly shares the modern decoder recipe (RoPE + RMSNorm + SwiGLU) with model-specific tweaks.
- For Qwen2, the official release notes state they use GQA across all sizes, and long-context support varies by checkpoint (the family is pretrained at 32K context; some instruction-tuned variants go up to 128K).
- For Qwen2.5, the release notes describe a dense decoder-only family with variants supporting up to 128K context, and mention pretraining up to 18T tokens (per the post).
References: Qwen report, Qwen2 release notes, Qwen2.5 release notes, Qwen docs/model cards

Mistral

Architecture: decoder-only Transformer.
Known family-level choices (typical for Mistral-style efficiency):
- Grouped-query attention (GQA)
- Sliding-window / local attention in some models
Trade-offs:
- Strong efficiency/latency; local attention can constrain very long-range interactions unless augmented.
References: Mistral 7B, Mixtral 8x7B (MoE)

GLM (e.g., GLM 4.5/4.6)

Architecture: PrefixLM-style hybrid masking.
Objective: prefix bi-directional attention for the prefix + causal decoding for the suffix.
Trade-offs:
- A bridge between encoder-style understanding and decoder-style generation.
- More complexity in masking/training setup.
References: GLM paper (blank infilling / PrefixLM)

DeepSeek (R1, V3.x)

V2: MoE model family; the repository describes 236B total parameters / 21B activated per token, 128K context for the main V2 model, pretraining on 8.1T tokens, and architecture components including MLA (Multi-head Latent Attention) and DeepSeekMoE.
V3: MoE model family; the repository describes 671B total parameters / 37B activated per token, 128K context, pretraining on 14.8T tokens, and use of MLA + DeepSeekMoE, plus an auxiliary-loss-free load-balancing strategy and a Multi-Token Prediction (MTP) objective.
R1: reasoning/post-training family; the repository describes DeepSeek-R1-Zero (RL without SFT) and DeepSeek-R1 (adds cold-start data before RL). It also states R1 models are trained based on DeepSeek-V3-Base.
References: DeepSeek-V2 repo, DeepSeek-V2 paper, DeepSeek-V3 repo, DeepSeek-V3 report, DeepSeek-R1 repo, DeepSeek-R1 paper

Nemotron-3-Nano (optional)

NVIDIA’s Nemotron family is documented via public model cards (Hugging Face) and a technical report for Nemotron-4.
Example specs from the model cards:
- Nemotron-3-8B-Base-4k: GPT-style Transformer; supports 4096 context length; trained with sequence length 4096 and uses multi-head attention (per the model card).
- Nemotron-4-340B (Base/Instruct): decoder-only Transformer; sequence length 4096; uses GQA and RoPE; Instruct supports 4096 context length.
References: Nemotron-3-8B-Base-4k model card, Nemotron-4-340B-Base model card, Nemotron-4-340B-Instruct model card, Nemotron-4 technical report

Benchmarks & websites (best places to learn and compare)

Suites / toolkits

OpenCompass: broad coverage and a practical pipeline for running many benchmarks consistently.
lm-evaluation-harness (EleutherAI): the de-facto standard harness for offline academic benchmarks; lots of existing baselines.
HELM (Stanford CRFM): strongest when you want scenario-style evaluation and transparent reporting dimensions (robustness, fairness, etc.).

Leaderboards / aggregators

“Where to read the real specs”

Official technical reports / model cards (most reliable for architecture details).
Inference engine repos / configs (often reveal attention type, rope scaling, context length).

Evaluation

This section focuses on how to evaluate models (methods) and what to run (recommended benchmarks).

Evaluation methods

Offline (automatic) evaluation
- Pros: fast, reproducible.
- Cons: benchmark overfitting; data contamination; prompt sensitivity.
Model-based judging (LLM-as-a-judge)
- Pros: scalable for instruction following and writing quality.
- Cons: bias, position effects, judge model drift; needs careful controls.
Human preference tests
- Pros: best alignment with UX.
- Cons: expensive and slow; needs good sampling + rater calibration.
System-level / agentic evaluation
- Pros: measures end-to-end usefulness (tools, codebases, long tasks).
- Cons: harder to standardize; infra heavy.

Setup pitfalls (what makes results misleading)

Prompting: 0-shot vs few-shot vs CoT; formatting; system prompts.
Decoding: temperature/top-p; deterministic vs sampled; seed control.
Contamination: train/test leakage and benchmark memorization.
Scoring: exact match vs pass@k; normalization; unit tests.

Recommended benchmark set (practical)

Pick a small but diverse set; don’t rely on a single number.

Two evaluation formulas that come up often

Perplexity: \[ \mathrm{PPL} = \exp\left(\frac{1}{n}\sum_{t=1}^{n} -\log p(x_t\mid x_{<t})\right) \]
Pass@k (coding): estimate probability at least one of \(k\) samples passes unit tests (often reported as pass@1 / pass@10).

General knowledge & reasoning

Math

Coding

Instruction following / chat quality

Truthfulness / safety

TruthfulQA
Safety red-teaming: pick a policy/threat model, then use a mix of automated scanners + curated datasets; examples include garak and NVIDIA’s AEGIS (referenced in the Nemotron-4 model card).

Multilingual

Long-context

LongBench
Needle-in-a-haystack: use controlled synthetic retrieval probes; a commonly referenced public implementation is https://github.com/gkamradt/LLMTest_NeedleInAHaystack

taxonomy

Summary

Most “LLMs” you see in practice are decoder-only Transformers; encoder-only and encoder-decoder are still common for embedding and conditional generation.
The core scaling constraints show up in attention (\(O(n^2)\)) and KV cache (roughly linear in \(n\) and number of KV groups).
Many modern families share the same decoder recipe (RoPE + RMSNorm + SwiGLU) and differ in efficiency choices (GQA/MQA, windowed attention, MoE routing).
Trust architecture claims that are backed by technical reports/model cards and configs.
Evaluation is only meaningful when prompts/decoding are controlled and you use a small, diverse benchmark set.