%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "primaryColor": "transparent", "secondaryColor": "transparent", "tertiaryColor": "transparent", "primaryTextColor": "#111111", "primaryBorderColor": "#111111", "lineColor": "#111111", "clusterBkg": "transparent", "clusterBorder": "#111111", "edgeLabelBackground": "transparent", "fontFamily": "system-ui, -apple-system, Segoe UI, Roboto, sans-serif", "fontSize": "14px"}}}%%
flowchart LR
subgraph A["Encoder-only (BERT)"]
A1["bi-directional self-attention"]
end
subgraph B["Decoder-only (GPT / LLaMA / Qwen / Mistral)"]
B1["causal self-attention"]
end
subgraph C["Encoder-decoder (T5)"]
C1["encoder: bi-directional self-attention"]
C2["decoder: causal self-attention"]
C3["cross-attention to encoder"]
end
subgraph D["PrefixLM (GLM family)"]
D1["prefix: bi-directional self-attention"]
D2["suffix: causal self-attention"]
end
4. Common Models & Benchmarks
Overview
This chapter summarizes widely used LLM architectures and model families (encoder-only, decoder-only, encoder-decoder, PrefixLM, MoE), along with the design choices that matter in practice and a pragmatic evaluation checklist.
Learning goals
By the end of this chapter, you should be able to:
- Explain the main Transformer archetypes (encoder-only, decoder-only, encoder-decoder, PrefixLM) and when each is used.
- Interpret core formulas and what they imply in practice (attention, CLM/MLM/S2S losses, KV cache sizing, RoPE, MoE routing).
- Reason about inference trade-offs (MHA vs MQA vs GQA; KV-cache scaling; context length).
- Find and validate “real specs” from technical reports/model cards and configs.
- Choose a small, diverse evaluation set and avoid common evaluation pitfalls.
Math Recap
Attention
Given token representations \(X \in \mathbb{R}^{n \times d}\), a standard attention head is: \[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \] \[ \mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V \] where \(M\) is an attention mask (e.g., causal mask for decoder-only LMs).
Compute/memory rule of thumb (per layer, dense attention)
- Attention scores are \(n \times n\) ⇒ time/memory scale as \(O(n^2)\).
- For hidden size \(d\) and sequence length \(n\), attention compute is roughly \(O(n^2 d)\).
KV cache (decoder-only inference)
At generation step \(t\), keys/values from prior tokens are reused:
Cache size per layer is roughly:
\[ \mathrm{KV\ bytes} \approx 2 \cdot n \cdot d_k \cdot b \]
where \(b\) is bytes per element (e.g., \(b=2\) for fp16/bf16).
The factor 2 is for \(K\) and \(V\) (and multiply again by number of heads/groups depending on attention type).
MHA vs MQA vs GQA (why it matters)
Let \(H\) be query heads and \(G\) be key/value groups.
- MHA: \(G = H\) (each head has its own \(K,V\)) ⇒ highest quality, largest KV cache.
- MQA: \(G = 1\) (all heads share one \(K,V\)) ⇒ smallest KV cache, sometimes quality hit.
- GQA: \(1 < G < H\) ⇒ trade-off between quality and KV-cache size.
KV cache scales approximately with \(G\) (number of KV groups).
MLP / FFN
Classic FFN: \[ \mathrm{FFN}(x)=\sigma(xW_1 + b_1)W_2 + b_2 \] Common modern variant (GLU/SwiGLU family): \[ \mathrm{SwiGLU}(x) = (xW_a) \odot \mathrm{swish}(xW_b) \]
Three common training objectives
- MLM (masked language modeling): predict masked tokens using both left and right context.
- Causal LM: predict next token \(p(x_t\mid x_{<t})\) with a causal mask.
- Seq2Seq: encoder produces hidden states; decoder predicts target sequence with cross-attention.
Losses (common interview formulas)
- Causal LM: \[ \mathcal{L}_{\text{CLM}} = - \sum_{t=1}^{n} \log p(x_t \mid x_{<t}) \]
- MLM (mask set \(\mathcal{M}\)): \[ \mathcal{L}_{\text{MLM}} = - \sum_{t\in \mathcal{M}} \log p(x_t \mid x_{\setminus \mathcal{M}}) \]
- Seq2Seq (target length \(m\)): \[ \mathcal{L}_{\text{S2S}} = - \sum_{t=1}^{m} \log p(y_t \mid y_{<t},\; x) \]
Positional encoding (the minimum you should know)
RoPE is commonly written as applying a position-dependent rotation to \(Q,K\) in 2D subspaces: \[ \mathrm{RoPE}(q_t) = R(t)q_t,\quad \mathrm{RoPE}(k_t) = R(t)k_t \] \[ R(t)=\begin{bmatrix}\cos(\theta t)&-\sin(\theta t)\\ \sin(\theta t)&\cos(\theta t)\end{bmatrix} \] In practice, \(R(t)\) is applied blockwise across dimensions with a range of frequencies.
MoE (mixture-of-experts) routing
A common formulation routes tokens to top-\(k\) experts: \[ g(x)=\mathrm{softmax}(Wx)\,,\qquad \mathrm{MoE}(x)=\sum_{e\in \mathrm{TopK}(g(x))} g_e(x)\,E_e(x) \] Key trade-off: capacity/quality vs routing/serving complexity.
Architecture archetypes
Quick comparison table
| Family / Model | Core architecture | Typical objective | Typical strengths | Typical trade-offs |
|---|---|---|---|---|
| BERT | Encoder-only | MLM | embeddings, classification, retrieval features | not a native generator |
| T5 | Encoder-decoder | Seq2Seq | strong conditional generation | more compute than decoder-only at same params |
| GPT-style | Decoder-only | Causal LM | general generation, instruction tuning | quadratic attention cost; long-context challenges |
| LLaMA-style | Decoder-only | Causal LM | efficient modern decoder recipe | same decoder-only limitations |
| Mistral-style | Decoder-only | Causal LM | efficiency tweaks (e.g., windowed attn) | windowing can reduce global context interactions |
| GLM / PrefixLM | Hybrid mask (PrefixLM) | PrefixLM / Causal-ish | good for understanding + generation hybrids | more masking complexity |
| MoE families | Transformer + routed FFNs | usually Causal LM / Seq2Seq | scale params cheaply; high capacity | routing complexity; serving efficiency trade-offs |
Model families (architectures, choices, trade-offs)
BERT
- Architecture: encoder-only Transformer (bi-directional self-attention).
- Objective: MLM (plus next-sentence prediction in original BERT).
- Design notes:
- Great for representations (embeddings) and classification.
- Less natural for open-ended generation without conversion to a decoder.
- References: BERT paper, The Illustrated BERT, Google AI Blog announcement
T5 (Text-to-Text Transfer Transformer)
- Architecture: encoder-decoder Transformer.
- Objective: span corruption / denoising (Seq2Seq).
- Design notes:
- Strong at conditional tasks (translate/summarize/structured outputs).
- Compared to decoder-only, you pay extra compute for the encoder.
- References: T5 paper, Google AI Blog post
GPT family (incl. “GPT-style decoders”)
- Architecture: decoder-only Transformer with a causal mask.
- Objective: next-token prediction (Causal LM), then instruction tuning and/or RLHF.
- Common design choices (varies by release):
- Pre-LN blocks; FFN activations like GELU/SwiGLU.
- Attention variants: MHA, MQA/GQA.
- Positional encoding: learned absolute or RoPE/relative variants.
- References: GPT, GPT-2, GPT-2 blog, GPT-3, InstructGPT (RLHF)
PaLM
- Architecture: large decoder-only Transformer.
- Objective: causal LM; later instruction-tuned variants.
- Notes:
- Treat as a strong example of “GPT-style at scale”.
- In the PaLM report, the authors describe several architecture choices: SwiGLU activations, parallel Transformer sublayers, Multi-Query Attention (MQA), RoPE, shared input/output embeddings, and no biases in dense kernels / layer norms. They also use a SentencePiece vocabulary (256k tokens) and report a training sequence length of 2048.
- References: PaLM paper, Google AI Blog post
LLaMA
- Architecture: decoder-only Transformer.
- Typical recipe (common in LLaMA-like stacks):
- RoPE positional encoding
- RMSNorm
- SwiGLU MLP
- (Later variants often use) GQA for faster inference
- Trade-offs:
- Efficient and widely adopted; many downstream derivatives.
- Long-context requires RoPE scaling / long-context adaptation.
- References: LLaMA, Llama 2
Qwen
- Architecture: typically LLaMA-like decoder-only Transformer family.
- Notes:
- Commonly shares the modern decoder recipe (RoPE + RMSNorm + SwiGLU) with model-specific tweaks.
- For Qwen2, the official release notes state they use GQA across all sizes, and long-context support varies by checkpoint (the family is pretrained at 32K context; some instruction-tuned variants go up to 128K).
- For Qwen2.5, the release notes describe a dense decoder-only family with variants supporting up to 128K context, and mention pretraining up to 18T tokens (per the post).
- References: Qwen report, Qwen2 release notes, Qwen2.5 release notes, Qwen docs/model cards
Mistral
- Architecture: decoder-only Transformer.
- Known family-level choices (typical for Mistral-style efficiency):
- Grouped-query attention (GQA)
- Sliding-window / local attention in some models
- Trade-offs:
- Strong efficiency/latency; local attention can constrain very long-range interactions unless augmented.
- References: Mistral 7B, Mixtral 8x7B (MoE)
GLM (e.g., GLM 4.5/4.6)
- Architecture: PrefixLM-style hybrid masking.
- Objective: prefix bi-directional attention for the prefix + causal decoding for the suffix.
- Trade-offs:
- A bridge between encoder-style understanding and decoder-style generation.
- More complexity in masking/training setup.
- References: GLM paper (blank infilling / PrefixLM)
DeepSeek (R1, V3.x)
- V2: MoE model family; the repository describes 236B total parameters / 21B activated per token, 128K context for the main V2 model, pretraining on 8.1T tokens, and architecture components including MLA (Multi-head Latent Attention) and DeepSeekMoE.
- V3: MoE model family; the repository describes 671B total parameters / 37B activated per token, 128K context, pretraining on 14.8T tokens, and use of MLA + DeepSeekMoE, plus an auxiliary-loss-free load-balancing strategy and a Multi-Token Prediction (MTP) objective.
- R1: reasoning/post-training family; the repository describes DeepSeek-R1-Zero (RL without SFT) and DeepSeek-R1 (adds cold-start data before RL). It also states R1 models are trained based on DeepSeek-V3-Base.
- References: DeepSeek-V2 repo, DeepSeek-V2 paper, DeepSeek-V3 repo, DeepSeek-V3 report, DeepSeek-R1 repo, DeepSeek-R1 paper
Nemotron-3-Nano (optional)
- NVIDIA’s Nemotron family is documented via public model cards (Hugging Face) and a technical report for Nemotron-4.
- Example specs from the model cards:
- Nemotron-3-8B-Base-4k: GPT-style Transformer; supports 4096 context length; trained with sequence length 4096 and uses multi-head attention (per the model card).
- Nemotron-4-340B (Base/Instruct): decoder-only Transformer; sequence length 4096; uses GQA and RoPE; Instruct supports 4096 context length.
- References: Nemotron-3-8B-Base-4k model card, Nemotron-4-340B-Base model card, Nemotron-4-340B-Instruct model card, Nemotron-4 technical report
Benchmarks & websites (best places to learn and compare)
Suites / toolkits
- OpenCompass: broad coverage and a practical pipeline for running many benchmarks consistently.
- lm-evaluation-harness (EleutherAI): the de-facto standard harness for offline academic benchmarks; lots of existing baselines.
- HELM (Stanford CRFM): strongest when you want scenario-style evaluation and transparent reporting dimensions (robustness, fairness, etc.).
Leaderboards / aggregators
“Where to read the real specs”
- Official technical reports / model cards (most reliable for architecture details).
- Inference engine repos / configs (often reveal attention type, rope scaling, context length).
Evaluation
This section focuses on how to evaluate models (methods) and what to run (recommended benchmarks).
Evaluation methods
- Offline (automatic) evaluation
- Pros: fast, reproducible.
- Cons: benchmark overfitting; data contamination; prompt sensitivity.
- Model-based judging (LLM-as-a-judge)
- Pros: scalable for instruction following and writing quality.
- Cons: bias, position effects, judge model drift; needs careful controls.
- Human preference tests
- Pros: best alignment with UX.
- Cons: expensive and slow; needs good sampling + rater calibration.
- System-level / agentic evaluation
- Pros: measures end-to-end usefulness (tools, codebases, long tasks).
- Cons: harder to standardize; infra heavy.
Setup pitfalls (what makes results misleading)
- Prompting: 0-shot vs few-shot vs CoT; formatting; system prompts.
- Decoding: temperature/top-p; deterministic vs sampled; seed control.
- Contamination: train/test leakage and benchmark memorization.
- Scoring: exact match vs pass@k; normalization; unit tests.
Recommended benchmark set (practical)
Pick a small but diverse set; don’t rely on a single number.
Two evaluation formulas that come up often
- Perplexity: \[ \mathrm{PPL} = \exp\left(\frac{1}{n}\sum_{t=1}^{n} -\log p(x_t\mid x_{<t})\right) \]
- Pass@k (coding): estimate probability at least one of \(k\) samples passes unit tests (often reported as pass@1 / pass@10).
General knowledge & reasoning
Math
Coding
Instruction following / chat quality
Truthfulness / safety
- TruthfulQA
- Safety red-teaming: pick a policy/threat model, then use a mix of automated scanners + curated datasets; examples include garak and NVIDIA’s AEGIS (referenced in the Nemotron-4 model card).
Multilingual
Long-context
- LongBench
- Needle-in-a-haystack: use controlled synthetic retrieval probes; a commonly referenced public implementation is https://github.com/gkamradt/LLMTest_NeedleInAHaystack
taxonomy
Summary
- Most “LLMs” you see in practice are decoder-only Transformers; encoder-only and encoder-decoder are still common for embedding and conditional generation.
- The core scaling constraints show up in attention (\(O(n^2)\)) and KV cache (roughly linear in \(n\) and number of KV groups).
- Many modern families share the same decoder recipe (RoPE + RMSNorm + SwiGLU) and differ in efficiency choices (GQA/MQA, windowed attention, MoE routing).
- Trust architecture claims that are backed by technical reports/model cards and configs.
- Evaluation is only meaningful when prompts/decoding are controlled and you use a small, diverse benchmark set.