1. Foundation

Encoding, decoding, and the building blocks of large language models.

1 Overview

A large language model (LLM) is a neural network trained to predict the next token in a sequence. This chapter develops a minimal mental model of how modern LLMs operate in practice.

At inference time, text is tokenized into token IDs (a deterministic mapping from text to discrete symbols), mapped to embeddings (learned vectors), processed by transformer blocks (stacked layers combining self-attention and a position-wise feed-forward network), and converted into logits (unnormalized vocabulary scores). Decoding then selects tokens from these logits to produce generated text.

You do not need to memorize every formula. In interviews and real systems work, the goal is to identify the major components, reason about tradeoffs (quality vs cost; latency vs throughput), and diagnose common failure modes (tokenization issues, unstable training dynamics, and repetitive decoding).

By the end of this chapter, you should be able to explain why subword tokenization dominates; describe embeddings and positional encoding; explain self-attention, masking, and why attention is expensive at long context; explain the role of residual connections, normalization, and the MLP (multi-layer perceptron; i.e., the transformer’s feed-forward network / FFN); compare positional encoding families (absolute vs rotary position embedding (RoPE) vs attention linear bias (ALiBi)); and explain core decoding methods (greedy, beam, top-k/top-p, temperature) and their failure modes.

2 Tokenization

Tokenization is the deterministic mapping from text to discrete symbols (tokens) that the model can process.

Tokenization is a frequent interview topic because it affects quality (domain coverage), cost (token count drives computation), and multilingual behavior. Changing a tokenizer effectively changes the model’s input alphabet, which can invalidate training-time assumptions and complicate evaluation.

2.1 Tokenization units

Tokenization units are commonly defined at the word, character, or subword level.

Word-level tokenization uses whole words as vocabulary items. It is conceptually simple, but it suffers from out-of-vocabulary (OOV) terms, a heavy-tailed vocabulary, and weak sharing across morphological variants (for example, swim vs swimming).

Character-level tokenization uses individual characters, which largely eliminates OOV issues but produces much longer sequences and makes attention substantially more expensive.

Subword-level tokenization uses frequent character sequences as tokens. It typically provides broad coverage without exploding sequence length, and it captures morphology and domain terms better than word-level tokenization. In practice, subword tokenization is the standard choice because it balances sequence length (characters) and coverage (words).

2.2 Learning subword vocabularies

Most subword algorithms take a corpus and a target vocabulary size \(V\), then learn (i) a vocabulary of pieces and (ii) a rule for segmenting text into those pieces. The core goal is to get good coverage with short sequences: fewer tokens per sentence (cheaper attention) without reverting to word-level OOV failures.

Byte-pair encoding (BPE) (including byte-level BPE) builds a subword vocabulary by iteratively merging the most frequent adjacent units to reduce token count. [code]

BBPE (byte-level byte-pair encoding) runs BPE over bytes instead of Unicode strings, which tends to be robust for multilingual text and noisy inputs and can generalize to arbitrary byte sequences.

WordPiece chooses merges that increase likelihood under a language-model objective (roughly, merge when the merged token is much more likely than the parts).

One common way to express the merge preference is as a log-likelihood gain (equivalently, a pointwise mutual information (PMI)-like score). If \(x\) and \(y\) are adjacent pieces and \(z=xy\) is their concatenation, then merging is attractive when \(P(z)\) is large relative to \(P(x)P(y)\):

\[ \log P(z) - (\log P(x) + \log P(y)) = \log \frac{P(z)}{P(x)P(y)} \]

Here, \(P(\cdot)\) is estimated from the corpus (or from an auxiliary LM objective, depending on the implementation). Intuitively, merges that frequently occur together get promoted into single tokens.

Unigram language model (LM) starts with a large candidate set and then deletes tokens that least hurt likelihood. A simple view is that the tokenizer defines a distribution over segmentations: for a string \(s\) segmented into tokens \(t_1,\dots,t_m\), the model assumes token independence, \[ p(t_{1:m}) = \prod_{i=1}^{m} p(t_i), \] and chooses the highest-probability segmentation subject to concatenating back to the original string: \[ \hat{t}_{1:m} = \arg\max_{t_{1:m}:\,\mathrm{concat}(t_{1:m})=s} \prod_{i=1}^{m} p(t_i). \] In practice, decoding the best segmentation is done with dynamic programming (a Viterbi-style search), and single-character fallbacks are kept to avoid OOV.

BPE and WordPiece grow the vocabulary by merging smaller units into larger ones, while Unigram LM shrinks the vocabulary by pruning a large candidate set down to \(V\).

2.3 Interview Questions

2.3.1 Why do LLMs prefer subword tokenization?

LLMs prefer subwords because they reduce OOV problems while keeping sequences shorter than character tokenization, which improves both coverage and computational efficiency.

2.3.2 How does vocabulary size trade off compute and quality?

Larger vocabularies increase embedding/softmax parameter size and can increase memory and compute per token, while smaller vocabularies increase sequence length and can raise attention cost; the best choice balances these effects for the target data.

2.3.3 When would you extend a tokenizer, and what can go wrong?

You extend a tokenizer when your domain has systematic fragmentation (for example, biomedical terms) or when you need better multilingual coverage; the main risks are embedding mismatch, degraded comparability with prior evaluations, and deployment bugs from inconsistent preprocessing.

2.3.4 What are some common pitfalls in tokenization?

Tokenizer mismatch. Training with one tokenizer or chat template and serving with another can break formatting, tool calls, and safety/guardrail assumptions.

Domain fragmentation. When important terms split into many tokens (for example, biomedical or code identifiers), compute increases and fidelity can degrade.

Evaluation comparability. Perplexity and token-based metrics are only comparable under the same tokenizer, so changing tokenization can invalidate historical baselines.

Numeric brittleness. Token boundaries can destabilize counting and numerical behavior, especially when numbers are split across many tokens.

3 Embedding

Embeddings turn token IDs into vectors so the model can do math on “meaning-like” representations. Embeddings are the learned interface between discrete tokens and continuous computation. A useful mental model is: tokenization chooses the symbols; embeddings learn how those symbols behave in the model’s vector space.

3.1 From one-hot vectors to dense embeddings

Token IDs are integers, but their numeric values are arbitrary labels. If the model were to treat IDs as numbers (for example, “token 9 is larger than token 3”), it would introduce meaningless structure.

One common way to represent tokens without imposing that false ordering is a one-hot vector. For a vocabulary of size \(V\), token \(t\) is represented as a vector \(x_t \in \mathbb{R}^V\) where the \(t\)-th entry is 1 and all others are 0. One-hot vectors are unambiguous, but they are high-dimensional and treat all tokens as equally unrelated (every pair is orthogonal).

A dense embedding replaces the \(V\)-dimensional one-hot representation with a learned \(d\)-dimensional vector, where \(d \ll V\). This reduces dimensionality and lets the model learn a geometry in which related tokens can be close.

3.2 Embedding lookup

Let the vocabulary size be \(V\) and the embedding dimension be \(d\). The embedding matrix is a learned lookup table \[ E \in \mathbb{R}^{V \times d}. \] Given a token ID \(t \in \{0,\dots,V-1\}\), the model looks up its vector \(e_t = E[t]\). In code, nn.Embedding(vocab_size, embedding_dim) implements this mapping.

Equivalently, if \(x_t\) is the one-hot vector for token \(t\), then \(e_t = x_t^\top E\). This makes clear that an embedding lookup is a linear map from a sparse one-hot vector into a dense feature vector.

3.3 How embeddings are learned

Classic approaches for learning embeddings include the word2vec family. Continuous bag of words (CBOW) predicts a target word from its surrounding context by averaging context embeddings. It is fast and produces useful semantic vectors, but it largely ignores word order.

In CBOW, context words are \(w_1, w_2, \dots, w_C\) and the target word is \(w_t\). Each context word can be represented as a one-hot vector \(x_i \in \mathbb{R}^V\), and an embedding matrix \(W \in \mathbb{R}^{V \times d}\) maps \(x_i\) to an embedding vector \(v_i = W^T x_i\). The hidden representation is the average \(h = \frac{1}{C} \sum_{i=1}^{C} v_i\), and a second matrix \(W' \in \mathbb{R}^{d \times V}\) produces scores \(u = {W'}^T h\) so that \[ p(w_t \mid \text{context}) = \frac{\exp(u_t)}{\sum_{j=1}^{V} \exp(u_j)}. \]

Skip-gram inverts the task and predicts context words from a target word, which often performs better on rare words at the cost of more computation.

FastText extends CBOW/skip-gram by representing a word as the sum of its character \(n\)-gram vectors, which improves robustness to morphology and reduces OOV issues. For example, with \(n=3\), the word playing can be represented by character trigrams such as <pla, lay, ayi, yin, ing> (plus boundary-aware variants, depending on implementation).

The embedding of a word can be written as \[ v_{\text{playing}} = \sum_{g \in G} z_g, \] where \(G\) is the set of character n-grams and \(z_g\) is the embedding vector for each n-gram. Training then proceeds with CBOW or skip-gram on top of the n-gram representation.

3.4 Training accelerations

To accelerate word2vec-style training, hierarchical softmax replaces the full softmax over the vocabulary with a binary tree, reducing complexity from \(|V|\) to \(\log V\) at the cost of changing the normalization. Negative sampling is another common trick that replaces the full softmax with a small binary classification problem against \(K\) sampled negatives.

For example, one positive pair might be:

(target = "cat", context = "cute")

and negatives might be:

("cat", "banana")
("cat", "engine")
("cat", "chair")

The negative sampling objective can be written as a log-sigmoid for the positive example plus log-sigmoid terms for the negatives: \[ L = \log \sigma(v_{+}^\top v_t) + \sum_{i=1}^{K} \log \sigma(-v_{\text{neg}_i}^\top v_t), \] where \(\sigma\) is sigmoid, \(v_+\) is the real context embedding, and \(v_{\text{neg}_i}\) are negative embeddings.

3.5 Weight tying

Decoder-only LLMs typically map hidden states back to vocabulary logits with an output projection \(W_{\text{out}} \in \mathbb{R}^{d \times V}\). Weight tying reuses parameters by setting \(W_{\text{out}} = E^\top\) (up to conventions). This reduces parameters and often improves sample efficiency, without changing the high-level structure of the forward pass.

3.6 Word embeddings vs sentence embeddings

A word/token embedding is a vector for a single vocabulary item (or subword token). A sentence embedding is a single vector meant to represent an entire span of text (a sentence, paragraph, or document), usually produced by pooling token-level hidden states (for example, mean pooling or a special classification token) or by a model trained specifically for representation learning.

3.7 Interview questions

3.7.1 Why are embeddings needed (why not use token IDs directly)?

Token IDs are arbitrary labels; embeddings provide a learned continuous space where similarity and compositional features can be represented efficiently.

3.7.2 What does “weight tying” mean (input embedding = output softmax weights), and why do it?

Reuse the input embedding matrix as the output projection (typically \(W_{\text{out}}=E^\top\)) to reduce parameters and often improve sample efficiency.

3.7.3 How do word/token embeddings and sentence embeddings differ?

Token embeddings represent individual vocabulary items; sentence embeddings compress a span of text into a single vector, usually by pooling token-level hidden states or training a dedicated encoder.

4 Attention

Attention is the mechanism that makes transformers context-aware. For each token position, the model computes a weighted mixture of other tokens’ representations, where the weights are learned from query–key similarity. In interviews, a strong answer connects this immediately to systems: attention is expressive, but it becomes expensive as context length grows.

4.1 Self-attention

Self-attention lets each token build a context-aware representation by taking a weighted average of other tokens in the same sequence. Intuitively, each position produces a query (“what I want to find”), while every position also exposes a key (“what I match on”) and a value (“what I contribute if selected”). The model compares queries to keys to get weights, then uses those weights to mix values.

Let \(X \in \mathbb{R}^{L\times d}\) be the sequence of hidden states for \(L\) positions (per batch element). For a single attention head, we compute learned linear projections \[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V, \] where \(W_Q, W_K \in \mathbb{R}^{d\times d_k}\) and \(W_V \in \mathbb{R}^{d\times d_v}\). This produces \(Q,K \in \mathbb{R}^{L\times d_k}\) and \(V \in \mathbb{R}^{L\times d_v}\).

Self-attention then follows three steps:

First, compute all query–key similarity scores: \[ S = \frac{QK^\top}{\sqrt{d_k}} \in \mathbb{R}^{L\times L}. \] The scaling by \(\sqrt{d_k}\) keeps dot products from growing with dimension and saturating the softmax.

Second, optionally apply a mask (for example, a causal mask in decoder-only models), then normalize row-wise into attention weights: \[ A = \operatorname{softmax}(S) \in \mathbb{R}^{L\times L}, \] where each row of \(A\) sums to 1 and represents how strongly one position attends across all key positions.

Third, mix values using those weights: \[ Y = AV \in \mathbb{R}^{L\times d_v}. \]

In one line, the scaled dot-product form is: \[ \mathrm{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V. \]

Pseudocode (scaled dot-product attention)

# X: [batch, seq, d_model]
# Wq, Wk: [d_model, d_k], Wv: [d_model, d_v]

Q = X @ Wq  # [batch, seq, d_k]
K = X @ Wk  # [batch, seq, d_k]
V = X @ Wv  # [batch, seq, d_v]

# scores: [batch, seq, seq]
scores = Q @ K.transpose(-2, -1)
scores = scores / sqrt(d_k)

# Causal mask (decoder-only)
# set j > i to -inf so softmax gives 0 probability.
# causal_mask: [1, seq, seq] or [batch, seq, seq]
scores = scores + causal_mask

weights = softmax(scores, dim=-1)  # row-wise over keys
Y = weights @ V  # [batch, seq, d_v]

Multi-head attention repeats this computation for multiple heads (with smaller per-head dimensions), then concatenates the head outputs and applies a final linear projection back to \(d_{\text{model}}\). In encoder-style self-attention, tokens can attend to both left and right context. In decoder-only LLMs, self-attention is made causal so each position can only use past context.

4.2 Causal masking

Decoder-only models must prevent “peeking” at future tokens during training and generation. A causal mask adds \(-\infty\) (or a very large negative number) to attention scores above the diagonal, forcing the softmax to assign zero probability mass to future positions. This preserves the autoregressive factorization \(p(x_t \mid x_{<t})\) while keeping computation parallel over sequence positions during training.

4.3 Cross-attention

Cross-attention attends from one sequence to another: queries come from one stream, while keys/values come from a different stream. This is the mechanism that lets a decoder condition on encoder outputs in encoder–decoder models (for example, translation or summarization), and it also appears in multimodal models (text queries attending to image/audio features).

4.4 Multi-head attention (MHA)

Multi-head attention runs several attention heads in parallel. Each head has its own projections for \(Q, K, V\), so different heads can specialize in different relational patterns (for example, local syntax, long-range dependencies, or positional structure). Practically, the heads are concatenated and projected back to the model dimension.

4.5 Complexity: why attention is expensive at long context

For a sequence of length \(L\), standard self-attention forms an \(L \times L\) score matrix per head. That is why both compute and (naively) memory scale as \(O(L^2)\), and why long-context prefill can dominate time-to-first-token (TTFT).

4.6 Attention acceleration

Several approaches target different bottlenecks:

Sparse attention restricts the attention pattern (for example, sliding windows or block sparsity) to reduce the number of score entries computed; see [code].

Linear attention replaces the softmax with kernel feature maps so the computation can be reordered into associative products, reducing the asymptotic complexity to \(O(L)\) at the cost of changing the inductive bias.

At inference time, the key–value (KV) cache stores keys and values for previously processed tokens so decoding does not recompute them at every step, trading memory bandwidth and capacity for lower per-token compute.

FlashAttention is an implementation technique that computes attention in tiles with numerically stable reductions, avoiding materializing the full \(L\times L\) matrix and improving GPU efficiency.

Multi-query attention (MQA) shares keys and values across heads to shrink the KV cache substantially, while grouped-query attention (GQA) shares keys/values within groups of heads to preserve quality while still reducing cache size; both are motivated by decode-time memory and bandwidth.

4.7 Interview questions

4.7.1 What is the causal mask and why do we need it for decoder-only LLMs?

A causal mask prevents future-token leakage by setting attention scores to \(-\infty\) for \(j>i\) (or applying an equivalent boolean mask). This makes the model’s computation consistent with autoregressive next-token prediction while still allowing parallel training.

4.7.2 Why does vanilla attention scale as \(O(L^2)\) in sequence length?

Because each token attends to every other token, attention computes an \(L\times L\) score matrix per head. That quadratic term dominates prefill compute and can dominate memory unless you use kernel/tiling tricks or sparsity.

4.7.3 How can you turn vanilla attention into “linear attention”?

The core idea is to remove the explicit \(L\times L\) softmax attention matrix by approximating the softmax with a kernel feature map so you can reassociate the products. If you approximate \[ \exp(q^\top k) \approx \phi(q)^\top \phi(k), \] then (ignoring masking details) you can rewrite attention as \[ \mathrm{Attn}(Q,K,V) \approx \frac{\phi(Q)\bigl(\phi(K)^\top V\bigr)}{\phi(Q)\bigl(\phi(K)^\top \mathbf{1}\bigr)}, \] which can be computed in \(O(L)\) time and memory by streaming a running summary like \(\sum_j \phi(k_j)v_j^\top\) instead of forming all pairwise scores. The tradeoff is that this changes the inductive bias and can be less accurate than true softmax attention, especially for tasks that benefit from sharp, selective attention.

4.7.4 Explain the KV cache. Why do we need it?

In autoregressive decoding, each new token would naively recompute attention keys/values for all previous tokens at every step. A KV cache stores the previously computed keys and values \((K_{1:t}, V_{1:t})\) for each layer so that when you generate token \(t{+}1\) you only compute the new \((k_{t+1}, v_{t+1})\) and attend to the cached tensors. This reduces decode-time compute substantially (you avoid redoing the old projections) at the cost of memory and bandwidth proportional to context length.

4.7.5 What changes in MQA/GQA (and why does that matter for the KV cache)?

In standard MHA, each head has its own keys/values, so the KV cache scales with the number of heads. MQA shares a single set of keys/values across heads (small cache), while GQA shares within groups of heads (middle ground), reducing decode-time memory and bandwidth with smaller quality tradeoffs.

5 Feed-Forward Network, Activations, Residual Connections, and Layer Norms

5.1 FFN

The FFN (also called the MLP) is the position-wise “compute” sublayer in a transformer block. It is applied independently at each token position and primarily mixes information across features (channels) within that token.

This complements attention, which primarily mixes information across tokens by taking a content-dependent weighted sum over positions. That makes attention excellent at routing, copying, and aggregating information from the context, but it is a limited compute primitive by itself.

The FFN adds nonlinear per-token computation: without a nonlinearity, stacked linear maps collapse to a single linear map. It is also a convenient place for feature construction (thresholding, gating, and interactions) and for increasing capacity without increasing sequence length (attention cost scales with sequence length, while FFN cost scales mainly with hidden size).

In its simplest form, the FFN is a 2-layer MLP with a nonlinearity: \[ \mathrm{FFN}(x) = W_2\,\sigma(W_1 x + b_1) + b_2. \]

In practice, the hidden dimension is usually expanded (often \(d_\text{ff} \approx 4\,d_\text{model}\)) and modern LLMs frequently use gated variants (e.g., SwiGLU), which add a learned gate that improves quality-per-FLOP.

You can think of a transformer block as: move information (attention) → compute on it (FFN).

5.2 Activations

An activation function is the nonlinearity applied between linear transformations inside an MLP/FFN. It matters because a stack of purely linear layers is still just a single linear map, which severely limits what the network can represent.

In transformers, attention is the mechanism that moves information across tokens, while the FFN/MLP is where most of the per-token feature building happens. The activation inside the FFN is what makes that per-token computation nonlinear, enabling thresholds, interactions, and input-dependent behaviors.

A useful intuition is that activations such as the Gaussian error linear unit (GELU) and SwiGLU act like nonlinear “valves”: they decide which intermediate features are emphasized, suppressed, or gated.

5.2.1 Types of activations

There are multiple choices in practice because we care about a slightly different mix of properties: stable gradients, smoothness, computational efficiency, and (for gated MLPs) an inductive bias for conditional feature flow.

5.2.1.1 Classic: ReLU

The rectified linear unit (ReLU) is simple and fast, but it can create “dead” units when many activations fall into the \(x<0\) region.

A vanilla two-layer FFN is typically written as: \[ \mathrm{FFN}(x) = f(xW_1 + b_1)W_2 + b_2, \] where \(f\) is an activation (for example, ReLU). For ReLU specifically: \[ \mathrm{FFN}_{\text{ReLU}}(x) = \operatorname{ReLU}(xW_1 + b_1)W_2 + b_2. \]

5.2.1.2 Smooth: GELU / Swish

Smooth activations like GELU and the Swish function often optimize better at scale and are common defaults in transformer FFNs.

5.2.1.3 Gated: GLU-family (SwiGLU)

Gated variants in the gated linear unit (GLU) family (including SwiGLU) add an explicit learned gate. This frequently improves quality-per-compute, but it costs extra projections (more parameters and matrix multiplies).

While notation differs across papers and codebases, the common pattern is: \[ \mathrm{FFN}_{\text{gated}}(x) = \bigl((xW_{\text{up}}) \odot g(xW_{\text{gate}})\bigr) W_{\text{down}} + b, \] where \(g(\cdot)\) is the gating nonlinearity.

In this view, GLU corresponds to \(g=\sigma\) (sigmoid gating), and SwiGLU corresponds to \(g=\text{Swish}\).

Common activations you’ll see include the exponential linear unit (ELU) and related nonlinearities.

Activation Formula Notes
Sigmoid \[\sigma(x) = \frac{1}{1 + e^{-x}}\] Saturates for large \(|x|\); can cause vanishing gradients; uses exponentials
Tanh \[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\] Zero-centered but still saturates; uses exponentials
ReLU \[\text{ReLU}(x) = \max(0, x)\] Simple and fast; zero gradient for \(x<0\) (“dying ReLU”)
Leaky ReLU \[\text{LeakyReLU}(x) = \begin{cases}x, & x>0 \\ \alpha x, & x<0 \end{cases}\] Mitigates dying ReLU with small negative slope
ELU \[\text{ELU}(x) = \begin{cases}x, & x>0 \\ \alpha(e^x - 1), & x<0 \end{cases}\] Smooth; negative outputs help shift mean; exponential cost
Swish \[\text{Swish}(x) = x \cdot \sigma(x)\] Smooth, non-monotonic; strong empirical performance
GELU \[\text{GELU}(x) = x \cdot \Phi(x)\] Standard in many Transformers; behaves like probabilistic gating
SwiGLU \[\text{SwiGLU}(x) = \text{Swish}(x_1) \odot x_2\] A common and effective gated MLP nonlinearity in modern LLMs
Softmax \[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\] Maps logits to a categorical distribution

5.3 Residual connections

Deep networks are hard to train because information and gradients have to pass through many nonlinear layers. The most common failure modes are:

  • Vanishing/exploding gradients: gradients can shrink to near-zero or blow up as depth increases.
  • Hard optimization: each layer has to learn a full transformation, even when “do nothing” would be a good default.
  • Representation drift: activation scales can change from layer to layer, making training numerically unstable.

Residual (skip) connections were popularized by He et al. in ResNets (Deep Residual Learning, 2015/2016) and are now a standard tool for training deep transformer stacks.

A residual connection adds an identity “skip path” around a sublayer so the sublayer only needs to learn an update to the current representation: \[ x_{l+1} = x_l + f(x_l). \]

Notation: \(x_l\) is the hidden state at layer \(l\), and \(f(\cdot)\) is a sublayer transformation (self-attention or MLP/FFN).

5.4 Normalization

Training becomes brittle when activation scales drift as you stack many layers: a small change in scale at one layer can compound through depth, leading to unstable gradients and sensitivity to the learning rate. This is especially noticeable in very deep transformers and when using mixed precision, where limited numeric range makes overflow/underflow easier.

Normalization fixes this by rescaling each token’s hidden-state vector so every sublayer sees inputs in a more predictable range. The key point is that transformer norms operate per token (they do not mix information across tokens), so they stabilize optimization without changing the attention pattern.

The two normalization layers you’ll see most often in transformers are LayerNorm and RMSNorm:

  • LayerNorm: subtract the mean and divide by the standard deviation across the \(H\) hidden features, then apply a learned scale and bias.
  • RMSNorm: divide by the root-mean-square magnitude across hidden features (no mean subtraction), usually with a learned scale and no bias.
Method Formula Notes
LayerNorm \[x_{\text{norm}} = \frac{x - \mathbb{E}[x]}{\sqrt{\operatorname{Var}(x)+\epsilon}} \cdot \gamma + \beta\] Classic choice; very common in early Transformers
RMSNorm \[\mathrm{RMS}(x)=\sqrt{\frac{1}{H}\sum_{i=1}^{H} x_i^2 + \epsilon},\quad x_{\text{norm}} = \frac{x}{\mathrm{RMS}(x)} \cdot \gamma\] Simpler than LayerNorm; widely used in modern LLMs (e.g., LLaMA-family)

In practice, normalization makes optimization less sensitive to depth and learning rate, reduces activation drift across layers, and improves numerical stability.

5.4.1 Common transformer variants

Residual connections and normalization are usually discussed together because they solve complementary training problems. The residual (skip) path provides an explicit identity route for activations and gradients, while normalization keeps the input scale to each sublayer in a predictable numeric range. In practice, most “transformer variants” in this area are simply different ways of combining these two ideas—specifically, choosing whether the normalization happens before the sublayer (pre-norm) or after the residual addition (post-norm), and sometimes adding extra scaling or gating to keep very deep stacks stable.

In transformers, the main design choice is where normalization is applied relative to the residual. Let \(\mathrm{Norm}(\cdot)\) denote a normalization layer (LayerNorm or RMSNorm).

  • Post-norm (early Transformers): normalize after adding the residual. \[x_{l+1} = \mathrm{Norm}(x_l + f(x_l)).\]

  • Pre-norm (modern LLMs): normalize the input to the sublayer, then add the residual. \[x_{l+1} = x_l + f(\mathrm{Norm}(x_l)).\]

Pre-norm is the dominant choice in modern decoder-only LLMs because it tends to be more stable at large depth.

Method Formula Notes
Standard Residual \[x_{l+1} = x_l + f(x_l)\] Original Residual Network (ResNet)‑style skip connection
Post‑Norm Transformer \[x_{l+1} = \text{Norm}(x_l + f(x_l))\] Used in early Transformers (including BERT); less stable for very deep stacks
Pre‑Norm Transformer \[x_{l+1} = x_l + f(\text{Norm}(x_l))\] Used in modern LLMs (GPT, LLaMA); typically more stable
DeepNorm \[x_{l+1} = x_l + \alpha \cdot f(\text{Norm}(x_l))\]
Encoder: \[\alpha = \frac{1}{\sqrt{2N}}\]
Decoder: \[\alpha = \frac{1}{\sqrt{4N}}\]
Adds residual scaling so very deep models stay stable; \(N\) is number of layers
ReZero \[x_{l+1} = x_l + \alpha \cdot f(x_l)\] (\(\alpha\) learnable, init = 0) Starts with near-zero residual; can stabilize early training

5.5 Interview questions

5.5.1 Why do we need the MLP/FFN if we already have attention?

Attention is a content-dependent weighted sum over tokens; it routes information but does not provide much nonlinear feature construction by itself. The FFN adds position-wise nonlinear computation that builds and transforms features after information has been mixed across tokens.

5.5.2 Why are gated activations (e.g., SwiGLU) common in modern LLMs?

Gated activations add a learned “selector” that controls which features flow through the FFN. This often improves quality at similar parameter/computation budgets because the model can represent conditional computation (feature is useful only in some contexts) more easily than with a single non-gated nonlinearity.

5.5.3 What’s the tradeoff between expressivity and compute?

More expressive activations (especially gated ones) typically require extra matrix multiplies and parameters (for example, two projections for a gate), increasing compute and memory bandwidth. In exchange, they often provide better training dynamics and higher accuracy per FLOP.

5.5.4 Why do residual connections help deep networks train?

They create an explicit identity path for both activations and gradients. This makes the network easier to optimize (layers learn residual updates), improves gradient flow through depth, and reduces the chance that training collapses due to unstable intermediate representations.

5.5.5 Pre-norm vs post-norm: what changes and why does stability differ?

In pre-norm, the input to the sublayer is normalized: \(x_{l+1}=x_l+f(\mathrm{LN}(x_l))\).

In post-norm, the residual branch is normalized after addition: \(x_{l+1}=\mathrm{LN}(x_l+f(x_l))\).

The practical stability intuition is that pre-norm keeps the input distribution to each sublayer more consistent throughout training, and it preserves a cleaner identity/skip path for gradients through the residual connection. This is one reason very deep decoder-only LLMs almost always use pre-norm (often with RMSNorm) variants.

6 Positional Encoding

Note

Positions tell the model the order of tokens—without positions, “dog bites man” and “man bites dog” look the same.

6.1 Why it is needed (what problem it solves)

Self-attention by itself has no notion of order. If you permute (shuffle) the input tokens, the attention mechanism sees the same set of token vectors and can produce the same result. In other words, without position information a transformer is permutation-invariant, which is incompatible with language where order changes meaning.

Positional encoding is also a practical systems concern because it affects:

  • Generalization to longer context (“length extrapolation”): some schemes degrade badly when you run longer sequences at inference than during training.
  • Inductive bias: some schemes favor locality/recency (useful for many tasks), while others make it easier to represent long-range structure.

6.2 What it is

Positional encoding is any mechanism that injects position information into the transformer so the model can distinguish token order. You can think of it as defining a position signal \(p(t)\) (or a pairwise bias \(b(i,j)\)) and then combining it with the usual content-based computation.

A helpful way to organize the design space is where position enters the computation:

  • Embedding-level (absolute positions): add or combine a position vector with each token embedding.
  • Attention-level (relative positions): modify attention scores so they depend on token-to-token distance.
  • Geometry-level (RoPE): rotate \(Q\) and \(K\) so relative position is reflected directly in their dot products.

6.3 How to implement it

Below are the common implementation patterns you’ll see in codebases.

6.4 Core families

Absolute positional encodings assign each position index its own vector (fixed or learned) and combine it with token embeddings. In most implementations, token and position embeddings are added (not concatenated), keeping a fixed hidden size.

Implementation pattern (absolute positions)

# token_ids: [batch, seq]
x = tok_embed(token_ids)                  # [batch, seq, d_model]
pos = pos_embed(arange(seq_len))          # [seq, d_model] (learned or sinusoidal)
x = x + pos[None, :, :]                   # broadcast over batch

The classic sinusoidal scheme uses sin/cos pairs with log-spaced frequencies: \[ PE(t,2i)=\sin\left(\frac{t}{10000^{2i/d_{\text{model}}}}\right),\quad PE(t,2i+1)=\cos\left(\frac{t}{10000^{2i/d_{\text{model}}}}\right) \]

Intuition: different frequencies give each position a unique “signature,” and dot products between these vectors expose relative offsets. A visualization of sinusoidal embeddings is available at visual.

Relative positional encodings represent position in terms of distances between token indices rather than absolute indices. A common pattern is to modify the attention score between token \(i\) and \(j\) with a term that depends on \((i-j)\) or \(|i-j|\).

This is the key reason relative methods often extrapolate better: the model learns “how far apart” rather than “what absolute index.”

Implementation pattern (relative bias)

# scores = Q @ K.transpose(-2, -1) / sqrt(d_k)   # [batch, heads, seq, seq]
scores = scores + relative_bias                  # [1 or batch, heads, seq, seq]
weights = softmax(scores, dim=-1)
out = weights @ V

RoPE (rotary position embedding) is often implemented by applying a position-dependent 2D rotation to each pair of hidden dimensions of \(Q\) and \(K\) before computing scores. A key practical detail is that you rotate \(Q\) and \(K\) (not \(V\)), and you do it per head.

Implementation sketch (RoPE)

# Q, K: [batch, heads, seq, head_dim]
Q = apply_rope(Q, seq_positions)
K = apply_rope(K, seq_positions)
scores = Q @ K.transpose(-2, -1) / sqrt(head_dim)

6.5 Common methods (and intuition)

Method Formula What It Encodes Intuition
Sinusoidal (Absolute) \[PE(t)_i = \sin(t \cdot \omega_i),\quad \omega_i = 10000^{-i/d}\] Absolute position index Each position gets a unique “wave signature.” Different frequencies let the model infer order (like reading a timestamp).
Learned Absolute \[PE(t) = E_t\] Absolute position index The model learns a table of positional vectors, like a “positional vocabulary.”
Relative (Shaw et al.) \[\text{score}_{ij} = Q_i K_j^\top + a_{i-j}\] Relative distance \((i-j)\) Instead of “where are they,” it learns “how far apart are they.” This matches how language works (“this word modifies the next one”).
Intra-token Distance (General) \[PE(i,j) = f(\lvert i-j \rvert)\] Arbitrary distance functions Can define any mapping from distance → bias. ALiBi and T5 are specific cases of this general framework.
ALiBi (Attention Linear Bias) \[\text{score}_{ij} += w_h \cdot (i - j)\] Linear distance penalty Far‑away tokens get a soft negative penalty that encourages attending to recent tokens and can generalize to very long sequences.
RoPE (Rotary Position Embedding) \[Q_t^{\text{rot}}=R_tQ_t,\quad K_t^{\text{rot}}=R_tK_t\]
\[R_t=\begin{bmatrix}\cos(\theta t)&-\sin(\theta t)\\\sin(\theta t)&\cos(\theta t)\end{bmatrix}\]
Encodes relative shift through geometric rotation Positions rotate Q/K vectors in a circular space. Relative distances become differences in rotation. Elegant, smooth, natural for attention.
T5 / DeBERTa (Relative Bias) \[\text{score}_{ij} = Q_i K_j^\top + b_{\lvert i-j \rvert}\] Bucketed distances Distances are grouped into buckets. “Far but not too far” gets the same bucket. Stable and easy for NLP tasks.

One way to summarize the intuition is: sinusoidal encodings give each position a unique wave signature; learned absolute encodings treat positions like a positional vocabulary; Shaw-style relative encodings learn “how far apart”; ALiBi acts like a recency bias; RoPE rotates queries and keys so relative position emerges from angle differences; T5/DeBERTa use distance buckets; and intra-token methods allow arbitrary distance-to-bias mappings.

6.5.1 Length Extrapolation

Length extrapolation refers to running a model at inference on contexts much longer than those seen during training (for example, training at 20k tokens and deploying at 128k tokens). Naive position schemes can degrade because the model never saw such large indices or such long-range dependencies. The goal is to encode position so the model’s attention computations remain well-behaved at longer lengths.

Method Math (Core Formula) Why it helps Pros Cons
RPE (Relative Position Embedding) \[\text{score}_{ij} = Q_i K_j^\top + a(i-j)\] Learns distances instead of absolute indices Often extrapolates well; matches linguistic locality Needs learned biases/embeddings; bucketization is common
RoPE (Rotary Positional Encoding) \[Q_t^{\text{rot}}=R_tQ_t,\; K_t^{\text{rot}}=R_tK_t\] Relative offsets emerge from angle differences Smooth; widely used; strong default Can break at very long context without scaling
ALiBi (Attention Linear Bias) \[\text{score}_{ij} += w_h (i-j)\] Adds a simple recency bias via distance penalty Very simple; good length generalization Weaker modeling of some long-range patterns
Position Insertion (PI) \[t' = \frac{L}{L_{\text{new}}}\cdot t\] Compresses positions back into familiar range Very easy; sometimes works surprisingly well Interpolation can distort structure
Base-N positional encoding \[t = \sum_k d_k B^k,\quad PE_k(t) = d_k\] Multi-scale digit representation grows naturally Potentially unbounded length Discontinuous; may need smoothing/tuning
NTK-aware RoPE scaling \[\theta' = \theta / s\] Slows down RoPE frequencies globally Easy and effective Global scaling may over/under-correct
Frequency-segmented NTK \[\theta'_i = \theta_i / s\cdot\text{part}(i)\] Scales high frequencies more than low ones Less distortion than global scaling More knobs; needs careful design
Dynamic NTK \[\theta'_i = \theta_i \cdot \frac{m}{\beta^{d/2-1}}\] Adaptive scaling across dimensions Stable long-range behavior (when tuned well) More complex; implementation details vary
YaRN \[\sqrt{1/t}=0.1\ln(s)+1\] (with RoPE frequency scaling) Balances scaling and stability Robust long-context extension Slightly more complex to implement

6.6 Intuition summary

Method Key Idea
RPE Learn distances
RoPE Encode angle rotation
NTK Slow down rotation globally
NTK parts Slow down rotation by frequency segment
Dynamic NTK Smart per‑dimension scaling
YaRN Best RoPE scaling (balanced + stable)
ALiBi Linear recency bias = soft mask
PI Compress indices back to training max
Base‑N Interpret positions as digits
Intra-token Arbitrary distance mapping

6.7 Interview questions

6.7.1 Why does a transformer need positional encoding?

Because attention alone has no built-in notion of order: without position information the model cannot reliably distinguish sequences that contain the same tokens in different orders.

6.7.2 Absolute vs relative vs RoPE: what’s the one-line intuition?

  • Absolute: give each position an ID vector and add it to the token embedding.
  • Relative: bias attention based on distance \((i-j)\) so “how far apart” is explicit.
  • RoPE: encode position by rotating \(Q/K\) so relative offsets show up naturally in \(QK^\top\).

6.7.3 What is “length extrapolation,” and why does it matter?

It’s running inference on contexts longer than the model saw during training. Some position schemes (especially learned absolute embeddings) can degrade because the model never learned representations for those indices, while many relative/rotary/bias-based schemes can degrade more gracefully (though RoPE often still needs scaling tricks for very long contexts).

6.7.4 Precompute vs compute-on-the-fly: what’s typical in production?

Sinusoidal positions are typically precomputed (cheap, deterministic). Learned absolute embeddings are simple table lookups. Relative biases and ALiBi are often generated once per sequence length (or cached per length). RoPE is applied on-the-fly to \(Q/K\) but is lightweight compared to the matmuls.

7 Decoding

Note

Decoding is choosing the next token from the model’s probability distribution—different strategies trade off creativity, repetition, and correctness.

Decoding is the bridge from probabilities to actual text. At each step it turns model logits into a next-token decision, appends that token, and repeats until a stop condition is met.

Formally, decoding is how you turn the model’s next-token distribution \(p(x_t \mid x_{<t})\) into an actual token choice \(x_t\), repeatedly, until generation ends.

In production, decoding is a pipeline: you may transform logits (temperature, repetition penalties), optionally truncate the candidate set (top-\(k\)/top-\(p\)), choose the next token (argmax or sampling), and enforce termination (EOS token, stop sequences, max tokens). The core families you should be able to compare are greedy decoding (fast and deterministic), beam search (searches multiple partial sequences but can be bland for open-ended text), sampling-based methods (temperature + top-\(k\)/top-\(p\) for diversity control), and compute-amplifying strategies like best-of-\(N\) or self-consistency (better reliability at higher cost).

7.1 Greedy Search [code]

\[ x_t = \arg\max_i p(i \mid x_{<t}) \] Greedy search always picks the highest-probability token, which is fast and deterministic but can be monotonic and repetitive.

7.2 Beam Search [code]

Beam search maintains the top-\(B\) partial sequences under a cumulative score (usually log-probability): \[ \mathrm{score}(x_{1:t}) = \sum_{\tau=1}^{t} \log p(x_\tau \mid x_{<\tau}),\quad \mathrm{Beam}_t = \operatorname{TopB}\bigl(\mathrm{score}(x_{1:t})\bigr) \] It targets globally likely sequences but is often too conservative for open-ended chat unless you add diversity-promoting modifications.

7.3 Top‑k Sampling [code]

Restrict sampling to the top-\(k\) tokens, then renormalize: \[ \widetilde{p}(i) \propto p(i)\,\mathbf{1}[i \in \operatorname{TopK}(p)],\quad x_t \sim \widetilde{p} \] Top-k sampling works well when the distribution is peaked and performs poorly when the distribution is flat.

7.4 Top‑p Sampling (Nucleus Sampling) [code]

Choose the smallest set \(S\) such that: \[ \sum_{i\in S} p(i) \ge p \] Then renormalize within \(S\) and sample: \[ x_t \sim \widetilde{p}(\cdot \mid \cdot \in S) \] Top-p sampling is adaptive (it avoids a fixed \(k\)) and is often stable for creative generation.

7.5 Temperature Sampling

Temperature rescales logits before the softmax: \[ p_i' = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} \] Temperature sampling uses \(T>1\) to make the distribution flatter (more random) and \(T<1\) to make it sharper (more deterministic).

7.6 Top-\(k\)/Top-\(p\) + temperature (common pipeline)

Temperature scales logits before sampling, where lower values are more deterministic. Top-k restricts sampling to the \(k\) most likely tokens, and top-p (nucleus) restricts sampling to the smallest set whose cumulative probability is at least \(p\). Repetition penalties (including no-repeat n-gram heuristics) reduce looping, and stop sequences enforce clean termination for structured formats such as JavaScript Object Notation (JSON) outputs or tool calls.

In practice you typically apply temperature to logits, compute probabilities, then apply top-\(k\) and/or top-\(p\) filtering before sampling. Combining these controls randomness (temperature) and truncates low-probability noise (top-\(k\)/top-\(p\)).

7.7 Best‑of‑N (Self‑Scoring)

Generate \(N\) samples: \[ \{y^{(1)}, y^{(2)}, \ldots, y^{(N)}\} \] Then score them using the LLM itself, a reward model, or task-specific scoring, and pick the best.

7.8 Majority Voting / Self‑Consistency

Generate multiple chains: \[ y^{(1)}, y^{(2)}, \ldots, y^{(N)} \] Choose the majority answer (or cluster center). This is often used for reasoning-style prompts and can reduce variance by aggregating across samples.

7.8.1 Summary

Method Deterministic? Diversity Notes
Greedy Yes Low Fast; can be repetitive
Beam search Partly Low Optimizes likelihood; often bland for chat
Top-\(k\) No Medium Simple truncation; sensitive when distribution is flat
Top-\(p\) No High Adaptive truncation; strong default for creative text
Temperature No Adjustable Controls randomness by rescaling logits
Top-\(k\)/Top-\(p\) + temperature No High Common production knob set
Best-of-\(N\) No (generation) High Improves quality if scoring is good
Self-consistency No (generation) Very high Useful for reasoning; costs more compute

Decoding is a high-leverage layer in the stack: it can materially change quality, tone, and failure modes without retraining, because it controls how uncertainty in the next-token distribution turns into an actual sequence. In interviews and production, many “model problems” are really decoding problems: repetitive loops, blandness, hallucinations from over-random sampling, and inconsistent formatting when termination is not enforced.

It is also a systems knob. More search or more samples generally improves reliability at higher latency/cost (e.g., best-of-\(N\) multiplies work), while stricter constraints (stop sequences, repetition controls) improve safety and structure but can reduce diversity. As a practical intuition: too-low temperature tends to collapse into dull or repetitive text; too-high temperature increases incoherence; and beam search is often a poor default for open-ended chat because likelihood optimization can favor generic continuations.

8 Architecture

Architecture is the layer of decisions that turns “transformers work” into “this model fits the task and the hardware.” Without clear architectural choices, you typically run into one (or more) of these problems:

The practical failure modes are usually predictable. Cost blow-ups come from depth/width choices and attention style, which drive training FLOPs and memory. Inference bottlenecks come from the KV cache and attention pattern, which set latency and throughput. Task mismatch happens when the objective and masking do not fit what you want (for example, bidirectional encoders for representation learning vs causal decoders for generation), and scaling limits show up when small details (norm placement, attention variant, MoE routing) dominate stability and efficiency as you push depth or context length.

At a high level, a transformer “architecture” specifies the objective (masked LM vs next-token prediction vs seq2seq), the attention mask pattern (bi-directional, causal, or hybrid), the repeated block structure (self-attention + FFN, stacked \(N\) times), and any extras such as cross-attention, mixture of experts (MoE), or long-context techniques.

In practice, “implementing the architecture” is mostly choosing a small set of knobs and then stacking blocks: depth vs width (number of layers \(N\) vs hidden size \(d\)), attention style (causal vs bi-directional; MHA vs MQA/GQA; long-context variants), FFN style (dense FFN vs MoE; gated MLPs), and stability choices (normalization type/placement and residual scaling for very deep models).

Below are the common high-level families you should recognize.

8.1 Encoder-only (bi-directional)

Encoder-only models compute attention over both the left and right context: \[ \mathrm{Attention}(x_i) = f(x_{<i}, x_i, x_{>i}) \]

They are naturally bi-directional, which makes them strong for feature extraction tasks such as embeddings and classification, and their pretraining objective is typically masked language modeling (MLM). By themselves, they are not designed for open-ended generation. Examples include BERT, RoBERTa, and DeBERTa.

8.2 Decoder-only (auto-regressive)

Decoder-only models are trained with next-token prediction, and masked causal attention ensures only left context is accessible: \[ p(x_t \mid x_{<t}) \]

They are ideal for generation, naturally support zero-shot and few-shot prompting, and are trained efficiently under the causal language modeling objective. Examples include GPT-family models, LLaMA, BLOOM, OPT, and Mistral.

8.3 Encoder–decoder (Seq2Seq)

Encoder: \[ h = \text{Enc}(x_{1:n}), \quad \text{bi‑directional} \]

Decoder: \[ p(y_t \mid y_{<t}, h), \quad \text{auto‑regressive} \]

Encoder–decoder models are often a strong fit for machine translation and summarization, because the decoder attends to both the encoder output and its own past tokens. Examples include T5, Flan-T5, and BART.

8.4 PrefixLM (partial causal masking)

PrefixLM is a hybrid where the prefix is encoded bi-directionally while the suffix is decoded causally, which aims to capture some encoder–decoder benefits with decoder-only efficiency.

Mask: \[ M_{ij} = \begin{cases} 0, & j \leq P \quad (\text{prefix, bi-directional})\\ 0, & j < i \quad (\text{suffix, causal})\\ -\infty, & \text{otherwise} \end{cases} \]

Examples include GLM and U-PaLM.

8.5 Mixture of Experts (MoE)

MoE replaces a single feed-forward block with multiple experts and routes each token through a subset of those experts. \[ \mathrm{FFN}(x) = \sum_{k=1}^{N} g_k(x)\, E_k(x) \] Where \(E_k\) are expert networks and \(g_k\) are gating weights (softmax or top-\(k\)).

Routing matters because it determines load balancing (whether experts are utilized evenly), capacity (whether tokens get dropped or overflow), and ultimately both quality and throughput.

SoftMoE merges experts via a soft combination into a single feed-forward computation: \[ \mathrm{SoftMoE}(x)=W\Big(\sum_k g_k(x) E_k(x)\Big) \]

Examples include Switch Transformer, Mixtral MoE, and DeepSeek V2/V3.

8.6 Summary

Architecture Attention style Objective Strengths Examples
Encoder-only Bi-directional MLM Feature extraction BERT
Decoder-only Causal NLL Generation; prompting GPT, LLaMA
Encoder–decoder Bi-dir + causal Seq2Seq Translation; summarization T5, BART
PrefixLM Hybrid mask Prefix-style LM Reasoning; dialogue GLM
MoE Routed experts Sparse Efficient scaling DeepSeek, Mixtral

8.7 Interview questions

8.7.1 Why are most chat LLMs decoder-only?

Because chat is a generation problem, and decoder-only models are trained with next-token prediction under a causal mask. That objective matches inference-time generation, and the architecture scales cleanly while supporting fast decoding with a KV cache.

8.7.2 When would you prefer encoder-only vs encoder–decoder?

Use encoder-only when you need representations (classification, retrieval embeddings, reranking). Use encoder–decoder when the task is naturally seq2seq (translation, summarization) and you benefit from separating “encode the source” from “generate the target.”

8.7.3 What is MoE in one sentence, and why does routing matter?

MoE replaces a single FFN with multiple expert FFNs and routes each token through only a subset of experts. Routing matters because it controls load balancing (throughput), capacity (overflows/drops), and quality.

8.7.4 What architectural choice tends to dominate inference memory?

For decoder-only LLMs, it’s usually the KV cache, which scales with context length, number of layers, and head dimensions (and is reduced by choices like MQA/GQA).

8.8 Interview questions

8.8.1 What does decoding do operationally (logits → transforms → filtering → selection → stop)?

Decoding turns next-token logits into an actual token: apply logit transforms (temperature/penalties), optionally filter (top-\(k\)/top-\(p\)), select (argmax/sampling/beam), append, and repeat until EOS, a stop sequence, or a token limit.

8.8.2 Why can beam search reduce diversity and increase generic outputs in open-ended generation?

Beam search keeps the highest-likelihood partial sequences, so it tends to collapse to “safe” high-probability phrasing and prunes creative but initially lower-probability branches unless you add explicit diversity mechanisms.

8.8.3 Top-\(k\) vs top-\(p\): what is the difference, and which is more stable across prompts?

Top-\(k\) samples from a fixed set of the \(k\) most likely tokens; top-\(p\) (nucleus) samples from the smallest set whose cumulative probability is at least \(p\). Top-\(p\) is usually more stable because it adapts when the distribution is sharp vs flat.

8.8.4 What does temperature change mathematically, and what failure modes show up at very low vs very high \(T\)?

Temperature rescales logits by dividing by \(T\) before softmax. Very low \(T\) becomes near-greedy (bland/repetitive), while very high \(T\) increases randomness (incoherence, contradictions, broken formatting).

8.8.5 How do repetition penalties / no-repeat n-gram heuristics work, and what can they break (e.g., valid repeated structure)?

They reduce looping by downweighting previously used tokens or forbidding tokens that would repeat an \(n\)-gram. They can break valid repetition (lists, parallel phrasing, code/JSON patterns) and sometimes force awkward rewrites.

8.8.6 When would you use best-of-\(N\) or self-consistency, and what’s the systems cost?

Use them when you can afford multiple samples to improve reliability (pick the best with a scorer, or take a majority/cluster). The systems cost is roughly \(N\times\) more generation work (tokens/latency) plus any scoring/verifier overhead.

8.8.7 What are “stop sequences,” and why do they matter for tool calls / structured outputs?

Stop sequences are strings that terminate generation when produced. They matter because they prevent trailing prose and help ensure outputs like tool calls or JSON end cleanly and remain parseable.

9 Beyond Transformers

Transformers are the default choice for modern LLMs, but they have a clear scaling pain point: standard attention becomes expensive at long context.

In production, two costs usually dominate:

Prefill compute grows as \(O(L^2)\) in sequence length \(L\) because full attention forms \(L\times L\) interactions (this often dominates time-to-first-token). Decode memory/bandwidth grows because autoregressive generation relies on a KV cache that scales with context length and number of layers.

Most “beyond transformers” ideas target one of these bottlenecks (or deliberately change the inductive bias). The tradeoff is typically some mix of quality, training stability, and ecosystem/tooling maturity.

Here, “beyond transformers” means any model family or inference-time strategy that changes how sequence information is processed relative to standard softmax attention. Some approaches change the architecture itself (for example, state-space models). Others keep a transformer backbone but change the inference pipeline (for example, recursion, retrieval, or chunking).

A practical mental map is: Transformer (decoder-only) is the standard stack of causal self-attention + FFN blocks that generates token-by-token using a KV cache; SSM / state-space (e.g., Mamba) replaces softmax attention with a state update that can process long sequences in roughly linear time (with different inductive bias and a smaller ecosystem); Retention / hybrid (e.g., RetNet) uses a retention/recurrence-like mechanism (often with parallel training) to reduce long-context cost while staying transformer-like; Recurrent hybrids (e.g., RWKV) mix transformer-style training with RNN-style inference so generation does not rely on an attention KV cache; Diffusion language models generate by iterative refinement rather than one-step next-token prediction (often improving controllability at the cost of multiple steps); and Recursive LMs (inference-time) keep the base model but add orchestration (retrieve, chunk, summarize, loop) so the system can handle problems larger than the model’s raw context window.

The table below is a practical map of the major families.

Family Generation style Main win Main tradeoff Where it shows up
Transformer (decoder-only) Autoregressive (token-by-token) Strong quality + mature tooling Attention/KV cache cost Default for chat LLMs
SSM / State-space (e.g., Mamba) Autoregressive Linear-time sequence processing Different inductive bias; ecosystem smaller Long-seq efficiency research + some production
Retention / hybrid (e.g., RetNet) Autoregressive Bridges recurrence + parallel training Less standard than transformers Research + specialized deployments
Recurrent hybrids (e.g., RWKV) Autoregressive No KV cache; “RNN-like inference” Model family not as dominant Edge/efficiency enthusiasts + research
Diffusion language models Iterative refinement Parallel-ish generation + controllability Multiple denoise steps; infra differs Emerging research; niche production
Recursive LMs (inference-time) Autoregressive + scaffolding Handle prompts beyond context windows More orchestration; not “just the model” Agents + long-document workflows

In practice, you usually don’t need formulas. You need to clearly state (i) which bottleneck is being targeted, and (ii) what you give up.

10 Summary

This chapter built a working mental model of how modern LLMs turn text into text: tokenization maps strings into discrete symbols, embeddings (plus positional information) turn those symbols into vectors, transformer blocks repeatedly mix information across tokens (attention) and compute new features per token (FFN/MLP), and decoding turns next-token logits into actual token choices until a stop condition is met.

Most of the practical engineering tradeoffs come from scaling: attention is expressive but expensive at long context, prefill compute grows roughly as \(O(L^2)\) in sequence length, and decode-time throughput is often limited by the KV cache (memory/bandwidth) rather than pure FLOPs. The rest of the chapter focused on the knobs that change quality, cost, and stability without changing the core idea: tokenization and vocabulary size control sequence length and coverage; residual connections + normalization (typically pre-norm) make deep stacks train reliably; positional encodings (absolute, relative, RoPE, ALiBi) determine how order is represented and how well the model extrapolates to longer contexts; decoding strategies trade off determinism, diversity, and compute; and architecture choices (encoder-only vs decoder-only vs encoder–decoder, plus MoE) match objectives and hardware constraints. Finally, “beyond transformers” families aim to reduce long-context bottlenecks or change inductive bias, typically trading ecosystem maturity and training stability for better scaling behavior.