3. Mid-Training
From CPT to Agentic RL: Foundations, recipes, and failure modes
0.1 Learning goals
Beyond memorizing definitions, interviews test whether you can choose the right stage (CPT vs SFT vs RL vs distill) and anticipate failure modes.
ELI5: Training an LLM is like training a new hire: first you teach broad skills, then company-specific knowledge, then “how we talk to customers,” and finally you coach them with feedback on real tasks.
By the end of this chapter, you should be able to:
- Explain why mid-training (CPT) exists and how it differs from pretraining and SFT.
- Implement critical data strategies: General Replay for memory, packing for throughput, and chat templates + masking for instruction tuning.
- Apply PEFT (LoRA/QLoRA) strategies, including multi-tenant serving patterns.
- Compare alignment techniques: PPO/RLHF vs. DPO/ORPO vs. Agentic RL (e.g., GRPO and related variants).
- Design pipelines that optimize for Reasoning (System 2) using verifiers and process rewards.
- Build reliable tool use systems (schema correctness + tool selection + chaining).
- Apply test-time scaling (sampling, revision, verifier reranking) to boost reliability.
- Debug common regressions like the stability gap, reward hacking, and template mismatch.
0.2 The big picture: The life cycle of an LLM
This chapter organizes the landscape into a decision flow: What’s missing: knowledge, behavior, preferences, reasoning reliability, or cost? Each corresponds to a different lever.
0.2.1 Interview framing: “Which knob would you turn?”
When you’re given a product requirement, answer in this order:
- Define the target behavior (format, safety, tool use, reasoning, latency).
- Diagnose the gap (knowledge gap vs behavior gap vs optimization gap).
- Pick the cheapest effective lever (prompt → SFT → DPO → RL → CPT → distill).
- Name the regression risks (forgetting, reward hacking, template mismatch, latency).
ELI5: If the model doesn’t know the facts, teach it with reading (CPT). If it knows facts but talks wrong, teach it with examples (SFT). If it needs to prefer “better” answers, use preferences/RL. If it’s too slow/expensive, compress it.
We treat training and inference as a pipeline of altering distributions and compute budgets.
flowchart LR
A[Base Model θ0] --> B{Domain gap?}
B -->|Yes: Knowledge/Vocab/Context| C[Mid-training / CPT]
B -->|No: Just behavior| D[SFT]
C --> E[θ_mid]
E --> D
D --> F[θ_sft]
F --> G{Alignment path}
G -->|Chat/Style| H[DPO / ORPO / PPO]
G -->|Reasoning/Math/Code| I[Agentic RL: GRPO / Self-training loops]
H --> J[θ_aligned]
I --> J
J --> K{Inference budget?}
K -->|Low| L[Direct decode]
K -->|High| M[Test-time scaling]
J --> N[Distillation / Quantization]
N --> O[θ_deploy]
Core mental model
- Pretraining: builds the engine (general capabilities).
- Mid-training (CPT): tunes the engine for terrain (domain knowledge, context-length priors, sometimes RL-compatibility).
- Post-training (SFT/DPO/RL): teaches the driver (behavior, safety, preference alignment).
- Test-time scaling: spends compute at inference to navigate complex routes (better reliability without retraining).
1 Phase 1: Mid-training / Continued Pretraining (CPT)
CPT is the workhorse for domain specialization and long-context priors. It’s also the most common place to introduce regressions if you don’t guardrail general capabilities.
ELI5: CPT is “more reading”: you keep training the model on domain documents so it picks up jargon and facts naturally.
1.0.1 When CPT is the right tool
Use CPT when you see: - high perplexity on in-domain corpora, - systematic entity/jargon failures, - domain-specific formatting/structures (legal docs, codebases), - long-document understanding gaps.
1.0.2 CPT design checklist (interview-ready)
- Data: quality > quantity; de-dup, remove boilerplate, enforce doc boundaries.
- Mixture: start with replay (e.g., 80/20) and tune by regression gates.
- LR: typically much lower than initial pretraining peak LR.
- Eval: track both domain gains and general regressions continuously.
1.1 What it is (and isn’t)
1.1.1 What CPT optimizes
CPT uses the same next-token objective as base pretraining, but on a different distribution.
- It tends to improve knowledge recall and domain fluency.
- It can also improve tool-use and reasoning indirectly when your domain data contains those patterns (e.g., code repos, math solutions).
1.1.2 What CPT is not
- Not instruction-following by itself: you can CPT a model into a great encyclopedia that still refuses to answer in JSON.
- Not a replacement for alignment: CPT can even make safety worse if your corpus contains unsafe patterns.
ELI5: CPT teaches what to say; SFT teaches how to say it to a user.
Continued next-token training on a target distribution. It is necessary when the base model lacks:
- domain knowledge (facts/jargon/entity priors),
- long-context priors (document structure, long-range retrieval),
- and sometimes RL-compatibility for reasoning-style RL (model-family differences).
It is not SFT: SFT is about behavioral formatting and instruction following.
1.2 Optional: Tokenizer extension
Tokenizer extension can be worth it when your domain has many high-frequency terms that get split into many sub-tokens (e.g., rare drug names, API identifiers, legal citations).
ELI5: Tokenizer extension is like adding new dictionary words so the model stops spelling domain terms letter-by-letter.
1.2.1 When to extend
- Decision rule: measure fragmentation rate on a domain vocabulary list. If important terms routinely become 5+ tokens, you’re wasting context and compute.
- Signal: you see frequent truncation/length issues, or the model mangles domain terms (broken identifiers, misspellings, bad citations).
1.2.2 How to extend (minimal-risk workflow)
- Compile a vocabulary shortlist (top entities, terms, APIs).
- Add tokens for the worst offenders.
- Resize embeddings; initialize new token rows (random, or average of constituent sub-token embeddings).
- Warm up: oversample examples containing the new tokens for the first phase of CPT.
1.2.3 Failure modes to mention
- Undertrained tokens: new tokens behave like noise early → mitigated via oversampling and warm-up.
- Segmentation mismatch: any downstream pipeline that tokenizes text must use the updated tokenizer.
- Distribution shock: adding tokens changes token counts and packing; re-check sequence length assumptions.
1.3 The stability gap
1.3.1 Why it happens
Early in CPT, gradients strongly adapt the model to the new distribution, often pushing it away from general-purpose representations.
Common contributors: - too-high learning rate, - low diversity domain data (narrow corpora), - insufficient general replay, - noisy/mislabeled documents (scrapes, templated boilerplate).
1.3.2 Debugging playbook
If general benchmarks drop sharply: - decrease LR and/or increase warm-up, - increase replay ratio, - improve data quality (dedup, filter spam), - add regularization to the reference (e.g., KL to base logits if available).
ELI5: Stability gap is like cramming for one exam and temporarily forgetting other subjects—replay is doing a little “general homework” to prevent that.
CPT often causes an early dip in general performance before recovering.
- Mitigation: General replay (e.g., 80% domain / 20% replay as a starting point), plus regression gates.
# Pseudocode: CPT loop with packing + replay
### Practical add-ons (what interviews like)
- **Curriculum:** start with higher replay, then anneal toward more domain.
- **Domain mixing:** if multiple sub-domains, use adaptive sampling (upweight domains with high loss).
- **Guardrails:** run a small regression suite every N steps; stop/rollback on big drops.
for step in range(T):
batch = sample(D_domain) if rand() < p_domain else sample(D_replay)
# Packing: concatenate docs to fill context (no padding; delimit with EOS)
x = pack_sequences(batch, seq_len=L)
# Next-token objective
loss = cross_entropy(model(x[:, :-1]), x[:, 1:])
loss.backward()
optimizer.step()
optimizer.zero_grad()
if step % eval_every == 0 and regression_failed():
tune(p_domain="down", lr="down", data_quality="up")Mid-training to enable RL scaling (optional add-on)
Two useful framing papers for the “CPT → RL compatibility” story are:
- OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling (arXiv:2506.20512) [@octothinker_2025]
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models (arXiv:2512.07783) [@interplay_pre_mid_rl_2025]