3. Mid-Training
Continued Pretraining: Controlled Domain Adaptation, Stability Engineering, and System Compatibility
1 Overview
Your team starts with a publicly available base model. It can explain Python, summarize documents, and perform well on general benchmarks. So you wire it into the company’s internal codebase and aim to turn it into a “code assistant”: understand private APIs, explain dependencies across internal services, and generate repair suggestions from runbooks. The pre-launch demo looks smooth. Real traffic after launch exposes something else entirely: the model autocompletes internal API names incorrectly, treats deprecated configuration as current convention, and loses its sense of repository hierarchy in multi-file contexts. It is not that it cannot write code. It is that it cannot “think in the distribution of this repository.”
At that point, prompting usually only fixes the surface. You can make the system prompt more detailed, stuff in more few-shot examples, or add retrieval and feed the model READMEs and interface docs. All of that helps. None of it truly changes the model’s internal priors. If the model has no parameterized representation of the repository’s naming conventions, directory structure, calling patterns, and internal terminology, it will keep mistaking patterns that “look like code” for “code from this repository.” That is why mid-training—continued pretraining (CPT)—exists.
Mid-training is not retraining a model from scratch. It is controlled continued pretraining on top of an already trained base model. The objective is often still next-token prediction, but the training distribution, data mixture, learning rate, restore mode, regression gating, and downstream compatibility requirements all change. It moves the model toward the target domain, but it does not automatically make that move stable, and it definitely does not automatically make the resulting model suitable for chat, tool use, or safety alignment.
From an engineering perspective, the value of CPT shows up mainly in four scenarios:
- The model needs to internalize new domain knowledge parametrically, not merely retrieve it at inference time;
- The structure and linguistic habits of the domain differ sharply from general internet text, for example regulations, clinical notes, papers, logs, and code repositories;
- Tokenization and context priors need adjustment, for example when high-frequency terms are split too aggressively or long-document structure is poorly learned;
- You want later SFT, preference optimization, or RL to start from a point closer to the target distribution.
But once you continue training, the risks enter the system too: catastrophic forgetting, the stability gap, distribution mismatch, tokenizer inconsistency, inconsistent checkpoint restore modes, and worse compatibility with downstream alignment. So mid-training is not “just feed it more data.” It is a classic systems engineering task: you are making a controlled tradeoff between specialization gains and preservation of general capability.
The central question of this chapter
Once a base model has finished pretraining, how do engineers adapt it to a new domain, language, context length, or task distribution without breaking what it already knows?
Running case: the internal code assistant
This chapter repeatedly uses one running case: a general code model that must be adapted to an enterprise private codebase. We use it to connect the full chain: when CPT is warranted, how to design the data mix, why general capability drops, when tokenizer expansion is worth it, and how CPT affects later SFT / DPO / RL.
1.1 Where mid-training sits in the lifecycle
This section answers a few key questions first:
- Why can the lifecycle of a modern LLM no longer be split into only “training” and “inference”?
- What role does mid-training play in the full model lifecycle?
- Why is “diagnose the gap first, then choose the training lever” a safer engineering order than “start fine-tuning first”?
Mid-training is the bridge between “general pretraining” and “task alignment.” Pretraining teaches the model a broad language world. Mid-training moves it closer to the target distribution. Post-training then shapes the behavior into a deployable product form. If you blur these three together, you will pick the wrong method, waste budget, and even hand behavior problems that should be solved by SFT to CPT, while handing knowledge problems that should be solved by CPT to prompt engineering.
If you abstract the whole system into “diagnose the gap → choose the cheapest lever that works,” you can write a very engineering-style objective:
\[ a^* = \arg\min_{a \in \mathcal{A}} C(a) \quad \text{s.t.} \quad Q(a; g) \ge \tau, \quad R(a) \le \rho \]
where:
- \(\mathcal{A} = \{\text{Prompt}, \text{RAG}, \text{SFT}, \text{CPT}, \text{DPO}, \text{RL}, \text{Distill}\}\) is the set of candidate interventions;
- \(g\) is the “gap vector,” for example a knowledge gap, behavior gap, search gap, freshness gap, or cost gap;
- \(C(a)\) is implementation cost;
- \(Q(a;g)\) is the expected repair effect of that method on the current gap;
- \(R(a)\) is the regression or system risk it introduces.
The point of this formula is not that it can be solved exactly. The point is to remind you that method selection is a constrained system-optimization problem.
In practice, different stages modify different things:
- Pretraining mainly shapes the base representation space and general language priors;
- Mid-training mainly changes the model’s parameterized knowledge, terminology habits, and document-structure priors under a particular distribution;
- Post-training mainly shapes interaction behavior, preference ranking, safety boundaries, and tool-use patterns;
- The inference system decides how to trade budget for quality, latency, or throughput under a fixed model.
That also means a mature engineering answer is rarely “I’ll fine-tune it a bit.” It starts by defining the gap:
- If the gap is knowledge—for example the model does not understand contract clauses, is unfamiliar with internal APIs, or cannot parse medical abbreviations—prioritize CPT or external memory;
- If the gap is behavior—for example it does not output JSON, does not answer in the right customer-service tone, or does not call tools—prioritize SFT;
- If the gap is preference or long-horizon strategy—for example multi-step reasoning is unstable or answer ranking is inconsistent—then consider DPO or RL;
- If the gap is cost or latency, you should stop adding training and work on distillation, quantization, or serving optimization instead.
Seeing the running case through the lifecycle
For the internal code assistant, pretraining gives the model general programming ability. Mid-training moves it into the language world of this repository. SFT then teaches it to output patches in the team’s required format, explain changes, and call retrieval or code-search tools. Later preference learning or RL is what optimizes which repair suggestions are more reliable and which tool trajectories are more cost-efficient.
1.2 Intervention selection: when to use Prompt, RAG, SFT, CPT, RL
This section answers a few key questions first:
- What is CPT, and how is it different from pretraining and SFT?
- What kinds of problems are CPT, SFT, prompt engineering, and RAG each suited to solve?
- When is continued pretraining the right answer, and when is it overtraining?
- If you have an open-source 7B model and 50,000 clinical notes, should you do CPT, SFT, or both?
- Can mid-training introduce a new language, a new domain, or even a new modality? Where are the boundaries?
CPT solves “insufficient parameterized knowledge and distribution priors,” not “incorrect interaction behavior.” It keeps the next-token objective from pretraining, but keeps pulling the model toward a new data distribution. So CPT is best for domain knowledge, terminology density, document structure, long-context habits, and some vocabulary issues. If the problem is only output format, role tone, tool-calling protocol, or refusal boundaries, CPT is usually not the first choice.
1.2.1 The essential difference between CPT, pretraining, and SFT
Formally, pretraining and CPT both use next-token prediction. The difference is not the loss function. It is the training distribution, the learning rate, and the scope of the objective. Pretraining acquires base capability on a huge and messy general distribution. CPT performs controlled continued training on a narrower, more biased distribution.
SFT is different. Its core purpose is not to make the model understand a language world more deeply. Its purpose is to teach the model a behavior mapping under a given user input: when to answer directly, when to call a tool, and what counts as an acceptable output format. SFT data is usually demonstration dialogues or structured trajectories, not large volumes of raw documents.
So in one sentence:
- Pretraining learns the rough statistical structure of the world;
- CPT changes that “world” into one closer to the target domain;
- SFT learns how to act and speak correctly inside that world.
1.2.2 A more executable diagnostic framework
It is closer to real engineering to write “which lever should I choose?” as a simplified diagnostic. Let:
- \(g_k\) = knowledge gap
- \(g_b\) = behavior gap
- \(g_f\) = freshness gap
- \(g_t\) = tool / trajectory gap
- \(g_r\) = reasoning / search gap
Then a very practical whiteboard heuristic is:
\[ \text{choose} = \begin{cases} \text{Prompt} & \text{if } g_k, g_b, g_r \text{ are all low; only trigger is unstable}\\ \text{RAG} & \text{if } g_f \text{ is high; knowledge must be updated externally}\\ \text{SFT} & \text{if } g_b \text{ or } g_t \text{ is high}\\ \text{CPT} & \text{if } g_k \text{ is high and large-scale domain corpora are available}\\ \text{RL} & \text{if } g_r \text{ is high and the task is verifiable} \end{cases} \]
Real systems are, of course, usually combinations rather than single-choice answers. But this framework at least blocks a common mistake: treating every problem as “we should just train it more.”
1.2.3 When to use CPT, and when not to
| Signal | Judgment | Why |
|---|---|---|
| The model’s perplexity on the target domain is clearly higher than on general corpora | Use CPT | The model is not yet standing inside that domain distribution |
| High-frequency terms, entity names, abbreviations, and citation formats are often wrong | Use CPT | Term co-occurrence and document-structure priors are insufficient |
| Inputs are structured long documents such as regulations, clinical notes, papers, long reports, logs, and code repositories | Use CPT | Long-document organization needs to be learned parametrically |
| You tried prompting or a small amount of SFT, and the model “sounds more like it,” but facts and terminology are still unstable | Use CPT | Surface behavior can be learned, but the knowledge prior is still missing |
| You want later SFT / RL to start from an initialization closer to the task distribution | Use CPT | It reduces the burden of filling knowledge gaps during downstream alignment |
| You only need the model to trigger capabilities it already has more reliably | Do not use CPT | Prompting / few-shot is enough |
| Knowledge changes quickly and is easier to maintain through retrieval than through parameters | Do not use CPT | Prefer RAG / external memory |
| The problem is output format, function-call schema, role tone, or refusal boundaries | Do not use CPT | That is SFT’s job |
| You do not have enough clean domain documents, only a small number of high-quality demonstrations | Do not use CPT | The data is insufficient to support CPT |
| The deployment budget is tight and cannot absorb CPT compute, evaluation, and re-alignment cost | Do not use CPT | CPT is expensive, high-leverage surgery |
Classic diagnostic question: if the model’s perplexity on legal documents is high but its general performance is fine, the first layer to fix is CPT. High perplexity means the model is not yet standing inside the legal text distribution. Term co-occurrence, citation format, and long-document organization are all unstable. You do not patch that by writing a few extra prompts.
1.2.4 An engineering case: clinical notes
Suppose you have an open-source 7B model and 50,000 clinical notes, and you want a medical QA assistant. A mature answer is not “do SFT directly” or “do CPT directly.” It is to reason in layers.
First, this dataset looks more like raw domain documents than curated QA demonstrations, so it is naturally better suited to CPT than to being used directly as SFT data. Second, medicine depends heavily on terminology, abbreviations, entities, and document-structure priors. A model may be decent at generic QA and still lack stable clinical language habits. The safer path is usually:
- Start with data governance: de-identify, remove templates, deduplicate, and filter obviously bad samples;
- Do CPT first: give the model clinical terminology and chart-structure priors;
- Then do a small amount of high-quality SFT: shape the model into explicit product behaviors such as QA, summarization, and triage suggestions;
- Finish with safety and evaluation: in medicine, “half-understood” is often more dangerous than a clean refusal.
So the good answer here is usually: do both, but not in an arbitrary order—CPT lays the foundation, SFT gives it form.
1.2.5 The boundary of new languages and new modalities
CPT is also often used to adapt to new languages or new modalities, but the boundary has to be stated clearly.
- A new language: if the base model already has the relevant script or byte-level tokenization capability and simply has not seen enough data, CPT can work very well;
- A new vocabulary / a new script: you may need tokenizer expansion, or even embedding resize;
- A new modality: if the base model has no visual, audio, or other modality input path, CPT alone cannot conjure a new modality into existence. You usually need a new encoder, an adapter, or a cross-modal alignment layer as well.
That boundary awareness matters. It distinguishes problems that continued pretraining can solve from problems that require architectural changes.
Seeing the running case through the decision tree
For the internal code assistant, if the problem is only “the model knows the API, but keeps forgetting to output a fixed patch schema,” you should start with SFT. If the problem is “the model is simply unfamiliar with the internal functions and directory structure in the repository,” CPT is what becomes worth doing. If knowledge changes frequently, retrieving engineering docs may be more maintainable than storing them parametrically.
2 What CPT actually changes: same objective, different distribution, rewritten priors
This section answers a few key questions first:
- Why does continued pretraining still use next-token prediction but still produce large behavioral changes?
- What does “same objective, different distribution” actually change—knowledge, the representation space, or the behavior interface?
- Why should some problems be handed to CPT while others must be left to SFT, DPO, or RL?
- Mathematically, where are CPT and pretraining actually the same, and where do they differ?
For an autoregressive model, the objective function can stay the same, but once the training distribution shifts from general web text to legal text, clinical notes, papers, logs, or an internal code repository, the optimal parameters change with it. The model learns not only new facts, but also “which words are more common, which structures are more normal, and which continuations feel like this domain.”
The objective stays the same, but the optimal parameters change
We can write autoregressive language modeling in a unified form:
\[ \mathcal{L}_{AR}(\theta; D) = -\mathbb{E}_{x \sim D} \sum_{t=1}^{|x|} \log p_{\theta}(x_t \mid x_{<t}) \]
If we denote the optimal parameters under training distribution \(D\) as:
\[ \theta^*(D) = \arg\min_{\theta}\mathcal{L}_{AR}(\theta; D) \]
then moving from pretraining to mid-training does not mean replacing the objective. It means replacing \(D\) from a broad general distribution \(D_{web}\) with a narrower, target-skewed \(D_{domain}\) or mixed distribution \(D_{mix}\). Once the distribution changes, \(\theta^*(D)\) changes. That is why the loss is still next-token, yet the model increasingly starts to sound like a lawyer, a doctor, an auditor, or a maintainer of an internal repository.
CPT is often summarized as “keep training with next-token loss,” but that sentence hides the real engineering fact: the objective function staying the same does not mean system behavior stays the same. When the training distribution moves from general web pages, forums, and encyclopedias to regulations, clinical notes, papers, or internal code repositories, the model re-estimates a large set of conditional probabilities:
- co-occurrence relationships between terms;
- prior frequencies of certain entities;
- local document structure and long-range organization;
- which token sequences are “more natural” in the new world.
You can write the data mixture in a simple form:
\[ D_{mix} = \lambda D_{domain} + (1 - \lambda) D_{replay} \]
and CPT’s objective is still:
\[ \mathcal{L}_{CPT}(\theta) = - \mathbb{E}_{x \sim D_{mix}} \sum_{t=1}^{|x|} \log p_\theta(x_t \mid x_{<t}) \]
What actually determines where the model is being pulled is not what the formula looks like. It is how much of each kind of data sits inside \(D_{mix}\).
If you want to write “domain gain” and “general retention” explicitly as a bi-objective problem, you can also write:
\[ \max_\theta \; \Delta M_{domain}(\theta) \quad \text{s.t.} \quad \Delta M_{general}(\theta) \ge -\tau_g, \qquad \Delta M_{safety}(\theta) \ge -\tau_s \]
This is the real engineering constraint of mid-training: you are not simply maximizing domain capability. You are maximizing domain gain inside a set of regression boundaries.
An intuitive example: it is still “continuation,” but the world has already changed
Suppose two prefixes come from two different distributions:
- General web text:
The service supports multiple environments, including ... - An enterprise internal repository:
The billing worker writes retries to /svc/retry_queue and calls ...
Under the first distribution, the model is more likely to continue with a common tutorial-style explanation. Under the second, it has to learn internal paths, private APIs, team conventions, and repository hierarchy. Both are “continuation.” They are not the same conditional-probability table.
The same thing happens in legal and medical settings:
- In contract text, tokens skew toward clauses, exceptions, citations, and long-range cross-references;
- In clinical notes, tokens skew toward abbreviations, timelines, lab values, and clinical style;
- In repository text, tokens skew toward paths, config keys, function families, and internal co-occurrence patterns.
So mid-training is not “memorize a little more knowledge.” It is re-sculpting the terrain of the model’s conditional probabilities so it looks more like the terrain of the target domain.
What CPT, SFT, and DPO / RL each change
It becomes clearer if you compare the common training stages side by side:
| Method | Main optimization objective | What it mainly changes | Typical data shape | Typical problem |
|---|---|---|---|---|
| Pretraining / CPT | Autoregressive next-token prediction | Knowledge priors, term co-occurrence, representation space, long-document habits | Raw documents, code, long text | “The model simply does not understand this language world” |
| SFT | Supervised behavior mapping | Output format, role tone, tool templates, dialogue interface | Instruction dialogues, tool trajectories, demonstration answers | “The model knows it, but cannot do it the way we need” |
| DPO / ORPO | Relative preference objective | Tendencies in answer quality, preference ranking | chosen / rejected preference pairs | “It can write answers, but quality ranking is unstable” |
| RL | Expected reward maximization | Search, exploration, long-horizon strategy | rollout + reward / verifier | “The task needs trial and error, planning, and verifiable feedback” |
- CPT changes what world the model believes is more common;
- SFT changes how the model should answer users inside that world;
- DPO / RL changes which behavior the model prefers among multiple plausible ones.
Why this matters in real systems
When a team says, “The model gets internal APIs wrong, and the JSON schema is unstable too,” those are actually two different layers of problem:
- It does not know the distribution of this repository. That is what CPT or external memory should solve;
- It does not know how to output under product constraints. That is what SFT or inference-time constraints should solve.
If you merge these into one problem, the most common outcome is predictable: you use the wrong training stage to patch the wrong gap, and in the end the model neither learns the new domain nor preserves its behavioral foundation.
3 Data engineering: domain corpora, mixture, general replay, packing
This section answers a few key questions first:
- What data is truly suitable for continued pretraining: raw documents, QA pairs, tool trajectories, or a mixed web crawl?
- Why is general replay not a small trick but one of the most important stability levers in mid-training?
- Why is mixture not a “recipe list” but a knob that directly controls domain gain and forgetting risk?
- Packing looks like a throughput optimization. Why does it still change the topology of the data seen by the gradient?
Most CPT projects succeed or fail first as a data engineering problem.
Loss is only a surface interface. What actually determines where the model is being pulled is the corpus distribution it sees, the weights of subdomains, the proportion of general replay, the strategy for document boundaries, and how all of those evolve over the course of training.
3.1 What good domain corpora look like
The best CPT data is usually not carefully written dialogue answers. It is raw domain documents: regulations, case law, contracts, clinical notes, papers, specifications, repository files, logs, comments, design docs. The reason is direct: CPT wants to learn the language world itself of the domain, not the behavior mapping between user and assistant.
This kind of data has three engineering advantages:
- It preserves terminology and co-occurrence relationships from the natural distribution;
- It preserves real document structure, such as section hierarchy, citations, table context, and code-file boundaries;
- It is easier to collect and expand continuously at scale.
Of course, that also means data cleaning matters even more. Residual web-crawl junk, template footers, duplicate samples, nav bars, OCR errors, and broken concatenation all flow directly into parameter updates. Many teams think they are doing “domain adaptation” when they are really teaching the model to memorize template noise.
3.2 Mixing general and domain data
Write the mixed distribution as:
\[ D_{mix} = \sum_{i=1}^{K}\alpha_i D_i, \qquad \sum_{i=1}^{K}\alpha_i = 1,\quad \alpha_i \ge 0 \]
where at least one \(D_i\) is target-domain data, and another common component is \(D_{replay}\). So tuning the mixture is not “seasoning.” It changes the composition of the world that the next parameter update sees. That is why many teams put “80/20” into their recipe. It is not a sacred number. It is a reminder that replay is not an after-the-fact patch. It is part of the training distribution itself.
Perplexity is often used to judge whether a model is truly standing inside a distribution. Given domain text \(x = (x_1, \ldots, x_T)\), its average negative log-likelihood is:
\[ \ell(x; \theta) = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}) \]
and the corresponding perplexity is:
\[ \mathrm{PPL}(x; \theta) = \exp\big(\ell(x; \theta)\big) \]
If the model’s PPL on the target domain is clearly higher than on general corpora, while entity, terminology, and structural mistakes are frequent, that is usually a signal more in favor of CPT than of pure prompting or SFT.
What general replay means is simple: while you feed the model new domain data, you also keep a portion of the original, more general and diverse training data. Its role is not merely “do not forget old knowledge.” It does three things at once:
- It reduces catastrophic forgetting, so the model does not leave its original general representation too quickly;
- It buffers optimization shock, so updates are not fully dominated by the narrow distribution;
- It preserves the base needed for later alignment, retaining general QA, common-sense reasoning, and dialogue behavior.
A common starting recipe is 80% domain data / 20% general replay, but that is not a law. It is just a conservative starting point in engineering. The right way to tune it depends on signals in both directions:
- If domain metrics improve too slowly, you can gradually increase the domain share;
- If general evals drop too fast, raise replay, lower the learning rate, improve data quality, and reduce the number of training steps if necessary.
| Strategy | Approach | Advantages | Risks |
|---|---|---|---|
| Fixed mixture | Keep a fixed 80/20, 90/10, or similar recipe throughout | Simple, stable, reproducible | May not be the optimum |
| Curriculum | Use more replay early, then gradually increase the domain share later | More stable early, more specialized later | Harder to tune, slower to converge |
| Subdomain adaptive sampling | Temporarily upweight subdomains with high loss | Can patch the weakest subdomains | Easy to make the overall distribution wobble |
| Regression-gate-driven adjustment | Automatically raise replay or lower LR when evals degrade | More engineering-friendly, good for production | Requires continuous evals and automation infrastructure |
For most teams, the stable order is: start with a fixed mixture, then decide from regression gates whether curriculum learning or adaptive weights are needed.
3.3 Packing: throughput optimization and document-boundary risk
CPT often processes huge volumes of documents with widely varying lengths. If you do not concatenate anything, short documents create a lot of padding and waste GPU badly. So a common engineering practice is to pack multiple documents into one fixed context window: document packing. We will unpack its mathematics and engineering meaning later in the section on training topology and system implementation. For now, remember one sentence:
Packing improves throughput. But if document boundaries are handled badly, it can also write nonexistent cross-document patterns into the parameters.
Seeing the running case through the data distribution
For the internal code assistant, the most valuable CPT data is not QA pairs about “how to answer code questions.” It is the repository’s real language world: source files, interface docs, change logs, runbooks, comments, and configuration templates. Only then does the model learn what the conditional probabilities of this codebase actually look like.
4 Evaluation, gating, and stopping criteria
This section answers a few key questions first:
- Which metrics should you watch during a mid-training run to decide whether it is truly improving?
- How should regression gates be designed so they can brake early when the model starts drifting?
- For checkpoint selection, should you take the last point, the highest domain score, or the Pareto-optimal point?
- When should you keep going, and when should you stop decisively?
Mid-training is not “train first, evaluate later.” It is “train, evaluate, and decide continuously whether the next token of budget is still worth spending.”
A run is not declared successful by one domain metric. It is judged by a constrained multi-objective set of questions: is domain gain rising, have general capabilities fallen through the floor, and have behavior or safety regressed to an unacceptable degree?
4.1 What belongs in the eval suite
The most common evaluation mistake in mid-training is to treat the domain benchmark as the only signal. The steadier approach is to split the eval suite into four buckets from the start:
\[ \mathcal{E} = \{\mathcal{E}_{domain},\; \mathcal{E}_{general},\; \mathcal{E}_{behavior},\; \mathcal{E}_{safety}\} \]
- \(\mathcal{E}_{domain}\): the target tasks—legal, medical, internal code, long documents, and so on;
- \(\mathcal{E}_{general}\): common-sense QA, general reasoning, basic writing;
- \(\mathcal{E}_{behavior}\): structured output, tool templates, dialogue usability;
- \(\mathcal{E}_{safety}\): refusal, sensitive boundaries, robustness on extreme inputs.
Then give the stopping criteria an explicit form. A very practical gating rule is:
\[ \text{stop if} \quad \Delta M_{general} < -\tau_g \;\; \text{or} \;\; \Delta M_{safety} < -\tau_s \;\; \text{or} \;\; \Delta M_{domain} < \epsilon_d \; \text{for } N \text{ consecutive evals} \]
In other words, “it is no longer worth burning more tokens” and “the regression is no longer acceptable” should be two separate stopping conditions.
4.2 Checkpoint selection: look for the Pareto frontier, not the last point
Suppose the domain, general, and safety metrics at evaluation step \(t\) in a run are \((M_d(t), M_g(t), M_s(t))\). A more engineering-friendly approach is to define the Pareto frontier first:
\[ \mathcal{P} = \left\{ t \;\middle|\; \nexists s,\; M_d(s)\ge M_d(t),\; M_g(s)\ge M_g(t),\; M_s(s)\ge M_s(t) \text{ with at least one strictly better} \right\} \]
Intuitively, the checkpoints on \(\mathcal{P}\) are candidates that are not completely dominated by any other point. You then choose the point on that frontier that best fits product constraints, instead of mechanically taking:
- the last checkpoint;
- the checkpoint with the highest domain score;
- or the checkpoint with the lowest training loss.
For the internal code assistant, this matters a lot: the point with the highest repository-level pass@k may also be the point where general QA and refusal behavior are worst. Mistaking “the highest score” for “the best checkpoint” is one of the most expensive judgment errors in mid-training.
4.3 Gates: bringing evaluation into the training loop
Regression gates do not mean running one final report after training. They mean turning a small set of regression suites into continuous monitors during training. A practical gate set should cover at least four kinds of signal:
- General capability: common-sense QA, general reasoning, basic writing;
- Domain capability: the legal, medical, code, or long-document tasks you actually want to improve;
- Behavior capability: whether instruction following, structured output, and tool templates have degraded;
- Safety boundaries: refusal, sensitive-content boundaries, robustness on extreme inputs.
If you want to turn the gate into a single score that ranks checkpoints, one simple and practical form is:
\[ J(t) = w_d\,\widetilde{\Delta M_d}(t) - w_g\,\widetilde{\mathrm{Reg}_{general}}(t) - w_b\,\widetilde{\mathrm{Reg}_{behavior}}(t) - w_s\,\widetilde{\mathrm{Reg}_{safety}}(t) \]
where \(\widetilde{\cdot}\) denotes normalized metrics. The point is not to create a magical scalar that explains everything. The point is to make checkpoint selection auditable instead of “we eyeballed a few tables.”
4.4 Stopping criteria
Suppose you run 20B tokens of CPT and MMLU drops by 5 points. You keep going to 50B tokens and the score recovers partway. A U-shaped curve like that is usually explained as:
- early on, the model absorbs the new distribution quickly, and the general representation gets compressed first;
- as replay continues to enter, the learning rate falls, and optimization settles, the model recovers part of the balance;
- but “partial recovery” does not mean it will recover all the way to the starting point, and it certainly does not mean this is the best checkpoint.
So the stopping rule should not be “since it recovered a bit, maybe wait longer.” It should be defined in advance:
- the minimum acceptable general score;
- the minimum threshold for domain gain;
- stop after no improvement for \(N\) consecutive evals;
- whether the gain per token still justifies further spend.
Writing the stop policy as an explicit rule is much steadier than “let’s wait and see.”
A typical form is:
\[ \text{stop if} \quad \Delta M_{general} < -\tau_g \;\;\text{or}\;\; \Delta M_{safety} < -\tau_s \;\;\text{or}\;\; \Delta M_{domain} < \epsilon_d \text{ for } N \text{ consecutive evals} \]
where:
- \(\tau_g\): the maximum acceptable degradation in general capability;
- \(\tau_s\): the maximum acceptable degradation in safety capability;
- \(\epsilon_d\): the minimum threshold for domain gain;
- \(N\): how many consecutive evals with no improvement trigger a stop.
The value of this formulation is not that it makes training look mathematically elegant. Its value is that it turns resource allocation from an intuition problem into an explicit operating policy.
At this point you know how to judge whether a run is healthy, which metrics to watch, how to choose a checkpoint, and what “time to stop” actually means. Next, when we talk about stability, we can define clearly what it means for training to go bad.
5 Stability engineering: learning rate, restore mode, gradient control, and rollback
This section answers a few key questions first:
- What is the stability gap in CPT, and why is it close to a common phenomenon?
- If general benchmarks fall suddenly right after CPT starts, in what order should you investigate?
- What roles do learning rate, data distribution, and gradient norm each play in stability?
- What are regression gates, and why are they more important than “check after training finishes”?
- When you encounter a U-shaped curve, how should you set stopping rules instead of blindly burning more compute?
Early degradation in CPT is not rare, but it must be actively managed. The danger of stability problems is not that scores necessarily fall. The danger is that many runs look, at first, like the degradation is “probably temporary,” and teams keep training until a problem that was still reversible becomes irreversible distribution drift.
5.1 What the stability gap actually is
The stability gap means that once the model starts training on a new distribution, domain loss falls quickly, but general capability drops noticeably first. After that it may partially recover, or it may not. The real challenge is that you cannot use hope to tell those cases apart.
Write two metrics as functions of training step \(t\):
- domain metric: \(M_d(t)\)
- general metric: \(M_g(t)\)
Then the simplest form of the “stability gap” is:
\[ \Delta_{stab}(t) = M_g(t) - M_g(0) \]
In many real runs, \(M_d(t)\) rises monotonically, while \(\Delta_{stab}(t)\) is significantly negative early on and only later may recover partway. That is the familiar U-shaped recovery trajectory in engineering practice.
It usually happens because several factors stack together:
- Learning-rate mismatch: for a large model that has already converged, the usable LR for continued training is often smaller than in initial pretraining;
- A distribution shift that is too abrupt: the domain data is too narrow or the change is too violent, so the model gets yanked rapidly into a local subspace;
- Insufficient replay: the old distribution is not weighted heavily enough to offset gradients from the new one;
- Higher gradient noise: dirty data, duplicated templates, a few extreme batches, and broken packing all amplify instability;
- Side effects from long context and packing: longer sequences, incorrect boundary handling, and sequence-parallel configuration issues can all make optimization more brittle.
5.2 Learning rate and warm-up: why CPT is more sensitive than pretraining to a “heavy hand”
In mid-training, the learning rate is often the first-order stability knob. The reason is not mysterious: CPT does not start from random initialization. It continues to move inside a parameter space that already has structure. Move too fast, and you rewrite old representations brutally. Move too slowly, and the domain adaptation remains weak.
A common but sloppy practice is to reuse the peak learning rate or decay schedule from pretraining. The more stable practice is usually:
- use a lower peak LR;
- design a separate warm-up for continued training;
- evaluate densely early on to make sure large regressions have not appeared;
- when needed, shorten each individual training window and compare more checkpoints instead of running everything in one shot.
5.3 An executable debugging order
If general benchmarks collapse after CPT starts, an engineering-grade debugging order is usually:
- Check learning rate and warm-up first: they are the most common culprit;
- Then check the replay ratio: is the domain share too aggressive?
- Check data quality: deduplicate, remove templates, remove mislabeled samples;
- Check packing and boundary handling: are you introducing cross-document contamination?
- Inspect gradient norms and loss curves: are there anomalous batches or optimizer instability?
- Add reference regularization if needed: for example a KL constraint against base-model logits to limit representational drift;
- Rollback and shrink the experiment: reproduce the problem with a smaller budget instead of gambling further on the full run.
If you want to write “reference regularization” more concretely, one common form is:
\[ \mathcal{L}_{total} = \mathcal{L}_{CPT} + \beta\, \mathbb{E}_{x \sim D_{replay}}\big[\mathrm{KL}(p_{\theta_0}(\cdot \mid x)\,\|\, p_{\theta}(\cdot \mid x))\big] \]
where \(\theta_0\) is the base-model parameter set and \(\theta\) is the current CPT parameter set. This term is not part of default CPT, but in high-risk settings it can serve as a soft “do not move too far away” constraint.
5.4 Gradient control, anomalous batches, and rollback during training
Beyond learning rate and mixture, another badly underestimated stability knob is gradient control. The most common practices are:
- gradient clipping: cap gradient spikes from anomalous batches;
- anomaly detection: set thresholds on loss, gradient norm, and length distribution;
- rollback: once a gate is violated, roll back to the most recent stable checkpoint.
The most common form of gradient clipping is:
\[ g_t \leftarrow g_t \cdot \min\left(1,\frac{c}{\|g_t\|_2}\right) \]
where \(c\) is the maximum gradient norm you set. Add one simple anomaly rule such as:
\[ \text{rollback if } \|g_t\|_2 > \mu_g + k\sigma_g \]
and you get a very plain but often useful engineering rule: do not gamble on anomalous batches.
5.5 A CPT training skeleton with replay and gating
# Pseudocode: CPT training loop with replay, packing, KL reference constraint, and regression gate
for step in range(T):
batch = sample(D_domain) if rand() < p_domain else sample(D_replay)
x = pack_sequences(batch, seq_len=L, add_eos=True)
logits = model(x[:, :-1])
loss_nll = cross_entropy(logits, x[:, 1:])
if use_ref_kl:
with torch.no_grad():
ref_logits = ref_model(x[:, :-1])
loss_kl = kl_divergence(ref_logits, logits)
loss = loss_nll + beta * loss_kl
else:
loss = loss_nll
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=grad_clip)
optimizer.step()
optimizer.zero_grad()
if step % eval_every == 0:
metrics = run_regression_suite()
if gate_triggered(metrics):
rollback_or_tune(lr="down", replay="up", data="clean")Seeing the running case through stability
A common failure mode in the internal code assistant is not “loss does not go down.” It is “repository tasks improve while general dialogue and safety behavior decline.” If the team watches only code-task pass@k, and does not include general QA, structured output, and refusal in its regression gates, it can easily burn a run that was still recoverable into a model that only “speaks repository slang.”
The goal of stability engineering is not to guarantee that metrics never fall. It is to guarantee that once they start falling, you know where to look, what to tune, and whether more compute should still be spent.
6 Tokenizer expansion: high-risk structural surgery
This section answers a few key questions first:
- When should you expand the tokenizer during CPT instead of keeping the original vocabulary?
- Which domains are most likely to be hurt by token fragmentation: scientific notation, code tokens, multilingual characters, or legal citations?
- How should the embeddings of new tokens be initialized to keep risk as low as possible?
- What is an “undertrained token,” and why does it show up so often after vocabulary expansion?
- Once the tokenizer is expanded, why must the training, evaluation, and deployment pipeline all be upgraded together?
Tokenizer expansion is high-return, high-risk “checkpoint surgery,” not a routine move. It can significantly reduce fragmentation of domain terms, save context budget, and improve modeling efficiency. But it also changes the embedding matrix, the sequence-length distribution, packing efficiency, log parsing, and the serving stack. Unless you have a strong enough signal that the gains are real, the safest default is still usually “do not expand the vocabulary.”
6.1 When it is worth expanding the tokenizer
A practical test is to measure the fragmentation rate of the target domain. For a term \(u\), let the original tokenizer produce a subword sequence of length \(k(u)\). Then you can define:
\[ r(u) = k(u) \]
If you take the expectation over a weighted term set \(\mathcal{V}_{domain}\) with occurrence frequency \(f(u)\), the average fragmentation rate becomes:
\[ \bar r = \frac{\sum_{u \in \mathcal{V}_{domain}} f(u)\, r(u)}{\sum_{u \in \mathcal{V}_{domain}} f(u)} \]
When many high-frequency, high-value terms have persistently high \(r(u)\)—for example:
- drug names, protein names, and trial IDs in biomedicine;
- clause numbers and citation formats in law;
- API names, paths, and configuration keys in code;
- formula symbols, units, and scientific notation in scientific literature;
- new scripts and rare character combinations in multilingual settings;
the model pays two costs at once:
- Context cost: a concept that should take 1 token now takes 4 to 8;
- Learning cost: the model has to reconstruct a stable concept across multiple fragments, which makes both training and decoding harder.
Only when that fragmentation is frequent enough, expensive enough, and clearly hurts quality is vocabulary expansion worth putting on the roadmap.
6.2 What you can still do without expanding the vocabulary
Tokenizer expansion is not the only answer. In many settings, cheaper alternatives are already enough:
- continue domain CPT so the model learns to understand the terms stably even under the old tokenization;
- reduce free-form generation of complex terms at inference time through retrieval and templates;
- normalize on the serving side, for example by mapping common aliases to a canonical form.
So tokenizer expansion should be treated as an engineering upgrade for when fragmentation has become an obvious bottleneck in quality or cost—not as a routine beautification pass for the vocabulary.
6.3 Expanding the vocabulary is surgery, not one line of config
Suppose the old vocabulary size is \(V\), the embedding dimension is \(d\), and the old embedding matrix is:
\[ E_{old} \in \mathbb{R}^{V \times d} \]
Now add \(k\) new tokens, and the new embedding matrix becomes:
\[ E_{new} \in \mathbb{R}^{(V+k) \times d} \]
If the model uses tied embeddings, the output head \(W_{lm}\) must be expanded at the same time. For a new token \(u_{new}\), one more stable initialization is to take the mean of the vectors corresponding to its old subword decomposition:
\[ E_{new}(u_{new}) = \frac{1}{m}\sum_{j=1}^{m} E_{old}(s_j) \]
where \((s_1, \ldots, s_m)\) is the old-tokenizer decomposition of that term. Compared with purely random initialization, this “folded initialization” is closer to compressing an old representation into a new slot, and is usually smoother.
6.4 The “undertrained token” problem
The most common failure mode after vocabulary expansion is undertrained tokens. The essence is simple: the new token has entered the vocabulary, but samples containing it still appear too infrequently early in training, so the new embedding row never receives enough gradient updates. The symptoms are:
- unstable generation;
- outputs that sometimes look like random noise;
- worse behavior in real domain context than under the old tokenization.
The usual mitigations include:
- oversample samples that contain the new tokens;
- give the expanded-tokenizer model its own warm-up stage;
- control the number of new tokens, expanding only the truly high-frequency, high-value terms;
- evaluate typical tasks that contain the new tokens more densely, instead of looking only at overall loss.
6.5 Tokenizer consistency
Once the tokenizer is expanded, every stage—training, evaluation, inference, log analysis, caching, offline preprocessing—must be upgraded in lockstep. Otherwise the most likely result is not “performance is slightly worse.” It is samples and checkpoints that are fundamentally incompatible:
- the same sentence is split into different tokens at training time and inference time;
- embedding indices no longer match;
- preprocessing caches become invalid;
- the online length budget no longer matches offline estimates.
So tokenizer expansion is not only a modeling problem. It is also a version-management and deployment-consistency problem.
Seeing the running case through tokenizer surgery
In the internal code assistant, tokenizer expansion is most easily triggered by high-frequency API names, configuration keys, path fragments, and internal service abbreviations. But those are also the tokens most likely to break on the serving side: once training and online retrieval or caching are still using the old tokenizer, the same function name gets split differently in different stages. At that point, “good offline, weird online” is almost inevitable.
7 Where do gradients actually come from? Three learning topologies
The first time we encounter mid-training, SFT, DPO, or RL, we focus on “what does the data look like?” Documents, dialogues, preference pairs, tool trajectories. That intuition is not wrong. It is just not deep enough.
What actually determines how the model updates its parameters is not just the content of the samples themselves. It is these more basic questions:
- At every position, what context did the model actually see?
- Which positions in the sequence really participated in the loss?
- Was the training sample given statically, or generated by the current policy itself?
Packing, Masking, and Rollout matter not because they are the “three hottest training tricks,” but because each of them controls one of the three most upstream variables in how gradients are formed:
- context topology
- supervision topology
- data-generation topology
If any one of those three is designed incorrectly, training may still converge, but the model is likely learning the wrong thing. Many training failures look on the surface like “the data quality is average” or “the loss is a bit unstable.” At the core, they are often cases where the gradient source is wrong.
This section answers a few key questions first:
- Why are Packing, Masking, and Rollout not on the same abstraction level, yet still worth explaining together?
- Why is it more accurate to say they do not merely “change a training detail,” but change where gradients come from?
- In CPT, SFT, DPO, and RL, which tokens actually receive gradient?
- Why can the model learn completely different things from the exact same text under a different training topology?
7.1 A unified framework: where do gradients actually come from
Let us put different training methods into one unified mathematical frame. Let model parameters be (), and let the training sequence be (s = (s_1, s_2, , s_T)). Then many autoregressive training methods can be abstracted as:
\[ \mathcal{L}(\theta) = \mathbb{E}_{s \sim q} \left[ \sum_{t=1}^{T} w_t\, \ell_t(\theta; s_{\le t}) \right] \]
where:
- (q) denotes the distribution of training sequences;
- (s_{t}) is the context prefix visible at position (t);
- (w_t) is the loss weight at that position;
- (_t) is the local loss corresponding to position (t).
Then the gradient is:
\[ g(\theta) = \nabla_\theta \mathcal{L}(\theta) = \nabla_\theta \mathbb{E}_{s \sim q} \left[ \sum_{t=1}^{T} w_t\, \ell_t(\theta; s_{\le t}) \right] \]
This formula matters a lot. In the end, parameter updates in training come from three things.
1. Sample distribution (q)
What kind of sequences are you actually sampling from?
- Static text from a corpus?
- Sequences mixed from domain documents and general replay?
- Or trajectories rolled out by the current policy itself?
2. Context prefix (s_{t})
The same token produces a different gradient if it is predicted under a different context.
For a language model, the loss is usually written as a conditional probability:
\[ \ell_t(\theta; s_{\le t}) = -\log p_\theta(s_{t+1} \mid s_{\le t}) \]
In other words, the model is not learning “what an isolated token is.” It is learning “given this prefix, what should the next token be?” So if the prefix changes, the gradient changes.
3. Loss weight (w_t)
Not every token has to be optimized. Some positions can participate in the loss; others can serve only as context without being supervised. Once (w_t) changes, the source of the parameter update changes as well.
7.2 The three most important “gradient entry points”
Under the unified framework above, Packing, Masking, and Rollout each correspond to a different upstream control:
- Packing mainly changes (s_{t}), the context each position sees;
- Masking mainly changes (w_t), which positions actually enter the loss;
- Rollout mainly changes (q), who generated the training samples in the first place.
So the more accurate statement is not “these three methods change the gradient the most.” It is:
They control the three most fundamental variables in the gradient-formation path: context, supervision positions, and sample source.
That is why they deserve to be explained together.
A more intuitive example
It becomes easier to understand these three topologies if you think of training as a teacher grading homework.
Packing
It is like stapling several students’ assignments together and handing them in as one packet. The teacher assumes the pages are connected. So the local context of each question changes.
Masking
It is like telling the teacher: “Only grade the last page. Ignore the earlier draft.” So not all content participates in scoring.
Rollout
It is like no longer giving the teacher a fixed answer sheet at all. Instead, you let the student solve the problem live and then grade the final result. So who wrote the work, and how it is graded, both change.
The analogy is not precise, but it matches the core distinction well:
- Packing changes how the homework is arranged;
- Masking changes what gets graded;
- Rollout changes how the homework is generated.
7.3 Packing: changing context topology
What packing actually is
Packing is not a new optimization objective. It is a data-layout strategy. It appears most often in pretraining and continued pretraining (CPT): to reduce padding waste, you pack multiple shorter documents into one fixed-length context window.
Let the original document collection be:
\[ D^{(i)} = (x^{(i)}_1, x^{(i)}_2, \dots, x^{(i)}_{n_i}) \]
Then the packing operation can be written as:
\[ z = \mathrm{Pack}\left(D^{(i_1)}, D^{(i_2)}, \dots, D^{(i_k)}\right) = [D^{(i_1)}, \langle eos \rangle, D^{(i_2)}, \langle eos \rangle, \dots] \]
After packing, the training objective is usually still the standard next-token loss:
\[ \mathcal{L}_{\text{pack}}(\theta) = -\sum_{t=1}^{T-1} \log p_\theta(z_{t+1} \mid z_{\le t}, A) \]
where (A) is an optional attention mask used to control whether document boundaries may be crossed.
Why it changes gradients
The key is not whether the loss formula changed. The key is that the context changed.
If the model was originally trained in single-document mode, the beginning of document B would only ever be predicted under the context “the start of document B.” But once you pack document A and document B together, the beginning of document B is now predicted under the context “after the tail of document A.”
In other words, Packing changes:
\[ s_{\le t} \]
It changes the conditional context seen by every token.
For a language model, that change is large, because the training objective was conditional to begin with:
\[ -\log p_\theta(x_t \mid x_{<t}) \]
For the same (x_t), once (x_{<t}) changes, the gradient is no longer the same gradient.
Example: two completely unrelated documents packed together
Suppose there are two short documents:
- Document A:
The patient reports chest pain for two hours. - Document B:
This contract takes effect on the date of signing.
After packing, the sequence may become:
\[ [ \text{The patient reports chest pain lasting two hours.}, \langle eos \rangle, \text{This contract takes effect on the date of signing.}, \langle eos \rangle ] \]
From a throughput perspective, this is great: almost no padding is wasted.
But from the perspective of the learning signal, if boundaries are handled badly, the model sees a statistical relation that does not exist:
“The patient reports chest pain for two hours.” is naturally followed by “This contract takes effect on the date of signing.”
That is obviously not a structure you want the model to learn.
Engineering meaning
The gain from Packing is throughput. The cost is that the context topology gets rewritten. So it is not a “pure systems optimization trick.” It is a learning design choice that directly affects gradient formation.
Common failure modes
- boundary leakage
- inconsistent EOS separation
- incorrect attention-mask design
- length-distribution changes that alter loss statistics and training dynamics
Packing changes the prefix under which a token is predicted.
7.4 Masking: changing supervision topology
What masking actually is
If Packing answers “how are the samples arranged?”, Masking answers “which positions count?”
In supervised fine-tuning (SFT), we often concatenate a whole dialogue into one sequence, but we do not want every token to participate in the loss. In many cases we only want the model to learn the assistant’s output, not to “repeat the user input.”
Let the training sequence be:
\[ s = (s_1, s_2, \dots, s_T) \]
Define a loss mask:
\[ m_t \in \{0,1\} \]
where:
- (m_t = 1): position (t) participates in the loss;
- (m_t = 0): position (t) does not participate in the loss.
Then the training objective becomes:
\[ \mathcal{L}_{\text{mask}}(\theta) = -\sum_{t=1}^{T-1} m_{t+1}\log p_\theta(s_{t+1} \mid s_{\le t}) \]
Why it changes gradients
Masking changes the loss weight at each position, namely:
\[ w_t = m_t \]
If (m_t = 0) at some position, that position still exists in the sequence and still serves as context to the model, but it contributes no gradient at all.
That means the same piece of text imposes a different learning responsibility as soon as the mask changes.
Example: one SFT dialogue sample
<system> You are an enterprise customer-service assistant
<user> Help me write a refund email
<assistant> Of course. Here is a concise, professional refund email:
...
In a typical assistant-only SFT setup:
systemtokens: serve as context, but do not enter the loss;usertokens: serve as context, but do not enter the loss;assistanttokens: do enter the loss.
In other words, the model uses the system and user content to predict the assistant output, but it is not punished for “failing to repeat the user.”
Why this matters so much
Because it effectively defines:
Who the model is actually responsible for.
- In CPT, almost all tokens participate in next-token prediction;
- In typical SFT, only assistant tokens are supervised;
- In some tool-trajectory training, you may supervise only tool calls and the final answer;
- In structured-output tasks, you may even compute loss only on specific JSON fields.
So Masking is not changing a “sample-format detail.” It is changing the supervision boundary.
Common failure modes
- mask misalignment, so user tokens are supervised too;
- template-concatenation errors, so the first assistant span is not counted in the loss;
- in multi-turn dialogue, some responses are supervised while others are silently missed;
- tool-call tokens and natural-language tokens use inconsistent supervision policies.
Masking changes which positions’ errors actually enter the gradient.
7.5 Rollout: changing data-generation topology
What rollout actually is
The core difference between Rollout and the previous two is that the training sample no longer comes entirely from a static dataset. It is generated by the current policy model itself.
Let the input prompt be (x), let the policy model be (_), and let the model-generated trajectory be:
\[ y = (y_1, y_2, \dots, y_T) \sim \pi_\theta(\cdot \mid x) \]
Then an environment, verifier, or reward model assigns a return to the full trajectory:
\[ R(x, y) \]
The training objective is to maximize expected reward:
\[ J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)}[R(x,y)] \]
A typical policy-gradient form is:
\[ \nabla_\theta J(\theta) \approx \mathbb{E}\left[ \sum_{t=1}^{T} A_t\, \nabla_\theta \log \pi_\theta(y_t \mid x, y_{<t}) \right] \]
where (A_t) can be understood as an advantage or reward-related weighting term.
Why it changes gradients
In supervised learning, the training distribution (q) usually comes from a static corpus. In rollout, the training sequence is sampled from the current policy (_). In other words, the sample distribution itself depends on the parameters:
\[ q = q_\theta \]
That is fundamentally different from the previous two topologies:
- Packing changes the context;
- Masking changes the supervised positions;
- Rollout changes the mechanism that generates the training samples.
Example: database query and explanation
Suppose the task is:
User: Find the top 10 products by GMV last month from the database and explain the results.
Under SFT, you would give the model a demonstration answer and have it imitate it.
Under rollout, the model must complete the full trajectory on its own:
- write the SQL;
- call the tool to query the database;
- read the returned results;
- write the explanation;
- receive a score from a verifier or reward function.
The reward may come from:
- whether the SQL is executable;
- whether the query results are correct;
- whether the final explanation is consistent with the returned data;
- whether the right tool was called.
At that point there is no fixed “standard answer token sequence” to imitate word by word. The model must first generate the trajectory itself, and only then can the gradient flow back along that trajectory.
Why it has a larger effect
Because rollout turns training from “fit a fixed dataset” into “optimize on your own behavior.” That causes several downstream consequences:
- the model must explore;
- supervision often arrives only after the whole sequence ends;
- credit assignment gets harder;
- reward noise flows back through the whole trajectory;
- once the current policy changes, the samples it sees in the future change too.
Rollout changes who generates the sample, and how the reward flows back through the trajectory into the parameters.
7.6 How gradients are changed
Readers naturally ask: are learning rate, batch size, optimizer, temperature, and sampling ratio not also very important?
Of course they are. But those variables are more like knobs for:
- how large the gradient is;
- how fast the update is;
- how much noise there is;
- how stable training is.
Packing, Masking, and Rollout are more like the mechanisms that decide:
- which samples the gradients come from;
- under what context the gradients are formed;
- which positions the gradients are assigned to;
- whether the samples are statically given or generated by the policy itself.
That is the difference:
- learning rate tunes update intensity;
- topology defines the gradient support set and source mechanism.
So the point of emphasizing these three topologies is not that they are the only important things. It is that they sit at the very top of the gradient-formation pipeline.
| Dimension | Packing | Masking | Rollout |
|---|---|---|---|
| Core question | How samples are arranged into the context window | Which positions participate in the loss | Who generated the samples |
| Mathematical object mainly changed | (s_{t}) | (w_t) | (q) |
| Requires exploration? | No | No | Yes |
| Is supervision immediately visible? | Yes | Yes | Often delayed |
| Typical setting | Pretraining, CPT | SFT | RL, agent training |
| Topology rewritten | Context topology | Supervision topology | Data-generation topology |
Packing determines the prefix under which a token is predicted, Masking determines which token errors enter the gradient, and Rollout determines who generates the training samples.
Looking back at the common training stages, the picture becomes clearer:
| Method | Typical data shape | What directly participates in the gradient | What is mainly learned |
|---|---|---|---|
| CPT | Raw documents, code, long text | Almost all tokens | Parameterized knowledge, domain distribution, document structure |
| SFT | Dialogue demonstrations, tool trajectories | Usually assistant tokens | Behavior format, instruction following, tool-use templates |
| DPO / ORPO | chosen / rejected preference pairs | Relative preference difference between two responses | Answer-quality tendencies, preference ranking |
| RL | prompt + rollout + reward | Tokens on policy-sampled trajectories | Search, exploration, long-horizon strategy improvement |
What matters behind this table is not the terminology. It is this:
The difference between training stages is not just “what the data looks like,” but where the gradients actually come from.
When training results do not match expectations, do not only ask “was the data correct?” Keep asking further: what context did the model see at those positions? Which positions actually participated in the loss? Did the samples come from a static corpus or from the current policy’s own behavior? Many problems that look like “the model learned bad habits” are not about a weak optimizer or an insufficiently advanced loss. They come from at least one wrong answer among Packing, Masking, and Rollout.
8 How to fit the whole thing into a training system: parallelism, restore, version binding, and auditability
This section answers a few key questions first:
- System-wise, how is CPT actually resumed from a pretrained checkpoint?
- In mid-training, what do data parallel, tensor / model parallel, and sequence parallel each solve?
- Why must tokenizer version, data version, and restore mode be managed together with the checkpoint?
- What kind of pipeline is actually rollback-capable, reproducible, and auditable?
Mid-training is a “reconfigured continuation of training,” not “take the old script, swap the data, and keep running.”
System implementation determines whether experiments can be reproduced, checkpoints can be rolled back, regressions can be localized, and online behavior can be traced. Many teams think they are discussing the model. What really separates runs in the end is restore mode, data version, the parallel stack, and logging discipline.
8.1 How to restore, and what to restore
CPT almost always starts from an existing pretrained checkpoint, but “restoring a checkpoint” has two meanings:
- Restore model weights only: the most common case when you continue training from an open-source model. The optimizer state is usually unavailable, so you must reinitialize the optimizer and scheduler;
- Restore full training state: if you continue training within your own pretraining pipeline, you can restore optimizer, scheduler, and RNG state too, which makes CPT more like “resume after interruption.”
Write the training state more formally and a restore becomes a choice over this state set:
\[ \mathcal{S} = \{\theta, \phi_{opt}, \psi_{sched}, \xi_{rng}, \tau_{tok}, \nu_{data}\} \]
where:
- \(\theta\): model weights;
- \(\phi_{opt}\): optimizer state;
- \(\psi_{sched}\): learning-rate scheduler state;
- \(\xi_{rng}\): random-number-generator state;
- \(\tau_{tok}\): tokenizer version;
- \(\nu_{data}\): data version and sampling configuration.
Then the two common restore modes can be written as:
\[ \mathcal{R}_{weights} = \{\theta\} \qquad \mathcal{R}_{full} = \{\theta, \phi_{opt}, \psi_{sched}, \xi_{rng}, \tau_{tok}, \nu_{data}\} \]
These two modes differ in engineering meaning. If you restore weights only, you must be much more careful with learning rate and warm-up, because the optimizer has almost no historical sense of the current parameters. If you restore full state, the run is more continuous, but you also need to confirm that the new distribution is actually compatible with the old optimizer state.
8.2 Operational differences between weights-only and full-state
This distinction is often underestimated in real projects. A weights-only resume and a full-state resume are not the same recipe.
- weights-only is closer to “take a finished model and start a new experiment on top of it”;
- full-state is closer to “keep running on the original training track.”
So when you restore only weights, engineering practice usually requires:
- a lower initial LR;
- a fresh warm-up;
- denser early evals;
- treating it as a new run, not “the second half of the old run.”
When you restore full state, what you need to watch more closely is:
- whether the momentum in the old optimizer still fits the new distribution;
- whether the scheduler will drive LR into a regime unsuitable for CPT;
- whether data version and tokenizer are kept strictly consistent.
8.3 How to parallelize
For large models, CPT usually reuses the distributed-training stack from pretraining, only with smaller budgets, shorter runs, and denser evaluation. The common roles are:
- Data Parallel / FSDP / ZeRO: split sample batches across devices and shard parameters, gradients, or optimizer state;
- Tensor Parallel: split layer tensors across devices, suitable for very large models;
- Pipeline Parallel: split layers across devices, suitable for very deep models;
- Sequence Parallel: distribute pressure along the sequence dimension under long context or large batches;
- Activation Checkpointing: trade more recomputation for less memory.
For the details, see Chapter 2.
In mid-training, more is not automatically better. The right choice is jointly determined by model size, context length, and cluster topology. A common CPT engineering judgment is: because the runs are short and evals are dense, keep system complexity as low as possible, and do not add too many parallelism layers just to chase absolute throughput.
8.4 How to bind versions and guarantee reproducibility, auditability, and rollback
- Data version and tokenizer version must be bound to the checkpoint. Otherwise, when the model regresses, you cannot tell whether the data changed or the code changed.
- Checkpointing should be more frequent than in pretraining. CPT is not trying to optimize for long-horizon convergence. It is trying to find a point with a better tradeoff between domain gain and general regression.
- Evaluation must be synchronized with checkpoints. Without evals for adjacent checkpoints, U-shaped curves are hard to interpret.
- Keep one immutable base checkpoint. That way you can always return to a clean starting point instead of getting lost among multiple “half-adapted” branches.
- The I/O pipeline has to keep up. Many mid-training bottlenecks are not compute at all, but long-document reads, deduplication, packing, and data loading themselves.
9 CPT and compatibility with subsequent SFT / DPO / RL
This section answers a few key questions first:
- Why does mid-training affect later instruction tuning, RLHF, DPO, and Agentic RL instead of being just an isolated preprocessing step?
- What kind of CPT helps later alignment, and what kind of CPT becomes a drag?
- Why does over-specialization hurt later alignment training?
- Given a real product requirement, how do you choose among prompting → RAG → SFT → DPO → RL → CPT → distillation?
- Why does “first build a model that understands the domain better” still not mean “the product is already aligned”?
CPT changes the model’s starting distribution. Later alignment changes behavior on top of that starting point. If mid-training brings the model closer to the target-task distribution, SFT and RL often become easier. But if CPT pulls the model too narrow, too templated, and too far from general dialogue distributions, later alignment becomes harder instead.
9.1 Why CPT affects later SFT / RL
Post-training methods do not create ability out of thin air. They are more like reweighting or behavior shaping on top of abilities that already exist. If a model already has some prior over the target domain, long context, code traces, tool text, or mathematical expression, then:
- SFT can shape those abilities into stable behavior more easily;
- preference optimization can learn quality ranking more easily among multiple answers that are all not terrible;
- RL / Agentic RL can get more useful reward signal from rollouts because the candidate solutions themselves are already more plausible.
Conversely, if the base model has almost no prior over the task distribution, asking RL to learn it directly often runs into sparse rewards, high exploration cost, and rollouts that are too poor to be useful. In other words, CPT can improve the learnability of later RL, but it cannot replace RL itself.
9.2 The two most common failure paths
How, exactly, can CPT make downstream alignment harder?
Failure path 1: over-specialization → dialogue degradation → rising SFT debt
If the domain distribution is too narrow and replay is too low, the model starts treating everything as a legal, medical, or code problem. The consequence is not just one benchmark falling. The whole chat distribution begins to degrade:
- ordinary user questions get answered in overly technical jargon;
- the output style looks like document continuation instead of assistant replies;
- structured output and tool templates get washed out by the style of domain documents.
From a systems perspective, that means a problem that should have required only “lightweight SFT” now becomes “first use SFT to pull the model back into a chat-capable state, then do the real product-behavior supervision.” That is SFT debt.
Failure path 2: representation drift → brittle rollout → unstable RL
Even when surface metrics do not look too bad, CPT can still make later RL more fragile through representation drift. Intuitively, if the model is forced by highly templated CPT data into a narrow subspace, RL rollouts run into several problems:
- intermediate reasoning steps become more rigid, shrinking the exploration space;
- there are fewer high-reward trajectories that a verifier can approve, so reward becomes sparser;
- small parameter updates are more likely to trigger style collapse or tool-use volatility.
If you want to write this “too far away” risk as a simplified quantity, you can measure it on a fixed chat distribution \(D_{chat}\):
\[ \mathrm{Drift}_{chat}(\theta_{cpt}) = \mathbb{E}_{x \sim D_{chat}} \big[ \mathrm{KL}(p_{\theta_{base}}(\cdot\mid x)\,\|\,p_{\theta_{cpt}}(\cdot\mid x)) \big] \]
It does not need to be the only metric. But it helps explain a principle: no matter how much the domain score rises, if it comes with large drift on the chat and alignment distributions, the downstream cost may end up higher instead.
9.3 A practical decision order
Given a product requirement, a safer decision order is usually:
- Try prompting / system design first: if the ability is already there and simply is not triggered reliably, do not rush into training;
- When knowledge changes quickly, prefer RAG: facts kept in external memory are more maintainable than facts written into parameters;
- When behavior format is unstable, do SFT: output schema, role style, and tool templates belong here;
- When quality ranking is unstable, do DPO / RLHF;
- When you need search, long-horizon strategy, and verifier-driven optimization, do RL;
- Only when the real gap is domain distribution and parameterized knowledge do CPT;
- When quality is good enough but cost is too high, do distillation or compression last.
This order looks like a flowchart, but beneath it is really one cost principle: move the cheapest lever that is closest to the actual gap first.
Seeing the running case through compatibility
For the internal code assistant, CPT may make the model more familiar with the repository, but it will not automatically teach it to “retrieve first, then explain, and finally output in the team’s patch template.” Those are still SFT’s job. On the other hand, if CPT makes the model start answering ordinary user questions in internal-runbook style, the cost of later alignment rises immediately.
10 Engineering cases
This section answers a few key questions first:
- When in-domain task scores rise and general capability falls, what are the most typical root causes?
- Why does tokenizer expansion often reveal itself only when offline metrics look fine but online behavior becomes odd?
- How can an aggressive learning rate turn a once-manageable CPT into a divergent run?
- Why does “more domain data” not always mean “the overall system got better”?
- Which failures look like small bugs at first but end up flowing directly into the gradient?
These are not abstract concepts. They are very common engineering stories in real mid-training projects. Each story uses the same structure: symptom → root cause → fix.
Case 1: legal tasks improved, but general QA collapsed
Symptom: legal retrieval and contract summarization improved clearly, but open-domain QA, MMLU, and general chat quality all dropped at the same time.
Root cause: the recipe was made almost entirely of legal documents, with replay set far too low, so the model was pulled off center by a narrow distribution in a short time. The team watched only the domain benchmark, and the general regression was discovered only after product canarying.
Fix: increase the proportion of general replay, lower the learning rate, put general and safety evals into regression gates, and change checkpoint selection from “highest domain score” to the Pareto-optimal point for “domain gain vs. general regression.”
Case 2: after vocabulary expansion, offline loss looked fine, but online code generation got worse
Symptom: a batch of new code API tokens was added. Training loss looked normal, but online generation frequently produced strange half-identifiers and unstable completions.
Root cause: the new tokens entered the vocabulary, but related samples appeared too infrequently, so the embedding rows remained undertrained. At the same time, part of the offline evaluation still used the old tokenizer, creating inconsistency across training, evaluation, and deployment.
Fix: unify tokenizer versions, oversample samples containing the new tokens in stages, add a dedicated regression set containing API names and paths, and clear every old-tokenizer cache on the serving side.
Case 3: after medical CPT, the model understood terminology better but stopped talking to users naturally
Symptom: the model improved significantly on clinical terminology, entity recognition, and long-document medical summarization, but the multi-turn dialogue experience got worse. The answer style looked more like document continuation than assistant replies.
Root cause: the team treated CPT as the productization step and forgot that mid-training learns distributions, not interaction behavior. The model was pushed toward the style of medical documents, then never received enough SFT to pull it back into the chat distribution.
Fix: after CPT, add high-quality medical QA and summarization SFT, and evaluate dialogue behavior, refusal boundaries, and structured-output quality separately.
Case 4: training looked normal for the first few hundred million tokens, then suddenly diverged
Symptom: loss fell steadily at first, then at some later stage loss, gradient norm, and general evals all deteriorated together, and every later checkpoint became worse.
Root cause: the learning rate was too high, and one data subset contained broken concatenation and repeated templates, so the model took a large optimization hit on a small number of anomalous batches. Because checkpoint intervals were too wide, the team could not localize the turning point in time.
Fix: shorten checkpoint and eval intervals, lower the peak LR, add gradient clipping, reclean the anomalous subset, and add length, character-set, and template checks in the data-loading stage.
Case 5: domain tasks improved, but later RL became harder rather than easier
Symptom: after mid-training, loss on code and math corpora fell, but later verifier-based RL improved less than expected, and the rollout style also became more rigid.
Root cause: the CPT data was highly templated. The model learned strong local formatting preferences, but not a sufficiently diverse space of intermediate solutions, which narrowed the later exploration space.
Fix: increase data diversity, mix in more natural reasoning and tool text, keep enough replay during CPT, and before RL, use SFT first to restore more stable interaction behavior and intermediate-step formats.
Case 6: a packing bug made citations from one document “bleed” into another
Symptom: when summarizing, the legal model started carrying citation fragments from the previous contract into the next contract, as if it had “cross-document memory.”
Root cause: document packing was implemented as simple concatenation without sufficiently explicit EOS or attention constraints at boundaries, so the model treated cross-document continuation as a real statistical pattern.
Fix: add explicit boundaries, correct the attention mask, rebuild a cross-document-contamination regression set, and roll back to an uncontaminated checkpoint.
Case 7: same base weights, but external reproduction looked nothing like the internal run
Symptom: the team took the same base model and reran CPT, but the stability and convergence point differed sharply from the original internal run. They began to suspect “maybe the data is wrong.”
Root cause: the internal run restored the optimizer / scheduler / RNG full state, while the external reproduction restored only model weights. In reality, the two sides were not running the same recipe at all.
Fix: record restore mode explicitly, treat weights-only resume as a new experiment with retuned LR / warm-up, and version tokenizer, data mix, and optimizer state together in experiment metadata.
Case 8: the domain adaptation worked well, but refusal boundaries quietly weakened
Symptom: the domain model became more confident on professional questions, but also became more willing to answer questions it should not answer. Safety degradation was only discovered later in the process.
Root cause: replay was too low, so the general safety foundation was weakened. At the same time, the regression gates did not include a separate safety slice, and the team only monitored domain tasks and general QA.
Fix: raise replay, add safety evaluation as a hard gate, and reserve post-CPT alignment work instead of assuming that “better domain capability will automatically produce better behavior.”
11 Chapter summary
CPT should be treated as a constrained engineering change. The following principles deserve to be used as a working checklist:
- Decide first whether this is a knowledge gap, then decide whether CPT is warranted. Do not use continued pretraining to solve problems that should be solved by SFT, RAG, or prompting.
- Data distribution matters more than raw data volume. No matter how much domain data you have, if it is full of templates, duplicates, and dirty samples, all you do is amplify errors.
- Default to general replay. Only in a small number of cases where you are truly confident should you try an almost pure-domain recipe.
- Use a more conservative learning rate than in pretraining. In an already-formed representation space, violent updates usually do more harm than good.
- Packing is a throughput optimization, not a free lunch. Handle document boundaries badly and you will write nonexistent patterns into the parameters.
- Be restrained about tokenizer expansion. Only when fragmentation of high-frequency terms has become an obvious quality or cost bottleneck is it worth touching.
- Checkpoint, data version, and tokenizer version must be managed together. Otherwise regressions cannot be localized.
- Evaluation must look in both directions. Measure not only domain gain, but also general capability, safety, behavior, and downstream alignment readiness.
- CPT is not the end of productization. Many systems still need SFT, preference optimization, and inference-layer control after CPT.
- If a cheaper method works, do not start with training. For engineering teams, the lowest-cost effective solution always comes first.
- Treat restore mode as an experimental variable. weights-only and full-state resume require different LR / warm-up settings and different expectations for reproducibility.
- Give every run an explicit stop policy. Do not mistake “maybe it will recover if we wait a little longer” for a training strategy.
11.1 Question summary
- What is continued pretraining (CPT), and how is it different from pretraining and SFT?
CPT continues training on top of a pretrained base model with the same next-token objective, but the training distribution is replaced with data closer to the target domain. The main difference from pretraining is not the loss function. It is the data distribution, the training budget, and the stability requirements. The difference from SFT is that CPT learns parameterized knowledge and domain priors, while SFT learns behavior mappings and output format.
- When should you use CPT instead of prompt engineering, RAG, or SFT?
Use CPT when the problem is that terminology, entities, long-document structure, and the domain distribution itself are not inside the model. Use prompting first when the ability is already latent but not being triggered reliably. Use RAG when knowledge changes quickly and must stay traceable. Use SFT when the model knows the facts but cannot output them in the required way.
- The model’s perplexity is high on legal documents, but its general performance is good. Which layer should you fix first?
Prioritize CPT, because that means the model has not modeled the legal text distribution well. SFT can change the way it answers, but it usually cannot compensate for missing priors over term co-occurrence, citation format, and long-document structure.
- Why does same objective, different distribution create such a large practical difference?
Because the model learns conditional probability distributions. The objective stays the same, but when the sample distribution moves from general web text to regulations, clinical notes, or code repositories, the model re-estimates which tokens, structures, and entities are more common, thereby changing its parameterized knowledge and language priors.
- Why is general replay important?
Replay is the lowest-cost way to protect general capability during domain adaptation. It reduces catastrophic forgetting, softens the optimization shock from the new distribution, and preserves the general foundation that later SFT and alignment depend on.
- How should you design CPT’s regression suite and stop criteria?
At minimum, include four kinds of evaluation: domain, general, behavior, and safety. Stop criteria should not ask only “did the domain metric improve?” They should also include general / safety regression thresholds, a stagnation window for domain gain, and whether the gain per token is still worth the spend.
- What is the stability gap in CPT?
The stability gap is the phenomenon where, after the model starts training on a new distribution, domain metrics rise while general capability first drops noticeably. It is usually caused by too high a learning rate, too narrow a distribution, insufficient replay, and dirty data acting together. The right response is a more conservative LR, a steadier mixture, denser evaluation, and a clear rollback strategy.
- When should you expand the tokenizer?
Only when the target domain contains many high-frequency, high-value tokens that are severely fragmented—drug names, legal citations, code identifiers, or new-language scripts—and that fragmentation has clearly started to hurt context cost and quality. Otherwise, prefer learning the representation directly under the old tokenization through CPT.
- How should the embedding of a new token be initialized more safely?
A common stable choice is to initialize it from the mean or composition of the old sub-token embeddings that make up that token, instead of pure random initialization. That lets the new token start training from a point closer to its original semantic neighborhood and reduces early noise and instability.
- What is the operational difference between weights-only resume and full-state resume?
weights-only is closer to starting a new experiment on top of existing parameters, and usually needs a fresh warm-up and a more conservative LR. full-state is closer to continuing on the original training track, but depends more heavily on the compatibility between the old optimizer / scheduler state and the new distribution. The two should not be treated as the same recipe.
- What is the core difference among CPT, SFT, DPO, and RL?
The core difference is how the data is constructed and where the gradient comes from. CPT applies next-token learning to almost all tokens. SFT mainly supervises assistant tokens. DPO compares the relative preference between chosen and rejected answers. RL optimizes the expected reward of rollouts sampled from the policy.
- Why does mid-training affect downstream alignment and RL?
Because CPT changes the model’s initialization distribution. A model closer to the target domain makes later SFT, DPO, and RL easier to learn. But if CPT over-specializes and damages general dialogue or safety boundaries, downstream alignment becomes more expensive instead.
- How do you debug a model whose domain perplexity improved, but tool use and structured output got worse?
First determine whether the issue is behavior regression rather than knowledge regression: check whether CPT washed out the SFT templates, whether behavior evals are missing, and whether replay is too low. Then check tokenizer / schema compatibility, packing contamination, and whether CPT should be followed by additional SFT on tool trajectories.
- Given a product requirement, how do you decide among prompting → RAG → SFT → DPO → RL → CPT → distill?
I would first define the target behavior and constraints, then diagnose the gap type. Do not train if you do not have to. When knowledge needs external updating, prefer RAG. When the problem is behavior format, prefer SFT. When the problem is preference ranking, use DPO / RLHF. When the problem is long-horizon search and verifier-driven optimization, use RL. Only when the model truly lacks domain knowledge and distribution priors do CPT. Finally, if quality is good enough but cost is too high, do distillation and compression.
The hard part of mid-training has never been the phrase “keep training.” It is how to make continued training still controlled, rollback-capable, interpretable, and compatible with downstream systems. From this chapter’s point of view, a mature AI engineer should be able to make at least four judgments:
- when the problem is a knowledge gap rather than a behavior gap;
- how to do controlled domain adaptation through replay, mixture, packing, and gates;
- why tokenizer, checkpoint, and training topology affect real stability;
- why the value of CPT is not only in itself, but in how it leaves a better starting point for later SFT, DPO, RL, and deployment.
Once you can answer “how do I adapt a model to a new domain without breaking what it already knows?” with those judgments, you have actually understood CPT as an engineering system.
In the next chapter, we show how to turn a base model that has already finished mid-training into an assistant.