2. Pretraining

From data systems and data mixture to large-scale training.

1 Overview

Figure 1: Data funnel—from raw corpora to trainable tokens
Important

This chapter is not about whether a model can answer questions. It is about a lower-level question:

How, exactly, are the parameters of a large language model trained?

More precisely, this chapter answers five engineering questions:

  1. What data will you let the model see, and how do you ensure that the data is worth seeing?
  2. How do you choose the proportions of different data sources so the model learns the capability structure you actually want?
  3. Under a fixed compute budget, how should you balance parameter count, tokens, context length, and precision?
  4. How do you keep a training run that lasts weeks or months stable, non-divergent, monitorable, and recoverable?
  5. How do you know the model is actually getting stronger, rather than merely pushing loss down on some validation set?

Imagine a very real on-call scenario: the team launches a pretraining job scheduled to run for 18 days. The target is to train a dense decoder-only model with \(N=7\text{B}\) on \(D=10^{12}\) tokens using \(512\) H100s. Using the rough estimate \(C\approx 6ND\), the theoretical training compute is about \(4.2\times 10^{22}\) FLOPs. On day 7, training loss suddenly spikes to \(2.8\times\) its previous level. On day 14, validation perplexity keeps falling, but an internal tool-use probe starts regressing. At that point, the real question is no longer “What is next-token prediction?” It is: do you actually understand the system you are training?

NoteRunning example throughout the chapter: a hypothetical 7B × 1T training project

To keep the discussion pinned to one concrete engineering object, this chapter repeatedly refers to a hypothetical case:

  • Model: 7B dense decoder-only Transformer
  • Data: \(1\)T nominal tokens; overall mixture is web and general documents \(68\%\), books \(12\%\), code \(10\%\), academic text \(6\%\), dialogue and forums \(4\%\)
  • Hardware: \(512\times\) H100, BF16 training, target MFU \(\eta\approx0.38\)
  • Training plan: effective global batch of about \(8\)M tokens/step, or roughly \(1.2\times10^5\) steps
  • Run events: one benchmark contamination scare, one loss spike, and one investigation into checkpoint resume semantic mismatch

This example does not model any specific company. It compresses the constraints nearly every large-model team runs into: the data pool, the budget, the distributed topology, the training dashboard, and the model’s downstream capabilities.

TipQuestions worth revisiting throughout this chapter
  1. What is the difference between nominal token budget and effective token budget?
  2. How would you prove that a resumed run is semantically equivalent to an uninterrupted run?
  3. If GPU utilization is high but tokens/s is poor, what do you check first?

In Chapter 1, we treated an LLM as a next-token predictor: given a prefix, predict the next token. That definition explains what the model does at inference time. Pretraining explains something else: why such a predictor gradually acquires language ability, knowledge compression, transferability, and eventually the possibility of being shaped into an “assistant.”

Modern decoder-only LLM pretraining still revolves around one objective that is both extremely simple and extremely powerful: autoregressive language modeling. Let the token sequence be \(x_1,\dots,x_T\), and the model parameters be \(\theta\). The training objective is to minimize the negative log-likelihood of the true next token:

\[ \mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T}\log p_\theta(x_t \mid x_{<t}) \]

This is why pretraining is usually called self-supervised learning: the supervision signal is not manually labeled. It comes from the text itself. During training, with teacher forcing, the model sees the true prefix \(x_{<t}\) at position \(t\). At inference time, it sees the prefix it just generated. So pretraining optimizes conditional distribution fitting, not the product goal of “answering like a helpful assistant.” GPT-3 pushed this large-scale autoregressive route to a new scale and made “train a strong base model first, then shape it afterward” the dominant paradigm (Brown et al., 2020).

One more point matters just as much: the output of pretraining is usually a base model, not an assistant.
Pretraining teaches the model what tends to appear in the world. Mid-training pushes the distribution toward specific domains. Post-training makes the model more usable, controllable, and consistent in interaction. Capability boundaries are mostly set during pretraining. The behavior interface is mostly shaped later.

From an engineering perspective, pretraining is not merely “feeding lots of text into a model.” A more accurate description is that it is a long pipeline from raw corpora to a base model, and every stage in that pipeline changes the result.

If you restate the pipeline above as a more operator-centric signature figure, you get Figure 2-1.

This chapter therefore no longer follows the sequence “a bit of math first, then a bit of systems.” Instead, it proceeds in the order that better matches engineering decisions:

  1. Data pipelines and quality: what the model actually saw;
  2. Data mixture and distribution: in what proportions it saw those things;
  3. Compute and scaling: whether you can actually run the training within budget;
  4. Training recipe and monitoring: whether it will converge stably, and how to locate problems when it does not;
  5. Evaluation and downstream impact: how these pretraining decisions constrain mid-training, post-training, and the inference system.

From here on, the chapter follows three intertwined threads: data system → budget balancing → infrastructure. If you can see all three at once, then loss, PPL, MFU, checkpoints, and contamination audits stop looking like isolated dots. They become different observations of the same system.

By the end of this chapter, you should be able to redefine pretraining in one complete sentence:

Pretraining is the process of using a large, traceable, filterable, mixable token stream, under a fixed compute budget, to train a conditional language model that is as strong, as stable, and as generalizable as possible.

2 Data Pipelines and Quality

ImportantQuestions to hold in mind for this section
  1. Walk through a pretraining data pipeline: data sources → filtering → deduplication → tokenization → mixture.
  2. Why is “data quality > data quantity” the single most important principle in pretraining?
  3. What filters would you apply to Common Crawl before pretraining?
  4. Explain train–train deduplication vs train–test contamination. Why do both matter?
  5. How would you build a pipeline to detect and prevent benchmark contamination?
  6. You find that a model has memorized training snippets verbatim. How do you diagnose and fix it?
  7. What data governance practices matter in pretraining (licenses, PII, audit trails)?
  8. How do you handle PII at scale in training data?
  9. What are the common deduplication strategies (exact hashing, MinHash, embedding similarity)?
  10. You need to build a classifier to filter toxic content from pretraining corpora. What architecture would you use, and how do you handle false positives that remove valuable data?
  11. What role do “quality classifiers” play in a pretraining data pipeline—for example, classifiers trained on Wikipedia vs random web pages? How did Llama 3 handle this?

2.1 Data Is Not a Folder. It Is an Auditable System

The most common and most dangerous illusion in pretraining is to think of “data” as a static asset, as if you already had a few hundred GB or a few TB of text, and the only remaining step were to push it into a dataloader. In reality, pretraining data is closer to a continuously running data system. It has to answer at least four questions:

  1. Where did the text come from, and what are the license and usage boundaries?
  2. What filtering, modifications, and removals did the text go through before the model ever saw it?
  3. Does any of it overlap with validation sets, benchmarks, or sensitive data?
  4. If the model later shows memorization, bias, or capability gaps, can you trace the issue back to specific data buckets?

Good pretraining data engineering is therefore not just “collect more text.” It is about giving the data three properties: high quality, traceability, and tunability.

That is why “data quality > data quantity” ends up being one of the most important principles in pretraining. A model trained on massive noisy text can still learn to continue sequences. But it is more likely to learn template residue, SEO spam pages, near-duplicate content, navigation language, and formatting noise. One of CCNet’s core lessons is precisely this: web-scale corpora like Common Crawl are extremely valuable, but only if language identification, deduplication, and quality filtering extract the part that is actually worth keeping (Wenzek et al., 2019). The Pile shows the same point from another angle: data diversity and quality together shape the ceiling on cross-domain generalization (Gao et al., 2020).

2.2 A Typical Pretraining Data Pipeline

A workable pretraining data pipeline usually contains at least the following nine stages:

  1. Data collection and parsing
    Crawl raw corpora from web pages, books, papers, code repositories, forums, and documentation sites, then parse HTML, PDF, repository files, and other formats into a unified text representation.

  2. Text normalization
    Standardize Unicode, whitespace, line breaks, control characters, and encoding anomalies. Strip scripts, styles, cookie banners, navigation templates, and other non-content noise.

  3. Language identification and document-boundary repair
    Detect language, remove mixed-language or incorrectly encoded text, repair wrongly concatenated page fragments, and make sure the unit “document” itself is meaningful.

  4. Quality filtering
    Drop texts that are too short, too long, structurally abnormal, or abnormal in punctuation ratio, digit ratio, or repeated-line ratio. Add classifiers or scorers based on high-quality reference corpora.

  5. Safety, privacy, and compliance filtering
    Handle PII, license-restricted text, clearly sensitive content, high-risk sites, and sources that cannot legally be used for training.

  6. Deduplication
    First do exact deduplication, then near-duplicate deduplication, and if necessary add line-level or paragraph-level deduplication.

  7. Pre-tokenization audit
    Check tokenizer fragmentation rate, special-character handling, and length distribution on critical domains. If necessary, backtrack and revise normalization or tokenizer design.

  8. Bucketing, sampling, and mixture preparation
    Split the data into buckets by source, language, domain, and quality tier to prepare for later mixture design and annealing.

  9. Holdout construction and contamination audit
    Before final training, check overlap against benchmarks, validation sets, red-team sets, and internal probe sets.

None of these stages is optional “data hygiene.” They decide whether the model is spending expensive parameter capacity on learning knowledge structure, or wasting capacity fitting noise.

2.3 Nominal Token Budget vs Effective Token Budget

Pretraining plans often say something like: “This run will see \(D_{\text{nominal}}\) tokens.” But the number of tokens that actually deliver high-value learning signal to parameter updates is closer to this engineering approximation:

\[ D_{\text{eff}} \approx D_{\text{nominal}} (1-r_{\text{dup}}) (1-r_{\text{pad}}) \rho_{\text{parse}} \rho_{\text{quality}} \]

Here, \(r_{\text{dup}}\) is the duplicate or near-duplicate rate, \(r_{\text{pad}}\) is the packing / padding waste ratio, \(\rho_{\text{parse}}\) is the parse success rate, and \(\rho_{\text{quality}}\) is the fraction of high-quality, learnable tokens. It is not a law. It is an engineering approximation. But it is a useful one: planned token budget is not effective token budget. Training signal leaks away at every layer of the data system.

In the 7B running example, if nominal token budget is \(1\)T and the combined effect of deduplication, padding, parse failures, and low-quality text drives the effective coefficient down to \(0.72\), then the model’s true effective budget is only about \(7.2\times10^{11}\) tokens. For a training plan that already lives on a compute-optimal knife edge, that gap is enough to turn “barely enough data” into “clearly undertrained.”

2.4 A Filtering Strategy for Common Crawl

Common Crawl is one of the most common raw sources for pretraining because it is huge, broad, and fresh. It is also the canonical case of “high value, high cleaning cost.” For web data like Common Crawl, you should at minimum consider the following filtering dimensions:

Filtering dimension Typical method Problem it solves
HTML / template stripping boilerplate removal, main-content extraction Remove navigation bars, footers, ads, and script residue
Language identification fastText-style language ID, paragraph-level fallback Avoid mixed languages and misclassification
Length and structure rules document length, sentence length, newline patterns, repeated-line ratio Remove malformed concatenated pages, list pages, and template pages
Character statistics rules punctuation ratio, digit ratio, abnormal-character ratio Block gibberish pages, log pages, and SEO spam
Reference-corpus similarity classify or score against Wikipedia / high-quality corpora Prioritize documents that look more like citable text
Domain-specific filtering custom extraction and rules for code/math/doc pages Reduce false positives on code pages, tutorials, and forums
Safety and PII filtering site rules, regex/NER, model classifiers Remove high-risk personal information and sensitive samples
Deduplication document-level, line-level, near-duplicate deduplication Improve token efficiency and reduce memorization

CCNet is a good example: it first extracts monolingual text from Common Crawl, then does language identification and deduplication, and finally uses similarity to high-quality reference corpora to improve the average quality of the retained text (Wenzek et al., 2019). The key is not any single rule. It is the combination of rules: heuristics remove noise with high recall; model-based classifiers sort the boundary cases more carefully.

2.5 Deduplication Is Not a Side Step. It Protects Core Capability

“Train–train deduplication” and “train–test contamination” are often lumped together, but they solve different problems.

Train–train deduplication

Repetition inside the training set causes at least three direct problems:

  • It wastes token budget: the model sees the same content again and again, which means expensive compute is being spent to repurchase the same information.
  • It increases memorization risk: lots of near-duplicate samples make it easier for the model to regurgitate rather than abstract.
  • It distorts learning priorities: high-frequency templates become overrepresented in the gradient and crowd out rarer but more valuable patterns.

Lee et al. showed this clearly: stronger deduplication on standard corpora not only reduces verbatim memorization, but can reach equivalent or better quality in fewer training steps, while also reducing overlap between training and validation data (Lee et al., 2022).

Train–test contamination

Contamination is different. If benchmark samples or highly similar variants enter the training set, evaluation scores become inflated. Contamination does not have to mean “the exact same question appears verbatim.” It can also look like:

  • The benchmark problem has been paraphrased while preserving the same factual structure.
  • Code snippets, solutions, or forum discussions about the benchmark end up in the training data.
  • The evaluation prompt differs only slightly in surface form from something in the training corpus.

So train–train deduplication protects generalization quality and token efficiency. Train–test contamination auditing protects evaluation credibility. You need both. Neither substitutes for the other.

WarningRunning example: a benchmark contamination scare caused by public solution mirrors

In the 7B running example, the team saw an anomalous jump in one code benchmark at around \(0.62\)T tokens—far larger than adjacent probes. The issue was not that the model had suddenly “figured it out.” The training mixture had absorbed public solution mirror sites, forum threads, and near-duplicate tutorial pages. The exact benchmark prompt did not appear word for word, but the solution fragments, test-case explanations, and mirrored discussions were enough to contaminate evaluation.

The correct response was not to log the score as a “stage win.” It was to do three things at once: 1. build benchmark-family denylisting and near-duplicate auditing; 2. downweight or remove solution sites, tutorial aggregators, and forum mirrors; 3. require contamination audits on later checkpoints and add the contaminated sample family to continuous monitoring.

2.6 Common Deduplication Strategies: Exact Hashing, MinHash, and Semantic Similarity

Deduplication in pretraining is best understood as a cascade from cheap to expensive, from high precision to high recall, not as a single algorithm.
Let the normalized document be \(d\), and let the normalization operator be \(N(\cdot)\). Then the goal of the whole chain can be written as:

\[ \text{dedup}(d_i, d_j)= \mathbf{1}\!\left[\mathrm{sim}(N(d_i), N(d_j)) \ge \tau\right] \]

Here \(\mathrm{sim}(\cdot,\cdot)\) may mean exact equality, Jaccard similarity, segment-level coverage, or embedding cosine similarity. \(\tau\) is the threshold for that layer. In practice, you do not jump straight to the most expensive similarity metric. You filter in stages:

\[ C_{\text{exact}} \ll C_{\text{MinHash}} \ll C_{\text{segment}} \ll C_{\text{embed}} \]

Here \(C\) is the cost of comparing one pair of samples. In other words, use the cheapest method first to remove the easiest duplicates, then reserve expensive methods for high-risk, small-scale, ambiguous sets.

  1. Exact hashing
    Compute a hash directly on the normalized document:

    \[ h_i = H\!\left(N(d_i)\right) \]

    If

    \[ h_i = h_j \]

    then treat \(d_i,d_j\) as exact duplicates and keep only one copy. The key is not the hash function’s name. It is whether normalization is complete enough: if whitespace, casing, Unicode form, HTML residue, or line-break style have not been standardized, many documents that “look the same” will still appear different to the hash.
    The strength of this layer is that it is extremely fast, simple, and low-risk. The weakness is just as clear: it only catches

    \[ N(d_i)=N(d_j) \]

    strict equality. It misses near-duplicates like “same document, different title,” “same document plus disclaimer,” or “mirror site with light template edits.”

  2. Near-duplicate deduplication (shingling + MinHash/LSH)
    For each document, build a set of \(k\)-shingles:

    \[ S_k(d)=\{s_1,s_2,\dots,s_m\} \]

    where each \(s_t\) is a token n-gram or character n-gram of length \(k\). The near-duplicate rate between two documents is usually first expressed with Jaccard similarity:

    \[ J(d_i,d_j) = \frac{|S_k(d_i)\cap S_k(d_j)|}{|S_k(d_i)\cup S_k(d_j)|} \]

    Computing \(J\) for all document pairs is too expensive at web scale, so MinHash is used to build compact signatures. For the \(r\)-th random permutation \(\pi_r\), define:

    \[ m_r(d)=\min_{s\in S_k(d)} \pi_r(s) \]

    The key MinHash property is:

    \[ \Pr\!\big[m_r(d_i)=m_r(d_j)\big] = J(d_i,d_j) \]

    So if we use \(R\) independent hash functions to form a signature vector, Jaccard similarity can be approximated by the collision rate:

    \[ \hat J(d_i,d_j) = \frac{1}{R}\sum_{r=1}^{R}\mathbf{1}\!\big[m_r(d_i)=m_r(d_j)\big] \]

    To avoid comparing every document pair, this is usually combined with LSH (Locality-Sensitive Hashing). If \(R=br\) signatures are split into \(b\) bands with \(r\) rows per band, then the probability that a document pair becomes a candidate pair is approximately:

    \[ P_{\text{cand}}(J)=1-(1-J^r)^b \]

    This formula matters because it explains why LSH forms a “soft threshold”: when \(J\) falls below a certain region, candidate probability drops quickly; when \(J\) is high enough, candidate probability approaches 1 quickly. That is exactly why MinHash + LSH is the workhorse for near-duplicate removal in large web corpora: it gives a practical compromise between recalling highly similar duplicates and controlling pairwise comparison cost.

  3. Line-level / segment-level deduplication
    Much high-repetition text is not “entire documents are nearly identical.” It is the same local paragraphs copied over and over. Tutorials, legal clauses, template pages, README files, FAQs, code comments, and site footers are common examples.
    In those cases, the right object is not the whole document but a set of segments or lines. Let a document be split into a set of segments:

    \[ P(d)=\{p_1,p_2,\dots,p_n\} \]

    Then one can define segment-level coverage, for example:

    \[ \mathrm{cover}(d_i \to d_j) = \frac{\sum_{p\in P(d_i)} |p|\cdot \mathbf{1}\!\left[\exists q\in P(d_j),\ \mathrm{sim}(p,q)\ge\tau_p\right]} {\sum_{p\in P(d_i)} |p|} \]

    Here \(|p|\) can be segment length in tokens, and \(\tau_p\) is a segment-level similarity threshold. The definition is directional: a high \(\mathrm{cover}(d_i \to d_j)\) means that a large fraction of \(d_i\) can be matched inside \(d_j\). That is particularly effective for detecting cases where “a long document is padded with large copied template blocks.”
    In real systems, segment-level deduplication is usually a two-stage process: first use local hashing or MinHash to recall candidate segments, then run more precise token-overlap comparisons on the candidates. Its value is that even when document-level Jaccard is not high, it can still detect cases like “40% of this page is copied.”

  4. Embedding-based similarity back-check
    The most expensive layer is usually semantic embeddings. Let an encoder \(f_\phi(\cdot)\) map a document into a vector space, then apply \(L_2\) normalization:

    \[ e(d)=\frac{f_\phi(d)}{\|f_\phi(d)\|_2} \]

    Then the semantic similarity between two documents can be written as cosine similarity:

    \[ s_{\cos}(d_i,d_j)=e(d_i)^\top e(d_j) \]

    If

    \[ s_{\cos}(d_i,d_j)\ge\tau_e \]

    then treat them as semantic neighbors and send them to a reranker or manual audit.
    The strength of this layer is that it can catch samples whose wording differs substantially but whose semantics are nearly identical. That makes it very useful for benchmark contamination audit, internal held-out problem audits, and leakage checks on sensitive documents. Its weakness is just as obvious: encoding and ANN search are expensive, and threshold choice is more fragile. It is easy to merge documents that are topically similar but should not be deduplicated. So embedding similarity is usually not the primary full-corpus deduplication engine. It is better used as a high-risk back-check layer.

Seen together, these four layers make the engineering logic of deduplication much clearer:
exact hashing removes exact duplicates, MinHash removes large-scale near-duplicates, segment-level methods remove local copy pollution, and embedding back-checks audit high-risk semantic overlap.
A mature deduplication system does not try to make one similarity metric “smartest.” Under a fixed compute budget, it tries to maximize:

\[ \frac{\text{redundant tokens removed}}{\text{deduplication cost}} \]

while minimizing two kinds of error:
- false negatives: duplicates that should have been removed remain and keep polluting training;
- false positives: valuable rare text gets wrongly removed.

That is why deduplication is not a pure algorithm selection problem. It is a systems design problem shaped jointly by similarity definition, threshold choice, cost budget, and risk control.

2.7 How to Diagnose and Fix a Model That Memorizes Training Snippets Verbatim

If “memorization” remains only an abstract concern, it is hard to manage in engineering practice. A more useful view is to treat it as a diagnosable systems failure.

Start with how to detect it:

  • Run extraction tests on high-risk private text, license-sensitive text, and long unique strings;
  • Build canary checks for sentences, templates, and long spans that appear abnormally often in the training data;
  • For outputs that trigger complaints, do reverse retrieval to see whether a highly similar source document exists in the training corpus;
  • For specific source buckets, compute duplicate rate, long-common-substring rate, and memorization-prone patterns.

Then ask how to fix it:

  • Strengthen near-duplicate and line-level deduplication;
  • Reduce the mixture weight of highly repetitive source buckets;
  • Remove templated sites, aggregators, and mirrors;
  • Apply stronger exclusion and PII cleaning for private or sensitive corpora;
  • If necessary, roll back and retrain the contaminated segment of the model, rather than trying to patch the problem later with safety prompts.

A good engineering rule of thumb is this:
if a model can stably reproduce long passages verbatim, that usually does not mean the model is “too strong.” It means something is wrong in the data system.

2.8 Data Governance: Licenses, PII, and Audit Trails

The baseline for pretraining data governance is not “try to avoid problems.” It is to make every high-risk piece of data explainable, traceable, removable, and reviewable.

At least three governance mechanisms matter:

License governance

You need to know the source of each dataset, when it was crawled, what license applies, whether training is allowed, whether redistribution is allowed, and whether commercial use is allowed. Otherwise, the legal risk will catch up with you at exactly the worst time: when the model is ready to become a product.

PII governance

PII handling at scale cannot usually rely on one method alone. A more realistic setup is layered:

  • coarse filtering based on site and source: avoid obviously high-risk sources first;
  • coarse pattern matching: emails, phone numbers, ID numbers, addresses, and so on;
  • fine filtering using named entity recognition (NER) or lightweight classifiers: context-sensitive entities such as people, organizations, addresses, account information;
  • manual spot checks and sampled review on high-risk slices.

The danger is a system with low false negatives but high false positives: if you accidentally remove large volumes of medical, legal, or enterprise documents, data quality can collapse without you noticing. PII filtering must therefore be evaluated by bucketed recall and over-removal rate, not just one global metric.

Audit trails

A pretraining project that is publishable, reproducible, and operable should be able to answer:

  • Which source did this sample come from?
  • Which filters did it pass through?
  • Why was it kept, downweighted, or removed?
  • Which data bucket did it finally enter, and with what sampling weight?

In practice, the most useful approach is to keep the minimum lineage metadata that is still sufficient: source ID, crawl snapshot, license label, language label, quality score, deduplication status, PII/safety label, final bucket, and sampling weight. Without this trail, every later debugging effort becomes guesswork.

2.9 Model-Based Filtering and Governance: Quality Classifiers and Safety Classifiers Are the Same Kind of Lever

At this point it helps to place quality classifiers, toxicity filters, and PII detectors under the same engineering abstraction: they are all learned gates on the data pipeline. They do not replace rule systems, but after heuristic filtering they redraw the boundary of “which text deserves to enter the training distribution.” The difference lies in the objective: a quality classifier cares more about information density and learnability; a safety classifier cares more about risk exposure and the cost of false recall.

From the viewpoint of training objectives, these filters all change the same thing: the sample distribution that enters empirical risk in the first place. In other words, they are not “preprocessing plugins.” They are upstream parameters of the training distribution.

2.10 Quality Classifiers and Two-Stage Filtering

The purpose of a quality classifier is not to replace heuristic rules. It is to turn the question “does this text look like high-quality, learnable text?” into something trainable, rankable, and thresholdable.

One classic strategy is to distinguish high-quality reference corpora—Wikipedia, curated document collections—from random web text, and train a classifier or scorer so the pipeline can retain text that looks more like something worth citing. This idea is common in CCNet and many later pretraining pipelines (Wenzek et al., 2019).

In practice, a stable filtering architecture is often two-stage:

  1. Fast-screen stage: heuristic rules + lightweight models
    Use fast models such as fastText or DistilRoBERTa for large-scale scoring to preserve throughput.

  2. Reranking stage: stronger but more expensive models, or sampled manual review
    Only apply these to boundary cases, important domain buckets, or high-risk sources, so false positives stay under control.

The public Llama 3 report shows a modern version of this idea: Meta used model-based quality filtering, knowledge classifiers to identify information type, and downsampling for overrepresented webpage categories. They also used small models and Llama 2 annotation signals to build quality, code, and mathematical reasoning classifiers, which in turn helped shape a better data mixture (Grattafiori et al., 2024). That is why quality classifiers matter for more than “removing dirty data.” They directly participate in mixture design.

2.11 When Building a Toxicity Filter, the Real Danger Is Not Missing a Few Dirty Samples. It Is Deleting Entire Classes of Valuable Text

If you need to build a toxicity filter for pretraining corpora, there are usually three realistic architectural choices:

  • lightweight text classifiers: high throughput, suitable for full-corpus scanning;
  • stronger encoder-based classifiers: more accurate, but more expensive;
  • layered systems: a lightweight model first, then a stronger model on high-risk or low-confidence samples.

The hard part is not which model family to pick. It is the shape of the error. Medical text, legal text, crime reporting, mental-health support, code comments, and public news often contain many “sensitive words” and many valuable facts at the same time. If your threshold is too strict, the model loses the ability to handle real-world complex text. If it is too loose, safety and product risk accumulate downstream.

That is why toxicity filters need slice-based evaluation: separately measure false positives on news, medicine, law, forums, code, fiction, and so on. Retain auditable samples and keep manually reviewing the edge cases. The maturity of pretraining governance shows up precisely in this kind of patience: refusing to worship a single overall accuracy number.

2.12 Section Takeaway

This section can be compressed into four lines:

  1. Pretraining data is not a pile of files. It is a traceable data system.
  2. Common Crawl is valuable, but only after strong cleaning, filtering, and deduplication.
  3. Internal deduplication protects generalization and token efficiency; train–test contamination auditing protects evaluation credibility.
  4. Quality classifiers, PII filtering, license governance, and audit trails are not compliance side quests. They are part of model quality.

Pretraining data is not a folder. It is a governed, deduplicated, auditable token production system. And once the high-quality pool is ready, the next question is no longer “Do we have data?” but “What weighted world will the model actually see?” That takes us to data mixture.

3 Data Mixture and Distribution

ImportantQuestions to hold in mind for this section
  1. Why does training data mixture matter? Give one example where adding more code data hurts dialogue quality.
  2. How would you design a data mixture for a general-purpose LLM?
  3. You add domain-specific data and domain performance goes up, but general benchmarks go down. What happened?
  4. What does “curriculum learning” mean in pretraining?
  5. How would you estimate the marginal value of adding a new data source to the mixture? Which proxy metrics can you track before running a full pretraining job?
Figure 2: Data mixture shapes capability structure

3.1 The Model Does Not See the World. It Sees a Weighted World

After cleaning, filtering, and deduplication, the next key decision in pretraining is not “do we start training?” It is “in what proportions do we train?” This step is often underestimated, but it directly shapes what the model becomes.

Suppose there are multiple data sources \(D_1, D_2, \dots, D_n\), with sampling weights \(w_1, w_2, \dots, w_n\). Then the probability of drawing from source \(D_i\) during training can be written as:

\[ p(D_i)=\frac{w_i}{\sum_j w_j} \]

The formula is simple, but it reveals one of the most important facts in pretraining:

What the model learns is not the world as it is. It is a weighted world.

Web text may dominate the real world, but if you let it dominate training, the model becomes more like an “internet autocomplete machine.” If you raise the share of books, papers, and formal documents, the model typically gets stronger long-form structure and more stable written expression. If you increase code and mathematical reasoning data, structured output and formal reasoning often improve, but ordinary conversational style can drift.

That is the value of multi-source high-quality corpora like The Pile: they show that data diversity is itself a form of capability design (Gao et al., 2020).

If you write mixture as part of the optimization objective, the engineering meaning becomes even clearer. Let \(\pi_i\) be the sampling probability of bucket \(i\), with \(\sum_i \pi_i = 1\). Then the overall empirical risk can be written as:

\[ \mathcal{L}(\theta) = \sum_{i=1}^{n} \pi_i \,\mathbb{E}_{x\sim D_i} \left[-\log p_\theta(x)\right] \]

The corresponding gradient is also just a weighted sum:

\[ \nabla_\theta \mathcal{L}(\theta) = \sum_{i=1}^{n} \pi_i \,\mathbb{E}_{x\sim D_i} \left[-\nabla_\theta \log p_\theta(x)\right] \]

That is why changing mixture is not “swapping a side dish.” It directly changes the expected gradient.

To compress “mixture shift → capability shift” into one memory card, use the table below:

Mixture shift Language fluency Coding ability Knowledge density Style stability Typical risk
More web text Neutral to better Neutral Lower or more diffuse Less stable Noise, templates, SEO text
More books / formal documents Better Neutral Better More stable Narrower coverage
More code More technical style Much better Better structural knowledge Weaker conversational stability Dialogue style drift
More academic text More formal Helps math / paper summarization Better More stable but more bookish Ordinary dialogue gets stiffer
More dialogue / forum text Stronger interaction feel Neutral Depends on quality More natural speech Facts may be less reliable

3.2 Why “More Code” Can Hurt Dialogue Quality

This point is easy to misunderstand. People often say: “But code is high-quality text. Why would more code hurt the model?”
The answer is not that “code is harmful.” The answer is that code changes the model’s prior over what kinds of continuations are common.

Imagine this case:

You are training a general-purpose dialogue model. The mix of web text, books, docs, forums, and code is roughly balanced. Later, to improve coding ability, you raise the code share from 8% to 30%, without increasing the total token budget. Three things can happen:

  1. Style transfer
    The model starts preferring short, tight, structured, list-like responses, sometimes even fenced code blocks, and ordinary Q&A begins to sound like fragments of technical documentation.

  2. Semantic bias
    When a user asks an open-ended question, the model is more likely to collapse toward “executable plans,” “pseudocode,” or “function interfaces,” instead of giving a human-friendly explanation.

  3. Capacity competition
    Under fixed parameter count and fixed token budget, more model capacity is spent fitting the code distribution. That means less capacity is available for the finer structure of ordinary natural language.

So the issue is not whether code can increase. It is that code should only increase with an explicit target mixture and a product goal. If your product is a coding assistant, that may be exactly right. If your product is customer support, education, or enterprise knowledge assistance, it may not be.

3.3 How to Design a Data Mixture for a General-Purpose LLM

For a general-purpose LLM, a more robust mixture design process is usually not “pick percentages from intuition.” It is closer to the following five-step method:

Step 1: define target capabilities before defining sources

Do you want better chat, code, math, retrieval comprehension, long-form summarization, or multilingual coverage? Different targets imply very different optimal mixtures.
Define the capability vector first, then map it to data buckets. Do not start from whatever data you happened to collect.

Step 2: bucket by function, not only by source

Source categories—web, books, code, papers—are useful, but not sufficient. It is more actionable to add functional buckets such as:

  • general knowledge and encyclopedic content
  • high-quality long-form and formal writing
  • code and technical documentation
  • mathematics and symbolic reasoning
  • multilingual text
  • QA / instructional text with rich task formats

The benefit is that when one capability underperforms, you can adjust weights more precisely than saying “add more web.”

Step 3: run mixture experiments on small models before betting the large-model budget

The Llama 3 report published a very useful pattern: compare different mixtures on smaller models using classification and scaling experiments, then extrapolate the result to larger models (Grattafiori et al., 2024). This is much safer, and much cheaper, than gambling a large-model run on one guessed mixture.

Step 4: validate by slices, not just total loss

You cannot judge a general-purpose model by one aggregate validation loss. At minimum, you want separate slices for:

  • general web and encyclopedic text
  • code
  • math / reasoning
  • long-form writing
  • multilingual text
  • safety / sensitive-domain text

Only then can you tell whether a new mixture is truly improving the whole model, or merely trading one capability for another.

Step 5: leave room for later continued pretraining

General pretraining is not the only place to close every domain gap. Many data sources that look tempting to stuff aggressively into pretraining are better handled later in continued pretraining, where the tradeoffs are easier to control.
A mature design does not try to do everything in pretraining. It knows which capabilities should be built early and which can move later.

If you instantiate that framework in the 7B running example, you get a more concrete design sketch:

Data bucket Share Main benefit Main risk
Web and general documents 68% Broad coverage, wide knowledge surface High noise, many duplicates
Books 12% Long-form structure, written expression Domain coverage bias
Code 10% Coding and structured reasoning Dialogue style becomes more technical
Academic text 6% Knowledge density, formal expression Tone gets stiffer
Dialogue and forums 4% Interaction tone, QA format Fact quality is less stable

The value of a table like this is not the exact percentages. It is that it forces you to map every unit of token budget to both capability gain and side effects.

3.4 When Domain Performance Rises and General Benchmarks Fall, It Usually Means Three Things

If you add domain data and find domain metrics rising while general metrics fall, that is rarely accidental. At least one of the following is probably happening:

1. Distribution shift

The model is spending more capacity and more training steps on a narrower domain distribution, so its fit to the general distribution slows down or gets squeezed.

2. The evaluation target changed

Your domain benchmark may now be much closer to the new data, while the general benchmark is farther away. So the model’s real product value may not have declined, but its adaptation to the old benchmark has.

3. The data quality or tokenization is unfriendly

Sometimes the problem is not “too much domain data.” It is “fragmented domain data.” If terms are split too finely by the tokenizer, or the documents are full of repeated templates, or formulas and tables are parsed poorly, then the actual high-value signal delivered by the domain corpus is lower than it looks.

Typical fixes include:

  • lowering the domain-data weight;
  • increasing the total token budget rather than only changing proportions;
  • adding general replay to preserve the base distribution;
  • moving domain strengthening from pretraining to continued pretraining.

The key judgment is this:
are you building a general base model, or a domain-skewed base model?
There is no mixture percentage that is “objectively correct” apart from product goals.

3.5 Curriculum Learning Is Not “Easy to Hard.” It Is Reallocating Budget Across Training Stages

Curriculum learning in pretraining is often described as “start simple, then move to harder data.” That description is too crude. A more useful engineering interpretation is:

Curriculum learning changes the data distribution, sequence length, or sample difficulty across training stages in order to improve convergence efficiency and final capability.

Three common curriculum patterns are:

Quality-first curriculum

In early training, use cleaner, higher-quality text so the model can quickly establish stable language statistics. Later, gradually add broader general text. The benefit is usually smoother, more predictable early loss.

Length curriculum

Start with shorter sequences to improve throughput. Once the model has learned the basic patterns, increase the context length. This is especially common in long-context training, because sequence length directly changes training cost.

Domain annealing

Late in training, raise the share of certain high-value buckets at a lower learning rate—for example, high-quality code, mathematical reasoning, or long-form documents—to get extra gains on those capabilities. Llama 3 explicitly reported annealing with small amounts of high-quality code and math data, while carefully excluding common benchmark training sets from the annealing pool to avoid evaluation distortion (Grattafiori et al., 2024).

The boundaries matter too:
curriculum learning is not a magic trick that always helps. If you design the curriculum badly, what you have really done is reallocate precious token budget at the wrong stage.

3.6 How to Estimate the Marginal Value of a New Data Source

A mature pretraining team does not throw a new data source directly into a major training run just because “it looks promising.” A safer approach is to estimate its marginal value first. A useful evaluation framework should include at least these proxy metrics:

1. Acceptance rate and net gain

After quality filtering, PII filtering, license screening, and deduplication, how many effective tokens remain?
If a source looks like 500B tokens on paper but only 20B remain after filtering, and those are highly repetitive, then its true marginal value is limited.

2. Unique information density after deduplication

Check the near-duplicate rate, long-common-substring ratio, and template fraction. Many sources that look huge on paper contribute very little genuinely new information.

3. Tokenizer friendliness

Measure fragmentation rate, average token length, and how key terms are segmented. This matters especially for code, law, medicine, and mixed table/text corpora.

4. Small-model proxy experiments

In a small-model or short-run setup, compare adding the source vs not adding it along these axes:

  • validation loss change;
  • domain-slice loss change;
  • small downstream benchmark change;
  • safety and memorization-risk change.

5. Annealing experiments

Take a mid-run checkpoint and do a short annealing run on the new source at low learning rate. Measure whether it produces cheap, repeatable gains. Llama 3 explicitly used annealing this way to judge the value of small high-quality domain data (Grattafiori et al., 2024).

The right question is not “How big is this source?” It is:

Under acceptable risk, does it contribute enough previously missing tokens that are closely aligned with the capability you want?

3.7 Section Takeaway

  1. Data mixture is capability design, not warehouse management.
  2. More of a given data type changes the model’s internal prior, so “more code” and “more math” are never unconditional goods.
  3. When domain data boosts domain metrics but hurts general metrics, it usually means distribution shift, capacity competition, or tokenizer / data-quality problems.
  4. The most reliable mixture design workflow is not intuition. It is small-model experiments, slice-based evaluation, and marginal-value estimation.

Once the mixture is set, the question shifts from “which world to train on” to “how many FLOPs to spend training that world.” That takes us to the balance among parameters, tokens, and compute budget.

4 Compute and Scaling

ImportantQuestions to hold in mind for this section
  1. What are the primary cost drivers in pretraining (parameters, context length, number of tokens, precision)?
  2. Explain the Chinchilla scaling law. What is the compute-optimal ratio between parameters and tokens?
  3. Roughly estimate: how many FLOPs are required to train a 70B model on 2T tokens?
  4. If model parameters double, how do FLOPs per token change approximately?
  5. Explain data parallelism, tensor parallelism, and pipeline parallelism. When is each used?
  6. What is ZeRO / FSDP, and how does it reduce memory?
  7. What is activation checkpointing, and what is the tradeoff?
  8. Explain mixed-precision training (BF16/FP16). Why does BF16 dominate in LLM training?
  9. What is the roofline model, and how does it help reason about GPU utilization?
  10. Estimate the minimum number of H100 GPUs and wall-clock time needed to pretrain a 7B model on 1T tokens. State your assumptions.
  11. What is sequence parallelism, and how does it work with tensor parallelism in long-context training?
  12. Explain the difference between Megatron-LM’s 3D parallelism and DeepSpeed ZeRO-3. When would you use each?
Figure 3: Compute triangle—the mutual constraints among N, D, and C

4.1 Why Pretraining Is Expensive: Four Main Knobs Turn at Once

Pretraining is expensive not because of one factor, but because four knobs move at once:

  1. parameter count \(N\): determines model capacity, and also the memory footprint of weights, gradients, and optimizer state;
  2. training token count \(D\): determines how much content the model will see;
  3. context length \(L\): directly affects attention cost, activation memory, and packing efficiency;
  4. numerical precision and training state: affects memory, bandwidth, stability, and effective throughput.

If you only look at parameter count, you will misread the budget. The difference between a 70B model and a 7B model is not just “10× more parameters.” It is also larger optimizer states, heavier communication pressure, harder parallel partitioning, and usually a larger token requirement. At the same time, once context length increases, attention cost and activation cost rise quickly.

From a systems perspective, what drives up the bill is not only theoretical FLOPs, but also:

  • synchronization and communication;
  • padding and packing waste;
  • dataloader and storage bandwidth;
  • checkpoint writes;
  • fault recovery and reruns;
  • conservative safety margins added to preserve stability.

So when discussing “pretraining cost,” you cannot ask only “How big is the model?” You have to ask: How far do you intend to train it, at what sequence length, for how long, and at what system efficiency?

4.2 A Good Enough Training FLOPs Estimator

For dense decoder-only Transformers, a widely used rough estimate of training FLOPs is:

\[ \text{FLOPs} \approx 6ND \]

where:

  • \(N\) is the number of parameters;
  • \(D\) is the total number of training tokens.

This is an extremely useful order-of-magnitude formula. It ignores many constants and systems overheads, but it is good enough for budget-level reasoning.

For example:

70B model, 2T tokens

\[ 6 \times 70\times10^9 \times 2\times10^{12} = 8.4\times10^{23} \ \text{FLOPs} \]

That is about \(8.4\times10^{23}\) FLOPs.

7B model, 1T tokens

\[ 6 \times 7\times10^9 \times 10^{12} = 4.2\times10^{22} \ \text{FLOPs} \]

That is about \(4.2\times10^{22}\) FLOPs.

One point needs emphasis: this is only the idealized training compute. It does not include communication, recomputation, checkpoints, idle time, or input-pipeline bottlenecks. Real wall-clock is usually more conservative than anything you get by directly back-solving from theoretical FLOPs.

A more operator-centric wall-clock estimate is usually written as:

\[ T_{\text{wall}} \approx \frac{6ND}{n_{\text{gpu}}\,P_{\text{peak}}\,\eta_{\text{MFU}}} \]

where \(n_{\text{gpu}}\) is the number of GPUs, \(P_{\text{peak}}\) is per-GPU peak throughput, and \(\eta_{\text{MFU}}\) is effective Model FLOP Utilization. If the actual tokens consumed per step are denoted by \(B^{\text{tok}}_{\text{global}}\), then the training step count can be roughly estimated by:

\[ S \approx \frac{D}{B^{\text{tok}}_{\text{global}}} \]

In the 7B running example, with \(D=10^{12}\) tokens and \(B^{\text{tok}}_{\text{global}}\approx 8.4\times10^6\) tokens/step, the total step count is about

\[ S \approx 1.19\times10^5 \text{ steps} \]

If you use \(512\) H100s, \(P_{\text{peak}}\approx 989\) TFLOPS (BF16 peak), and \(\eta_{\text{MFU}}\approx0.38\), then the idealized pure-compute wall-clock comes out to a few days. Add I/O, checkpoints, fault recovery, and padding losses, and real wall-clock usually stretches further.

4.3 What Chinchilla Changed: Not “Bigger Is Better,” but “Do Not Underfeed Large Models”

Kaplan et al.’s scaling laws show that language-model loss tends to decrease smoothly as model size, data size, and compute increase (Kaplan et al., 2020). This gave the field a crucial piece of confidence: scaling up usually helps, and the gains are not random.

But Hoffmann et al. pointed out something equally important: many large models were actually undertrained—they had too many parameters and not enough tokens (Hoffmann et al., 2022). That is the central correction behind Chinchilla. It is not asking whether “more parameters are always better.” It is asking:

Under a fixed compute budget, how should parameter count and training tokens be balanced to maximize return?

It is like dating: time and energy are fixed budgets. You cannot just look for someone with elite specs. You also need enough time together, enough mutual understanding, enough shared experience. The best deal is not the highest standard in the abstract. It is the best balance, under limited investment, between how well matched the person is and how fully the relationship has time to develop.

In the original Chinchilla setting, one very common rule of thumb is:
the compute-optimal region is around 20 training tokens per parameter.

That number is famous, but what matters more is its boundary:

  • it is an empirical rule, not a law of nature;
  • it depends on data quality, optimizer choice, training recipe, and architectural details;
  • it is a strong starting point, not a guarantee that later models will match it exactly.

The Llama 3 report shows that once data quality, recipes, and target capabilities change, there can still be gains from training on more high-quality tokens even beyond the classical Chinchilla point (Grattafiori et al., 2024). So the mature engineering stance is not to worship one ratio. It is to use Chinchilla as a first-pass budgeting baseline.

4.4 What Happens When Parameters Double, or Context Length Doubles

For dense models, assuming roughly similar architectural proportions:

  • double the parameters: forward/backward FLOPs per token roughly double, and weights plus optimizer-state memory also rises roughly linearly;
  • double the training tokens: total training cost roughly doubles;
  • double the context length: attention cost rises significantly, activation memory grows faster, and long-sequence throughput drops clearly;
  • drop precision from FP32 to BF16/FP16: memory and bandwidth pressure go down, but numerical stability management becomes more important.

That is why pretraining planning is not a 2D table. It is a 4D tradeoff:
parameters, tokens, context length, and precision all matter at once.

4.5 Why One Machine Is Not Enough: Pretraining Naturally Becomes Distributed

Figure 4: Distributed pretraining topology

Once the model reaches the billion-parameter scale, a single machine quickly hits three walls:

  1. memory wall: parameters, gradients, optimizer state, and activations no longer fit;
  2. time wall: single-machine wall-clock becomes unacceptable;
  3. bandwidth wall: a single machine can no longer read, cache, and process data fast enough to sustain target throughput.

So the pretraining problem naturally shifts from “how do we write a loss function?” to “how do we split the model, split the state, and split the training itself?”

4.6 A Parallelism Map: DP, TP, PP, ZeRO/FSDP, and Sequence Parallelism

You can think of parallelism as: take one training problem and split it across different dimensions so multiple GPUs can work on it together.
Let global batch be \(B\), sequence length be \(L\), hidden dimension be \(H\), parameter count be \(P\), and device count be \(n\). Single-GPU training is usually limited by three things:

  • state memory: parameters, gradients, optimizer state, roughly on the same order as \(P\);
  • activation memory: intermediate tensors, often scaling roughly with \(B\times L\times H\);
  • communication cost: time spent synchronizing parameters, gradients, or activations across devices.

The five major kinds of parallelism are different ways to trade among these bottlenecks. A useful mnemonic is:

DP splits samples, TP splits matrices, PP splits layers, ZeRO splits state, SP splits sequences.

Parallelism type What it splits Main problem it solves Main cost
Data parallelism (DP) Samples Increase throughput Gradient synchronization
Tensor parallelism (TP) Matrices inside one layer A single layer does not fit on one GPU Frequent communication
Pipeline parallelism (PP) Layer groups along depth The full model does not fit on one GPU / one node Pipeline bubbles, scheduling complexity
ZeRO / FSDP Parameters, gradients, optimizer state Reduce redundant memory More communication, greater implementation complexity
Sequence parallelism (SP) Activations / operations along the sequence dimension Activation pressure in long-context training Must be co-designed with TP

Use one unified example:
suppose you need to train a 70B model on 8 GPUs, with global batch \(B=1024\) and context length \(L=8192\).
The difference among parallel strategies is not “which is more advanced.” It is which dimension you are cutting along.

Data parallelism (DP)

DP is the most intuitive: every GPU holds a full model replica, but each one sees different samples.
If there are \(n\) GPUs, each GPU processes

\[ B_{\text{local}}=\frac{B}{n} \]

samples. After computing a local gradient \(g_i\), you synchronize:

\[ g=\frac{1}{n}\sum_{i=1}^{n} g_i \]

It is like 8 chefs following the same recipe, each cooking a different batch, then averaging the seasoning at the end.
The upside is simplicity, stability, and easy throughput scaling. The downside is equally clear: every GPU must fit a full model replica. If the 70B model cannot fit on one GPU, adding more pure DP does not help. DP solves “too many samples,” not “too large a model.”

Tensor parallelism (TP)

TP solves “a single layer does not fit.”
Take a linear layer

\[ Y=XW \]

If the output dimension is too large, you can split the weight matrix by columns:

\[ W=[W_1, W_2, \dots, W_n] \]

Then GPU \(i\) only computes its own slice:

\[ Y_i = XW_i \]

and the results are later concatenated or reduced.
Intuitively, DP is “different people solve the whole problem for different samples,” while TP is “several people solve different subparts of the same large problem.”
It is especially useful for the huge GEMMs inside attention and FFNs. The cost is that every layer communicates. So TP depends heavily on high-speed interconnects. If inter-GPU bandwidth is weak, the math kernels themselves may be fine while the whole system slows under communication.

Pipeline parallelism (PP)

PP solves “the whole model does not fit on one machine.” The idea is to cut the model along depth.
If the model has 80 layers and you split it into 4 stages, each stage handles about 20 layers. The first GPU group runs layers 1–20, the second runs 21–40, and so on.
It is like a factory pipeline: not everyone builds a whole car. One group installs the chassis, another installs the doors, another installs the electrical system.

The problem is that a pipeline is not automatically full. To keep later stages from waiting idly, you usually split one batch into several micro-batches. If the pipeline has \(p\) stages and the number of micro-batches is \(m\), the bubble ratio is approximately:

\[ \text{bubble} \approx \frac{p-1}{m+p-1} \]

So if there are too few micro-batches, the pipeline keeps stalling. If there are too many, scheduling complexity and activation pressure increase. The hard part of PP is not the concept. It is filling the pipeline well enough that the concept actually works.

ZeRO / FSDP

ZeRO / FSDP do not split samples, matrices, or layers. They split the training state itself.
The main waste in naive DP is that every GPU stores full parameters, full gradients, and full optimizer state. If we denote them by \(P,G,O\), then per-GPU state memory is approximately:

\[ M_{\text{naive}} \approx P+G+O \]

Under ideal sharding, each of the \(n\) GPUs only holds a shard, so per-GPU memory becomes roughly:

\[ M_{\text{shard}} \approx \frac{P+G+O}{n} \]

It is like 8 people each carrying a full toolbox, then switching to a system where each person carries only part of the tools and shares when necessary.
This drastically reduces redundant memory, but at the cost of more all-gathers, reduce-scatters, and more complicated semantics. FSDP is the main modern engineering realization of this idea: store in shards by default, gather temporarily when a given layer needs to run.

Sequence parallelism (SP)

When context length rises from 4K to 32K, 64K, or beyond, the exploding term is often not parameters but activations.
If the activation tensor is

\[ A\in\mathbb{R}^{B\times L\times H} \]

then the shape itself tells you what happens: as \(L\) grows, memory grows linearly with it. SP says: if TP has already split the width dimension, we can also split the length dimension and shard the sequence across multiple GPUs. If the sequence is evenly split across \(n\) GPUs, then per-GPU activation size drops approximately to:

\[ \frac{B\times L\times H}{n} \]

Intuitively, it is like taking one very long scroll and giving 8 people one section each, instead of making every person unroll the whole thing.
SP is often used together with TP because in long-context training, the expensive part is the coexistence of “huge matrices across width” and “huge activations across length.” TP mainly reduces per-layer operator pressure. SP mainly reduces long-sequence activation pressure.

Put these five strategies together and the picture becomes clearer:
parallel training is not simply “add more GPUs.” It is deciding whether to cut along samples, matrices, layers, state, or sequence.

4.7 Megatron 3D Parallelism vs ZeRO-3/FSDP: Not Substitutes, but Different Layers of the Problem

When people first encounter these techniques, they often ask: “Which is better, Megatron 3D parallelism or ZeRO-3?”
The more accurate answer is: they do not solve exactly the same problem.

Megatron-LM’s “3D parallelism” usually means a combination of DP + TP + PP, which splits the computation graph and the model structure itself (Shoeybi et al., 2019).
ZeRO-3 / FSDP, by contrast, split the training state and remove redundant replicas (Rajbhandari et al., 2019).

So:

  • when even a single layer does not fit on one GPU, you usually need TP;
  • when the full model does not fit on one machine, PP often enters the design;
  • when full replicas are too wasteful, ZeRO-3/FSDP become extremely valuable;
  • in serious large-scale training, these are often used together rather than chosen one over the other.

4.8 Activation Checkpointing: Trade Compute for Memory

The core idea of activation checkpointing is simple: do not save every intermediate activation during the forward pass. Recompute part of them on demand during backpropagation. This can significantly reduce memory pressure, making larger batches, longer contexts, or larger models possible.

The tradeoff is clear:

  • upside: saves memory, often the thing that makes training possible at all;
  • downside: adds extra compute and lengthens step time.

This is the kind of engineering valve that does not make training cheaper, but often makes previously impossible configurations trainable.

4.9 Mixed Precision: Why BF16 Became the Default Starting Point

LLM training is no longer done entirely in FP32. The reason is simple: memory and bandwidth are too expensive. Mixed-precision training uses 16-bit representations for most tensors, sharply reducing both memory footprint and transfer cost.

Between BF16 and FP16, modern large-model training usually prefers BF16 not because it is “more advanced,” but because it is easier to keep stable. BF16 preserves the same exponent range as FP32, so it is far less likely to overflow or underflow in large-scale training.
The engineering consequence is: you can spend more attention on the training itself, and less fighting numerical accidents.

4.10 Use the Roofline Model to See Whether the System Is Computing or Waiting

The roofline model provides an effective lens on bottlenecks: it says the performance ceiling of a kernel or training workload is governed by two ceilings—the compute peak and the memory-bandwidth peak. If a workload has low arithmetic intensity, it is more likely to be memory-bandwidth bound. If arithmetic intensity is high, it is more likely to approach compute peak.

This helps because it breaks down the vague claim “GPU utilization is low” into sharper questions:

  • Is the operator itself poorly saturated?
  • Is HBM bandwidth the bottleneck?
  • Is communication dragging the system down?
  • Or is the data pipeline simply not feeding the hardware?

Many training systems look busy at the GPU level while true MFU (Model FLOP Utilization) remains poor. The roofline view helps you tell whether the bottleneck is compute, memory, or communication.

Figure 5: The roofline model—identify the real bottleneck in a training system

4.11 GPU Utilization Is High but Tokens/s Is Poor: Usually the Problem Is Not Kernels but Effective-Token Ratio

In the 7B running example, one dashboard showed GPU utilization staying above \(90\%\), yet real tokens/s lagged the plan by nearly \(28\%\). The root cause was not slow matmuls, but the combination of three things: poor packing granularity, dataloader jitter from uneven shards, and repeated reads caused by a data-cursor mismatch after resume. The more useful operator metric is not utilization alone, but:

\[ \rho_{\text{effective}} = \frac{\text{useful tokens/s}}{\text{loaded tokens/s}} \]

When \(\rho_{\text{effective}}\) falls significantly, the first things to inspect are usually: 1. packing quality and padding waste; 2. dataloader / storage bandwidth; 3. shard assignment and worker load balance; 4. repeated reads or stale sample replay.

In short, care about effective token budget and effective tokens/s.

4.12 A Rough H100 Estimate: How Many GPUs, How Long, and How Much for a 7B Model on 1T Tokens

Use the same rough estimate as above.

Suppose you are training a 7B dense model on 1T total tokens. Further assume that one H100 can achieve about 350–450 TFLOPS of effective end-to-end training throughput (note: this is an engineering assumption, not hardware peak). Then:

  • 256 H100s: pure-compute time is about 4.2–5.4 days
  • 512 H100s: pure-compute time is about 2.1–2.7 days
  • 1024 H100s: pure-compute time is about 1.1–1.4 days

If you convert that into dollars, the key quantity is not “how many GPUs,” but “total GPU-hours.” These configurations all land roughly in the range of 26k–33k H100-hours. At a rough rate of $3–$4 / H100-hour, the pure GPU bill is about $80k–$130k. Notice that the pure GPU bill for 256, 512, and 1024 GPUs usually does not differ by an order of magnitude, because the total FLOPs barely changed. More GPUs are mostly a trade: shorter wall-clock.

But that is still just “ideal pure-compute time.” Real wall-clock usually has to add:

  • data loading and packing;
  • communication and synchronization;
  • checkpoint writes;
  • fault recovery;
  • safety margin and efficiency losses;
  • plus storage, networking, and rerun overheads.

So when you rough out GPU count, wall-clock time, and training cost, you must write down the system-efficiency assumptions explicitly. Otherwise, every estimate will look more optimistic than reality.

4.13 Section Takeaway

  1. Pretraining cost is jointly determined by parameters, tokens, context length, and precision.
  2. \(6ND\) is a very good first-pass FLOPs estimate, but real wall-clock is always more expensive.
  3. The core of Chinchilla is not a magical constant. It is “do not underfeed large models.”
  4. Large-model training is distributed from the ground up; TP, PP, ZeRO/FSDP, and SP solve different layers of the bottleneck stack.
  5. When you do budgeting, calculate not only theoretical FLOPs, but real system efficiency.

Once the budget is defined, the next step is not writing fancier formulas. It is making those FLOPs land stably on a real cluster. That brings us to training recipes and monitoring.

5 Training Recipe and Monitoring

ImportantQuestions to hold in mind for this section
  1. What is the typical learning-rate schedule for LLM pretraining?
  2. Loss spikes by 3× mid-training and then slowly recovers. How do you debug it?
  3. What are the five most common failure modes in pretraining (instability, contamination, memorization, distribution shift, safety drift)?
  4. How do you set up probe sets to monitor quality during pretraining?
  5. Training loss keeps decreasing, but downstream evaluation stops improving. What is wrong?
  6. When do you decide to stop pretraining, and which signals matter?
  7. What is gradient-norm monitoring? If gradient norm suddenly spikes, what kind of training problem does it imply, and what should you do?

5.1 A Training Recipe Is Not a Hyperparameter List. It Is a Stability Boundary

At large scale, a training recipe does not merely “tune performance a bit.” It decides whether the run survives the first week, the second week, and the whole month. Every component—learning rate, global batch, warmup length, weight decay, numeric format, gradient clipping, checkpoint frequency—can change whether training diverges, whether loss oscillates for long periods, and whether the system can recover after a cluster incident.

That is why a pretraining recipe is better understood as a stability-boundary design problem. You are not trying to build “the prettiest config file.” You are trying to build a system that:

  • can run at scale;
  • reveals instability quickly;
  • can be rolled back after failure;
  • makes changes explainable;
  • can be reviewed after training ends.

5.2 The Typical Learning-Rate Schedule: Warm Up First, Decay Later

In modern LLM pretraining, the most common learning-rate schedule is still:

  1. warmup: rise smoothly from a small learning rate to a peak;
  2. main training phase: either stay stable for a while or begin decaying slowly right away;
  3. decay phase: usually cosine decay or linear decay until the end of training.

Why is warmup almost always necessary? Because early training combines three dangerous conditions:

  • parameters have not yet entered a stable region;
  • momentum statistics are not yet reliable;
  • large batches plus low precision make numerical oscillation more likely.

If you start with a large learning rate, loss spikes, exploding gradients, NaNs, and irreversible divergence tend to arrive very quickly.
That is why the learning-rate schedule is not just a convergence trick. In large-model training, it is first a risk-management tool.

5.3 What a Modern Pretraining Recipe Usually Looks Like

Different teams choose different defaults, but a reasonably stable modern pretraining recipe usually includes the following components:

  • optimizer: AdamW remains the safest mainstream starting point. It decouples weight decay from gradient updates and behaves well in large-scale adaptive optimization (Loshchilov & Hutter, 2017).
  • numerical format: BF16 mixed precision, with key statistics and master weights maintained stably.
  • learning-rate schedule: short warmup + long decay, often parameterized by tokens rather than epochs.
  • gradient clipping: limits the damage from abnormal batches.
  • activation checkpointing: trades recompute for memory and increases the trainable scale.
  • global batch design: jointly determined by DP, gradient accumulation, and micro-batch size.
  • packing and data-order control: reduces padding so “training tokens” do not drift too far from “effective tokens.”
  • checkpoint and rollback mechanisms: make sure failures do not mean restarting from scratch.

The important point is that these components are not independent.
Change the global batch and the learning rate may no longer be right. Change the context length and both activation memory and throughput change. Change precision and the stability boundary moves too. A good recipe is never a bag of isolated switches. It is a coupled set of system parameters.

For global-norm clipping, a common formulation is:

\[ \tilde{g} = g \cdot \min\left(1, \frac{\tau}{\lVert g \rVert_2}\right) \]

where \(\tau\) is the clipping threshold. The formula is simple, but it exposes one core idea of training recipes: many “advanced tricks” are just ways to limit how far an abnormal step can perturb the optimization trajectory.

5.4 Operator View: What a Real Training Dashboard Must Show

If you translate the recipe into the things an on-call engineer actually watches every day, a decent dashboard should contain at least the following:

Metric Healthy signal Red flag Common root cause First move
train loss Smooth decline Sudden spike / persistent oscillation LR, bad batch, restore error Align the timeline
val PPL Steady decline or plateau No improvement / divergence from train Overfitting, contamination, mixture imbalance Inspect slice validation
grad norm Noisy but bounded Sudden spike or periodic jitter Overflow, extreme sample, LR too high Stabilize first, replay next
tokens/s Near target Persistently below target Poor packing, I/O starvation, communication Check step breakdown
MFU Stable in target band High utilization but low MFU Kernel / communication / input starvation Compare against roofline
checkpoint age Refreshes regularly Too long without a checkpoint or abnormal restores Storage jitter, write blocking Inspect restore path
slice eval Critical slices improve together One domain keeps regressing Mixture, tokenizer, contamination Localize to bucket first

In the 7B running example, a useful dashboard is not “one with many graphs.” It is one where these signals line up on the same timeline: did the loss spike happen right after a resume? Did tokens/s drop right after the packing policy changed? Without a shared timeline, a dashboard is just pretty noise.

Healthy vs abnormal training-curve comparison

5.5 Loss Spikes by 3× Mid-Training and Then Recovers. How Do You Debug It?

This is a classic and tricky pretraining failure. The spike may be temporary, but that does not mean it is safe to ignore. A more robust debugging path usually looks like this:

Step 1: align the timeline

First check whether anything in the system changed right when the spike happened:

  • did the learning-rate phase change?
  • did sequence length change?
  • did the mixture get reweighted?
  • did training resume from a checkpoint?
  • did nodes restart, communication glitch, or data shards switch?

Many failures are not “sudden.” They are latent effects of some state change that happened just before.

Step 2: inspect gradient norms and numerical logs

If the loss spike comes with a sharp gradient-norm spike, suspect first:

  • learning rate too high;
  • a bad batch;
  • precision overflow;
  • gradient-sync issues.

If gradient norm is normal but loss is abnormal, then look first at data content, label offset, packing bugs, or restore logic.

Step 3: replay the abnormal batch

If you can identify the batch at that step, replay it offline.
Many deep bugs only show up this way—for example, tokenizer mis-segmentation on a particular document type, a corrupted sample inside one shard, or very long samples that trigger attention-scale numerical issues.

Step 4: decide whether it is incidental or structural

  • one-off, then gone: could be a bad sample, transient node jitter, or a temporary communication issue;
  • periodic: usually tied to the learning-rate schedule, a specific data bucket, or checkpoint restore flow;
  • increasingly frequent: means the system is approaching the stability boundary and you should not keep gambling just because “it recovered.”

The most dangerous engineering mistake is to confuse “it recovered” with “nothing is wrong.”

5.6 Resume Semantics: How Do You Prove a Resumed Run Is Equivalent to an Uninterrupted One?

The “full state” of training at step \(t\) is never just the parameters \(\theta_t\). More completely, it is at least:

\[ \mathcal{S}_t = (\theta_t, m_t, v_t, \mathrm{rng}_t, \pi_t, \sigma_t) \]

where:

  • \(m_t, v_t\) are Adam’s first and second moments;
  • \(\mathrm{rng}_t\) is RNG state;
  • \(\pi_t\) represents the data cursor, shard order, packing state, and sample-order position;
  • \(\sigma_t\) represents scheduler state, loss-scaler state, and gradient-accumulation state.

So “resume succeeded” is not the same as “the weights file loaded.” A stricter notion of semantic equivalence is: under the same input shard and the same random state, the next several steps after resume should match the uninterrupted run in loss, gradient norm, and update direction up to floating-point-level differences.

WarningRunning example: a subtle regression caused by incomplete restore

In the 7B running example, one checkpoint restore was initially judged “successful” because loss returned to roughly its prior level. But a few hundred steps later, gradient-norm noise widened significantly, and tokens/s fell slightly. The root cause turned out to be this: weights and optimizer state were restored, but the data cursor and gradient-accumulation state were not, so the semantic meaning of several later steps drifted.

The repair principle is simple: treat resume semantics as a first-class systems feature, not a side function of the training script.

5.7 The Five Most Common Failure Modes in Pretraining

It is more useful to classify failures by “which layer caused them” than by surface symptom. The most common five are:

Failure mode Typical symptom Root-cause layer
Instability Loss spikes, NaNs, exploding gradient norm Optimization and numerics
Contamination Benchmark jumps look too good, offline evaluation is distorted Data and evaluation
Memorization Verbatim long-span regurgitation, privacy leakage Data and deduplication
Distribution shift One domain improves while general ability regresses, or the reverse Data mixture
Safety drift More aggressive sensitive responses, moving refusal boundary Data governance and downstream alignment interface

The core idea behind this table is:
first decide whether the problem belongs to data, optimization, systems, or evaluation, then choose the debugging path.
Otherwise it is easy to spend a lot of time debugging the wrong layer.

5.8 A Probe Set Is Not a Small Benchmark. It Is an Early-Warning System

During training, one validation loss is not enough. You need a small, stable probe-set system to surface quality changes in different directions early.

A good probe set usually has five properties:

  1. small and cheap: can run frequently without slowing training;
  2. covers critical slices: at minimum general knowledge, code, math, long-context behavior, language/domain slices, safety boundaries;
  3. low contamination risk: ideally internally constructed, strictly deduplicated, and traceable;
  4. stable signal: do not rely too heavily on temperature-sensitive or score-volatile metrics;
  5. product relevance: not only what looks academically important, but also what actually matters to the product.

The best probe-set design does not try to cover everything. It covers the abilities you would most regret not noticing had regressed.

5.9 Training Loss Keeps Falling, but Downstream Scores Plateau: The Model May Be Learning Easy Things

This is a common misread in pretraining. If loss keeps decreasing, the model is still fitting the training distribution better. But if downstream scores do not move, those extra bits of fit may not be turning into the capabilities you care about.

Four common causes are:

  1. validation distribution and downstream task distribution do not match
    Better fit on “average web text” does not imply equal gains in code, math, long-form reasoning, or tool-use proxies.

  2. the new tokens do not contain much information density
    The model may be getting better at predicting easier tokens—template pages, repeated phrases, formal shells—instead of learning genuinely new knowledge patterns.

  3. tokenizer or context settings block the gains from propagating
    Some domains will not improve much even with more data if tokenization is too fragmented or long contexts are heavily truncated.

  4. the run has entered a diminishing-returns region
    This does not mean training must stop. It means “spend more” now requires a clearer reason.

So loss is never the answer itself. It only tells you the model is still learning. To know whether it is learning what you actually want, you need slice evaluation and probe sets.

5.10 Gradient Norm: The Cheapest Seismograph, and One of the Most Neglected

The value of gradient-norm monitoring is that it often exposes risk earlier than higher-level metrics do. A sudden spike in grad norm usually means one of the following:

  • the learning rate has moved beyond the current stability boundary;
  • mixed precision overflow or another numerical issue;
  • one abnormal batch is creating extreme gradients;
  • training state after resume is inconsistent;
  • a sudden data-distribution switch injected unusually hard samples.

A practical response sequence is usually “stabilize first, localize second”:

  1. trigger gradient clipping or an automatic safeguard;
  2. save and label the abnormal state;
  3. check whether the event lines up with LR, sequence length, or data-bucket switches;
  4. replay the abnormal batch;
  5. if necessary, roll back to a safe checkpoint and resume with smaller steps.

If you can only add one more training-monitoring chart, often the best choice is not a tenth fancy dashboard panel, but a reliable grad norm over time.

5.11 When to Stop Pretraining: Not “Loss Stopped Falling,” but “Is More Training Still Worth It?”

Stopping pretraining is almost never a single-signal decision. A more realistic stopping rule usually combines:

  • whether the planned token budget has been spent;
  • whether validation loss has slowed materially;
  • whether the key slices have reached a plateau;
  • whether probe sets have stopped improving or started regressing;
  • whether the marginal cost of continued training exceeds the acceptable marginal gain;
  • whether the data is being reused too heavily.

In other words, stopping does not mean “the model cannot learn anymore.” It means making a more practical engineering judgment:

given the current budget, data, and goal, is more training still worth it?

5.12 Section Takeaway

  1. A training recipe is a stability boundary, not a bag of isolated hyperparameters.
  2. Warmup + decay is the default not because of habit, but because large-scale training needs risk control.
  3. Loss spikes, gradient-norm spikes, and downstream plateaus must all be localized inside the four-layer stack of data → optimization → systems → evaluation.
  4. A probe set does not replace full evaluation. Its job is to catch the regressions you most do not want to miss, early.
  5. Stopping pretraining is a cost–benefit decision, not a ritual of staring at one loss curve until it goes flat.
NoteA 60-second inspection order for AI engineers
  1. First check whether train loss / val PPL / grad norm become abnormal at the same time.
  2. Then check whether tokens/s / MFU / checkpoint age are abnormal too, to separate “optimization problems” from “systems problems.”
  3. Then inspect slice eval, to decide whether the regression is global or confined to a specific bucket like code, math, or long-form text.
  4. Only then decide whether to continue training, roll back, lower the learning rate, or freeze data and restore paths for postmortem.

A training recipe is the stability boundary. Monitoring tells you whether you are still inside it. Even if training looks stable, that does not mean the model is moving in the right product direction. So the next question is: what do these monitoring numbers actually tell you, and what do they not?

6 Evaluation and Downstream Impact

Figure 6: Training monitoring and downstream effect—loss ↓ ≠ capability ↑
ImportantQuestions to hold in mind for this section
  1. What does perplexity measure, and what are its limits?
  2. Why does lower perplexity not always mean better instruction following?
  3. How do pretraining choices constrain mid-training, SFT, and inference?
  4. If your base model is weak at code, should you fix it in pretraining or later?
  5. You pretrain two models with the same architecture but different data mixtures. Model A has lower perplexity, but Model B scores higher on downstream tasks. How do you explain that?

6.1 Perplexity Measures “Average Surprise,” Not “Product Completeness”

Let the validation set contain \(N\) target tokens, and let the true sequence be \(x_1,\dots,x_N\). At position \(i\), the model assigns the true token the conditional probability

\[ p_i = p_\theta(x_i \mid x_{<i}) \]

Then the average token loss, i.e. the average negative log-likelihood, is:

\[ \ell = -\frac{1}{N}\sum_{i=1}^{N}\log p_i \]

Perplexity (PPL) is just the exponential of that average loss:

\[ \mathrm{PPL} = \exp(\ell) \]

Put the two together and the form becomes clearer:

\[ \mathrm{PPL} = \exp\!\left( -\frac{1}{N}\sum_{i=1}^{N}\log p_i \right) = \left( \prod_{i=1}^{N}\frac{1}{p_i} \right)^{1/N} \]

So PPL is really the geometric mean of the inverse probabilities assigned to the true tokens.
If the model consistently puts high probability on the true next token, \(\ell\) is low and PPL is low. If it often guesses poorly and assigns small probability to the true token, \(\ell\) is high and PPL is high.

A vivid way to think about it is this:
PPL asks: at each step, how many plausible options does the model still seem to be hesitating among, on average?

For example, if across many positions the model assigns the true token probability about \(0.2\), then

\[ \ell \approx -\log 0.2,\qquad \mathrm{PPL}\approx 5 \]

You can roughly read that as: the model is choosing among about 5 equally plausible options on average.
If the average true-token probability is only \(0.05\), then

\[ \mathrm{PPL}\approx 20 \]

Now the model is much more “perplexed,” because it is effectively groping among about 20 candidates.

You can think of it like filling in a line of verse.

When the line is “Looking back at the bleak place I came from — return: there is neither rain nor ___,” a decent model should assign high probability to “sunshine.”
A weaker model may hesitate among “sunshine,” “cold,” “sound,” “feeling,” and other continuations, or even produce one that does not fit.
The first has low surprise and low PPL. The second has high surprise and high PPL.

That is why PPL is indeed a very important base metric in pretraining. It is:

  • cheap;
  • stable;
  • easy to compute frequently;
  • directly aligned with the next-token prediction objective.

But its limitations need to be just as clear.

1. Perplexity depends on the tokenizer

Under different tokenizers, the same text becomes a different token sequence, so \(N\) changes and the meaning of each \(p_i\) changes. That means PPL is no longer directly comparable.
If one tokenizer splits a technical term more finely, the model may simply be solving more local, easier predictions. A change in PPL may reflect segmentation more than actual language-model quality.

2. Perplexity depends on the evaluation distribution

If you compute PPL on a web validation set, then it measures fit to the web-text distribution, not unified capability across all task distributions.
Lower PPL on web text does not imply stronger performance on code, math, legal documents, or real product requests.
More formally, PPL measures

\[ \mathbb{E}_{x\sim D_{\text{val}}}\big[-\log p_\theta(x)\big] \]

on one distribution, not “the totality of the capabilities you care about.”

3. Perplexity does not directly measure behavior

It does not directly measure:

  • instruction following;
  • tool use;
  • JSON-structured output;
  • safety boundaries;
  • verbosity, politeness, or stylistic consistency;
  • multi-turn interaction quality.

The reason is simple: PPL cares only about whether the true token got high probability. It does not care whether the resulting answer is a good answer in product terms.
A model can absolutely have lower validation PPL while being more verbose, worse at tool calls, or less aligned with the response style you want.

If you write pretraining evaluation in a form closer to product concerns, you can define the actual risk you care about as:

\[ R_{\text{prod}}(\theta) = \sum_{k=1}^{K} \alpha_k R_k(\theta), \qquad \sum_{k=1}^{K} \alpha_k = 1 \]

where \(R_k\) may correspond to code, long-form writing, tool use, factuality, safety, or some domain-specific task.
Validation PPL only measures negative log-likelihood on some development distribution. It does not directly minimize \(R_{\text{prod}}\).

So the right statement is not “PPL is useless.” It is that PPL is one of the most important thermometers during training, but it is not the full medical report.

6.2 Why Lower Perplexity Does Not Guarantee Better Instruction Following

The core reason is this:
pretraining optimizes continuation of real text, not obedience to user intent.

A base model can have lower PPL and still:

  • not know when to refuse;
  • not know what it means to “answer according to a JSON schema”;
  • not know when to call tools;
  • not know what kind of answer is actually helpful to a user;
  • prefer to continue like ordinary internet text.

Instruction following, preference alignment, safety boundaries, and tool behavior are mostly shaped later, through SFT, preference optimization, RL, or constrained decoding.

So lower PPL often means stronger underlying language modeling ability, but not automatically a better behavior interface. That distinction matters because it tells you where to fix the problem:

  • capability gaps should usually be fixed earlier;
  • behavior gaps can usually be fixed later.

6.3 A Concrete Case of “Lower Loss, Worse Product”

Here is a synthetic case that is nevertheless common in real engineering. Two runs use the same 7B architecture and the same total token budget, but different mixtures:

Model Dev-set PPL ↓ Code pass@1 ↑ Tool-use success ↑ Instruction-following win rate ↑ Main mixture change
Model A 7.12 39.4 61.8 67.1 Upweighted code + formal technical documents
Model B 7.34 36.8 69.5 74.2 More balanced mixture, with more dialogue and general documents

Why does this counterintuitive pattern happen—lower PPL, worse product? Because A got better at fitting formal structure that is easier to predict in the dev distribution, while shifting dialogue behavior, tool-trigger boundaries, and natural instruction style into a narrower region. In other words, A fits that validation text better, but B is closer to the real product risk \(R_{\text{prod}}\).

That is why pretraining evaluation should retain all three of the following at once: 1. base distribution metrics: loss / PPL; 2. slice metrics: code, math, long-form text, multilinguality; 3. product proxies: tool use, structured output, instruction following, safety boundaries.

6.4 How Pretraining Constrains Mid-Training, SFT, and Inference

Pretraining is not merely a prerequisite stage. It keeps constraining every later layer.

Tokenizer constraint

If the pretraining tokenizer splits code, medical terms, table symbols, or multilingual characters too finely, later continued pretraining, SFT, RAG, and inference cost will all inherit that problem.
That is not something post-training can easily repair.

Context-length and positional-scheme constraint

If pretraining only covered 8K contexts, then pushing to 128K later is not just an inference-systems problem. It also touches positional encoding, length extrapolation, KV cache behavior, and long-context data distribution.
It can be postponed, but the cost rises.

Data-mixture constraint

If the base model is weak at code, math, or long-form consistency, post-training can often only polish what is already there. It cannot conjure deep statistical competence from nowhere.

Architecture and systems constraint

Your architectural choices, parallelism strategy, precision, and training stack continue to shape:

  • the cost of mid-training;
  • how adapters are trained;
  • KV-memory usage at inference time;
  • how feasible long-context extension is;
  • how stable tool-call structured output will be.

In one sentence:

Pretraining determines how far you can still go later, how fast you can fix things, and how much it will cost.

6.5 If the Base Model Is Weak at Code, Fix It During Pretraining or Later?

This is a classic and practical judgment call. A useful rule of thumb is:

If the gap is in statistical competence and distribution coverage, fix it as early as possible

For example:

  • the model does not really understand code syntax;
  • it is unfamiliar with common APIs, libraries, and language idioms;
  • it collapses on long code contexts;
  • it cannot read mixed natural-language/code text well.

These problems suggest that the base model lacks coverage of the code distribution. They are best fixed in pretraining, or at least in continued pretraining, because the gap is in the underlying language model.

If the gap is in the behavior interface, it can move later

For example:

  • it can write code, but not emit it in the required tool format;
  • it can complete the function body, but explains it poorly;
  • it can code, but does not follow code-review templates;
  • it does not ask clarifying questions or run tests before answering when it should.

These are closer to instruction following, interaction style, and tool use. They can usually be fixed in SFT, preference optimization, or inference-time constraints.

The worst mistake is to misclassify a base capability gap as a post-training problem.
Then you end up trying to patch the interface on top of a weak substrate, and discover that no amount of behavioral tuning makes it feel natural.

6.6 Why Model A Has Lower Perplexity While Model B Has Better Downstream Tasks

When two models share the same architecture but use different data mixtures, and you see “A has lower PPL, B has stronger downstream performance,” the four most common explanations are:

1. Validation distribution and task distribution do not match

A is better on the distribution you used to compute PPL. B is better on the task distribution you actually care about.
That is not a contradiction. It is a mismatch in what you measured.

2. B’s mixture is closer to the target capability

For example, B may have better proportions for code, math, long-form text, or professional documentation. So even if its overall PPL is worse, its real task performance can still be stronger.

3. Tokenizer or sample-length differences changed learning speed

Some mixtures deliver stronger “effective learning signal” under the same token budget. Others are full of easier-to-predict but lower-information tokens.
The speed at which PPL drops is not always the speed at which capability grows.

4. A’s evaluation may be contaminated or overfit to the validation distribution

If A’s training data is closer to the validation set, a better-looking PPL is not surprising. But that does not mean it generalizes better.

So the mature engineering position is not “Should we trust PPL or benchmarks?” It is:

you must first confirm that you are comparing the same distribution, the same tokenizer, and the same target capability.

6.7 From Pretraining to Mid-Training and Post-Training

At this point, pretraining can be redefined very precisely:

Pretraining is the process of training a general-purpose conditional language model over a large-scale token distribution.

Its outputs are mainly two things:

  1. a capability substrate: language statistics, knowledge compression, pattern transfer, long-range dependency handling, structural understanding;
  2. a parameter initialization: a plastic starting point for mid-training, SFT, preference optimization, tool-use training, and inference-time constraints.

This is exactly why later chapters must discuss mid-training and post-training separately:

  • mid-training mainly addresses distribution coverage;
  • post-training mainly addresses the behavior interface.

If you blur the two together, decisions lose focus: capabilities that should have been fixed early get delayed, while behavior issues that should have been fixed later are forced into the base stage, which is both expensive and unstable.

6.8 Section Takeaway

  1. Perplexity is one of the most important base metrics in pretraining, but it is not the total score for product quality.
  2. Lower PPL does not guarantee better instruction following, because the two objectives are not the same.
  3. Pretraining continues to constrain tokenizer design, context length, architecture, base capability, and downstream training cost.
  4. Fix base capability gaps early; fix behavior and interface issues later.
  5. When two models share an architecture but the one with lower PPL performs worse downstream, it is usually a sign that distribution, goal, and evaluation are misaligned.

At the end of pretraining, what you get is not an “already finished assistant,” but a parameter set jointly shaped by the data system, budget constraints, training stability, and evaluation discipline.

7 Chapter Summary

Pretraining is not “feeding lots of text into a model.” It is the construction of a system that continually turns high-quality tokens into generalizable parameters under a fixed budget. If you remember only five judgments from this chapter, let them be these:

  1. The quality of the data system determines whether the model can actually approach its ceiling.
  2. Data mixture is not a warehousing problem. It is a capability-design problem.
  3. Pretraining budget planning must jointly balance parameters, tokens, context length, and precision.
  4. Large-model training is first a distributed-systems problem, and only then a matter of “tuning the optimizer nicely.”
  5. Loss and perplexity only answer whether the model keeps fitting the distribution; they cannot alone answer whether the model is becoming more suitable for a product.

LLM capability does not grow directly out of a loss function; it is jointly determined by the data system, the compute budget, distributed infrastructure, the training recipe, and evaluation discipline.

Will pretraining be replaced? Now that AutoML is already strong, will AI engineers still exist?

I think that question actually mixes together two different things.

The first is: who trains the model?
The second is: who is responsible for the model?

People often blur them together, which is why they conclude that if automation gets strong enough, humans simply disappear from the loop. That does not necessarily follow. Automation is best at replacing work where the objective is clearly specified, the feedback is already quantized, and nobody really has to take responsibility when something goes wrong. It will get better and better at running experiments, tuning parameters, running A/B tests, discarding bad configurations, and keeping good ones. As soon as a job looks enough like search, an agent will eventually outperform a human at it.

But pretraining is not pure search.

Pretraining is closer to shaping a mind. Not because it is mystical, but because it is expensive, partially irreversible, and cumulative in consequence. You are not tuning a local metric. You are deciding how a model will represent the world. How data enters, how loss is defined, which capabilities get rewarded, which biases get amplified, which errors harden after trillions of tokens—these are not questions answered by merely “running the experiments.” They are questions of judgment. And the core of judgment is not optimization. It is responsibility.

So I do not think “training a mind” disappears. If anything, it stays. It will remain high-bar, high-responsibility, and therefore high-leverage work. Because what is actually scarce has never been people who can run training jobs. It is people who know when to continue, when to stop, what is wrong with the model, and who are willing to take responsibility for the result.

What will get replaced is another class of ML work: training pipelines that do not require understanding domain adaptation, failure modes, or consequences. Those workflows matter, but they are more like assembly lines. Once the process is stable enough, it is only a matter of time before agents take over.

This is the usual shape of technological progress: machines first take the work that does not require taste, then force humans upward into the work that truly requires judgment. In the end, what remains is not “the person who operates the machine,” but “the person who decides what the machine should become.”

7.1 Recap Questions

1. What is pretraining?
Pretraining is the process of training a general-purpose conditional language model on massive unlabeled text using a self-supervised objective. It lets the model first learn the statistical structure of language and the world, then serve as the substrate for later mid-training and post-training.

2. Why can LLMs be trained without manually labeled data?
Because the training labels come from the text itself. Given a prefix, the true next token is the supervision signal, so next-token prediction is naturally a form of self-supervised learning.

3. What is the next-token prediction objective?
It means that at each position, the model predicts the next token from the preceding tokens and minimizes cross-entropy loss. Its core form is \(P(x_t\mid x_{<t})\).

4. Why is massive data necessary?
Because language and knowledge distributions are complex. If a model is to learn grammar, style, world knowledge, code structure, and reasoning patterns, it must encounter those phenomena repeatedly across sufficiently diverse corpora.

5. What does pretraining teach the model?
It teaches language fluency, concept co-occurrence, document structure, knowledge patterns, and a certain amount of reasoning structure. It does not automatically teach the model to behave like a productized assistant.

6. How is large-scale training data built?
Usually by collecting, parsing, normalizing, quality filtering, handling PII and licenses, deduplicating, tokenizing, sharding, and auditing for contamination before the data enters mixture design and the training system.

7. What is data mixture?
It is the weighted sampling distribution of different data sources during training. It directly shapes what the model becomes, so it is fundamentally capability design, not storage management.

8. Why does data quality matter more than raw data quantity?
Because parameter capacity and compute are expensive. High-noise, repetitive, template-heavy content wastes capacity and raises memorization and contamination risk.

9. What is the engineering meaning of scaling laws?
They tell you that increasing model size, data, and compute usually yields predictable gains, but with diminishing returns. So under a budget, the problem is to find a better balance, not just to push parameter count upward.

10. Why is compute budget a hard constraint?
Because training cost is jointly constrained by parameter count, training tokens, context length, precision, and system efficiency. Under a limited budget, the problem may not be that the model is “too small,” but that it has not been fed enough or run long enough.

11. What does a modern pretraining pipeline look like?
It starts from raw corpora, passes through parsing, filtering, deduplication, tokenization, sharding, data mixture, then enters distributed training, checkpointing, validation, and contamination audit, and finally outputs a base model.

12. How are trillions of tokens processed efficiently?
Usually by pretokenizing and writing high-throughput shards, reading them stably by rank during training, and combining packing, caching, and asynchronous prefetch so online tokenization and random I/O do not become bottlenecks.

13. Why is large-scale distributed infrastructure necessary?
Because a single machine usually cannot hold the model or the training state, and cannot finish training in acceptable time. DP, TP, PP, ZeRO/FSDP, and related strategies are needed to split memory and compute across devices.

14. How do you monitor training stability?
At minimum by watching training loss, validation PPL, gradient norm, tokens/s, MFU, data-input latency, and checkpoint health, while preserving the ability to replay abnormal batches, restores, and slice evaluations.

15. How does pretraining affect downstream alignment and inference?
It determines tokenizer cost, base capability, context-extension difficulty, later finetuning cost, and inference-system efficiency. Many later-stage problems actually originate in pretraining design.

The next chapter continues along the same line: once general pretraining ends, how do you use mid-training / continued pretraining (CPT) to push the model toward specific domains, task distributions, and sharper capability boundaries?

8 References

  1. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS. https://arxiv.org/abs/2005.14165
  2. Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361
  3. Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. https://arxiv.org/abs/2203.15556
  4. Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR. https://arxiv.org/abs/1910.10683
  5. Wenzek, G., et al. (2019). CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. https://arxiv.org/abs/1911.00359
  6. Gao, L., et al. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. https://arxiv.org/abs/2101.00027
  7. Lee, K., et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL. https://aclanthology.org/2022.acl-long.577/
  8. Loshchilov, I., & Hutter, F. (2017). Decoupled Weight Decay Regularization. https://arxiv.org/abs/1711.05101
  9. Shoeybi, M., et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
  10. Rajbhandari, S., et al. (2019). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. https://arxiv.org/abs/1910.02054
  11. Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783