2. Pretraining

Definition

Data

  • Where to get data?
    • specific websites
    • keyword search on specific websites
    • search engine.

Processing

  1. cleaning
  • url filter
  • content filter
  • langid
  • dedup
  • diversity
  • domain knowledge injection
  1. data distribution
  • domain data <= 15%
  • geeneral:specific - 1:1
  • chinese, en, code = 4:4:2 ### Tokenizer
  • how to train a tokenize? get a super large ram cpu. get a common daaset. then use bpe / bbpe.
    • 数字切分
    • 控制压缩率
    • 手动移除脏数据、敏感token
    • alignment needs a lot of
  • how to expand the vocab? ### Architecture
  • LLAMA
  • RoPE + GQA + RMS_norm + swiglu. ### Training framework
  • megatron-lm
  • pai-megatron-patch ### Training Stratgies
  • batch_size controls convergence and compute balance.
  • too much batch size, compute and data would be too much -> ossilation
  • too small batch size, too slow.
  • best batch_size? why bs mattters? It affects model convergence speed and compute resource balance.
  • MiniCPM
  • phi
  • DeepSeekMath ### Monitor
  1. channel_loss
  2. loss_spike: is there loss spike? not too bad, but should converge eventually.
  3. ppl: the more perplexity, the better. its fine it rises then drop. we can randomly sample 200 samples from distribution. as the monitor for ppl.

Scaling Law

  • How to do resource estimation?

Mixed Precision Training

  • why clipping gradient?

Continued pretraining

  1. Domain Adaptation
  • generic:domain:instruct = 7:2:1
  1. Long-context Pretraining
  • turn pretrain from 4096 to 16384
  • turn rope \(/theta\)
  • context parllel training. segment long sequence.

Inference time

  1. Batch Inference
  • matrix-multiplication
  • attention: kv-cache
  • where do float come from?
    • matrix
    • attention
Latency
Evaluation
PPL
- you can only compare models w urself.
benchmark
- github.com/open-compass/opencompass
inject important information into prompt
check if model is forgetting or not from probablistic perspective.
ask model to continue prompt using rouge-l, bleu, bestscore to measure similarity.