2. Pretraining
Definition
Data
- Where to get data?
- specific websites
- keyword search on specific websites
- search engine.
Processing
- cleaning
- url filter
- content filter
- langid
- dedup
- diversity
- domain knowledge injection
- data distribution
- domain data <= 15%
- geeneral:specific - 1:1
- chinese, en, code = 4:4:2 ### Tokenizer
- how to train a tokenize? get a super large ram cpu. get a common daaset. then use bpe / bbpe.
- 数字切分
- 控制压缩率
- 手动移除脏数据、敏感token
- alignment needs a lot of
- how to expand the vocab? ### Architecture
- LLAMA
- RoPE + GQA + RMS_norm + swiglu. ### Training framework
- megatron-lm
- pai-megatron-patch ### Training Stratgies
- batch_size controls convergence and compute balance.
- too much batch size, compute and data would be too much -> ossilation
- too small batch size, too slow.
- best batch_size? why bs mattters? It affects model convergence speed and compute resource balance.
- MiniCPM
- phi
- DeepSeekMath ### Monitor
- channel_loss
- loss_spike: is there loss spike? not too bad, but should converge eventually.
- ppl: the more perplexity, the better. its fine it rises then drop. we can randomly sample 200 samples from distribution. as the monitor for ppl.
Scaling Law
- How to do resource estimation?
Mixed Precision Training
- why clipping gradient?
Continued pretraining
- Domain Adaptation
- generic:domain:instruct = 7:2:1
- Long-context Pretraining
- turn pretrain from 4096 to 16384
- turn rope \(/theta\)
- context parllel training. segment long sequence.
Inference time
- Batch Inference
- matrix-multiplication
- attention: kv-cache
- where do float come from?
- matrix
- attention
| Latency |
|---|
| Evaluation |
| PPL |
| - you can only compare models w urself. |
| benchmark |
| - github.com/open-compass/opencompass |
| inject important information into prompt |
| check if model is forgetting or not from probablistic perspective. |
| ask model to continue prompt using rouge-l, bleu, bestscore to measure similarity. |