Skip to content

01. Normalization

Why do deep networks need normalization? How does RMSNorm stabilize training?


🎯 Learning goals

After this module, you will be able to:

  • ✅ Understand the root cause of vanishing/exploding gradients
  • ✅ Understand how normalization stabilizes activation distributions
  • ✅ Distinguish RMSNorm vs LayerNorm
  • ✅ Distinguish Pre-LN vs Post-LN
  • ✅ Implement RMSNorm from scratch

📚 Learning path

1️⃣ Quick experience (10 min)

Run experiments first to build intuition:

bash
cd experiments

# Exp 1: See what happens without normalization
python exp1_gradient_vanishing.py

# Exp 2: Compare four configurations
python exp2_norm_comparison.py --quick

You will see:

  • No normalization: activation std quickly collapses to 0 (vanishing gradients)
  • Post-LN: unstable loss in early training
  • Pre-LN + RMSNorm: stable convergence ✅

2️⃣ Theory (30 min)

Read the teaching doc:

Core structure:

  • Why: why normalization is needed
  • What: what RMSNorm is
  • How: how to validate it

3️⃣ Implementation (15 min)

Read real code:

Key files:

  • model/model_minimind.py:95-105 - RMSNorm implementation
  • model/model_minimind.py:359-380 - usage inside TransformerBlock

4️⃣ Self-check (5 min)

Finish the quiz:


🔬 Experiment list

ExperimentPurposeTimeData
exp1_gradient_vanishing.pyShow why normalization is necessary10 secsynthetic
exp2_norm_comparison.pyCompare four configs5 minTinyShakespeare
exp3_prenorm_vs_postnorm.pyPre-LN vs Post-LN8 minTinyShakespeare

Run all experiments

bash
cd experiments
bash run_all.sh

💡 Key points

1. Why normalization?

Deep net without normalization:
Layer 1: std = 1.00
Layer 2: std = 0.85
Layer 3: std = 0.62
Layer 4: std = 0.38
...
Layer 8: std = 0.016  ← gradients almost vanish!

Intuition: like adding a “pressure regulator” to a faucet—output stays stable even when input changes.


2. RMSNorm vs LayerNorm

FeatureLayerNormRMSNorm
Stepssubtract mean + divide by stddivide by RMS
Params2d (γ, β)d (γ)
Speedslowerfaster by 7–64%
Half precision stabilityweakerbetter

Conclusion: RMSNorm is simpler, faster, and more stable.


3. Pre-LN vs Post-LN

python
# Post-LN (older)
x = x + Attention(x)          # compute
x = LayerNorm(x)              # normalize after

# Pre-LN (modern)
x = x + Attention(Norm(x))    # normalize before
x = x + FFN(Norm(x))          # then compute

Advantages:

  • ✅ Cleaner residual path (gradients flow directly)
  • ✅ More stable for deep networks (>12 layers)
  • ✅ More tolerant to learning rate

📊 Expected results

After running experiments, you should see under experiments/results/:

  1. gradient_vanishing.png

    • No norm: std decays
    • With norm: std stays stable
  2. norm_comparison.png

    • Four loss curves
    • NoNorm becomes NaN after ~500 steps
    • Pre-LN is most stable
  3. prenorm_vs_postnorm.png

    • Convergence comparison

✅ Completion check

After finishing, you should be able to:

Theory

  • [ ] Explain gradient vanishing in your own words
  • [ ] State the RMSNorm formula
  • [ ] Explain why Pre-LN is better than Post-LN

Practice

  • [ ] Implement RMSNorm from scratch
  • [ ] Draw the Pre-LN Transformer block flow
  • [ ] Debug normalization-related training issues

Intuition

  • [ ] Explain normalization with an analogy
  • [ ] Predict what happens without normalization
  • [ ] Explain why modern LLMs choose Pre-LN + RMSNorm

🔗 Resources

Papers

Blogs

Videos


🎓 Next step

After finishing this module, go to: 👉 02. Position Encoding

Learn how Transformer handles position and why RoPE works.

Built on MiniMind for learning and experiments