Normalization Teaching Notes

Understand why deep networks need normalization and how RMSNorm stabilizes training.

1. Why normalization?

Problem: deep networks are hard to train

When stacking many layers, activations can shrink rapidly. A simple example with 8 layers:

Problem 1: Vanishing activations

Layer 1: activation std = 1.04
Layer 2: activation std = 0.85
Layer 3: activation std = 0.62
Layer 4: activation std = 0.38
Layer 5: activation std = 0.21
...
Layer 8: activation std = 0.016  ← almost 0

As the signal shrinks, gradients vanish and the model stops learning.

Problem 2: Exploding activations

In other cases, activations grow too large, leading to NaN and unstable training.

Intuition: normalization as a stabilizer

Key idea: normalization keeps the scale of activations stable across layers.

Without normalization: each layer amplifies or shrinks the signal unpredictably.
With normalization: each layer receives inputs with controlled magnitude.

Why this matters

Normalization helps with three critical issues:

Stabilize activation scale so deep stacks remain trainable.
Improve gradient flow, avoiding vanishing or exploding gradients.
Make training more robust to hyperparameters and initialization.

2. What is RMSNorm?

Definition

RMSNorm stands for Root Mean Square Normalization.

It removes the mean subtraction of LayerNorm and only normalizes by the RMS:

RMSNorm (x) = \frac{x}{RMS (x)} ⊙ γ

where

RMS (x) = \sqrt{\frac{1}{d} \sum_{i = 1}^{d} x_{i}^{2} + ϵ}

Parameters:

$x$ : input tensor with shape [batch_size, seq_len, hidden_dim]
$d$ : hidden_dim
$ϵ$ : small constant, e.g. 1e-5
$γ$ : learnable scale parameter with shape [hidden_dim]
$⊙$ : element-wise multiplication

RMSNorm vs LayerNorm

Feature	LayerNorm	RMSNorm
Formula	$(x - μ) / (σ + ϵ)$	$x / (RMS (x) + ϵ)$
Steps	subtract mean + divide by std	divide by RMS only
Parameters	$2 d$ (weight + bias)	$d$ (weight only)
Speed	slower	7–64% faster
Half precision	less stable	more stable
Common usage	BERT, GPT-2	Llama, GPT-3+, MiniMind

Why RMSNorm is preferred in LLMs

Simpler computation (no mean subtraction).
Faster training due to fewer ops.
More stable in half precision (BF16/FP16).

MiniMind uses Pre-LN + RMSNorm as the default configuration.

Implementation (reference)

python

class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))  # learnable scale

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        return self.weight * self._norm(x.float()).type_as(x)

Notes:

rsqrt is $1 / \sqrt{x}$ , faster than 1/sqrt(x).
mean(-1) computes RMS over hidden_dim.
.float() improves stability in half precision.
.type_as(x) returns to original dtype.

3. Pre-LN vs Post-LN

Normalization can be placed before or after each sub-layer.

Post-LN (older)

python

x = x + Attention(x)
x = LayerNorm(x)
x = x + FFN(x)
x = LayerNorm(x)

Issues:

Residual path is less clean.
Deep networks (>12 layers) become unstable.
Training often requires smaller learning rates.

Pre-LN (modern)

python

x = x + Attention(Norm(x))
x = x + FFN(Norm(x))

Advantages:

Cleaner residual path → gradients flow directly.
More stable for deep networks.
More tolerant to larger learning rates.

MiniMind choice: Pre-LN + RMSNorm.

4. How to verify (experiments)

Experiment 1: Gradient vanishing

Goal: show how normalization prevents vanishing activations.

bash

python experiments/exp1_gradient_vanishing.py

Expected:

No normalization: std shrinks to 0.016
With normalization: std stays around 1.0

Output: results/gradient_vanishing.png

Experiment 2: Compare configurations

Goal: compare NoNorm / Post-LN / Pre-LN / Pre-LN + RMSNorm.

bash

python experiments/exp2_norm_comparison.py
# quick mode
python experiments/exp2_norm_comparison.py --quick

Expected (summary):

Config	Converges	NaN	Final loss	Stability
NoNorm	❌	~500 steps	NaN	poor
Post-LN	✅	-	~3.5	sensitive (LR < 1e-4)
Pre-LN + LayerNorm	✅	-	~2.8	good
Pre-LN + RMSNorm	✅	-	~2.7	best

Output: results/norm_comparison.png

Experiment 3: Pre-LN vs Post-LN

Goal: show Pre-LN is more stable in deeper networks.

bash

python experiments/exp3_prenorm_vs_postnorm.py

Expected:

4 layers: both converge.
8 layers: Pre-LN remains stable, Post-LN becomes unstable.

Output: results/prenorm_vs_postnorm.png

5. Key takeaways

Normalization is essential to stabilize deep networks.
RMSNorm is simpler and faster than LayerNorm while remaining stable.
Pre-LN is the modern default because it stabilizes gradients.

MiniMind’s block (Pre-LN)

python

# Pre-LN Transformer Block
residual = x
x = self.norm1(x)
x = self.attention(x)
x = residual + x

residual = x
x = self.norm2(x)
x = self.feedforward(x)
x = residual + x

Order: Norm → Compute → Residual.

6. Further reading

Papers

Blogs

Code references

MiniMind: model/model_minimind.py:95-105 (RMSNorm)
MiniMind: model/model_minimind.py:359-380 (TransformerBlock)

Quiz

quiz.md - 5 self-check questions

✅ Self-check

Before moving on, make sure you can:

[ ] Explain vanishing/exploding gradients
[ ] Write the RMSNorm formula
[ ] Explain RMSNorm vs LayerNorm
[ ] Explain Pre-LN vs Post-LN
[ ] Sketch the Pre-LN Transformer block flow
[ ] Identify where RMSNorm appears in the model

Normalization Teaching Notes ​

1. Why normalization? ​

Problem: deep networks are hard to train ​

Intuition: normalization as a stabilizer ​

Why this matters ​

2. What is RMSNorm? ​

Definition ​

RMSNorm vs LayerNorm ​

Why RMSNorm is preferred in LLMs ​

Implementation (reference) ​

3. Pre-LN vs Post-LN ​

Post-LN (older) ​

Pre-LN (modern) ​

4. How to verify (experiments) ​

Experiment 1: Gradient vanishing ​

Experiment 2: Compare configurations ​

Experiment 3: Pre-LN vs Post-LN ​

5. Key takeaways ​

MiniMind’s block (Pre-LN) ​

6. Further reading ​

Papers ​

Blogs ​

Code references ​

Quiz ​

✅ Self-check ​

Normalization Teaching Notes

1. Why normalization?

Problem: deep networks are hard to train

Intuition: normalization as a stabilizer

Why this matters

2. What is RMSNorm?

Definition

RMSNorm vs LayerNorm

Why RMSNorm is preferred in LLMs

Implementation (reference)

3. Pre-LN vs Post-LN

Post-LN (older)

Pre-LN (modern)

4. How to verify (experiments)

Experiment 1: Gradient vanishing

Experiment 2: Compare configurations

Experiment 3: Pre-LN vs Post-LN

5. Key takeaways

MiniMind’s block (Pre-LN)

6. Further reading

Papers

Blogs

Code references

Quiz

✅ Self-check