Normalization Quiz
Check your understanding of core concepts in the Normalization module
🧠 Quiz (multiple choice)
What is the main cause of gradient vanishing in deep networks?
What is the key difference between RMSNorm and LayerNorm?
Why is Pre-LN more stable than Post-LN?
In MiniMind, how many RMSNorm layers does each TransformerBlock contain?
Why does RMSNorm use .float() and .type_as(x) in forward?
💬 Open questions
Q6: Debugging scenario
You train a 16-layer Transformer and the loss becomes NaN around step ~100. Based on what you learned, list possible causes and fixes.
Suggested answer
Possible causes:
- Missing normalization → vanishing/exploding gradients
- Using Post-LN → unstable in deep stacks
- Too high learning rate
- Poor initialization
- FP16 numerical instability
Fixes (top options):
Switch to Pre-LN + RMSNorm
pythonclass TransformerBlock(nn.Module): def forward(self, x): x = x + self.attn(self.norm1(x)) # Pre-Norm x = x + self.ffn(self.norm2(x)) # Pre-Norm return xLower learning rate
- From 1e-3 down to 1e-4 or lower
- Add warmup
Check initialization
- Use Kaiming or Xavier initialization
- Ensure activation std starts near 1.0
Use gradient clipping
pythontorch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Prefer BF16 over FP16
pythonwith torch.autocast(device_type='cuda', dtype=torch.bfloat16): output = model(input)
Debug checklist:
- Log activation stats per layer (mean/std/max/min)
- Monitor for NaNs in gradients
- Compare with smaller configs
Q7: Explain the intuition
Explain in your own words why normalization stabilizes training, using a simple analogy.
Suggested answer
Example intuition:
- Without normalization: each layer acts like a filter that can amplify or shrink signals, so the signal scale quickly drifts.
- With normalization: each layer first normalizes the input to a stable scale, so gradients flow reliably.
Analogy:
- A water system without pressure control: higher floors get almost no water.
- A system with pressure regulators: every floor receives stable pressure.
Key takeaway: Normalization keeps signal scale stable, so gradients remain healthy and training converges.
✅ Completion checklist
- [ ] Q1–Q5 correct: core concepts confirmed
- [ ] Q6: list 3+ causes and fixes: practical debugging
- [ ] Q7: clear analogy: intuitive understanding
If you are not fully confident, review teaching.md and rerun the experiments.
🎓 Next step
Continue to 02. Position Encoding.