Skip to content

04. FeedForward

Why expand–compress? What advantages does SwiGLU provide?


🎯 Learning goals

After this module, you will be able to:

  • ✅ Understand the expand–compress structure of FeedForward
  • ✅ Understand why a high-dimensional intermediate layer is needed
  • ✅ Understand the advantages of SwiGLU
  • ✅ Understand the division of labor between FeedForward and Attention
  • ✅ Implement SwiGLU FeedForward from scratch

📚 Learning path

1️⃣ Quick experience (10 min)

bash
cd experiments

# Exp 1: FeedForward basics
python exp1_feedforward.py

🔬 Experiment list

ExperimentPurposeTime
exp1_feedforward.pyUnderstand expand–compress and SwiGLU10 min

💡 Key points

1. Core structure of FeedForward

FFN(x)=W2σ(W1x)

Standard FFN:

  • W1: hidden_size → intermediate_size (expand)
  • σ: activation (ReLU, GELU, ...)
  • W2: intermediate_size → hidden_size (compress)

MiniMind SwiGLU:

SwiGLU(x)=Wdown(SiLU(Wgatex)Wupx)

2. Why expand–compress?

Problem with direct transform:

  • 768 → 768: only linear, limited expressiveness
  • cannot fit complex nonlinear functions

Advantage of expand–compress:

  • 768 → 2048 → 768: move into a higher-dimensional space
  • easier to separate patterns in high dimensions
  • compress back while preserving useful info

Analogy:

  • Cooking: ingredients → chopped/processed → plated
  • Photos: pixels → features → optimized pixels

3. SwiGLU activation

Why not ReLU?

  • ReLU is simple but often weaker
  • GLU variants perform better in LLMs

SwiGLU formula:

python
hidden = SiLU(gate) * up  # gating
output = down_proj(hidden)

Three projections:

  • gate_proj: compute gate signal
  • up_proj: compute value signal
  • down_proj: project back to hidden size

SiLU (Swish):

SiLU(x)=xσ(x)

4. Division of labor: FeedForward vs Attention

ComponentRoleAnalogy
Attentionexchange information between tokensmeeting discussion
FeedForwardprocess each token independentlyindividual thinking

Key difference:

  • Attention: seq_len × seq_len interactions
  • FeedForward: independent per position

Transformer block flow:

x → RMSNorm → Attention → + → RMSNorm → FeedForward → +
              (info exchange)           (independent processing)

📖 Docs


✅ Completion check

After finishing, you should be able to:

Theory

  • [ ] Explain why expand–compress is needed
  • [ ] Explain the three projections in SwiGLU
  • [ ] Explain SiLU activation
  • [ ] Explain the division of labor with Attention

Practice

  • [ ] Implement standard FFN from scratch
  • [ ] Implement SwiGLU from scratch
  • [ ] Compare different activations

🔗 Resources

Papers

Code reference

  • MiniMind: model/model_minimind.py:330-380

🎓 Next step

After finishing this module, go to: 👉 05. Residual Connection

Built on MiniMind for learning and experiments