Skip to content

FeedForward Quiz

Answer the following questions to check your understanding.


Q1

Why does FeedForward use "expand-compress"?

A
To reduce computation
B
To reduce parameters
C
To increase expressiveness in a higher-dimensional space
D
To speed up inference
Q2

How many projection matrices does SwiGLU use?

A
1
B
2
C
3
D
4
Q3

What is the formula for SiLU?

A
max(0, x)
B
x * sigmoid(x)
C
tanh(x)
D
1 / (1 + e^(-x))
Q4

What does the gating mechanism do in SwiGLU?

A
Speed up computation
B
Reduce parameters
C
Selectively control information flow
D
Prevent vanishing gradients
Q5

What is the main difference between FeedForward and Attention?

A
FFN has more parameters
B
FFN processes each position independently; Attention mixes positions
C
FFN is faster
D
FFN performs better
Q6

Why is SwiGLU often better than ReLU?

A
Faster computation
B
Fewer parameters
C
Gating adds richer non-linearity and smoother gradients
D
Simpler implementation
Q7

In MiniMind, intermediate_size is typically how many times hidden_size?

A
B
C
D
16×

🎯 Comprehensive Questions

Q8: Practical scenario

If FeedForward outputs are always close to 0, what might be wrong and how would you debug it?

Show reference answer

Possible causes:

  1. Initialization issues:

    • weights initialized too small
    • outputs become tiny
  2. Gate signal issues:

    • gate_proj outputs mostly negative
    • SiLU(negative) ≈ 0
    • gate * up ≈ 0
  3. Vanishing gradients:

    • deep network, gradients don’t flow
    • weights stop updating
  4. Precision issues:

    • low precision (e.g., FP16)
    • small values underflow to 0

Debug steps:

python
# 1. inspect intermediates
gate = self.gate_proj(x)
print(f"gate mean: {gate.mean()}, std: {gate.std()}")

silu_gate = F.silu(gate)
print(f"silu_gate mean: {silu_gate.mean()}, std: {silu_gate.std()}")

up = self.up_proj(x)
print(f"up mean: {up.mean()}, std: {up.std()}")

hidden = silu_gate * up
print(f"hidden mean: {hidden.mean()}, std: {hidden.std()}")

# 2. check weight initialization
for name, param in self.named_parameters():
    print(f"{name}: mean={param.mean():.6f}, std={param.std():.6f}")

# 3. check gradients
output.sum().backward()
for name, param in self.named_parameters():
    if param.grad is not None:
        print(f"{name} grad: {param.grad.abs().mean():.6f}")

Solutions:

  1. Adjust initialization:

    python
    nn.init.xavier_uniform_(self.gate_proj.weight)
  2. Check RMSNorm:

    • ensure inputs are normalized
    • avoid extreme ranges
  3. Mixed precision:

    • use FP32 for critical ops
    • BF16 for the rest

Q9: Conceptual understanding

Why does FeedForward process each position independently instead of global interaction like Attention?

Show reference answer

Design rationale:

  1. Clear division of labor:

    • Attention already mixes information
    • FeedForward focuses on "deep processing"
    • avoids redundant functionality
  2. Compute efficiency:

    • global interaction: O(n²d)
    • independent processing: O(nd²)
    • when n > d, independent is cheaper
  3. Parameter efficiency:

    • global interaction needs position-dependent parameters
    • independent processing can reuse parameters
  4. Theory:

    • Transformer design:
      • Attention = global routing
      • FFN = local feature transformation
    • similar to spatial conv + 1x1 conv in CNNs

Analogy:

  • meeting (Attention): exchange info
  • thinking (FFN): digest and organize
  • alternating works best

Empirical evidence:

  • papers show this division works well
  • all-FFN or all-Attention designs perform worse

Q10: Code understanding

Explain each step in the following code:

python
def forward(self, x):
    return self.down_proj(
        F.silu(self.gate_proj(x)) * self.up_proj(x)
    )
Show reference answer

Step-by-step:

python
def forward(self, x):
    # x: [batch, seq, hidden_dim]
    # e.g., [32, 512, 512]

    # Step 1: gate signal
    gate = self.gate_proj(x)
    # gate: [batch, seq, intermediate_dim]
    # e.g., [32, 512, 2048]

    # Step 2: SiLU activation
    gate_activated = F.silu(gate)
    # SiLU(x) = x * sigmoid(x)
    # smooth activation, preserves some negative info
    # gate_activated: [32, 512, 2048]

    # Step 3: value signal
    up = self.up_proj(x)
    # up: [32, 512, 2048]

    # Step 4: gating
    hidden = gate_activated * up
    # elementwise multiply
    # gate_activated controls the flow of up
    # hidden: [32, 512, 2048]

    # Step 5: compress back
    output = self.down_proj(hidden)
    # output: [32, 512, 512]

    return output

Key points:

  1. Two parallel paths: gate_proj and up_proj
  2. Gating: SiLU(gate) controls up
  3. Dimension flow: 512 → 2048 → 512
  4. No bias: all Linear layers use bias=False

Why this style?

  • concise: one-line formula
  • efficient: frameworks optimize this pattern
  • direct mapping to SwiGLU formula

✅ Completion check

After finishing all questions, check whether you can:

  • [ ] Get Q1–Q7 all correct: solid basics
  • [ ] Provide 2+ debugging steps in Q8: troubleshooting ability
  • [ ] Explain design rationale in Q9: architecture understanding
  • [ ] Explain code step-by-step in Q10: implementation clarity

If anything is unclear, return to teaching.md or rerun the experiments.


🎓 Advanced challenge

Want to go deeper? Try:

  1. Modify experiment code:

    • compare ReLU, GELU, SiLU activation distributions
    • visualize gating patterns
    • test different intermediate_size values
  2. Read papers:

  3. Implement variants:

    • implement GeGLU (GELU gating)
    • implement standard FFN (compare results)
    • implement MoE FeedForward

Next: go to 05. Residual Connection to learn residual connections.

Built on MiniMind for learning and experiments