Skip to content

02. Position Encoding

Why does Transformer need position information? How does RoPE work?


🎯 Learning goals

After this module, you will be able to:

  • ✅ Understand Attention’s permutation invariance issue
  • ✅ Understand RoPE’s core idea (rotary encoding)
  • ✅ Understand why multi-frequency is needed
  • ✅ Understand how RoPE supports length extrapolation
  • ✅ Implement RoPE from scratch

📚 Learning path

1️⃣ Quick experience (10 min)

bash
cd experiments

# Exp 1: prove Attention is permutation-invariant
python exp1_why_position.py

# Exp 2: RoPE basics
python exp2_rope_basics.py --quick

🔬 Experiment list

ExperimentPurposeTimeData
exp1_why_position.pyShow why position encoding is needed30 secsynthetic
exp2_rope_basics.pyRoPE core idea2 minsynthetic
exp3_multi_frequency.pyMulti-frequency mechanism2 minsynthetic
exp4_rope_explained.pyFull implementation (optional)5 minsynthetic

💡 Key points

1. Why position encoding?

Problem: Attention is permutation-invariant

python
# Same words, different order
Sentence 1: "I like you" → Attention can’t tell order
Sentence 2: "You like I" → different meaning!

Fix: attach position information to each token


2. Core idea of RoPE

Rotary encoding: encode position via rotation matrices

  • position 0: rotate 0°
  • position 1: rotate θ°
  • position 2: rotate 2θ°
  • ...

Advantages:

  • ✅ Relative position emerges naturally (angle differences)
  • ✅ Supports length extrapolation (train 256, infer 512+)
  • ✅ Efficient (apply rotation directly)

3. Multi-frequency mechanism

Why multiple frequencies?

  • Low frequency (slow rotation): long-range positions (paragraph level)
  • High frequency (fast rotation): short-range positions (token level)

Analogy:

  • Clock hands: seconds (high), minutes (mid), hours (low)
  • Combined to represent time accurately

📖 Docs


✅ Completion check

After finishing, you should be able to:

Theory

  • [ ] Explain why Attention needs position information
  • [ ] State the RoPE formula
  • [ ] Explain the role of multi-frequency

Practice

  • [ ] Implement RoPE from scratch
  • [ ] Draw the RoPE rotation diagram
  • [ ] Explain how RoPE supports extrapolation

🔗 Resources

Paper

Code reference

  • MiniMind: model/model_minimind.py:108-200

🎓 Next step

After finishing this module, go to: 👉 03. Attention

Built on MiniMind for learning and experiments