Skip to content

Position Encoding Quiz

Answer the following questions to check your understanding.


Q1

Why does Attention need positional encoding?

A
To speed up computation
B
To reduce parameters
C
Because Attention is permutation-invariant and cannot distinguish order
D
To support longer sequences
Q2

What is the core idea of RoPE?

A
Assign a learnable vector to each position
B
Encode position by rotating vectors
C
Compute relative distance between tokens
D
Use sine functions to generate positional encodings
Q3

Why does RoPE use multiple frequencies?

A
To speed up computation
B
To reduce memory usage
C
High frequency encodes local position, low frequency encodes global position
D
To be compatible with different head_dim values
Q4

Where is RoPE applied in Attention?

A
Query only
B
Key only
C
Query and Key
D
Query, Key, and Value
Q5

What is the main advantage of RoPE over absolute positional embeddings?

A
Faster computation
B
Fewer parameters
C
Supports length extrapolation (train short, infer long)
D
Simpler implementation

🎯 Comprehensive Questions

Q6: Practical scenario

Assume you trained a RoPE model with max_seq_len=512. Now you need to process sequences of length 2048. What issues might occur, and how can you fix them?

Show reference answer

Possible issues:

  1. Performance drop:

    • Extrapolation is possible, but the model never saw such long sequences
    • Long-range attention patterns may be inaccurate
  2. Over-rotation:

    • Position 2000 rotates 4× more than training length
    • High-frequency dimensions may wrap too many times and lose information

Solutions:

  1. Use YaRN (recommended):

    python
    # enable in MiniMind
    config.inference_rope_scaling = True
    • rescale rotation frequencies
    • smooths long-sequence rotations
  2. Position Interpolation:

    python
    # scale positions into training range
    pos_ids = pos_ids * (512 / 2048)
    • simple but effective
    • compresses the position range
  3. Continue training:

    • fine-tune with long-sequence data
    • adapt the model to long-range attention

Best practices:

  • use a larger rope_base in training (e.g., 1e6)
  • enable YaRN in inference
  • validate with task-specific tests

Q7: Conceptual understanding

In your own words: why does the RoPE dot product depend only on relative position?

Show reference answer

Math intuition:

Assume:

  • Query at position m: qm=R(mθ)q (rotate by mθ)
  • Key at position n: kn=R(nθ)k (rotate by nθ)

Dot product:

qmkn=[R(mθ)q][R(nθ)k]

Key property: the transpose of a rotation matrix is its inverse

=qR(mθ)R(nθ)k=qR((nm)θ)k

Conclusion:

  • the dot product contains (nm)θ only
  • no absolute positions m or n

Intuitive analogy:

  • two people facing each other
  • no matter where they stand in the room
  • their relative angle stays the same

That’s why:

  • positions (5, 8) and (100, 103) have the same attention score
  • the model learns “distance = 3” patterns

✅ Completion check

After finishing all questions, check whether you can:

  • [ ] Get Q1–Q5 all correct: solid basics
  • [ ] Provide 2+ solutions in Q6: practical ability
  • [ ] Explain Q7 clearly: deep conceptual understanding

If anything is unclear, return to teaching.md or rerun the experiments.


🎓 Advanced challenge

Want to go deeper? Try:

  1. Modify experiment code:

    • implement a simple absolute positional encoding
    • compare length extrapolation vs RoPE
    • test different rope_base values
  2. Read papers:

  3. Implement variants:

    • implement ALiBi (another positional encoding)
    • compare RoPE vs ALiBi

Next: go to 03. Attention to learn attention mechanisms.

Built on MiniMind for learning and experiments