Topics

Masked self-attention combines two key ideas:

  1. self-attention mechanism where Q, K, V come from same sequence
  2. masked attention that prevents attending to future positions

This combination is primarily used in autoregressive models, e.g. transformer decoder block. Unlike standard self-attention which allows full bidirectional context, masked self-attention enforces causality - each position can only attend to previous positions and itself.

This is achieved by applying a causal mask during the attention calculation. The mask sets the attention scores for future positions to a very low value (typically negative infinity before the softmax), effectively zeroing them out in the final attention distribution.

where is the causal mask with when .

Since it’s self-attention, the masking prevents attending to future positions within the same sequence being processed.