Topics
It is important to distinguish the causal mask from the padding mask:
- Causal Mask: Prevents attention to future positions within a sequence. Applied only in decoder self-attention. Shape is typically triangular. Purpose is to enforce auto-regressive property during training/inference
- Padding Mask: Prevents attention to padding tokens added to make sequences in a batch equal length. Can be applied in encoder self-attention, decoder self-attention, and decoder cross-attention whenever padding is present. Shape depends on the lengths of sequences in the batch. Purpose is to ignore meaningless padding tokens during computation.
Both are often implemented by adding large negative values before the softmax, but they target different tokens for different reasons. Some libraries combine both masks automatically when provided.