Topics
Masked multi-head self-attention is an attention mechanism variant that combines:
- masking from masked attention to prevent info leakage
- multiple attention heads from multi-head attention to learn different relationships
- self-attention to process sequence elements relative to each other
Used in decoder-only transformers like GPT or decoder part of transformer architecture.
Typical process:
- For each head, compute query, key, value matrices
- Apply causal mask to prevent attending to future positions
- Compute scaled dot product attention per head
- Concatenate all head outputs and linearly project to obtain result for the layer