Topics
Let’s categorize the attention mechanism based on four key axes:
- Input Scope: How much of the input are we attending to
- Score Computation: How attention scores are calculated (e.g., dot product, additive)
- Architecture/Application: How attention is applied (e.g., self-attention, cross-attention)
- Efficiency/Optimization: Techniques to reduce computational cost (e.g., sparse, flash)
- Special Properties: Unique constraints or behaviors (e.g., masked, hard)
Most modern attention mechanisms build on the key-query-value KQV framework:
- Queries (Q): What we’re attending from
- Keys (K): What we’re attending to
- Values (V): What we retrieve
Category | Type | Key Feature | Use Case |
---|---|---|---|
Input Scope | local attention | Fixed window around position | convolutional neural networks style processing |
global attention | Full sequence consideration | transformer | |
Score Computation | Dot Product | transformer | |
scaled dot product attention | transformer | ||
luong attention (Multiplicative) | Variants like | RNN seq2seq | |
Additive (bahdanau attention) | Early seq2seq | ||
Architecture | cross-attention | Queries from decoder, keys from encoder | encoder-decoder architecture (e.g transformer) |
self-attention | Q, K, V from same sequence | transformer | |
multi-head attention | Multiple parallel attention heads | transformer | |
Efficiency | sparse attention | Attends to subset of positions | Long sequences |
flash attention | Optimized computation/memory | Speeding up transformers | |
Special Properties | masked attention | Blocks future positions | auto-regressive property models |
hard attention | Selects one position | Interpretability |