Topics
Let’s categorize the attention mechanism based on four key axes:
- Input Scope: How much of the input are we attending to
 - Score Computation: How attention scores are calculated (e.g., dot product, additive)
 - Architecture/Application: How attention is applied (e.g., self-attention, cross-attention)
 - Efficiency/Optimization: Techniques to reduce computational cost (e.g., sparse, flash)
 - Special Properties: Unique constraints or behaviors (e.g., masked, hard)
 
Most modern attention mechanisms build on the key-query-value KQV framework:
- Queries (Q): What we’re attending from
 - Keys (K): What we’re attending to
 - Values (V): What we retrieve
 
| Category | Type | Key Feature | Use Case | 
|---|---|---|---|
| Input Scope | local attention | Fixed window around position | convolutional neural networks style processing | 
| global attention | Full sequence consideration | transformer | |
| Score Computation | Dot Product | transformer | |
| scaled dot product attention | transformer | ||
| luong attention (Multiplicative)  | Variants like | RNN seq2seq | |
| Additive (bahdanau attention) | Early seq2seq | ||
| Architecture | cross-attention | Queries from decoder, keys from encoder | encoder-decoder architecture (e.g transformer) | 
| self-attention | Q, K, V from same sequence | transformer | |
| multi-head attention | Multiple parallel attention heads | transformer | |
| Efficiency | sparse attention | Attends to subset of positions | Long sequences | 
| flash attention | Optimized computation/memory | Speeding up transformers | |
| Special Properties | masked attention | Blocks future positions | auto-regressive property models | 
| hard attention | Selects one position | Interpretability |