Topics

Let’s categorize the attention mechanism based on four key axes:

  • Input Scope: How much of the input are we attending to
  • Score Computation: How attention scores are calculated (e.g., dot product, additive)
  • Architecture/Application: How attention is applied (e.g., self-attention, cross-attention)
  • Efficiency/Optimization: Techniques to reduce computational cost (e.g., sparse, flash)
  • Special Properties: Unique constraints or behaviors (e.g., masked, hard)

Most modern attention mechanisms build on the key-query-value KQV framework:

  • Queries (Q): What we’re attending from
  • Keys (K): What we’re attending to
  • Values (V): What we retrieve
CategoryTypeKey FeatureUse Case
Input Scopelocal attentionFixed window around positionconvolutional neural networks style processing
global attentionFull sequence considerationtransformer
Score ComputationDot Producttransformer
scaled dot product attentiontransformer
luong attention
(Multiplicative)
Variants like RNN seq2seq
Additive (bahdanau attention)Early seq2seq
Architecturecross-attentionQueries from decoder, keys from encoderencoder-decoder architecture (e.g transformer)
self-attentionQ, K, V from same sequencetransformer
multi-head attentionMultiple parallel attention headstransformer
Efficiencysparse attentionAttends to subset of positionsLong sequences
flash attentionOptimized computation/memorySpeeding up transformers
Special Propertiesmasked attentionBlocks future positionsauto-regressive property models
hard attentionSelects one positionInterpretability