variants of attention mechanism

Topics

attention mechanism

Let’s categorize the attention mechanism based on four key axes:

Input Scope: How much of the input are we attending to
Score Computation: How attention scores are calculated (e.g., dot product, additive)
Architecture/Application: How attention is applied (e.g., self-attention, cross-attention)
Efficiency/Optimization: Techniques to reduce computational cost (e.g., sparse, flash)
Special Properties: Unique constraints or behaviors (e.g., masked, hard)

Most modern attention mechanisms build on the key-query-value KQV framework:

Queries (Q): What we’re attending from
Keys (K): What we’re attending to
Values (V): What we retrieve

Category	Type	Key Feature	Use Case
Input Scope	local attention	Fixed window around position	convolutional neural networks style processing
	global attention	Full sequence consideration	transformer
Score Computation	Dot Product	$q \cdot k$	transformer
	scaled dot product attention	$\frac{q \cdot k}{d _{k}}$	transformer
	luong attention (Multiplicative)	Variants like $q^{T} Wk$	RNN seq2seq
	Additive (bahdanau attention)	$v^{T} tanh (W_{q} q + W_{k} k)$	Early seq2seq
Architecture	cross-attention	Queries from decoder, keys from encoder	encoder-decoder architecture (e.g transformer)
	self-attention	Q, K, V from same sequence	transformer
	multi-head attention	Multiple parallel attention heads	transformer
Efficiency	sparse attention	Attends to subset of positions	Long sequences
	flash attention	Optimized computation/memory	Speeding up transformers
Special Properties	masked attention	Blocks future positions	auto-regressive property models
	hard attention	Selects one position	Interpretability