Topics

Choice between local attention and global attention fundamental trade-off driven by sequence length and task.

Global attention provides richest contextual info, but computationally prohibitive for long sequences. Local attention offers linear scalability, feasible for long inputs, but fixed window limits capturing long-range dependencies directly.

This trade-off motivates research into sparse attention mechanisms: attempt to balance capturing relevant long-range dependencies while maintaining sub-quadratic complexity.

FeatureLocal Attention (Fixed Window)Global Attention
ScopeConsiders small, fixed-size window (w) of input sequenceConsiders entire input sequence (n) for every position
Computational Complexity (Linear in sequence length n) (Quadratic in sequence length n)
Memory Complexity (Reduced memory footprint) (Requires storing large attention matrix)
Long-Range Dependency CaptureLimited; may miss dependencies outside windowEffective; can capture dependencies across sequence
Contextual UnderstandingLocalized; focuses on immediate surroundingsGlobal; broader understanding of overall context
Typical Use Cases/MotivationLong sequences, efficiency-critical tasks, data with localityHolistic understanding tasks, shorter sequences