Topics
Choice between local attention and global attention fundamental trade-off driven by sequence length and task.
Global attention provides richest contextual info, but computationally prohibitive for long sequences. Local attention offers linear scalability, feasible for long inputs, but fixed window limits capturing long-range dependencies directly.
This trade-off motivates research into sparse attention mechanisms: attempt to balance capturing relevant long-range dependencies while maintaining sub-quadratic complexity.
Feature | Local Attention (Fixed Window) | Global Attention |
---|---|---|
Scope | Considers small, fixed-size window (w) of input sequence | Considers entire input sequence (n) for every position |
Computational Complexity | (Linear in sequence length n) | (Quadratic in sequence length n) |
Memory Complexity | (Reduced memory footprint) | (Requires storing large attention matrix) |
Long-Range Dependency Capture | Limited; may miss dependencies outside window | Effective; can capture dependencies across sequence |
Contextual Understanding | Localized; focuses on immediate surroundings | Global; broader understanding of overall context |
Typical Use Cases/Motivation | Long sequences, efficiency-critical tasks, data with locality | Holistic understanding tasks, shorter sequences |