Topics
Goal of linear attention (algorithm optimization) is different from implementation optimizations like flash attention.
FlashAttention
- Hardware-aware algorithm optimizes standard attention mechanism
- Avoids explicit attention matrix materialization in High Bandwidth Memory (HBM)
- Uses techniques like tiling, recomputation within faster on-chip Static Random-Access Memory (SRAM)
- Effectively reduces memory footprint to during execution
- Does not change underlying computational complexity; still performs computations
Linear Attention
- Aims to reduce both computational complexity and memory complexity to
- Achieved by reformulating attention calculation itself
- Relevant where even compute becomes prohibitive