Topics

Goal of linear attention (algorithm optimization) is different from implementation optimizations like flash attention.

FlashAttention

  • Hardware-aware algorithm optimizes standard attention mechanism
  • Avoids explicit attention matrix materialization in High Bandwidth Memory (HBM)
  • Uses techniques like tiling, recomputation within faster on-chip Static Random-Access Memory (SRAM)
  • Effectively reduces memory footprint to during execution
  • Does not change underlying computational complexity; still performs computations

Linear Attention

  • Aims to reduce both computational complexity and memory complexity to
  • Achieved by reformulating attention calculation itself
  • Relevant where even compute becomes prohibitive