Topics

The standard attention mechanism faces a significant challenge: its computational and memory requirements scale quadratic with the length of the input sequence . This complexity represents a major bottleneck, severely limiting the application of standard transformer model to tasks involving very long sequences.

Note

While both time and memory complexity scale as , the memory requirement is frequently cited as the more immediate practical bottleneck on modern hardware accelerators like GPUs (limited High-Bandwidth Memory aka HBM).

To overcome this critical limitation, a class of efficient attention mechanisms known as Linear Attention has emerged, to circumvent the quadratic bottleneck by approximating or reformulating the standard attention computation to achieve linear, , or near-linear time and memory complexity.

The fundamental strategy is to avoid the explicit computation and materialization of the full attention matrix. This is typically achieved by altering the order of operations in the attention calculation, often by leveraging mathematical properties such as kernel functions and the associative property of matrix multiplication.