attention mechanism quadratic complexity

Topics

matrices

attention mechanism

Standard attention mechanism has $O (N^{2})$ complexity because:

Time Complexity: matrix multiplication $Q K^{T}$ (dot product between every query vector $Q_{i}$ and every key vector $K_{j}$ ) results in $N \times N$ matrix of similarity scores, which takes $O (N^{2} d_{k})$ FLOPs. Multiplying $N \times N$ attention weight matrix by value matrix $V$ takes $O (N^{2} d_{v})$ time. Overall time complexity is $O (N^{2})$ , assuming $d_{k}$ and $d_{v}$ are fixed dimensions independent of $N$
Memory Complexity: The $N \times N$ attention score matrix and attention weight matrix ( $softmax (...)$ ) require $O (N^{2})$ memory

This quadratic scaling severely restricts the maximum sequence length $N$ that can be practically processed. For instance, processing sequences longer than a few thousand tokens (as of writing) becomes prohibitively expensive in terms of both computation time and, often more critically, memory consumption.

linear attention math formulation (reduces complexity to $O (n)$ )

Altamash Khan

Altamash Khan

attention mechanism quadratic complexity

Backlinks

Altamash Khan

attention mechanism quadratic complexity

Related

Backlinks