Topics
Standard attention mechanism has complexity because:
-
Time Complexity: matrix multiplication (dot product between every query vector and every key vector ) results in matrix of similarity scores, which takes FLOPs. Multiplying attention weight matrix by value matrix takes time. Overall time complexity is , assuming and are fixed dimensions independent of
-
Memory Complexity: The attention score matrix and attention weight matrix () require memory
This quadratic scaling severely restricts the maximum sequence length that can be practically processed. For instance, processing sequences longer than a few thousand tokens (as of writing) becomes prohibitively expensive in terms of both computation time and, often more critically, memory consumption.
Related
- linear attention math formulation (reduces complexity to )