Topics

Standard attention mechanism has complexity because:

  1. Time Complexity: matrix multiplication (dot product between every query vector and every key vector ) results in matrix of similarity scores, which takes FLOPs. Multiplying attention weight matrix by value matrix takes time. Overall time complexity is , assuming ​ and ​ are fixed dimensions independent of

  2. Memory Complexity: The attention score matrix and attention weight matrix () require memory

This quadratic scaling severely restricts the maximum sequence length that can be practically processed. For instance, processing sequences longer than a few thousand tokens (as of writing) becomes prohibitively expensive in terms of both computation time and, often more critically, memory consumption.