Topics
Linear attention reduces computational complexity standard attention from to using kernelization and associative matrix multiplication.
Standard scaled dot product attention output for query :
row vectors (), row vector (), Output row vector (). is scalar dot product.
We can generalize (dot prod and softmax combo) using a similarity function :
Linear attention approximates this similarity function using kernel decomposed by feature maps . maps row vector to row vector.
Substitute into generalized attention:
Using the linearity of dot product and the associative property of mat mul:
where:
- is row vector
- is column vector
- is row vector
- Sum is matrix. Also called as context summary matrix
- Sum is column vector
Note
The terms (context-summary) and (key-summary) are pre-computable.
Vectorized form (efficient) with , , and :
- Global key-value (context) summary:
- Global key summary: , where is a column vector of ones
- Numerator:
- Denominator (normalization term):
- Output: , where is element-wise division and is broadcasted.
Complexity Analysis
- Computing and : or depending on
- Computing : requires time
- Computing : requires time
- Computing Numerator:
- Computing Denominator:
- Final element-wise division:
Time complexity: . If and are constant with respect to , the complexity is .
Space complexity: Determined by the need to store , which requires if and are constant.