Topics

Linear attention reduces computational complexity standard attention from to using kernelization and associative matrix multiplication.

Standard scaled dot product attention output for query :

row vectors (), row vector (), Output row vector (). is scalar dot product.

We can generalize (dot prod and softmax combo) using a similarity function :

Linear attention approximates this similarity function using kernel decomposed by feature maps . maps row vector to row vector.

Substitute into generalized attention:

Using the linearity of dot product and the associative property of mat mul:

where:

  • is row vector
  • is column vector
  • is row vector
  • Sum is matrix. Also called as context summary matrix
  • Sum is column vector

Note

The terms (context-summary) and (key-summary) are pre-computable.

Vectorized form (efficient) with , , and :

  • Global key-value (context) summary:
  • Global key summary: , where is a column vector of ones
  • Numerator:
  • Denominator (normalization term):
  • Output: , where is element-wise division and is broadcasted.

Complexity Analysis

  • Computing and : or depending on
  • Computing : requires time
  • Computing : requires time
  • Computing Numerator:
  • Computing Denominator:
  • Final element-wise division:

Time complexity: . If and are constant with respect to , the complexity is .

Space complexity: Determined by the need to store , which requires if and are constant.