linear attention math formulation

Topics

linear attention

linear algebra

Linear attention reduces computational complexity standard attention from $O (N^{2} d)$ to $O (N d)$ using kernelization and associative matrix multiplication.

Standard scaled dot product attention output for query $Q_{i}$ :

Output_{i}^{1 \times d_{v}} = \frac{\sum _{j = 1}^{N} exp ( \frac{Q _{i} K _{j}^{T}}{d _{k}} ) V _{j}}{\sum _{j = 1}^{N} exp ( \frac{Q _{i} K _{j}^{T}}{d _{k}} )}

$Q_{i}, K_{j}$ row vectors ( $1 \times d_{k}$ ), $V_{j}$ row vector ( $1 \times d_{v}$ ), Output $_{i}$ row vector ( $1 \times d_{v}$ ). $Q_{i} K_{j}^{T}$ is scalar dot product.

We can generalize (dot prod and softmax combo) using a similarity function $sim (Q_{i}, K_{j})$ :

Output_{i} = \frac{\sum _{j = 1}^{N} sim ( Q _{i} , K _{j} ) V _{j}}{\sum _{j = 1}^{N} sim ( Q _{i} , K _{j} )}

Linear attention approximates this similarity function using kernel decomposed by feature maps $ϕ : R^{d_{k}} \to R^{c}$ . $ϕ$ maps $1 \times d_{k}$ row vector to $1 \times c$ row vector.

sim (Q_{i}, K_{j}) \approx ϕ (Q_{i}) ϕ (K_{j})^{T}

Substitute into generalized attention:

Output_{i} \approx \frac{\sum _{j = 1}^{N} ( ϕ ( Q _{i} ) ϕ ( K _{j} ) ^{T} ) V _{j}}{\sum _{j = 1}^{N} ϕ ( Q _{i} ) ϕ ( K _{j} ) ^{T}}

Using the linearity of dot product and the associative property of mat mul:

Output_{i} \approx \frac{ϕ ( Q _{i} ) ( \sum _{j = 1}^{N} ϕ ( K _{j} ) ^{T} V _{j} )}{ϕ ( Q _{i} ) ( \sum _{j = 1}^{N} ϕ ( K _{j} ) ^{T} )}

where:

$ϕ (Q_{i})$ is $1 \times c$ row vector
$ϕ (K_{j})^{T}$ is $c \times 1$ column vector
$V_{j}$ is $1 \times d_{v}$ row vector
Sum $\sum_{j = 1}^{N} ϕ (K_{j})^{T} V_{j}$ is $c \times d_{v}$ matrix. Also called as context summary matrix
Sum $\sum_{j = 1}^{N} ϕ (K_{j})^{T}$ is $c \times 1$ column vector

Note

The terms $\sum_{j = 1}^{N} ϕ (K_{j})^{T} V_{j}$ (context-summary) and $\sum_{j = 1}^{N} ϕ (K_{j})^{T}$ (key-summary) are pre-computable.

Vectorized form (efficient) with $ϕ (Q) \in R^{N \times c}$ , $ϕ (K) \in R^{N \times c}$ , and $V \in R^{N \times d_{v}}$ :

Global key-value (context) summary: $S = ϕ (K)^{T} V \in R^{c \times d_{v}}$
Global key summary: $Z = \sum_{j = 1}^{N} ϕ (K_{j}) = ϕ (K)^{T} 1_{N} \in R^{c}$ , where $1_{N}$ is a column vector of ones
Numerator: $Numerator = ϕ (Q) S \in R^{N \times d_{v}}$
Denominator (normalization term): $Denominator = ϕ (Q) Z \in R^{N}$
Output: $Output \approx Numerator / Denominator^{'} \in R^{N \times d_{v}}$ , where $/$ is element-wise division and $Denominator^{'}$ is broadcasted.

VecOutput \approx \frac{ϕ ( Q ) ϕ ( K ) ^{T} V}{ϕ ( Q ) ϕ ( K ) ^{T} 1 _{N}}

Complexity Analysis

Computing $ϕ (Q)$ and $ϕ (K)$ : $O (N c d_{k})$ or $O (N c)$ depending on $ϕ$
Computing $S = ϕ (K)^{T} V$ : requires $O (N c d_{v})$ time
Computing $Z = ϕ (K)^{T} 1_{N}$ : requires $O (N c)$ time
Computing Numerator: $O (N c d_{v})$
Computing Denominator: $O (N c)$
Final element-wise division: $O (N d_{v})$

Time complexity: $\approx O (N (c d_{v} + c))$ . If $c$ and $d_{v}$ are constant with respect to $N$ , the complexity is $O (N)$ .

Space complexity: Determined by the need to store $ϕ (Q), ϕ (K), S, Z, Output$ , which requires $O (N c + c d_{v} + c + N d_{v}) \approx O (N)$ if $c$ and $d_{v}$ are constant.

Altamash Khan

Altamash Khan

linear attention math formulation

Complexity Analysis

Backlinks

Altamash Khan

linear attention math formulation

Complexity Analysis

Related

Backlinks