choice of feature map in linear attention

Topics

linear attention

In linear attention, we approximate the Q-K similarity function by using kernels that can be decomposed:

sim (Q_{i}, K_{j}) \approx ϕ (Q_{i}) ϕ (K_{j})^{T}

where $ϕ : R^{d_{k}} \to R^{c}$ represents a feature map. We are basically saying that Q and K similarity can be expressed as a dot-product in this new space obtained after transforming them by $ϕ$ .

Note

This isn’t kernel trick exactly, since we are explicitly doing the transform $x \to ϕ (x)$

The choice of the feature map $ϕ$ is critical. It must satisfy two potentially conflicting requirements:

Approximation Quality: $ϕ (q) \cdot ϕ (k)$ should be a reasonably good approximation of the original similarity score, which is typically related to $exp (q \cdot k / d_{k})$ . If the approximation is poor, the performance of the resulting attention mechanism may degrade significantly
Computational Feasibility: The feature map $ϕ$ must be efficiently computable, and its dimension $c$ should not be excessively large

Common choices for $ϕ$ include simple activations like $ϕ (x) = elu (x) + 1$ , the identity function $ϕ (x) = x$ (leading to unnormalized linear attention), random features designed to approximate Gaussian or softmax kernels (as in performer), or polynomial features.

An important question is how to handle the normalization inherent in the softmax function. Standard softmax ensures attention weights sum to 1 for each query, acting as a probability distribution over the value vectors. The denominator term in linear attention math formulation: $ϕ (Q) Z$ is an approximation which if very small, can lead to numerical instability. Few methods to address this:

omit the normalization (denominator) altogether and use alternative normalization schemes
design kernels with specific properties
introducing gating mechanisms
use different approximation strategies like nystrom method for matrix approximation, as seen in nystromformer

Altamash Khan

Altamash Khan

choice of feature map in linear attention

Backlinks

Altamash Khan

choice of feature map in linear attention

Related

Backlinks