Topics
In linear attention, we approximate the Q-K similarity function by using kernels that can be decomposed:
where represents a feature map. We are basically saying that Q and K similarity can be expressed as a dot-product in this new space obtained after transforming them by .
Note
This isn’t kernel trick exactly, since we are explicitly doing the transform
The choice of the feature map is critical. It must satisfy two potentially conflicting requirements:
- Approximation Quality: should be a reasonably good approximation of the original similarity score, which is typically related to . If the approximation is poor, the performance of the resulting attention mechanism may degrade significantly
- Computational Feasibility: The feature map must be efficiently computable, and its dimension should not be excessively large
Common choices for include simple activations like , the identity function (leading to unnormalized linear attention), random features designed to approximate Gaussian or softmax kernels (as in performer), or polynomial features.
An important question is how to handle the normalization inherent in the softmax function. Standard softmax ensures attention weights sum to 1 for each query, acting as a probability distribution over the value vectors. The denominator term in linear attention math formulation: is an approximation which if very small, can lead to numerical instability. Few methods to address this:
- omit the normalization (denominator) altogether and use alternative normalization schemes
- design kernels with specific properties
- introducing gating mechanisms
- use different approximation strategies like nystrom method for matrix approximation, as seen in nystromformer