Topics
Sparse attention can also be related to the concept of hard vs soft attention. While standard soft attention uses a continuous weighted average (typically via softmax), hard attention makes a discrete selection. It can be seen that sparse attention is a form of soft attention where many weights are zero, or a hybrid approach. Methods like sparsemax or -entmax are normalization techniques that produce sparse attention weights differentiably, effectively forcing some attention weights to be exactly zero, bridging the gap between fully dense soft attention and non-differentiable hard attention.
Such methods can make attention outputs more interpretable and focused, and can be useful in scenarios where we suspect the true attention should be sparse (e.g. maybe a task where only a couple of input tokens are truly relevant to each query).