hard vs sparse vs soft attention

Topics

variants of attention mechanism

Sparse attention can also be related to the concept of hard vs soft attention. While standard soft attention uses a continuous weighted average (typically via softmax), hard attention makes a discrete selection. It can be seen that sparse attention is a form of soft attention where many weights are zero, or a hybrid approach. Methods like sparsemax or $α$ -entmax are normalization techniques that produce sparse attention weights differentiably, effectively forcing some attention weights to be exactly zero, bridging the gap between fully dense soft attention and non-differentiable hard attention.

Such methods can make attention outputs more interpretable and focused, and can be useful in scenarios where we suspect the true attention should be sparse (e.g. maybe a task where only a couple of input tokens are truly relevant to each query).

Altamash Khan

Altamash Khan

hard vs sparse vs soft attention

Backlinks

Altamash Khan

hard vs sparse vs soft attention

Related

Backlinks