masked self-attention

Topics

attention mechanism

Masked self-attention combines two key ideas:

self-attention mechanism where Q, K, V come from same sequence
masked attention that prevents attending to future positions

This combination is primarily used in autoregressive models, e.g. transformer decoder block. Unlike standard self-attention which allows full bidirectional context, masked self-attention enforces causality - each position can only attend to previous positions and itself.

This is achieved by applying a causal mask during the attention calculation. The mask sets the attention scores for future positions to a very low value (typically negative infinity before the softmax), effectively zeroing them out in the final attention distribution.

Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}} + M) V

where $M$ is the causal mask with $M_{ij} = - \infty$ when $i < j$ .

Since it’s self-attention, the masking prevents attending to future positions within the same sequence being processed.

masked multi-head self-attention

Altamash Khan

Altamash Khan

masked self-attention

Backlinks

Altamash Khan

masked self-attention

Related

Backlinks