self-attention

Topics

attention mechanism

Self-attention is a variant of the attention mechanism where queries, keys, and values (key-query-value KQV framework) are all derived from the same input sequence. This contrasts with cross-attention where queries come from one sequence and keys/values from another.

Core idea: Each element in a sequence attends to all other elements within that same sequence. This allows the model to capture dependencies between any two positions in the input, regardless of their distance.

Examples

transformer encoder: Allows every word to attend to every other word

BERT: Uses self-attention for learning contextual word embeddings

vision transformer (ViT): Apply self-attention across image patches

Self-attention, like any standard dot product attention, can be global or local. global attention (standard transformer) means each position attends to all others. local attention (e.g., swin transformer, longformer) restricts attention to a neighborhood or window for efficiency.

Note

The output of a self-attention layer is typically a sequence of the same length as the input, where each element’s representation is now enriched with contextual information from other relevant elements in the sequence.

cross-attention (attending between different sequences)
masked self-attention (masking of future positions in autoregressive models)
multi-head self-attention (multiple attention heads for capturing different types of dependencies)

Altamash Khan

Altamash Khan

self-attention

Backlinks

Altamash Khan

self-attention

Related

Backlinks