Topics

Self-attention is a variant of the attention mechanism where queries, keys, and values (key-query-value KQV framework) are all derived from the same input sequence. This contrasts with cross-attention where queries come from one sequence and keys/values from another.

Core idea: Each element in a sequence attends to all other elements within that same sequence. This allows the model to capture dependencies between any two positions in the input, regardless of their distance.

Examples

  • transformer encoder: Allows every word to attend to every other word
  • BERT: Uses self-attention for learning contextual word embeddings
  • vision transformer (ViT): Apply self-attention across image patches

Self-attention, like any standard dot product attention, can be global or local. global attention (standard transformer) means each position attends to all others. local attention (e.g., swin transformer, longformer) restricts attention to a neighborhood or window for efficiency.

Note

The output of a self-attention layer is typically a sequence of the same length as the input, where each element’s representation is now enriched with contextual information from other relevant elements in the sequence.