Topics

Masked multi-head self-attention is an attention mechanism variant that combines:

Used in decoder-only transformers like GPT or decoder part of transformer architecture.

Typical process:

  1. For each head, compute query, key, value matrices
  2. Apply causal mask to prevent attending to future positions
  3. Compute scaled dot product attention per head
  4. Concatenate all head outputs and linearly project to obtain result for the layer