Topics

Contrast to local attention, global attention mechanism allows each query token attend to every token in input sequence.

Mechanism w.r.t the key-query-value KQV framework:

  • Query at position computes attention scores vs all keys for
  • Context vector is weighted sum of all value vectors
  • Examples: original bahdanau attention (used RNN context), transformer self-attention

Computational Complexity:

  • Calculating attention scores operations
  • Storing attention matrix memory
  • Quadratic scaling makes it expensive for very long sequences

Context Capture:

  • Strength lies in capturing long-range dependencies
  • Provides comprehensive global understanding
  • Every token attends every other token
  • Beneficial for tasks needing holistic understanding, like sequence modeling

Note

Global attention tackled the fixed-length context vector bottleneck problem. Allowed model dynamically focus on relevant parts input sequence during translation or sequence generation. Led to improved performance, especially for longer sequences.