Zettels ProblemSets Resume Contact

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Zettels ProblemSets Resume Contact

global attention

Apr 19, 20251 min read

Topics

attention mechanism

Contrast to local attention, global attention mechanism allows each query token attend to every token in input sequence.

Mechanism w.r.t the key-query-value KQV framework:

Query $Q (t)$ at position $t$ computes attention scores vs all keys $K (j)$ for $j = 1, \dots, n$
Context vector $c_{t}$ is weighted sum of all value vectors $V (j)$
Examples: original bahdanau attention (used RNN context), transformer self-attention

Computational Complexity:

Calculating attention scores $O (n^{2})$ operations
Storing attention matrix $O (n^{2})$ memory
Quadratic scaling makes it expensive for very long sequences

Context Capture:

Strength lies in capturing long-range dependencies
Provides comprehensive global understanding
Every token attends every other token
Beneficial for tasks needing holistic understanding, like sequence modeling

Note

Global attention tackled the fixed-length context vector bottleneck problem. Allowed model dynamically focus on relevant parts input sequence during translation or sequence generation. Led to improved performance, especially for longer sequences.

Related

local attention

Backlinks

hard attention
hard vs sparse vs soft attention
local vs global attention
self-attention
variants of attention mechanism

Created with Quartz v4.4.0 © 2025

Source