Zettels ProblemSets Resume Contact

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Zettels ProblemSets Resume Contact

masked multi-head self-attention

Apr 24, 20251 min read

Topics

masked attention

self-attention

multi-head attention

Masked multi-head self-attention is an attention mechanism variant that combines:

masking from masked attention to prevent info leakage
multiple attention heads from multi-head attention to learn different relationships
self-attention to process sequence elements relative to each other

Used in decoder-only transformers like GPT or decoder part of transformer architecture.

Typical process:

For each head, compute query, key, value matrices
Apply causal mask to prevent attending to future positions
Compute scaled dot product attention per head
Concatenate all head outputs and linearly project to obtain result for the layer

Related

Backlinks

masked self-attention
multi-head attention
transformer decoder block
vanilla transformer architecture overview

Created with Quartz v4.4.0 © 2025

Source