multi-head attention

Topics

attention mechanism

Multi-Head Attention (MHA) is a core component of the transformer model, applied in both self-attention and cross-attention. Instead of performing attention using full Q, K, V vectors, MHA splits the computation into multiple parallel “heads”.

Mechanim:

First, input Q, K, V matrices (dimension $d_{m o d e l}$ ) are linearly projected $h$ times using different learned weight matrices:

W_{i}^{Q} \in R^{d_{m o d e l} \times d_{k}} W_{i}^{K} \in R^{d_{m o d e l} \times d_{k}} W_{i}^{V} \in R^{d_{m o d e l} \times d_{v}}

for each head $i = 1, \dots, h$ . This results in $h$ sets of projected queries ( $Q_{i} = Q W_{i}^{Q}$ ), keys ( $K_{i} = K W_{i}^{K}$ ), and values ( $V_{i} = V W_{i}^{V}$ ). The dimensions $d_{k}$ and $d_{v}$ are typically set to $d_{m o d e l} / h$ .

Note

Dims of Key and Query should match for dot product calculation.

Next, the scaled dot product attention function is applied independently and in parallel to each projected set:

head_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q _{i} K _{i}^{T}}{d _{k}}) V_{i}

Each $head_{i}$ captures attention information from a different projected subspace. Finally, the outputs of the $h$ heads, $head_{1}, \dots, head_{h}$ (each dimension $d_{v}$ ), are concatenated along the feature dimension. This concatenated output (dimension $h \times d_{v}$ ) is then passed through a final linear projection layer with a weight matrix $W^{O} \in R^{(h d_{v}) \times d_{m o d e l}}$ to produce the final MHA output:

MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}

This final projection integrates the information learned across the different heads and ensures the output dimension matches the model’s expected dimension.

Pros:

Allows model jointly attend to information from different representation subspaces/positions
Multiple heads can specialize capturing different dependency types (local vs global, syntactic vs semantic)
Facilitates parallel computation across heads

Cons:

Head redundancy (“attention collapse”): Multiple heads learn similar functions/patterns
Not all heads contribute equally, some might be prunable
True specialization across all heads not guaranteed

Note

Research explores methods encouraging diversity among heads, e.g., regularization or dynamic head selection like Mixture-of-Head attention.

masked multi-head self-attention

Altamash Khan

Altamash Khan

multi-head attention

Backlinks

Altamash Khan

multi-head attention

Related

Backlinks