scaled dot product attention

Topics

attention mechanism

Core attention operation in transformer architectures. Computes weighted sum of values ( $V$ ) based on query-key ( $Q$ , $K$ ) similarities:

A tt e n t i o n (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V

Components:

$Q$ : What we’re looking for
$K$ : What’s available to attend to
$V$ : Actual content being weighted
$d_{k}$ : Key/query dimension (scaling factor)

Q, K, V Derivation: $Q = X W_{Q}$ , $K = X W_{K}$ , $V = X W_{V}$

$X$ : input representation
$W_{Q}$ , $W_{K}$ , $W_{V}$ : learned weight matrices

The process is same as regular dot-product attention, but with added scaling for stability:

Compute $Q K^{T}$ scores (pairwise similarities)
Scale by $d_{k}$ (prevent gradient instability)
Softmax → attention weights (probability distribution)
Weighted sum: $w e i g h t s \times V$

scores = (Q @ K.T) / math.sqrt(d_k)  # Score matrix
weights = F.softmax(scores, dim=-1)  # Attention map
output = weights @ V       # Weighted sum

Note

Why scaling matters: Without $d_{k}$ , dot products grow large as dimensionality increases → softmax gradients vanish. Scaling maintains stable gradients.

Efficiency: Pure matrix ops → highly parallelizable (GPU-friendly).
Drawbacks: Complexity: O( $L^{2}$ ) with sequence length (L).

multi-head attention (extension using multiple parallel attention heads)
self-attention (Q, K, V from same input)
cross-attention (Q from decoder, K, V from encoder)