Topics
Cross-attention mechanism connects encoder-decoder architecture in standard transformer models. It’s located in each decoder layer.
Allows decoder focus on relevant parts of input sequence processed by encoder when generating output token. Essential for tasks where output depends on different input sequence, such as NMT, summarization, multimodal tasks (image captioning, VQA).
Defining characteristic is source of Q,K,V vectors (key-query-value KQV framework):
- Queries (Q): From decoder. Derived from output of preceding sub-layer (typically masked self-attention) in decoder block. Represents decoder’s current state and information needed from source
- Keys (K), Values(V): From encoder. Derived from final output representation of input sequence generated by encoder stack
In scaled dot product attention implementation, use decoder queries and encoder keys/values:
This calculation lets decoder find relevant encoded input parts by comparing its query (Q) with encoder keys (K). It then gets a weighted mix of features (V). This is a dynamic lookup, where decoder uses its query to look into encoder’s output “memory”.
graph LR
subgraph "Encoder"
E["Encoder Output<br>(K, V)"]
end
subgraph "Decoder"
D["Decoder Query (Q)"]
A["Attention Weights"]
O["Output Features"]
end
D -->|"Compare with K"| A
E -->|"Provide K, V"| A
A -->|"Weighted sum"| O