cross-attention

Topics

attention mechanism

encoder-decoder architecture

seq2seq modeling

Cross-attention mechanism connects encoder-decoder architecture in standard transformer models. It’s located in each decoder layer.

Allows decoder focus on relevant parts of input sequence processed by encoder when generating output token. Essential for tasks where output depends on different input sequence, such as NMT, summarization, multimodal tasks (image captioning, VQA).

Defining characteristic is source of Q,K,V vectors (key-query-value KQV framework):

Queries (Q): From decoder. Derived from output of preceding sub-layer (typically masked self-attention) in decoder block. Represents decoder’s current state and information needed from source
Keys (K), Values(V): From encoder. Derived from final output representation of input sequence generated by encoder stack

In scaled dot product attention implementation, use decoder queries and encoder keys/values:

Attention (Q_{d}, K_{e}, V_{e}) = softmax (\frac{Q _{d} K _{e}^{T}}{d _{k}}) V_{e}

This calculation lets decoder find relevant encoded input parts by comparing its query (Q) with encoder keys (K). It then gets a weighted mix of features (V). This is a dynamic lookup, where decoder uses its query to look into encoder’s output “memory”.

graph LR
    subgraph "Encoder"
        E["Encoder Output<br>(K, V)"]
    end
    subgraph "Decoder"
        D["Decoder Query (Q)"]
        A["Attention Weights"]
        O["Output Features"]
    end
    D -->|"Compare with K"| A
    E -->|"Provide K, V"| A
    A -->|"Weighted sum"| O

Altamash Khan

Altamash Khan

cross-attention

Backlinks

Altamash Khan

cross-attention

Related

Backlinks