Topics

Key-Query-Value (K-Q-V) framework generalizes attention mechanism. Framework computes output by mapping query (Q) vector and set of key (K) / value (V) vector pairs.

  • Query represents what model looks for.
  • Keys represent information available in input elements.
  • Values represent content to retrieve.

Example

Searching papers on ArXiv: The query is ideally what you will put in the search box. Internally, ArXiv may organize papers by a set of predefined keys. It will compare your query to those predefined set of keys and return papers that best match with query and keys correspondence. Values merely refers to all papers in the database. Note that it’s not an attempt to show how ArXiv system works

Output calculated as weighted sum of value vectors. Weight for each value is determined by compatibility function (usually dot product) between query and corresponding key. This gives us an initial weights or “scores”. To normalize, softmax is typically applied on these scores, to obtain final attention weights.

Simply consider the following: denote by a database of tuples of keys and values. Moreover, denote by a query. Then we can define the attention over as

where are scalar attention weights. This is typically referred to as attention pooling.

K-Q-V framework instrumental in developing self-attention (intra-attention). In self-attention, queries, keys, and values derived from same input sequence. This allows models capture internal dependencies within sequence without relying on recurrence like RNNs.

Transition from RNN-specific attention mechanisms to generalized K-Q-V framework and self-attention represents fundamental shift in sequence modeling. Moves from inherently sequential processing to architectures centered around parallelizable attention computations, such as the transformer (which completely removed recurrence and performed better).