Topics

Traditional encoder-decoder architecture for seq2seq modeling compresses entire input sequence into single fixed-length context vector .

Encoder processes input sequence, produces final hidden state . This serves as context vector :

Decoder uses fixed vector as initial state or conditioning input at each time step to generate output .

Decoder equations:


is decoder hidden state at time . is decoder RNN cell. is output layer.

Problems with fixed-length context vector:

  • Information Loss: Mapping variable-length input to fixed vector limits information retention. Capacity of limited, especially for long inputs
  • Difficulty with Long Sequences: Performance degrades as input length grows. Fixed struggles to encode all relevant information for long output sequence generation. Decoder relies on single compressed representation for all output steps
  • Inability to Focus: Fixed context vector provides global input summary. Does not allow decoder to focus on specific input parts most relevant for current output generation. All input parts treated equally during compression

The attention mechanism addresses this bottleneck. Introduced in models like bahdanau attention, luong attention and later “widespread” via the transformer. Allows decoder to compute context vector at each decoding step .

is weighted sum of encoder hidden states :

Weights are dynamically computed based on relevance of each encoder state to current decoder hidden state . This allows decoder to focus on most pertinent input information for each output token, mitigating information bottleneck.