Topics

Responsible for generating output sequences based on the encoded inputs and previously generated outputs.

Structure:

  • Stack of identical layers (typically 6-12)
  • Each layer contains:

|480

Key features:

  • auto-regressive property: Generates outputs sequentially, one element at a time.
  • masked self-attention: Prevents attending to future positions, maintaining the auto-regressive property.
    • During training, we have access to future generations as well, but by masking those indices, we prevent the model from cheating.
  • cross-attention: Allows the decoder to focus on relevant parts of the encoded inputs.

Example

In machine translation, for translating The cat sat on the mat to French:

  • Decoder first outputs Le based on the encoded input English sentence
  • Embed previous outputs (Le in this case) along with positional encoding
  • Use masked self-attention to focus on relevant outputs so far
  • Use cross attention to focus on relevant encoded inputs (from encoder block)
  • Generates chat considering all prev outputs (Le) and the encoded inputs
  • Generation continues until eos (end of sentence) token