Topics

The vanilla Transformer architecture introduced in “Attention Is All You Need” paper, was designed to address the limitations of limitations of recurrent neural networks in seq2seq modelling, particularly for machine translation. The Transformer achieved state-of-the-art results on WMT 2014 (machine translation benchmark), while requiring significantly less training time than recurrent approaches. Its success stems from three key innovations:

  1. Complete replacement of recurrence with self-attention
  2. multi-head attention allowing different representation subspaces
  3. Positional embeddings enabling sequence order awareness

The attention mechanism weighs the influence of different input parts on each output part. The original paper designed the architecture as:

It follows the encoder-decoder architecture, as was originally used for machine translation task. The transformer encoder block contains:

The transformer decoder block has similar components but adds:

Input processing involves:

The architecture’s design choices enable several benefits:

  • parallel computation across sequence positions
  • constant path length between any two positions (vs. in RNNs)
  • direct modeling of long-range dependencies through attention

Plethora of applications: Machine translation, text summarization, NER, Question answering, text generation, chatbots, computer vision and more.