vanilla transformer architecture overview

Topics

transformer

The vanilla Transformer architecture introduced in “Attention Is All You Need” paper, was designed to address the limitations of limitations of recurrent neural networks in seq2seq modelling, particularly for machine translation. The Transformer achieved state-of-the-art results on WMT 2014 (machine translation benchmark), while requiring significantly less training time than recurrent approaches. Its success stems from three key innovations:

Complete replacement of recurrence with self-attention
multi-head attention allowing different representation subspaces
Positional embeddings enabling sequence order awareness

The attention mechanism weighs the influence of different input parts on each output part. The original paper designed the architecture as:

It follows the encoder-decoder architecture, as was originally used for machine translation task. The transformer encoder block contains:

multi-head self-attention layer that computes attention weights between all positions
feed forward layers to increase model capacity
residual connection around each sub-layer followed by layer normalization (normalization technique to stabilize gradients)

The transformer decoder block has similar components but adds:

masked multi-head self-attention (masking to prevent positions from attending to subsequent positions)
cross-attention (via multi-head attention) to incorporate the encoder’s output: (K, V) come from encoder output, Q comes from decoder

Input processing involves:

token embedding for sequence models to convert discrete tokens to continuous vectors
encoding sequence position without recurrence via sinusoidal positional encoding or learned positional embeddings

The architecture’s design choices enable several benefits:

parallel computation across sequence positions
constant path length between any two positions (vs. $O (n)$ in RNNs)
direct modeling of long-range dependencies through attention

Plethora of applications: Machine translation, text summarization, NER, Question answering, text generation, chatbots, computer vision and more.

Altamash Khan

Altamash Khan

vanilla transformer architecture overview

Backlinks

Altamash Khan

vanilla transformer architecture overview

Related

Backlinks