Topics
Transformers use a network architecture that relies onĀ attention mechanismĀ to weigh the influence of different input parts on each output part. The original paper designed the architecture as:
The key components are:
- attention mechanism: Enables the model to selectively focus on different parts of the input sequence
- Includes the concept of self-attention and multi-head attention
- Parallelization: Model can process all elements in the sequence simultaneously.
- encoder-decoder architecture: Model was originally used for seq-seq tasks, e.g. machine translation and has 2 blocks:
- positional encoding: Incorporated to account for the sequential nature of input data.
- layer normalization: Normalization technique to stabilize gradients.
Applications: Machine translation, text summarization, NER, Question answering, text generation, chatbots, computer vision and more.