Topics
Vanilla transformers have the Encoder and the Decoder blocks. Both elements harness the power of self-attention mechanisms to intricately process data, significantly enhancing the model’s proficiency in handling sequential information. The encoder’s primary function is to encode input sequences, transforming raw data into context-aware representations.
Structure:
- Stack of identical layers (typically 6-12)
- Each layer contains (with residual connections):
- multi-head attention sublayer
- Feed-forward network
Key features:
- Parallel processing: Unlike RNNs, encoders process entire sequences simultaneously, improving efficiency.
- Context awareness: Self-attention allows each token to attend to all other tokens, capturing long-range dependencies.
- positional encoding is added to input embeddings to retain sequence order information.
Example
In machine translation, the encoder might process the sentence
The cat sat on the mat
by:
- Embedding each word
- Applying positional encoding
- Using self-attention to understand relationships (e.g., “sat” relates to “cat”)
- Passing through feed-forward layers for further processing
The output of the encoder is a series of context-rich representations that can be used by a transformer decoder block or for other downstream tasks.