Topics

Vanilla transformers have the Encoder and the Decoder blocks. Both elements harness the power of self-attention mechanisms to intricately process data, significantly enhancing the model’s proficiency in handling sequential information. The encoder’s primary function is to encode input sequences, transforming raw data into context-aware representations.

Structure:

|360

Key features:

  • Parallel processing: Unlike RNNs, encoders process entire sequences simultaneously, improving efficiency.
  • Context awareness: Self-attention allows each token to attend to all other tokens, capturing long-range dependencies.
  • positional encoding is added to input embeddings to retain sequence order information.

Example

In machine translation, the encoder might process the sentence The cat sat on the mat by:

  • Embedding each word
  • Applying positional encoding
  • Using self-attention to understand relationships (e.g., “sat” relates to “cat”)
  • Passing through feed-forward layers for further processing

The output of the encoder is a series of context-rich representations that can be used by a transformer decoder block or for other downstream tasks.