transformer encoder block

Topics

transformer

A standard Transformer encoder block is a fundamental component of the Transformer’s encoder. It takes an input sequence and processes it through 2 main sub-layers, each followed by a residual connection and layer normalization step. The encoder’s primary function is to encode input sequences, transforming raw data into context-aware representations. The structure of an encoder block is as follows:

multi-head self-attention: The input to the encoder block first goes through a multi-head attention sub-layer. In this self-attention mechanism, the queries, keys, and values are all derived from the output of the previous encoder layer (or the input embeddings with positional encoding in the first layer). This allows each position in the input sequence to attend to all other positions, capturing contextual dependencies. A padding mask is typically used to ignore padding tokens (which are used for sequence alignment)
Add & Norm: A residual connection is applied around the multi-head attention sub-layer, meaning the original input to the sub-layer is added to its output. This sum is then passed through a layer normalization layer, i.e. $x_{o u t} = LayerNorm (x + Sublayer (x))$
Position-wise FFN: The output of the normalization step is then fed into a position wise feed forward networks sub-layer (2 linear layers with ReLU activation, dim expansion then compression 512 → 2048 → 512).
Second Add & Norm: Same as step 2, and provides final output for this encoder block

The data flow can be summarized as: Input $\to$ Multi-Head Attention $\to$ Add & Norm $\to$ Position-wise FFN $\to$ Add & Norm $\to$ Output. The stacking of multiple such encoder layers
(6-12x) allows the model to learn increasingly complex representations of the input sequence.

|360

Key features:

Input: token embedding for sequence models + positional encoding
Parallel processing: Unlike RNNs, encoders process entire sequences simultaneously
All sub-layers maintain same dimension

Translation

Encodes The cat sat on the mat by:

Embedding words + positional info

Computing attention scores to understand relationships (e.g. “sat” → “cat”)

Outputs contextual embeddings for the decoder or other downstream tasks

transformer decoder block
BERT (encoder-only architecture)

Altamash Khan

Altamash Khan

transformer encoder block

Backlinks

Altamash Khan

transformer encoder block

Related

Backlinks