Topics

Unlike recurrent neural networks (RNNs) that inherently process sequences in order, the transformer architecture processes all tokens in parallel. This parallelism offers significant computational advantages but means the model doesn’t have a built-in sense of the order of tokens in the sequence. Since the order of words (or other tokens) is crucial for meaning, we need a mechanism to explicitly encode the position of each token.

The original Transformer paper introduced sinusoidal positional encoding as a way to address this. This method uses sine and cosine functions of different frequencies to create a unique vector representation for each position in the sequence. The formulas for calculating these positional encodings () for a given position and dimension are:

where is the position of the token in the sequence (starting from 0), is the dimensionality of the embedding vectors, and ranges from 0 to . The different frequencies of the sine and cosine functions across the dimensions allow for a unique encoding of each position.

During forward pass, the same position gets the same positional embedding regardless of token. The positional encoding vector is added element-wise to the corresponding token embedding for sequence models, allowing the model to jointly reason about both what the token is and where it appears.

While sinusoidal positional encoding is a common choice, other methods exist: learned positional embeddings involve learning a separate embedding vector for each position (similar to token embeddings); relative positional encoding schemes focus on modeling the distance between tokens.