Topics

Some history: before the advent of the transformer, attention mechanism was primarily developed to enhance RNN based encoder-decoder architecture, particularly for neural machine translation (NMT). Two influential approaches from this era are bahdanau attention (additive attention, proposed by Bahdanau et al.) and luong attention (multiplicative attention explored by Luong et al.).

Their core difference lies in the alignment score computation strategy: an MLP for additive versus simpler multiplicative functions for Luong; and the timing of decoder state usage (previous vs current ​).

Tip

For larger dimensions, additive attention might outperform unscaled dot-product attention, highlighting the importance of the scaling factor introduced later in SDPA.

FeatureBahdanau (Additive) AttentionLuong (Multiplicative) Attention
Primary Score Function TypeFeed-forward Network (MLP)Dot Product / Bilinear 31
Decoder State Used for ScoringPrevious stateCurrent state
ComplexityHigher computational cost, more parametersLower computational cost, fewer parameters
Key Performance AspectsModels complex alignments well, good for long sequencesSimpler, faster, less prone to overfitting, flexible score functions
Typical EncoderBidirectional RNNUnidirectional RNN (common variant)
Decoder Integration PointContext vector ct​ influences calculation of Context vector ​ combined with ​ to form ​ for prediction

The exploration of different scoring functions by Luong et al., particularly the dot-product, foreshadowed the scaled dot product attention that became central to the transformer architecture. While both Bahdanau and Luong attention significantly improved RNN capabilities, they remained bound by the sequential nature of RNNs.