Topics
Some history: before the advent of the transformer, attention mechanism was primarily developed to enhance RNN based encoder-decoder architecture, particularly for neural machine translation (NMT). Two influential approaches from this era are bahdanau attention (additive attention, proposed by Bahdanau et al.) and luong attention (multiplicative attention explored by Luong et al.).
Their core difference lies in the alignment score computation strategy: an MLP for additive versus simpler multiplicative functions for Luong; and the timing of decoder state usage (previous vs current ).
Tip
For larger dimensions, additive attention might outperform unscaled dot-product attention, highlighting the importance of the scaling factor introduced later in SDPA.
Feature | Bahdanau (Additive) Attention | Luong (Multiplicative) Attention |
---|---|---|
Primary Score Function Type | Feed-forward Network (MLP) | Dot Product / Bilinear 31 |
Decoder State Used for Scoring | Previous state | Current state |
Complexity | Higher computational cost, more parameters | Lower computational cost, fewer parameters |
Key Performance Aspects | Models complex alignments well, good for long sequences | Simpler, faster, less prone to overfitting, flexible score functions |
Typical Encoder | Bidirectional RNN | Unidirectional RNN (common variant) |
Decoder Integration Point | Context vector ct influences calculation of | Context vector combined with to form for prediction |
The exploration of different scoring functions by Luong et al., particularly the dot-product, foreshadowed the scaled dot product attention that became central to the transformer architecture. While both Bahdanau and Luong attention significantly improved RNN capabilities, they remained bound by the sequential nature of RNNs.