luong vs bahdanau attention

Topics

attention mechanism

Some history: before the advent of the transformer, attention mechanism was primarily developed to enhance RNN based encoder-decoder architecture, particularly for neural machine translation (NMT). Two influential approaches from this era are bahdanau attention (additive attention, proposed by Bahdanau et al.) and luong attention (multiplicative attention explored by Luong et al.).

Their core difference lies in the alignment score computation strategy: an MLP for additive versus simpler multiplicative functions for Luong; and the timing of decoder state usage (previous $s_{t - 1}$ vs current $s_{t}$ ).

Tip

For larger dimensions, additive attention might outperform unscaled dot-product attention, highlighting the importance of the scaling factor introduced later in SDPA.

Feature	Bahdanau (Additive) Attention	Luong (Multiplicative) Attention
Primary Score Function Type	Feed-forward Network (MLP)	Dot Product / Bilinear 31
Decoder State Used for Scoring	Previous state	Current state
Complexity	Higher computational cost, more parameters	Lower computational cost, fewer parameters
Key Performance Aspects	Models complex alignments well, good for long sequences	Simpler, faster, less prone to overfitting, flexible score functions
Typical Encoder	Bidirectional RNN	Unidirectional RNN (common variant)
Decoder Integration Point	Context vector ct influences calculation of $s_{t}$	Context vector $c_{t}$ combined with $s_{t}$ to form $\tilde{s}_{t}$ for prediction

The exploration of different scoring functions by Luong et al., particularly the dot-product, foreshadowed the scaled dot product attention that became central to the transformer architecture. While both Bahdanau and Luong attention significantly improved RNN capabilities, they remained bound by the sequential nature of RNNs.

Altamash Khan

Altamash Khan

luong vs bahdanau attention

Backlinks

Altamash Khan

luong vs bahdanau attention

Related

Backlinks