beam search normalization factor

Topics

beam search

In beam search with beam size $k$ , we finally get $k$ sequences, out of which we pick one which maximizes the following score:

\frac{1}{L ^{α}} t = 1 \sum L lo g P (\overset{y}{^}_{t} ∣ \overset{y}{^}_{1}, \dots, \overset{y}{^}_{t - 1}, c)

This is pretty much the sum of log likelihood values, with each term conditioned on previous outputs and context vector $c$ .

We use log likelihood for numerical stability, since probabilities are multiplied at each step and value will keep getting smaller due to $p \in [0, 1]$ .

Additionally, since log gives negative value and we want to maximize our log prob, naively the model will prefer shorter sentences. To overcome this potential problem, we normalize by number of words in sequence hypothesis. Mathematically, we have the normalizing factor as $\frac{1}{L ^{α}}$ with $α \approx 0.75$ .

Example

$L = 4, α = 0.75 ⟹ \frac{1}{L ^{α}} = 0.35$ $L = 8, α = 0.75 ⟹ \frac{1}{L ^{α}} = 0.21$

Since our log probs are negative, this results in penalization of shorter sentences.

Altamash Khan

Altamash Khan

beam search normalization factor

Backlinks

Altamash Khan

beam search normalization factor

Related

Backlinks