Topics

In beam search with beam size , we finally get sequences, out of which we pick one which maximizes the following score:

This is pretty much the sum of log likelihood values, with each term conditioned on previous outputs and context vector .

We use log likelihood for numerical stability, since probabilities are multiplied at each step and value will keep getting smaller due to .

Additionally, since log gives negative value and we want to maximize our log prob, naively the model will prefer shorter sentences. To overcome this potential problem, we normalize by number of words in sequence hypothesis. Mathematically, we have the normalizing factor as with .

Example

Since our log probs are negative, this results in penalization of shorter sentences.