beam search example walkthrough

Topics

beam search

Let’s take the case of machine translation using encoder-decoder architecture where the output vocabulary $V$ consists of five elements: $A, B, C, D, E$ , with one representing the end-of-sequence token.

In this example, the beam size is set to 2, and the maximum output sequence length is 3. At time step 1, the algorithm begins with an empty sequence and selects the two tokens with the highest conditional probabilities $P (\overset{y}{^}_{1} ∣ c)$ , which are $A$ and $C$ . Here, $c$ represents the context or input.

Moving to time step 2, the algorithm expands each of these initial tokens by considering all possible next tokens $\overset{y}{^}_{2} \in V$ . It computes the probabilities:

P ([A, \overset{y}{^}_{2}] ∣ c) = P (A ∣ c) \cdot P (\overset{y}{^}_{2} ∣ A, c)

P ([C, \overset{y}{^}_{2}] ∣ c) = P (C ∣ c) \cdot P (\overset{y}{^}_{2} ∣ C, c)

From these ten possibilities, it selects the two sequences with the highest probabilities, shown in the diagram as $A B$ and $CE$ .

At the final time step 3, the process repeats. For each of the two sequences from step 2, it computes:

P ([A, B, \overset{y}{^}_{3}] ∣ c) = P (A, B ∣ c) \cdot P (\overset{y}{^}_{3} ∣ A, B, c)

P ([C, E, \overset{y}{^}_{3}] ∣ c) = P (C, E ∣ c) \cdot P (\overset{y}{^}_{3} ∣ C, E, c)

Again, it selects the two highest probability sequences, resulting in $A B D$ and $CE D$ as the final candidates and choose the one which maximizes the following score:

\frac{1}{L ^{α}} lo g P (\overset{y}{^}_{1}, \dots, \overset{y}{^}_{L} ∣ c) = \frac{1}{L ^{α}} t = 1 \sum L lo g P (\overset{y}{^}_{t} ∣ \overset{y}{^}_{1}, \dots, \overset{y}{^}_{t - 1}, c)

The above score is just log likelihood, but with the additional beam search normalization factor.

Altamash Khan

Altamash Khan

beam search example walkthrough

Backlinks

Altamash Khan

beam search example walkthrough

Related

Backlinks