Topics

Several variations of the standard architecture exist and can be categorized as:

Model variations

Training variations

  • Use of teacher forcing technique, where insead of feeding the predicted output, feed the ground truth as input to the decoder for next timestep

Inference variations

  • We can’t do teacher forcing during inference since we don’t have ground truth, so we have to feed the predicted output to decoder for next timestep. But here, instead of taking argmax over the output distribution (aka the vanilla greedy decoding), we can apply some other sampling technique such as beam search

Condition on encoder output

  • Pass the encoder output through an MLP layer, before connecting with decoder
  • For every decoder timestep, concat encoder output with previous decoder output and use this as “input”

Output unit variations

  • Normally, output distribution is over target vocab. The vocab can be either words, subwords or characters. For languages like Mandarin, there is no word boundary as such, so use characters as vocab. We can also use subwords (e.g. generated by the byte pair encoding algo) which can potentially address sparsity problems when some unseen words are encountered. Example: unfit is made up of un and fit. Even if unfit isn’t seen in the training data (and hence not in the vocab), model can generate the subwords un and fit, since with other similar examples, it understands that un is used as a negation prefix