Topics
Several variations of the standard architecture exist and can be categorized as:
Model variations
Training variations
- Use of teacher forcing technique, where insead of feeding the predicted output, feed the ground truth as input to the decoder for next timestep
Inference variations
- We can’t do teacher forcing during inference since we don’t have ground truth, so we have to feed the predicted output to decoder for next timestep. But here, instead of taking
argmaxover the output distribution (aka the vanilla greedy decoding), we can apply some other sampling technique such as beam search
Condition on encoder output
- Pass the encoder output through an MLP layer, before connecting with decoder
- For every decoder timestep, concat encoder output with previous decoder output and use this as “input”
Output unit variations
- Normally, output distribution is over target vocab. The vocab can be either words, subwords or characters. For languages like Mandarin, there is no word boundary as such, so use characters as vocab. We can also use subwords (e.g. generated by the byte pair encoding algo) which can potentially address sparsity problems when some unseen words are encountered. Example:
unfitis made up ofunandfit. Even ifunfitisn’t seen in the training data (and hence not in the vocab), model can generate the subwordsunandfit, since with other similar examples, it understands thatunis used as anegationprefix