Topics
Several variations of the standard architecture exist and can be categorized as:
Model variations
Training variations
- Use of teacher forcing technique, where insead of feeding the predicted output, feed the ground truth as input to the decoder for next timestep
Inference variations
- We can’t do teacher forcing during inference since we don’t have ground truth, so we have to feed the predicted output to decoder for next timestep. But here, instead of taking
argmax
over the output distribution (aka the vanilla greedy decoding), we can apply some other sampling technique such as beam search
Condition on encoder output
- Pass the encoder output through an MLP layer, before connecting with decoder
- For every decoder timestep, concat encoder output with previous decoder output and use this as “input”
Output unit variations
- Normally, output distribution is over target vocab. The vocab can be either words, subwords or characters. For languages like Mandarin, there is no word boundary as such, so use characters as vocab. We can also use subwords (e.g. generated by the byte pair encoding algo) which can potentially address sparsity problems when some unseen words are encountered. Example:
unfit
is made up ofun
andfit
. Even ifunfit
isn’t seen in the training data (and hence not in the vocab), model can generate the subwordsun
andfit
, since with other similar examples, it understands thatun
is used as anegation
prefix