Topics
Since MoEs comprise of many experts, total number of parameters is larger than the effective number of parameters. During training or fine-tuning, this causes challenges related to overfitting. Some ways to tackle this are via higher regularization:
- Higher dropout within the experts
- Token dropping
For small tasks, such as SuperGLUE CB, overfitting is high, while for larger tasks such as SuperGLUE ReCoRD, MoE performs well.