MoE fine-tuning is difficult

Topics

mixture of experts

LLM fine-tuning

Since MoEs comprise of many experts, total number of parameters is larger than the effective number of parameters. During training or fine-tuning, this causes challenges related to overfitting. Some ways to tackle this are via higher regularization:

Higher dropout within the experts
Token dropping

For small tasks, such as SuperGLUE CB, overfitting is high, while for larger tasks such as SuperGLUE ReCoRD, MoE performs well.

MoE training tips

Altamash Khan

Altamash Khan

MoE fine-tuning is difficult

Backlinks

Altamash Khan

MoE fine-tuning is difficult

Related

Backlinks