Topics

Since MoE fine-tuning is difficult (same for training), as they overfit easily, heavy regularization is used.

For by only freezing the MoE layers, we can speed up the training while preserving the quality

Also, few tips:

Expert Parallelism

MoEs are tricky to parallelize, so distribute the experts across workers.