Zettels ProblemSets Resume Contact

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Zettels ProblemSets Resume Contact

MoE training tips

Nov 19, 20241 min read

Topics

mixture of experts

optimization

Since MoE fine-tuning is difficult (same for training), as they overfit easily, heavy regularization is used.

For by only freezing the MoE layers, we can speed up the training while preserving the quality

Also, few tips:

Lower batch size
Increase learning rate
instruction tuning works well for MoEs
Tune for larger (or more number of) tasks
Scaling experts yield better sample efficiency, but diminishing gains beyond 256
Stability via expert capacity and load balancing in MoE

Expert Parallelism

MoEs are tricky to parallelize, so distribute the experts across workers.

Related

Backlinks

MoE fine-tuning is difficult

Created with Quartz v4.4.0 © 2025

Source