Topics
If all our tokens are sent to just a few popular experts, that will make training inefficient. Without load balancing, some experts become overused while others are underutilized, which can lead to:
- Poor specialization (as same “popular” experts are used for everything)
- Inefficient resource usage
- Expert collapse (some experts never trained)
To mitigate this, an auxiliary loss for load balancing in MoEs is added to encourage giving all experts equal importance. Other techniques include:
- Noisy Top-K gating: Add some noise to the router logits
- router z-loss
- Drop tokens exceeding expert capacity
- Expert Choice Routing: instead of tokens choosing experts, experts choose their most relevant tokens