Topics

If all our tokens are sent to just a few popular experts, that will make training inefficient. Without load balancing, some experts become overused while others are underutilized, which can lead to:

  • Poor specialization (as same “popular” experts are used for everything)
  • Inefficient resource usage
  • Expert collapse (some experts never trained)

To mitigate this, an auxiliary loss for load balancing in MoEs is added to encourage giving all experts equal importance. Other techniques include:

  • Noisy Top-K gating: Add some noise to the router logits
  • router z-loss
  • Drop tokens exceeding expert capacity
  • Expert Choice Routing: instead of tokens choosing experts, experts choose their most relevant tokens