load balancing in MoE

Topics

mixture of experts

load balancing

If all our tokens are sent to just a few popular experts, that will make training inefficient. Without load balancing, some experts become overused while others are underutilized, which leads to:

Poor specialization (as same “popular” experts are used for everything)
Inefficient resource usage
Expert collapse (some experts never trained)

To mitigate this, an auxiliary loss for load balancing in MoEs is added to encourage giving all experts equal importance. Other techniques include:

Noisy Top-K gating: Add some noise to the router logits
router z-loss
Drop tokens exceeding expert capacity
Expert Choice Routing: instead of tokens choosing experts, experts choose their most relevant tokens

Altamash Khan

Altamash Khan

load balancing in MoE

Backlinks

Altamash Khan

load balancing in MoE

Related

Backlinks