Topics

The expert capacity determines how many tokens each expert is responsible for during the training or inference process. It sets a limit on the number of tokens processed per expert.

There are a couple of motivations for expert capacity:

  • Prevent bias: If certain experts or a set of experts become overly favored — reflecting a bias towards exploitation over exploration — it can lead to potential performance issues and load imbalance
  • Fixing tensor shapes: We cannot know how many tokens will go to each expert ahead of time, so we need to fix the capacity factor to ensure tensor shapes are known during compilation
expert_capacity = int((tokens_per_batch / self.num_experts) * self.capacity_factor)

If there are 6 tokens in a batch and we have 3 experts, above implies a capacity of 2 tokens per expert. We also use a capacity factor for load balancing in MoEs. This is a hyper-param that gives some additional buffer to each expert.