Topics
In MoEs, a learned gating network (G) is used to generate routing scores for all experts, followed by:
- selecting the experts (Top-K or noisy Top-K)
- normalization of scores for those selected experts β―
softmax
commonly used - combine their outputs using these normalized weights
Warning
Steps 1. and 2. arenβt exactly followed as is. Implementations and their order vary.
The last step is mathematically represented as,
G is typically a simple network with weights . One can choose to selec
Noisy Top-k Gating introduces some (tunable) noise and then keeps the top k values.
Question
Why not just select the top expert?
The initial conjecture was that routing to more than one expert was needed to have the gate learn how to route to different experts, so at least two experts had to be picked.