token routing

Topics

mixture of experts

In MoEs, a learned gating network (G) is used to generate routing scores for all experts, followed by:

selecting the experts (Top-K or noisy Top-K)
normalization of scores for those selected experts ⎯ softmax commonly used
combine their outputs using these normalized weights

Warning

Steps 1. and 2. aren’t exactly followed as is. Implementations and their order vary.

The last step is mathematically represented as,

y = i = 1 \sum n G (x)_{i} E_{i} (x)

G is typically a simple network with weights $W_{g}$ . One can choose to selec

G (x) G (x) = Softmax (x \cdot W_{g}) = Softmax (KeepTopK (H (x))

Noisy Top-k Gating introduces some (tunable) noise and then keeps the top k values.

Question

Why not just select the top expert?
The initial conjecture was that routing to more than one expert was needed to have the gate learn how to route to different experts, so at least two experts had to be picked.

Altamash Khan

Altamash Khan

token routing

Backlinks

Altamash Khan

token routing

Related

Backlinks