Topics
Pros
- Allows for pretraining with less compute
- Faster inference
- Useful in high throughput scenarios with many machines/cores
Cons
- Even though the effective number of params is lower, all experts have to be loaded in memory, since we do not know which expert the token will be routed to. This results in high VRAM usage
- MoE fine-tuning is difficult
- Knowing what each expert in MoE learns isn’t very helpful in practice