Topics
In the context of interpretability of experts in MoE, we can observe:
- Non-transformer: More interpretable, clearer specialization
- Transformer: More diffuse, harder to characterize expert roles
If we dive deeper into transformer architectures specifically, we can observe (based on switch-transformer MoE authors):
- Encoder expert specializes in shallow concepts
- Decoder expert has less specialization
In a multilingual setup one could imagine each expert specializing in a language, but the opposite happens: due to token routing and load balancing in MoE, there is no single expert specialized in any given language.