Topics

In the context of interpretability of experts in MoE, we can observe:

  • Non-transformer: More interpretable, clearer specialization
  • Transformer: More diffuse, harder to characterize expert roles

If we dive deeper into transformer architectures specifically, we can observe (based on switch-transformer MoE authors):

  • Encoder expert specializes in shallow concepts
  • Decoder expert has less specialization

In a multilingual setup one could imagine each expert specializing in a language, but the opposite happens: due to token routing and load balancing in MoE, there is no single expert specialized in any given language.