Zettels ProblemSets Resume Contact

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Altamash Khan

Did the war forge the spear that remained? No. All it did was identify the spear that wouldn't break

Zettels ProblemSets Resume Contact

what each expert in MoE learns

Nov 18, 20241 min read

Topics

mixture of experts

model explainability

In the context of interpretability of experts in MoE, we can observe:

Non-transformer: More interpretable, clearer specialization
Transformer: More diffuse, harder to characterize expert roles

If we dive deeper into transformer architectures specifically, we can observe (based on switch-transformer MoE authors):

Encoder expert specializes in shallow concepts
Decoder expert has less specialization

In a multilingual setup one could imagine each expert specializing in a language, but the opposite happens: due to token routing and load balancing in MoE, there is no single expert specialized in any given language.

Related

Backlinks

MoE pros and cons

Created with Quartz v4.4.0 © 2025

Source