DOT-MoE uses differentiable optimal transport to convert dense feed-forward layers into balanced sparse mixture-of-experts models, reducing active parameters by 50% while retaining 90% of original model performance. This approach eliminates manual expert design and routing heuristics, enabling end-to-end trainable, scalable sparse inference for pretrained models with improved predictive fidelity.
Use Case
Opening the operator briefing
Pulling the full operator breakdown, tooling context, and verification notes.
