Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
Paper
•
2502.05172
•
Published
•
2
This repo provides the inference code for running models used in https://arxiv.org/abs/2502.05172 (accepted to ICML 2025).
The checkpoints are structured as ./<dmodel>/<n_experts>/<n_att_heads>/<n_training_steps>.pt.
Example code loading a model from a path and running evaluations can be found in ./benchmark.py.
@misc{ludziejewski2025jointmoescalinglaws,
title={Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient},
author={Jan Ludziejewski and Maciej Pióro and Jakub Krajewski and Maciej Stefaniak and Michał Krutul and Jan Małaśnicki and Marek Cygan and Piotr Sankowski and Kamil Adamczewski and Piotr Miłoś and Sebastian Jaszczur},
year={2025},
eprint={2502.05172},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.05172},
}