ZEDA-GLM-4.7-Flash-Dynamic

This repository contains the dynamic Mixture-of-Experts (MoE) model based on GLM-4.7-Flash, transformed using the Zero-Expert Self-Distillation Adaptation (ZEDA) framework.

ZEDA transforms post-trained static MoE models into efficient dynamic ones by injecting parameter-free zero-output experts and performing two-stage self-distillation. This model eliminates over 50% of expert FLOPs at marginal accuracy loss, delivering approximately a 1.20x end-to-end inference speedup.

Paper: Post-Trained MoE Can Skip Half Experts via Self-Distillation
Repository: GitHub - TsinghuaC3I/ZEDA

Model Description

To stabilize architectural conversion from static to dynamic MoE, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On benchmarks spanning math, code, and instruction following, ZEDA outperforms strong dynamic MoE baselines while significantly reducing computational overhead.

Environment Setup

This model uses a modified architecture (Glm4MoeLitePlusPlusForCausalLM) and requires specific environment setup, including modified versions of transformers, sglang, and slime. Please refer to the Getting Started section of the official GitHub repository for detailed installation instructions and Docker configurations.

Citation

If you find this work helpful, please cite:

@misc{lv2026posttrainedmoeskiphalf,
      title={Post-Trained MoE Can Skip Half Experts via Self-Distillation}, 
      author={Xingtai Lv and Li Sheng and Kaiyan Zhang and Yichen You and Siyan Gao and Xueheng Luo and Yuxin Zuo and Yuchen Fan and Junlin Yang and Ganqu Cui and Bingning Wang and Fan Yang and Youbang Sun and Ning Ding and Bowen Zhou},
      year={2026},
      eprint={2605.18643},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.18643}, 
}