fix(modeling): add training-path MoE dispatch and KV cache API compat

#9

Fixes #8 (UnboundLocalError in DeepseekV3MoE.forward during training).

Three changes in modeling_deepseek.py:

  1. Add training-path MoE dispatch (DeepseekV3MoE.forward)
    The original code only had ,
    leaving y undefined during training (causing UnboundLocalError).
    Added a proper training branch using sort-based dispatch:

    • Expand each token top_k times, sort by expert ID
    • Single GPU->CPU sync for all expert boundaries
    • Call each expert on its contiguous slice, unsort, apply routing weights
  2. Remove in moe_infer
    Commented out so eval steps inside a training loop do not crash.

  3. KV cache API compatibility (get_usable_length -> get_seq_length)
    past_key_value.get_usable_length() and past_key_values.seen_tokens are
    deprecated in transformers >= 4.40. Replaced with get_seq_length().

This PR fixes the issue reported in #8.

Root cause: only had an inference branch (), so was never assigned during training, causing when the shared-expert accumulation was reached.

Changes:

  1. Training-path MoE dispatch — Added a proper branch using sort-based dispatch. Tokens are expanded top_k times, sorted by expert ID so each expert receives a contiguous slice, processed with a single GPU→CPU sync (instead of one per expert), then unsorted and aggregated with routing weights. Supports gradient flow (no ).

  2. **Remove in ** — Commented out so eval steps inside a training loop do not accidentally crash.

  3. KV cache API compatibility — and are deprecated in . Replaced with .

Validated with 100-step fine-tuning of Moonlight-16B-A3B using DeepSpeed ZeRO-2 + AutoEP + Muon optimizer; loss decreases correctly throughout.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment