Qwen3-0.6B Deep MoE (FineWeb recovery, step 1500)

Sparse-upcycled Qwen3-0.6B with middle layer stacking (28 → 58 layers) and linear-ramp MoE (peak 25 experts, top-2 routing).

Training

  • Stage 1: Reasoning/chat mix on MoE-only weights (10k steps)
  • Stage 2: FineWeb-Edu continued pretrain, full finetune (this checkpoint at step 1500/8000)
  • Optimizer: Muon + AdamW
  • Seq len: 512

Loading

Requires the custom upcycle loader in llm_from_scratch or load via load_upcycled_qwen3() with moe_upcycle.json included in this repo.

from llm.model.qwen3_upcycle import load_upcycled_qwen3
model, meta = load_upcycled_qwen3("avewright/qwen3-0.6b-deep-moe-fineweb")
Downloads last month
-
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avewright/qwen3-0.6b-deep-moe-fineweb

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(1032)
this model