YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MoE LLM Training Pipeline
Complete training pipeline for building a ~20B total / ~3.6B active Mixture-of-Experts language model from scratch, targeting code generation, reasoning, bash/Slurm operations, and internal library knowledge.
Architecture
Two architecture options based on production MoE models:
| Option | Based On | Total Params | Active/Token | Experts | Attention |
|---|---|---|---|---|---|
| A (Recommended) | GPT-OSS-20B | 21B | 3.6B | 32, top-4 | GQA + sliding/full |
| B | DeepSeek-V2-Lite | 15.7B | 2.4B | 64, top-6 | MLA + shared experts |
Hardware Requirements
- Training: 4 nodes Γ 4 H100 GPUs (16 GPUs, 1.28TB VRAM)
- Inference: 1 node Γ 4 H100 GPUs (320GB VRAM, model fits in BF16)
Training Pipeline
Phase 0: Data Preparation (2-4 weeks)
ββ Crawl internal repos β Generate synthetic data β Filter/dedup β Tokenize
Phase 1: Pre-Training with Megatron-LM (4-8 weeks)
ββ 1a: General pre-training (4K context, all data)
ββ 1b: Code-heavy annealing (increased code ratio)
ββ 1c: Long-context extension (β 131K via YaRN)
Phase 2: SFT with TRL (2-3 days)
ββ Instruction-following on code, Slurm, reasoning tasks
Phase 3: GRPO RL (3-5 days)
ββ Code execution rewards + math verification + Slurm validation
Phase 4: Evaluation
ββ HumanEval, MBPP, GSM8K, MATH, MMLU + custom benchmarks
Repository Structure
βββ configs/ # Model architecture configs
β βββ model_gptoss_20b.yaml # GPT-OSS-20B style (recommended)
β βββ model_deepseek_v2lite.yaml # DeepSeek-V2-Lite style (alternative)
βββ scripts/ # Training scripts
β βββ pretrain_megatron.sh # Megatron-LM pre-training launcher
β βββ convert_to_hf.py # Megatron β HuggingFace conversion
β βββ sft_train.py # SFT with TRL
β βββ rl_grpo_train.py # GRPO reinforcement learning
β βββ setup_environment.sh # Environment setup
βββ data_curation/ # Data preparation pipeline
β βββ crawl_internal_repos.py # Extract code from your repos
β βββ generate_synthetic_data.py # LLM-powered data generation
β βββ filter_and_dedup.py # Quality filtering + MinHash dedup
β βββ prepare_slurm_data.py # Slurm/bash data curation
β βββ tokenize_for_megatron.py # Convert to Megatron binary format
βββ slurm/ # Slurm job scripts
β βββ pretrain.sbatch # Pre-training job (4 nodes)
β βββ sft.sbatch # SFT job (1 node)
β βββ eval.sbatch # Evaluation job (1 node)
βββ docs/ # Documentation
βββ MASTER_PLAN.md # Complete training plan
βββ ARCHITECTURE_GUIDE.md # MoE architecture deep-dive
βββ DATA_GUIDE.md # Dataset and data curation guide
Quick Start
# 1. Set up environments
bash scripts/setup_environment.sh
# 2. Curate your internal data
python data_curation/crawl_internal_repos.py --repos /path/to/your/repos --output data/internal.jsonl
python data_curation/generate_synthetic_data.py --input data/internal.jsonl --output data/synthetic.jsonl
python data_curation/filter_and_dedup.py --input data/internal.jsonl data/synthetic.jsonl --output data/filtered.jsonl
python data_curation/tokenize_for_megatron.py --input data/filtered.jsonl --output-prefix data/tokenized/internal
# 3. Launch pre-training
sbatch slurm/pretrain.sbatch
# 4. Convert checkpoint and run SFT
python scripts/convert_to_hf.py --megatron-checkpoint checkpoints/iter_100000 --output-dir hf-model
sbatch slurm/sft.sbatch
# 5. Run RL
torchrun --nproc_per_node=4 scripts/rl_grpo_train.py --model-path sft-model --output-dir rl-model
Key References
- GPT-OSS-120B / GPT-OSS-20B β Target architecture
- DeepSeek-V2-Lite β Alternative architecture
- Megatron-LM β Training framework
- FineWeb-Edu β General pre-training data
- The Stack v2 β Code pre-training data
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support