MoE LLM Training Pipeline

Complete training pipeline for building a ~20B total / ~3.6B active Mixture-of-Experts language model from scratch, targeting code generation, reasoning, bash/Slurm operations, and internal library knowledge.

Architecture

Two architecture options based on production MoE models:

Option	Based On	Total Params	Active/Token	Experts	Attention
A (Recommended)	GPT-OSS-20B	21B	3.6B	32, top-4	GQA + sliding/full
B	DeepSeek-V2-Lite	15.7B	2.4B	64, top-6	MLA + shared experts

Hardware Requirements

Training: 4 nodes × 4 H100 GPUs (16 GPUs, 1.28TB VRAM)
Inference: 1 node × 4 H100 GPUs (320GB VRAM, model fits in BF16)

Training Pipeline

Phase 0: Data Preparation (2-4 weeks)
  └─ Crawl internal repos → Generate synthetic data → Filter/dedup → Tokenize

Phase 1: Pre-Training with Megatron-LM (4-8 weeks)
  ├─ 1a: General pre-training (4K context, all data)
  ├─ 1b: Code-heavy annealing (increased code ratio)
  └─ 1c: Long-context extension (→ 131K via YaRN)

Phase 2: SFT with TRL (2-3 days)
  └─ Instruction-following on code, Slurm, reasoning tasks

Phase 3: GRPO RL (3-5 days)
  └─ Code execution rewards + math verification + Slurm validation

Phase 4: Evaluation
  └─ HumanEval, MBPP, GSM8K, MATH, MMLU + custom benchmarks

Repository Structure

├── configs/                          # Model architecture configs
│   ├── model_gptoss_20b.yaml        # GPT-OSS-20B style (recommended)
│   └── model_deepseek_v2lite.yaml   # DeepSeek-V2-Lite style (alternative)
├── scripts/                          # Training scripts
│   ├── pretrain_megatron.sh         # Megatron-LM pre-training launcher
│   ├── convert_to_hf.py            # Megatron → HuggingFace conversion
│   ├── sft_train.py                # SFT with TRL
│   ├── rl_grpo_train.py            # GRPO reinforcement learning
│   └── setup_environment.sh        # Environment setup
├── data_curation/                    # Data preparation pipeline
│   ├── crawl_internal_repos.py      # Extract code from your repos
│   ├── generate_synthetic_data.py   # LLM-powered data generation
│   ├── filter_and_dedup.py         # Quality filtering + MinHash dedup
│   ├── prepare_slurm_data.py       # Slurm/bash data curation
│   └── tokenize_for_megatron.py    # Convert to Megatron binary format
├── slurm/                            # Slurm job scripts
│   ├── pretrain.sbatch              # Pre-training job (4 nodes)
│   ├── sft.sbatch                   # SFT job (1 node)
│   └── eval.sbatch                  # Evaluation job (1 node)
└── docs/                             # Documentation
    ├── MASTER_PLAN.md               # Complete training plan
    ├── ARCHITECTURE_GUIDE.md        # MoE architecture deep-dive
    └── DATA_GUIDE.md               # Dataset and data curation guide

Quick Start

# 1. Set up environments
bash scripts/setup_environment.sh

# 2. Curate your internal data
python data_curation/crawl_internal_repos.py --repos /path/to/your/repos --output data/internal.jsonl
python data_curation/generate_synthetic_data.py --input data/internal.jsonl --output data/synthetic.jsonl
python data_curation/filter_and_dedup.py --input data/internal.jsonl data/synthetic.jsonl --output data/filtered.jsonl
python data_curation/tokenize_for_megatron.py --input data/filtered.jsonl --output-prefix data/tokenized/internal

# 3. Launch pre-training
sbatch slurm/pretrain.sbatch

# 4. Convert checkpoint and run SFT
python scripts/convert_to_hf.py --megatron-checkpoint checkpoints/iter_100000 --output-dir hf-model
sbatch slurm/sft.sbatch

# 5. Run RL
torchrun --nproc_per_node=4 scripts/rl_grpo_train.py --model-path sft-model --output-dir rl-model

Key References

GPT-OSS-120B / GPT-OSS-20B — Target architecture
DeepSeek-V2-Lite — Alternative architecture
Megatron-LM — Training framework
FineWeb-Edu — General pre-training data
The Stack v2 — Code pre-training data

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support