YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MoE LLM Training Pipeline

Complete training pipeline for building a ~20B total / ~3.6B active Mixture-of-Experts language model from scratch, targeting code generation, reasoning, bash/Slurm operations, and internal library knowledge.

Architecture

Two architecture options based on production MoE models:

Option Based On Total Params Active/Token Experts Attention
A (Recommended) GPT-OSS-20B 21B 3.6B 32, top-4 GQA + sliding/full
B DeepSeek-V2-Lite 15.7B 2.4B 64, top-6 MLA + shared experts

Hardware Requirements

  • Training: 4 nodes Γ— 4 H100 GPUs (16 GPUs, 1.28TB VRAM)
  • Inference: 1 node Γ— 4 H100 GPUs (320GB VRAM, model fits in BF16)

Training Pipeline

Phase 0: Data Preparation (2-4 weeks)
  └─ Crawl internal repos β†’ Generate synthetic data β†’ Filter/dedup β†’ Tokenize

Phase 1: Pre-Training with Megatron-LM (4-8 weeks)
  β”œβ”€ 1a: General pre-training (4K context, all data)
  β”œβ”€ 1b: Code-heavy annealing (increased code ratio)
  └─ 1c: Long-context extension (β†’ 131K via YaRN)

Phase 2: SFT with TRL (2-3 days)
  └─ Instruction-following on code, Slurm, reasoning tasks

Phase 3: GRPO RL (3-5 days)
  └─ Code execution rewards + math verification + Slurm validation

Phase 4: Evaluation
  └─ HumanEval, MBPP, GSM8K, MATH, MMLU + custom benchmarks

Repository Structure

β”œβ”€β”€ configs/                          # Model architecture configs
β”‚   β”œβ”€β”€ model_gptoss_20b.yaml        # GPT-OSS-20B style (recommended)
β”‚   └── model_deepseek_v2lite.yaml   # DeepSeek-V2-Lite style (alternative)
β”œβ”€β”€ scripts/                          # Training scripts
β”‚   β”œβ”€β”€ pretrain_megatron.sh         # Megatron-LM pre-training launcher
β”‚   β”œβ”€β”€ convert_to_hf.py            # Megatron β†’ HuggingFace conversion
β”‚   β”œβ”€β”€ sft_train.py                # SFT with TRL
β”‚   β”œβ”€β”€ rl_grpo_train.py            # GRPO reinforcement learning
β”‚   └── setup_environment.sh        # Environment setup
β”œβ”€β”€ data_curation/                    # Data preparation pipeline
β”‚   β”œβ”€β”€ crawl_internal_repos.py      # Extract code from your repos
β”‚   β”œβ”€β”€ generate_synthetic_data.py   # LLM-powered data generation
β”‚   β”œβ”€β”€ filter_and_dedup.py         # Quality filtering + MinHash dedup
β”‚   β”œβ”€β”€ prepare_slurm_data.py       # Slurm/bash data curation
β”‚   └── tokenize_for_megatron.py    # Convert to Megatron binary format
β”œβ”€β”€ slurm/                            # Slurm job scripts
β”‚   β”œβ”€β”€ pretrain.sbatch              # Pre-training job (4 nodes)
β”‚   β”œβ”€β”€ sft.sbatch                   # SFT job (1 node)
β”‚   └── eval.sbatch                  # Evaluation job (1 node)
└── docs/                             # Documentation
    β”œβ”€β”€ MASTER_PLAN.md               # Complete training plan
    β”œβ”€β”€ ARCHITECTURE_GUIDE.md        # MoE architecture deep-dive
    └── DATA_GUIDE.md               # Dataset and data curation guide

Quick Start

# 1. Set up environments
bash scripts/setup_environment.sh

# 2. Curate your internal data
python data_curation/crawl_internal_repos.py --repos /path/to/your/repos --output data/internal.jsonl
python data_curation/generate_synthetic_data.py --input data/internal.jsonl --output data/synthetic.jsonl
python data_curation/filter_and_dedup.py --input data/internal.jsonl data/synthetic.jsonl --output data/filtered.jsonl
python data_curation/tokenize_for_megatron.py --input data/filtered.jsonl --output-prefix data/tokenized/internal

# 3. Launch pre-training
sbatch slurm/pretrain.sbatch

# 4. Convert checkpoint and run SFT
python scripts/convert_to_hf.py --megatron-checkpoint checkpoints/iter_100000 --output-dir hf-model
sbatch slurm/sft.sbatch

# 5. Run RL
torchrun --nproc_per_node=4 scripts/rl_grpo_train.py --model-path sft-model --output-dir rl-model

Key References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support