LLaMA-MoE-v2-3.8B (1+1/7) SFT

[💻 Code] | [📃 Technical Report]

LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA3. We build LLaMA-MoE-v2 with the following two steps:

Partition LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.

Model	#Activated Experts	#Experts	#Activated Params	SFT Model
LLaMA-MLP-MoE (2/8)	2	8	3.8B	🤗 SFT
LLaMA-MLP-MoE (1+1/7)	2	8	3.8B	🤗 SFT

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v2-3_8B-residual-sft"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.cuda()

input_text = "Could you recommend me some mystery novels?"
input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()

pred = model.generate(input_ids, max_length=200, temperature=1.0, do_sample=True, use_cache=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
"""
I'd be delighted to recommend some mystery novels to you! Here are a few suggestions across various sub-genres:

**Classic Whodunit**

1. "And Then There Were None" by Agatha Christie - A timeless tale of ten strangers who are invited to an isolated island, only to be killed off one by one.
2. "The Murder on the Orient Express" by Agatha Christie - A classic whodunit set on a luxurious train traveling from Istanbul to Paris, where a famous author goes missing.
3. "The Devil in the White City" by Erik Larson - A non-fiction book that combines historical events with a mystery, exploring the 1893 World's Columbian Exposition in Chicago and the serial killer H.H. Holmes.

**Modern Whodunits**

1. "Gone Girl" by Gillian Flynn - A twisty, psychological thriller about a couple whose seemingly perfect ...
"""

📊 Performance

Model	#Training Tokens	MMLU(5)	GSM8k(8)	HumanEval(pass@10)	IFEval	BoolQ(32)	SciQ	PIQA	ARC-c(25)	TruthfulQA	HellaSwag(10)
LLaMA3-8B	15T	67.2	76.5	71.4	76.5	83.0	93.2	78.5	61.9	51.7	78.8
INCITE-3B	1T	25.1	2.1	6.92	30.1	66.5	94.7	74.4	40.2	36.4	65.6
Sheared-LLaMA-2.7B	50B	28.2	1.9	3.2	28.8	67.6	75.8	41.1	47.6	71.2	39.0
Gemma-2-2b	2T	53.0	26.3	46.1	34.9	72.3	75.8	67.5	52.6	50.8	69.0
Salamandra-2b	7.8T	25.1	1.90	5.82	27.7	68.0	89.8	74.7	46.3	43.4	62.3
SmolLM2-1.7B	11T	50.4	38.5	39.1	29.0	68.2	84.3	76.0	53.2	39.9	72.6
OpenMoE-3B-9B	1T	26.5	1.36	1.01	31.2	61.7	68.4	65.7	33.3	40.5	56.5
LLaMA-MoE-3B-7B	200B	28.2	4.62	12.0	28.1	68.1	88.8	77.9	44.0	33.3	73.2
OLMoE-1B-7B	1T	53.8	40.9	40.5	35.5	80.9	94.9	80.1	55.6	43.3	79.6
MLP-MoE (8top2)	7B	40.6	53.1	53.5	32.7	74.6	90.6	69.3	42.8	45.6	59.0
MLP-MoE (8top2)	8.4B	41.0	59.6	57.1	31.7	74.5	90.2	69.5	43.3	46.9	58.1
MLP-MoE (1+7top1)	7B	42.7	55.0	51.2	36.0	76.9	88.8	67.9	40.2	46.9	53.7

📃 Citation

@misc{llama-moe-v2,
  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
  year={2024},
  month={Nov},
  url={https://arxiv.org/abs/2411.15708}
}