LLaMA-MoE-v2-3.8B (1+1/7) SFT

[πŸ’» Code] | [πŸ“ƒ Technical Report]

LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA3. We build LLaMA-MoE-v2 with the following two steps:

  1. Partition LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
  2. Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.
Model #Activated Experts #Experts #Activated Params SFT Model
LLaMA-MLP-MoE (2/8) 2 8 3.8B πŸ€— SFT
LLaMA-MLP-MoE (1+1/7) 2 8 3.8B πŸ€— SFT

πŸš€ QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v2-3_8B-residual-sft"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.cuda()

input_text = "Could you recommend me some mystery novels?"
input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()

pred = model.generate(input_ids, max_length=200, temperature=1.0, do_sample=True, use_cache=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
"""
I'd be delighted to recommend some mystery novels to you! Here are a few suggestions across various sub-genres:

**Classic Whodunit**

1. "And Then There Were None" by Agatha Christie - A timeless tale of ten strangers who are invited to an isolated island, only to be killed off one by one.
2. "The Murder on the Orient Express" by Agatha Christie - A classic whodunit set on a luxurious train traveling from Istanbul to Paris, where a famous author goes missing.
3. "The Devil in the White City" by Erik Larson - A non-fiction book that combines historical events with a mystery, exploring the 1893 World's Columbian Exposition in Chicago and the serial killer H.H. Holmes.

**Modern Whodunits**

1. "Gone Girl" by Gillian Flynn - A twisty, psychological thriller about a couple whose seemingly perfect ...
"""

πŸ“Š Performance

Model #Training Tokens MMLU(5) GSM8k(8) HumanEval(pass@10) IFEval BoolQ(32) SciQ PIQA ARC-c(25) TruthfulQA HellaSwag(10)
LLaMA3-8B 15T 67.2 76.5 71.4 76.5 83.0 93.2 78.5 61.9 51.7 78.8
INCITE-3B 1T 25.1 2.1 6.92 30.1 66.5 94.7 74.4 40.2 36.4 65.6
Sheared-LLaMA-2.7B 50B 28.2 1.9 3.2 28.8 67.6 75.8 41.1 47.6 71.2 39.0
Gemma-2-2b 2T 53.0 26.3 46.1 34.9 72.3 75.8 67.5 52.6 50.8 69.0
Salamandra-2b 7.8T 25.1 1.90 5.82 27.7 68.0 89.8 74.7 46.3 43.4 62.3
SmolLM2-1.7B 11T 50.4 38.5 39.1 29.0 68.2 84.3 76.0 53.2 39.9 72.6
OpenMoE-3B-9B 1T 26.5 1.36 1.01 31.2 61.7 68.4 65.7 33.3 40.5 56.5
LLaMA-MoE-3B-7B 200B 28.2 4.62 12.0 28.1 68.1 88.8 77.9 44.0 33.3 73.2
OLMoE-1B-7B 1T 53.8 40.9 40.5 35.5 80.9 94.9 80.1 55.6 43.3 79.6
MLP-MoE (8top2) 7B 40.6 53.1 53.5 32.7 74.6 90.6 69.3 42.8 45.6 59.0
MLP-MoE (8top2) 8.4B 41.0 59.6 57.1 31.7 74.5 90.2 69.5 43.3 46.9 58.1
MLP-MoE (1+7top1) 7B 42.7 55.0 51.2 36.0 76.9 88.8 67.9 40.2 46.9 53.7

πŸ“ƒ Citation

@misc{llama-moe-v2,
  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
  year={2024},
  month={Nov},
  url={https://arxiv.org/abs/2411.15708}
}
Downloads last month
77
Safetensors
Model size
8.03B params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.