Storm-7B

Developed by: Jie Liu $^{* 1, 2}$ , Zhanhui Zhou $^{* 2}$ , Chao Yang $^{2}$ , Han-Sen Zhong $^{2}$ , Wanli Ouyang $^{1, 2}$ .
$^{1}$ MMLab, The Chinese University of Hong Kong $^{2}$ Shanghai AI Laboratory

Introduction

We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the AlpacaEval 2.0 leaderboard, ranking 3rd in length-controlled win rate.

The recipe for this model is simple: 1) fine-tuning from Openchat-3.5-0106, 2) applying iterative DPO training, a variant of DPO where a language model iteratively learns from the preferences of the trained reward model. We will release our technical report and code as soon as possible.

A snapshot of the AlpacaEval 2.0 leaderboard (2024/4/28) is listed below:

	LC Win Rate	Win Rate
GPT-4 Turbo (04/09)	55.0%	46.1%
GPT-4 Preview (11/06)	50.0%	50.0%
Storm-7B	48.9%	52.5%
Nanbeige Plus Chat v0.1	44.5%	56.7%
Qwen1.5 110B Chat	43.9%	33.8%
Aligner 2B+Claude 3 Opus	41.8%	34.5%
Claude 3 Opus (02/29)	40.5%	29.1%
GPT-4	38.1%	23.6%
openchat-3.5-0106	15.4%	10.1%

Please refer to the leaderboard webpage for up-to-date results.

We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.

	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	Avg.
Storm-7B	67.58	80.97	62.21	57.24	80.51	69.70
openchat-3.5-0106	66.38	83.00	63.47	52.55	81.06	69.29
internlm2-7b	58.02	81.24	65.24	48.73	83.82	67.41
gemma-7B	61.09	82.20	64.56	44.79	79.01	66.33
Yi-9B	61.18	78.82	70.06	42.45	77.51	66.00
Meta-Llama-3-8B	59.47	82.09	66.69	43.90	77.35	65.90
Mistral-7B-v0.1	59.98	83.31	64.16	42.15	78.37	65.59
Qwen-7b	51.37	78.47	59.84	47.79	72.69	62.03

Uses

Our model uses the same chat template as Openchat-3.5-0106. A sample code snippet for inference using our model is provided below.

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)

def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(
        input_ids,
        max_length=2048,
        do_sample=True,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    response_ids = outputs[0]
    response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response_text

prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)

Scripts

You can reproduce our results on AlphaEval 2.0 using the script provided below.

git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'

Limitations

Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.

Citation

@misc{liu2024storm,
    title = {Storm-7B},
    url = {},
    author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
    month = {April},
    year = {2024}
}