Edit model card

Storm-7B

Introduction

We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the AlpacaEval 2.0 leaderboard, ranking 3rd in length-controlled win rate.

The recipe for this model is simple: 1) fine-tuning from Openchat-3.5-0106, 2) applying iterative DPO training, a variant of DPO where a language model iteratively learns from the preferences of the trained reward model. We will release our technical report and code as soon as possible.

A snapshot of the AlpacaEval 2.0 leaderboard (2024/4/28) is listed below:

LC Win Rate Win Rate
GPT-4 Turbo (04/09) 55.0% 46.1%
GPT-4 Preview (11/06) 50.0% 50.0%
Storm-7B 48.9% 52.5%
Nanbeige Plus Chat v0.1 44.5% 56.7%
Qwen1.5 110B Chat 43.9% 33.8%
Aligner 2B+Claude 3 Opus 41.8% 34.5%
Claude 3 Opus (02/29) 40.5% 29.1%
GPT-4 38.1% 23.6%
openchat-3.5-0106 15.4% 10.1%

Please refer to the leaderboard webpage for up-to-date results.

We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.

ARC HellaSwag MMLU TruthfulQA Winogrande Avg.
Storm-7B 67.58 80.97 62.21 57.24 80.51 69.70
openchat-3.5-0106 66.38 83.00 63.47 52.55 81.06 69.29
internlm2-7b 58.02 81.24 65.24 48.73 83.82 67.41
gemma-7B 61.09 82.20 64.56 44.79 79.01 66.33
Yi-9B 61.18 78.82 70.06 42.45 77.51 66.00
Meta-Llama-3-8B 59.47 82.09 66.69 43.90 77.35 65.90
Mistral-7B-v0.1 59.98 83.31 64.16 42.15 78.37 65.59
Qwen-7b 51.37 78.47 59.84 47.79 72.69 62.03

Uses

Our model uses the same chat template as Openchat-3.5-0106. A sample code snippet for inference using our model is provided below.

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)

def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(
        input_ids,
        max_length=2048,
        do_sample=True,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    response_ids = outputs[0]
    response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response_text

prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)

Scripts

You can reproduce our results on AlphaEval 2.0 using the script provided below.

git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'

Limitations

Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.

Citation

@misc{liu2024storm,
    title = {Storm-7B},
    url = {},
    author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
    month = {April},
    year = {2024}
}
Downloads last month
130
Safetensors
Model size
7.24B params
Tensor type
BF16
·