File size: 5,651 Bytes
48ddfee de93377 48ddfee 43ecab1 48ddfee 9822b8e 48ddfee 43ecab1 48ddfee 43ecab1 9822b8e 43ecab1 9822b8e 43ecab1 9822b8e 43ecab1 9822b8e 43ecab1 48ddfee 43ecab1 1d7cdbc 43ecab1 26cce27 48ddfee 43ecab1 48ddfee 43ecab1 45fb696 43ecab1 9822b8e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
license: apache-2.0
library_name: transformers
tags:
- storm
- mistral
- openchat
- RLAIF
- reward model
---
# Storm-7B
- **Developed by**: [Jie Liu](https://jieliu.site/) \\(^{*1,2}\\), [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN) \\(^{*2}\\), [Jiaheng Liu](https://liujiaheng.github.io/) \\(^{2}\\), [Xingyuan Bu](https://scholar.google.com.hk/citations?user=cqYaRhUAAAAJ&hl=zh-CN) \\(^{2}\\), [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN) \\(^{2}\\), [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN) \\(^{\dag 2}\\), [Wanli Ouyang](https://wlouyang.github.io/) \\(^{1,2}\\).
- \\(^{1}\\)MMLab, The Chinese University of Hong Kong   \\(^{2}\\)Shanghai AI Laboratory
## Introduction
We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard.
Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.
A snapshot of the AlpacaEval 2.0 leaderboard (Single Model, 2024/6/18) is listed below:
| | **LC Win Rate** | **Win Rate** |
| :----------------------: | :-------------: | :----------: |
| GPT-4 Turbo (04/09) | 55.0% | 46.1% |
| GPT-4 Turbo (04/09) | 55.0% | 46.1% |
| GPT-4 Turbo (04/09) | 55.0% | 46.1% |
| GPT-4 Turbo (04/09) | 55.0% | 46.1% |
| GPT-4 Preview (11/06) | 50.0% | 50.0% |
| **Storm-7B** | 48.9% | 52.5% |
| Nanbeige Plus Chat v0.1 | 44.5% | 56.7% |
| Qwen1.5 110B Chat | 43.9% | 33.8% |
| Aligner 2B+Claude 3 Opus | 41.8% | 34.5% |
| Claude 3 Opus (02/29) | 40.5% | 29.1% |
| GPT-4 | 38.1% | 23.6% |
| openchat-3.5-0106 | 15.4% | 10.1% |
Please refer to the [leaderboard webpage](https://tatsu-lab.github.io/alpaca_eval/) for up-to-date results.
We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.
| | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | Avg. |
| ----------------- | ----- | --------- | ----- | ---------- | ---------- | ----- |
| **Storm-7B** | 67.58 | 80.97 | 62.21 | 57.24 | 80.51 | 69.70 |
| openchat-3.5-0106 | 66.38 | 83.00 | 63.47 | 52.55 | 81.06 | 69.29 |
| internlm2-7b | 58.02 | 81.24 | 65.24 | 48.73 | 83.82 | 67.41 |
| gemma-7B | 61.09 | 82.20 | 64.56 | 44.79 | 79.01 | 66.33 |
| Yi-9B | 61.18 | 78.82 | 70.06 | 42.45 | 77.51 | 66.00 |
| Meta-Llama-3-8B | 59.47 | 82.09 | 66.69 | 43.90 | 77.35 | 65.90 |
| Mistral-7B-v0.1 | 59.98 | 83.31 | 64.16 | 42.15 | 78.37 | 65.59 |
| Qwen-7b | 51.37 | 78.47 | 59.84 | 47.79 | 72.69 | 62.03 |
## Uses
Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)
def generate_response(prompt):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
outputs = model.generate(
input_ids,
max_length=2048,
do_sample=True,
temperature=1.0,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response_ids = outputs[0]
response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
return response_text
prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)
```
## Scripts
You can reproduce our results on AlphaEval 2.0 using the script provided below.
```bash
git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'
```
## Limitations
Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.
## Citation
```
@misc{liu2024storm,
title = {Storm-7B: An Empirical Study of Iterative Direct Preference Optimization},
url = {},
author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
month = {April},
year = {2024}
}
```
|