File size: 5,651 Bytes

48ddfee
de93377
48ddfee
 
 
 
 
 
 
 
43ecab1
48ddfee
9822b8e
48ddfee
43ecab1
48ddfee
43ecab1
9822b8e
43ecab1
9822b8e
43ecab1
9822b8e
43ecab1
 
 
 
9822b8e
 
 
43ecab1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48ddfee
43ecab1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d7cdbc
43ecab1
 
 
 
 
26cce27
 
 
 
 
 
 
 
 
 
48ddfee
43ecab1
 
 
48ddfee
43ecab1
 
 
45fb696
43ecab1
 
 
 
 
9822b8e

---
license: apache-2.0
library_name: transformers
tags:
- storm
- mistral
- openchat
- RLAIF
- reward model
---

# Storm-7B
- **Developed by**: [Jie Liu](https://jieliu.site/) \\(^{*1,2}\\), [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN) \\(^{*2}\\), [Jiaheng Liu](https://liujiaheng.github.io/) \\(^{2}\\), [Xingyuan Bu](https://scholar.google.com.hk/citations?user=cqYaRhUAAAAJ&hl=zh-CN) \\(^{2}\\), [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN) \\(^{2}\\), [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN) \\(^{\dag 2}\\), [Wanli Ouyang](https://wlouyang.github.io/) \\(^{1,2}\\).
- \\(^{1}\\)MMLab, The Chinese University of Hong Kong &ensp;  \\(^{2}\\)Shanghai AI Laboratory

## Introduction

We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard.

Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. 

A snapshot of the AlpacaEval 2.0 leaderboard (Single Model, 2024/6/18) is listed below: 

|                          | **LC Win Rate** | **Win Rate** |
| :----------------------: | :-------------: | :----------: |
|   GPT-4 Turbo (04/09)    |      55.0%      |    46.1%     |
|   GPT-4 Turbo (04/09)    |      55.0%      |    46.1%     |
|   GPT-4 Turbo (04/09)    |      55.0%      |    46.1%     |
|   GPT-4 Turbo (04/09)    |      55.0%      |    46.1%     |
|  GPT-4 Preview (11/06)   |      50.0%      |    50.0%     |
|       **Storm-7B**       |      48.9%      |    52.5%     |
| Nanbeige Plus Chat v0.1  |      44.5%      |    56.7%     |
|    Qwen1.5 110B Chat     |      43.9%      |    33.8%     |
| Aligner 2B+Claude 3 Opus |      41.8%      |    34.5%     |
|  Claude 3 Opus (02/29)   |      40.5%      |    29.1%     |
|          GPT-4           |      38.1%      |    23.6%     |
|    openchat-3.5-0106     |      15.4%      |    10.1%     |

Please refer to the [leaderboard webpage](https://tatsu-lab.github.io/alpaca_eval/) for up-to-date results. 

We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.

|                   | ARC   | HellaSwag | MMLU  | TruthfulQA | Winogrande | Avg.  |
| ----------------- | ----- | --------- | ----- | ---------- | ---------- | ----- |
| **Storm-7B**      | 67.58 | 80.97     | 62.21 | 57.24      | 80.51      | 69.70 |
| openchat-3.5-0106 | 66.38 | 83.00     | 63.47 | 52.55      | 81.06      | 69.29 |
| internlm2-7b      | 58.02 | 81.24     | 65.24 | 48.73      | 83.82      | 67.41 |
| gemma-7B          | 61.09 | 82.20     | 64.56 | 44.79      | 79.01      | 66.33 |
| Yi-9B             | 61.18 | 78.82     | 70.06 | 42.45      | 77.51      | 66.00 |
| Meta-Llama-3-8B   | 59.47 | 82.09     | 66.69 | 43.90      | 77.35      | 65.90 |
| Mistral-7B-v0.1   | 59.98 | 83.31     | 64.16 | 42.15      | 78.37      | 65.59 |
| Qwen-7b           | 51.37 | 78.47     | 59.84 | 47.79      | 72.69      | 62.03 |

## Uses

Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)

def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(
        input_ids,
        max_length=2048,
        do_sample=True,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    response_ids = outputs[0]
    response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response_text

prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)
```

## Scripts
You can reproduce our results on AlphaEval 2.0 using the script provided below.
```bash
git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'
```

## Limitations

Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.

## Citation

```
@misc{liu2024storm,
    title = {Storm-7B: An Empirical Study of Iterative Direct Preference Optimization},
    url = {},
    author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
    month = {April},
    year = {2024}
}
```