Storm-7B / README.md

jieliu

update readme

43ecab1 4 months ago

preview code

raw

history blame

No virus

4.8 kB

	\---

	license: apache-2.0

	\---

	### Storm-7B

	> Developed by: [Jie Liu](https://jieliu.site/)$^{1,2}$, [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN)$^{2}$, [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN)$^{2}$, [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN)$^{2}$, and [Wanli Ouyang](https://wlouyang.github.io/)$^{1,2}$.
	>
	> $^{1}$MMLab, The Chinese University of Hong Kong $^{2}$Shanghai AI Laboratory

	#### Introduction

	We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard, ranking 3rd in length-controlled win rate.

	The recipe for this model is simple: 1) fine-tuning from [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106), 2) applying iterative DPO training, a variant of DPO where a language model iteratively learns from the preferences of the trained reward model. We will release our technical report and code as soon as possible.

	A snapshot of the AlpacaEval 2.0 leaderboard (2024/4/28) is listed below:

	\| \| LC Win Rate \| Win Rate \|
	\| :----------------------: \| :-------------: \| :----------: \|
	\| GPT-4 Turbo (04/09) \| 55.0% \| 46.1% \|
	\| GPT-4 Preview (11/06) \| 50.0% \| 50.0% \|
	\| Storm-7B \| 48.9% \| 52.5% \|
	\| Nanbeige Plus Chat v0.1 \| 44.5% \| 56.7% \|
	\| Qwen1.5 110B Chat \| 43.9% \| 33.8% \|
	\| Aligner 2B+Claude 3 Opus \| 41.8% \| 34.5% \|
	\| Claude 3 Opus (02/29) \| 40.5% \| 29.1% \|
	\| GPT-4 \| 38.1% \| 23.6% \|
	\| openchat-3.5-0106 \| 15.4% \| 10.1% \|

	Please refer to the [leaderboard webpage](https://tatsu-lab.github.io/alpaca_eval/) for up-to-date results.

	We also conducted preliminary evaluations on other benchmarks and observed no significant degradation.

	\| \| ARC \| HellaSwag \| MMLU \| TruthfulQA \| Winogrande \| Avg. \|
	\| ----------------- \| ----- \| --------- \| ----- \| ---------- \| ---------- \| ----- \|
	\| Storm-7B \| 67.58 \| 80.97 \| 62.21 \| 57.24 \| 80.51 \| 69.70 \|
	\| openchat-3.5-0106 \| 66.38 \| 83.00 \| 63.47 \| 52.55 \| 81.06 \| 69.29 \|
	\| internlm2-7b \| 58.02 \| 81.24 \| 65.24 \| 48.73 \| 83.82 \| 67.41 \|
	\| gemma-7B \| 61.09 \| 82.20 \| 64.56 \| 44.79 \| 79.01 \| 66.33 \|
	\| Yi-9B \| 61.18 \| 78.82 \| 70.06 \| 42.45 \| 77.51 \| 66.00 \|
	\| Meta-Llama-3-8B \| 59.47 \| 82.09 \| 66.69 \| 43.90 \| 77.35 \| 65.90 \|
	\| Mistral-7B-v0.1 \| 59.98 \| 83.31 \| 64.16 \| 42.15 \| 78.37 \| 65.59 \|
	\| Qwen-7b \| 51.37 \| 78.47 \| 59.84 \| 47.79 \| 72.69 \| 62.03 \|

	#### Uses

	Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	device = "cuda"

	model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
	tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
	model.eval().requires_grad_(False)

	def generate_response(prompt):
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
	outputs = model.generate(
	input_ids,
	max_length=2048,
	do_sample=True,
	temperature=1.0,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)
	response_ids = outputs[0]
	response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
	return response_text

	prompt = "I'm trying to teach myself to have nicer handwriting. Can you help?"
	input_prompt = f"GPT4 Correct User: {prompt}<\|end_of_turn\|>GPT4 Correct Assistant:"
	response_text = generate_response(input_prompt)
	print("Response:", response_text)
	```

	#### Limitations

	Storm-7B is a quick demonstration that a language model, fine-tuned with AI feedback, can easily surpass or match state-of-the-art models, as assessed by the same AI feedback. However, this improvement on the automatic leaderboard may not necessarily indicate better alignment with human intentions. Our model therefore represents a critical, preliminary reevaluation of the RLAIF paradigm, questioning how much learning from and being evaluated by AI feedback aligns with actual human preferences.

	#### Citation

	```
	@misc{liu2024storm,
	title = {Storm-7B},
	url = {},
	author = {Jie Liu and Zhanhui Zhou and Chao Yang and Han-Sen Zhong and Wanli Ouyang},
	month = {April},
	year = {2024}
	}
	```