File size: 6,505 Bytes
c99152e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language:
- en
license: apache-2.0
library_name: transformers

---

# Model Card for Mistral-7B-ReMax-v0.1


The Mistral-7B-ReMax-v0.1 Large Language Model (LLM) is a Reinforcement Learning from Human Preference (RLHF) fine-tuned version of [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).

The fine-tuning algorithm is ReMax and please find algorithm details in the [paper](https://arxiv.org/abs/2310.10505).

## Model Details

- Developed by: Ziniu Li
- License: apache-2.0
- Finetuned from [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

## Uses

### Direct Use

The instruction format is the same as Mistral-7B-Instruct-v0.2. Specifically, the prompt should be surrounded by \[INST\] and \[/INST\] tokens. 

```
text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"
```


## Training Details

### Training Data

10k prompts from the [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset. Note that no responses from this dataset are used in training.

### Reward Model

The reward model is based on [UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b). 

**Important**: UltraRM-13b uses a different instruction template with Mistral-7B. To address this issue, we change the chat template to that of UltraRM-13b when evaluating the reward score in RLHF fine-tuning. 


### Training Procedure 

The training algorithm is ReMax, whose details are available in the [paper](https://arxiv.org/abs/2310.10505), and implementations are available in the [repository](https://github.com/liziniu/ReMax).


### Training Hyperparameters

```
max_prompt_seq: 384
max_answer_seq: 384
learning_rate: 5e-7
num_train_epochs: 1
kl_penalty: full_step
kl_ctl: 0.1
reward_clip_value: 1.0
temperature for sampling: 0.7
top_p for sampling: 0.9
```


## Evaluation


### Chat1M 


Since we fine-tune Mistral-7B-Instruct-v0.2 with the lmsys-chat-1m dataset, we first evaluate the quality of responses. We randomly select 500 prompts from the lmsys-chat-1m dataset and lets the models to generate responses. Then, we let GPT-4 to evaluate the binary win rate. The average win-rate of Mistral-7B-ReMax-v0.1 is 57.8%.

|                          | Mistral-7B-ReMax-v0.1 Wins | Mistral-7B-ReMax-v0.1 Wins |
| ------------------------ | -------------------------- | -------------------------- |
| Mistral-7B-Instruct-v0.2 | 289                        | 211                        |

Samples of responses are provided below. 

```
==================Instruction================
你好
==================Mistral-7B-Instruct-v0.2================
你好!我是一位帮助您解答问题的机器人。如果您有任何问题,请告诉我,我会尽力回答。

你好(Hello)是中文语言中的一个常用词语,常用于问候或表示亲和,意思是“你好,你很好”。

在我帮助您之前,我想先问一下,您是否知道什么是机器学习?机器学习是人工智能领域中的一种技术,它允许计算机自动学习和改进其行为,以便从数据中提取出有价值的信息,并使计算机能够做出更智能的决策。

如果您已经熟悉机器学习,那么我可以帮助您深入探讨相关主题,如神经网络、支持向量机、决策树等。但是,如果您对机器学习不太熟悉,我可以简单地解释它的基本原理,以及它在现实生活中的应用。

请告诉我您的需求,我会尽力满足您的期望。

再次问候您:你好!
==================Mistral-7B-ReMax-v0.1================
好的,欢迎使用我的帮助!(Hello, welcome to use my help!)

是的,我可以以中文回答你的问题,或者帮你完成一些简单的任务。(Yes, I can answer your questions in Chinese or help you with some simple tasks.)

请问有什么特定的问题需要我帮忙呢?(What specific question do you need my help with?)

或者,你可以告诉我什么方面我可以提供帮助。(Or you can tell me which area I can provide help in.)

希望我能帮到你!(I hope I can help you!)
```


### AlpacaEval

[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) provides 805 test questions that the model has not been trained on. We provide the evaluation results below:

|                          | AlpacaEval | AlpacaEval 2 | Response Length |
| ------------------------ | ---------- | ------------ | --------------- |
| Mistral-7B-Instruct-v0.2 | 92.78      | 14.72        | 1676            |
| Mistral-7B-ReMax-v0.1    | 94.40      | 16.00        | 1478            |
| LLaMa2 Chat 7B           | 71.37      | 4.96         | 1479            |
| LLaMa2 Chat 70B          | 92.66      | 13.87        | 1790            |
| GPT-4                    | 95.28      | 23.58        | 2049            |

Samples of responses are provided below

```
==================Instruction================
Marie is at the music store for a store day. She bought two $50 combos. Then, for $25 she bought two t-shirts for $5 each and five hoodies for $2 each. Then, she bought sneakers for $8 each. How much did she spend total?
==================Mistral-7B-Instruct-v0.2================
Marie bought 2 combos at $50 each, so that's $100. She bought 2 t-shirts at $5 each, which is $10. She bought 5 hoodies at $2 each, which is $10. The sneakers cost $8. So Marie spent a total of $118. ($100 + $10 + $8 = $118)
==================Mistral-7B-ReMax-v0.1================
Marie bought two combos at $50 each, so that's $100. She bought two t-shirts for $5 each, which totals $10. She bought five hoodies for $2 each, so that's $10. The sneakers cost $8.

So Marie spent $100 + $10 + $10 + $8 = $128 in total.
```

## Limitations

The Mistral-7B-ReMax-v0.1 model demonstrates that the optimization algorithm ReMax is effective. This model shares the same limitations as the Mistral-7B-Instruct-v0.2 model. In particular, it does not have any moderation mechanisms. 

## Citation


**BibTeX:**


```
@article{li2023remax,
  title={Remax: A simple, effective, and efficient method for aligning large language models},
  author={Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, Ruoyu and Luo, Zhi-Quan},
  journal={arXiv preprint arXiv:2310.10505},
  year={2023}
}
```