File size: 16,171 Bytes
615e9c3 a9c8995 615e9c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 |
---
pipeline_tag: text-classification
license: other
language:
- en
- zh
tags:
- reward model
---
# InternLM
<div align="center">
<img src="https://github.com/InternLM/InternLM/assets/22529082/b9788105-8892-4398-8b47-b513a292378e" width="200"/>
<div> </div>
<div align="center">
<b><font size="5">InternLM Reward</font></b>
</div>
[💻Github Repo](https://github.com/InternLM/InternLM) • [🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new) • [📜Technical Report](https://arxiv.org/abs/2403.17297)
</div>
<p align="center">
👋 join us on <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">WeChat</a>
</p>
## Introduction
**InternLM-Reward** is a reward model trained on the foundation of InternLM2-Chat-SFT. This model has been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpful and harmless.
### Key Features:
- **Variety of Sizes Available**: Our open-sourced reward models are available in sizes of **1.8B, 7B, and 20B**, each demonstrating exceptional performance across various metrics.
- **Comprehensive Coverage of Preference**: Trained with **2.4 million** preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
- **Multilingual Support**: InternLM-Reward was trained on high-quality **English and Chinese** preference data, delivering robust performance in both languages.
This model was applied to the PPO training process of InternLM2-Chat. The reward model training techniques from the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297) have been open-sourced in XTuner, try it out [here](https://github.com/InternLM/xtuner)!
## Performance Evaluation on RewardBench
| Models | Score | Chat | Chat Hard | Safety | Reasoning |
| --- | --- | --- | --- | --- | --- |
| InternLM-Reward-20B | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
| InternLM-Reward-7B | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
| InternLM-Reward-1.8B | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
- The evaluation is conducted on the [RewardBench](https://github.com/allenai/reward-bench) dataset.
- For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.
## Demo Code
### Basic Usage
We provide some user-friendly APIs for you to use the model. Here is an example of how to use the model to get the reward score of a chat, compare two chats, or rank multiple chats.
```python
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"internlm/internlm-reward-7b",
device_map="cuda",
torch_dtype=torch.float16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
chat_1 = [
{"role": "user", "content": "Hello! What's your name?"},
{"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
]
chat_2 = [
{"role": "user", "content": "Hello! What's your name?"},
{"role": "assistant", "content": "I have no idea."}
]
# get reward score for a single chat
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1: 0.767578125
# >>> score2: -2.22265625
# batch inference, get multiple scores at once
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores: [0.767578125, -2.22265625]
# compare whether chat_1 is better than chat_2
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res: True
# rank multiple chats, it will return the ranking index of each chat
# the chat with the highest score will have ranking index as 0
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res) # lower index means higher score
# >>> rank_res: [0, 1]
```
### Best of N Sampling
Here is an example of how to use the reward model to perform best of N sampling.
The code below demonstrates how to select the best response from the candidates generated by the language model.
```python
import torch
from transformers import AutoModel, AutoTokenizer
# prepare the llm model and tokenizer
llm = AutoModel.from_pretrained(
"internlm/internlm2-chat-7b",
device_map="cuda",
torch_dtype=torch.float16,
trust_remote_code=True,
)
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
# prepare the reward model and tokenizer
reward = AutoModel.from_pretrained(
"internlm/internlm-reward-7b",
device_map="cuda",
torch_dtype=torch.float16,
trust_remote_code=True,
)
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
# prepare the chat prompt
prompt = "Write an article about the artificial intelligence revolution."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = llm_tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")
# generate best of N candidates
num_candidates = 10 # N=10
candidates = []
outputs = llm.generate(
**model_inputs,
max_new_tokens=512,
num_return_sequences=num_candidates,
pad_token_id=llm_tokenizer.eos_token_id,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.8,
)
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]
for i in range(num_candidates):
candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
candidates.append(messages + [{"role": "assistant", "content": candidate}])
rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])
## print the ranked candidates
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
# print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")
# print the best response
best_response = sorted_candidates[0][1][-1]['content']
print(best_response)
```
## Open Source License
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <internlm@pjlab.org.cn>.
## Citation
```
@misc{cai2024internlm2,
title={InternLM2 Technical Report},
author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and Keyu Chen and Xin Chen and Xun Chen and Zehui Chen and Zhi Chen and Pei Chu and Xiaoyi Dong and Haodong Duan and Qi Fan and Zhaoye Fei and Yang Gao and Jiaye Ge and Chenya Gu and Yuzhe Gu and Tao Gui and Aijia Guo and Qipeng Guo and Conghui He and Yingfan Hu and Ting Huang and Tao Jiang and Penglong Jiao and Zhenjiang Jin and Zhikai Lei and Jiaxing Li and Jingwen Li and Linyang Li and Shuaibin Li and Wei Li and Yining Li and Hongwei Liu and Jiangning Liu and Jiawei Hong and Kaiwen Liu and Kuikun Liu and Xiaoran Liu and Chengqi Lv and Haijun Lv and Kai Lv and Li Ma and Runyuan Ma and Zerun Ma and Wenchang Ning and Linke Ouyang and Jiantao Qiu and Yuan Qu and Fukai Shang and Yunfan Shao and Demin Song and Zifan Song and Zhihao Sui and Peng Sun and Yu Sun and Huanze Tang and Bin Wang and Guoteng Wang and Jiaqi Wang and Jiayu Wang and Rui Wang and Yudong Wang and Ziyi Wang and Xingjian Wei and Qizhen Weng and Fan Wu and Yingtong Xiong and Chao Xu and Ruiliang Xu and Hang Yan and Yirong Yan and Xiaogui Yang and Haochen Ye and Huaiyuan Ying and Jia Yu and Jing Yu and Yuhang Zang and Chuyu Zhang and Li Zhang and Pan Zhang and Peng Zhang and Ruijie Zhang and Shuo Zhang and Songyang Zhang and Wenjian Zhang and Wenwei Zhang and Xingcheng Zhang and Xinyue Zhang and Hui Zhao and Qian Zhao and Xiaomeng Zhao and Fengzhe Zhou and Zaida Zhou and Jingming Zhuo and Yicheng Zou and Xipeng Qiu and Yu Qiao and Dahua Lin},
year={2024},
eprint={2403.17297},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## 简介
**InternLM-Reward** 是基于 **InternLM2-Chat-SFT** 训练的奖励模型。该模型使用超过 240 万条人工标注和 AI 合成的偏好样本,覆盖了包括对话、写作、诗歌、总结、编码和数学等多个领域。在取得了出色性能的同时也兼顾了实用性和安全性偏好的平衡。
### InternLM-Reward 的主要特点:
- **多种尺寸可供选择**:我们开源的奖励模型有 1.8B、7B 和 20B 三种尺寸,每种尺寸都展示出了卓越的性能。
- **全面覆盖偏好**:模型训练了 240 万条来自人工标注和AI合成的偏好样本,涉及对话、写作、诗歌、总结、编码和数学等多个领域,同时确保了实用性和安全性偏好的平衡。
- **多语言支持**:InternLM-Reward 在高质量的**英文和中文**偏好数据上进行训练,确保了在这两种语言上都有稳健的表现。
该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的[技术报告](https://arxiv.org/abs/2403.17297)中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击[链接](https://github.com/InternLM/xtuner)进行尝试!
## RewardBench 上的性能评估
| Models | Score | Chat | Chat Hard | Safety | Reasoning |
| --- | --- | --- | --- | --- | --- |
| InternLM-Reward-20B | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
| InternLM-Reward-7B | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
| InternLM-Reward-1.8B | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
- 评估使用了 [RewardBench](https://github.com/allenai/reward-bench) 数据集进行。
- 为了公平比较,测试期间没有使用我们技术报告中提出的"条件系统提示"。
## 示例代码
### 基本用法
我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例,展示如何使用 InternLM-Reward 获取聊天的奖励分数、比较两组对话或对多个对话进行排名。
```python
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"internlm/internlm-reward-7b",
device_map="cuda",
torch_dtype=torch.float16,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
chat_1 = [
{"role": "user", "content": "Hello! What's your name?"},
{"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
]
chat_2 = [
{"role": "user", "content": "Hello! What's your name?"},
{"role": "assistant", "content": "I have no idea."}
]
# 获取单个对话的奖励分数
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1: 0.767578125
# >>> score2: -2.22265625
# 批量推理,一次获取多个分数
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores: [0.767578125, -2.22265625]
# 比较 chat_1 是否比 chat_2 更好
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res: True
# 排名多个对话,它将返回每个对话的排名序号
# 分数最高的对话排名序号为 0
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res) # 排名序号越低表示分数越高
# >>> rank_res: [0, 1]
```
### Best of N 采样
以下是如何使用 InternLM-Reward 执行Best of N 采样的示例。
以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。
```python
import torch
from transformers import AutoModel, AutoTokenizer
# 准备语言模型和分词器
llm = AutoModel.from_pretrained(
"internlm/internlm2-chat-7b",
device_map="cuda",
torch_dtype=torch.float16,
trust_remote_code=True,
)
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
# 准备奖励模型和分词器
reward = AutoModel.from_pretrained(
"internlm/internlm-reward-7b",
device_map="cuda",
torch_dtype=torch.float16,
trust_remote_code=True,
)
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-reward-7b", trust_remote_code=True)
# 准备提示词
prompt = "Write an article about the artificial intelligence revolution."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = llm_tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")
# 生成 N 个候选
num_candidates = 10 # N=10
candidates = []
outputs = llm.generate(
**model_inputs,
max_new_tokens=512,
num_return_sequences=num_candidates,
pad_token_id=llm_tokenizer.eos_token_id,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.8,
)
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]
for i in range(num_candidates):
candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
candidates.append(messages + [{"role": "assistant", "content": candidate}])
rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])
## 打印排序后的候选
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
# print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")
# 打印最佳回答
best_response = sorted_candidates[0][1][-1]['content']
print(best_response)
```
## 开源许可证
本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放,也可申请免费的商业使用授权([申请表](https://wj.qq.com/s2/12725412/f7c1/))。其他问题与合作请联系 <internlm@pjlab.org.cn>。
## 引用
```
@misc{cai2024internlm2,
title={InternLM2 Technical Report},
author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and Keyu Chen and Xin Chen and Xun Chen and Zehui Chen and Zhi Chen and Pei Chu and Xiaoyi Dong and Haodong Duan and Qi Fan and Zhaoye Fei and Yang Gao and Jiaye Ge and Chenya Gu and Yuzhe Gu and Tao Gui and Aijia Guo and Qipeng Guo and Conghui He and Yingfan Hu and Ting Huang and Tao Jiang and Penglong Jiao and Zhenjiang Jin and Zhikai Lei and Jiaxing Li and Jingwen Li and Linyang Li and Shuaibin Li and Wei Li and Yining Li and Hongwei Liu and Jiangning Liu and Jiawei Hong and Kaiwen Liu and Kuikun Liu and Xiaoran Liu and Chengqi Lv and Haijun Lv and Kai Lv and Li Ma and Runyuan Ma and Zerun Ma and Wenchang Ning and Linke Ouyang and Jiantao Qiu and Yuan Qu and Fukai Shang and Yunfan Shao and Demin Song and Zifan Song and Zhihao Sui and Peng Sun and Yu Sun and Huanze Tang and Bin Wang and Guoteng Wang and Jiaqi Wang and Jiayu Wang and Rui Wang and Yudong Wang and Ziyi Wang and Xingjian Wei and Qizhen Weng and Fan Wu and Yingtong Xiong and Chao Xu and Ruiliang Xu and Hang Yan and Yirong Yan and Xiaogui Yang and Haochen Ye and Huaiyuan Ying and Jia Yu and Jing Yu and Yuhang Zang and Chuyu Zhang and Li Zhang and Pan Zhang and Peng Zhang and Ruijie Zhang and Shuo Zhang and Songyang Zhang and Wenjian Zhang and Wenwei Zhang and Xingcheng Zhang and Xinyue Zhang and Hui Zhao and Qian Zhao and Xiaomeng Zhao and Fengzhe Zhou and Zaida Zhou and Jingming Zhuo and Yicheng Zou and Xipeng Qiu and Yu Qiao and Dahua Lin},
year={2024},
eprint={2403.17297},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |