InternLM

InternLM2-1.8B-Reward

💻Github Repo • 🤔Reporting Issues • 📜Technical Report

👋 join us on Discord and WeChat

Introduction

InternLM2-1.8B-Reward is a reward model trained on the foundation of InternLM2-Chat-1.8B-SFT. This model has been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpful and harmless.

Key Features:

Variety of Sizes Available: Our open-sourced reward models are available in sizes of 1.8B, 7B, and 20B, each demonstrating exceptional performance across various metrics. We aim for these different-sized models to facilitate research on the scaling laws of reward models, providing valuable insights to the community.
Comprehensive Coverage of Preference: Trained with 2.4 million preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
Multilingual Support: InternLM2-Reward was trained on high-quality English and Chinese preference data, delivering robust performance in both languages.

This model was applied to the RLHF training process of InternLM2-Chat. The reward model training techniques from the InternLM2 Technical Report have been open-sourced in XTuner, try it out here!

Performance Evaluation on RewardBench

Models	Score	Chat	Chat Hard	Safety	Reasoning
InternLM2-20B-Reward	89.5	98.6	74.1	89.4	95.7
InternLM2-7B-Reward	86.6	98.6	66.7	88.3	92.8
InternLM2-1.8B-Reward	80.6	95.0	58.1	81.8	87.4

The evaluation is conducted on the RewardBench dataset.
For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.

Demo Code

Basic Usage

We provide some user-friendly APIs for you to use the model. Here is an example of how to use the model to get the reward score of a chat, compare two chats, or rank multiple chats.

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "internlm/internlm2-1_8b-reward", 
    device_map="cuda", 
    torch_dtype=torch.float16, 
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

chat_1 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
]
chat_2 = [
    {"role": "user", "content": "Hello! What's your name?"}, 
    {"role": "assistant", "content": "I have no idea."}
]


# get reward score for a single chat
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1:  0.767578125
# >>> score2:  -2.22265625


# batch inference, get multiple scores at once
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores:  [0.767578125, -2.22265625]


# compare whether chat_1 is better than chat_2
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res:  True


# rank multiple chats, it will return the ranking index of each chat
# the chat with the highest score will have ranking index as 0 
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res)  # lower index means higher score
# >>> rank_res:  [0, 1]

Best of N Sampling

Here is an example of how to use the reward model to perform best of N sampling. The code below demonstrates how to select the best response from the candidates generated by the language model.

import torch
from transformers import AutoModel, AutoTokenizer

# prepare the llm model and tokenizer
llm = AutoModel.from_pretrained(
    "internlm/internlm2-chat-7b",
    device_map="cuda", 
    torch_dtype=torch.float16, 
    trust_remote_code=True,
)
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)

# prepare the reward model and tokenizer
reward = AutoModel.from_pretrained(
    "internlm/internlm2-1_8b-reward",
    device_map="cuda", 
    torch_dtype=torch.float16, 
    trust_remote_code=True,
)
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

# prepare the chat prompt
prompt = "Write an article about the artificial intelligence revolution."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = llm_tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")

# generate best of N candidates
num_candidates = 10  # N=10
candidates = []

outputs = llm.generate(
    **model_inputs,
    max_new_tokens=512,
    num_return_sequences=num_candidates,
    pad_token_id=llm_tokenizer.eos_token_id,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
)
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]
for i in range(num_candidates):
    candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
    candidates.append(messages + [{"role": "assistant", "content": candidate}])

rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])

## print the ranked candidates
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
#     print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")

# print the best response
best_response = sorted_candidates[0][1][-1]['content']
print(best_response)

Open Source License

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact internlm@pjlab.org.cn.

Citation

@misc{cai2024internlm2,
      title={InternLM2 Technical Report},
      author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and Keyu Chen and Xin Chen and Xun Chen and Zehui Chen and Zhi Chen and Pei Chu and Xiaoyi Dong and Haodong Duan and Qi Fan and Zhaoye Fei and Yang Gao and Jiaye Ge and Chenya Gu and Yuzhe Gu and Tao Gui and Aijia Guo and Qipeng Guo and Conghui He and Yingfan Hu and Ting Huang and Tao Jiang and Penglong Jiao and Zhenjiang Jin and Zhikai Lei and Jiaxing Li and Jingwen Li and Linyang Li and Shuaibin Li and Wei Li and Yining Li and Hongwei Liu and Jiangning Liu and Jiawei Hong and Kaiwen Liu and Kuikun Liu and Xiaoran Liu and Chengqi Lv and Haijun Lv and Kai Lv and Li Ma and Runyuan Ma and Zerun Ma and Wenchang Ning and Linke Ouyang and Jiantao Qiu and Yuan Qu and Fukai Shang and Yunfan Shao and Demin Song and Zifan Song and Zhihao Sui and Peng Sun and Yu Sun and Huanze Tang and Bin Wang and Guoteng Wang and Jiaqi Wang and Jiayu Wang and Rui Wang and Yudong Wang and Ziyi Wang and Xingjian Wei and Qizhen Weng and Fan Wu and Yingtong Xiong and Chao Xu and Ruiliang Xu and Hang Yan and Yirong Yan and Xiaogui Yang and Haochen Ye and Huaiyuan Ying and Jia Yu and Jing Yu and Yuhang Zang and Chuyu Zhang and Li Zhang and Pan Zhang and Peng Zhang and Ruijie Zhang and Shuo Zhang and Songyang Zhang and Wenjian Zhang and Wenwei Zhang and Xingcheng Zhang and Xinyue Zhang and Hui Zhao and Qian Zhao and Xiaomeng Zhao and Fengzhe Zhou and Zaida Zhou and Jingming Zhuo and Yicheng Zou and Xipeng Qiu and Yu Qiao and Dahua Lin},
      year={2024},
      eprint={2403.17297},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

简介

InternLM2-1.8B-Reward 是基于 InternLM2-Chat-1.8B-SFT 训练的奖励模型。该模型使用超过 240 万条人工标注和 AI 合成的偏好样本，覆盖了包括对话、写作、诗歌、总结、编码和数学等多个领域。在取得了出色性能的同时也兼顾了实用性和安全性偏好的平衡。

InternLM2-Reward 的主要特点：

多种尺寸可供选择：我们开源的奖励模型有 1.8B、7B 和 20B 三种尺寸，每种尺寸都展示出了卓越的性能。我们希望这些不同大小的模型能够促进社区关于 Reward Model 缩放定律的研究。
全面覆盖偏好：模型训练了 240 万条来自人工标注和AI合成的偏好样本，涉及对话、写作、诗歌、总结、编码和数学等多个领域，同时确保了实用性和安全性偏好的平衡。
多语言支持：InternLM2-Reward 在高质量的英文和中文偏好数据上进行训练，确保了在这两种语言上都有稳健的表现。

该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的技术报告中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击链接进行尝试！

RewardBench 上的性能评估

Models	Score	Chat	Chat Hard	Safety	Reasoning
InternLM2-20B-Reward	89.5	98.6	74.1	89.4	95.7
InternLM2-7B-Reward	86.6	98.6	66.7	88.3	92.8
InternLM2-1.8B-Reward	80.6	95.0	58.1	81.8	87.4

评估使用了 RewardBench 数据集进行。
为了公平比较，测试期间没有使用我们技术报告中提出的"条件系统提示"。

示例代码

基本用法

我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例，展示如何使用 InternLM2-Reward 获取对话的奖励分数、比较两组对话或对多个对话进行排名。

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "internlm/internlm2-1_8b-reward", 
    device_map="cuda", 
    torch_dtype=torch.float16, 
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

chat_1 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
]
chat_2 = [
    {"role": "user", "content": "Hello! What's your name?"}, 
    {"role": "assistant", "content": "I have no idea."}
]


# 获取单个对话的奖励分数
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1:  0.767578125
# >>> score2:  -2.22265625


# 批量推理，一次获取多个分数
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores:  [0.767578125, -2.22265625]


# 比较 chat_1 是否比 chat_2 更好
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res:  True


# 排名多个对话，它将返回每个对话的排名序号
# 分数最高的对话排名序号为 0
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res)  # 排名序号越低表示分数越高
# >>> rank_res:  [0, 1]

Best of N 采样

以下是如何使用 InternLM2-Reward 执行Best of N 采样的示例。以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。

import torch
from transformers import AutoModel, AutoTokenizer

# 准备语言模型和分词器
llm = AutoModel.from_pretrained(
    "internlm/internlm2-chat-7b",
    device_map="cuda", 
    torch_dtype=torch.float16, 
    trust_remote_code=True,
)
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)

# 准备奖励模型和分词器
reward = AutoModel.from_pretrained(
    "internlm/internlm2-1_8b-reward",
    device_map="cuda", 
    torch_dtype=torch.float16, 
    trust_remote_code=True,
)
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

# 准备提示词
prompt = "Write an article about the artificial intelligence revolution."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = llm_tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")

# 生成 N 个候选
num_candidates = 10  # N=10
candidates = []

outputs = llm.generate(
    **model_inputs,
    max_new_tokens=512,
    num_return_sequences=num_candidates,
    pad_token_id=llm_tokenizer.eos_token_id,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
)
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]


for i in range(num_candidates):
    candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
    candidates.append(messages + [{"role": "assistant", "content": candidate}])

rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])

## 打印排序后的候选
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
#     print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")

# 打印最佳回答
best_response = sorted_candidates[0][1][-1]['content']
print(best_response)

开源许可证

本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放，也可申请免费的商业使用授权（申请表）。其他问题与合作请联系 internlm@pjlab.org.cn。

引用

@misc{cai2024internlm2,
      title={InternLM2 Technical Report},
      author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and Keyu Chen and Xin Chen and Xun Chen and Zehui Chen and Zhi Chen and Pei Chu and Xiaoyi Dong and Haodong Duan and Qi Fan and Zhaoye Fei and Yang Gao and Jiaye Ge and Chenya Gu and Yuzhe Gu and Tao Gui and Aijia Guo and Qipeng Guo and Conghui He and Yingfan Hu and Ting Huang and Tao Jiang and Penglong Jiao and Zhenjiang Jin and Zhikai Lei and Jiaxing Li and Jingwen Li and Linyang Li and Shuaibin Li and Wei Li and Yining Li and Hongwei Liu and Jiangning Liu and Jiawei Hong and Kaiwen Liu and Kuikun Liu and Xiaoran Liu and Chengqi Lv and Haijun Lv and Kai Lv and Li Ma and Runyuan Ma and Zerun Ma and Wenchang Ning and Linke Ouyang and Jiantao Qiu and Yuan Qu and Fukai Shang and Yunfan Shao and Demin Song and Zifan Song and Zhihao Sui and Peng Sun and Yu Sun and Huanze Tang and Bin Wang and Guoteng Wang and Jiaqi Wang and Jiayu Wang and Rui Wang and Yudong Wang and Ziyi Wang and Xingjian Wei and Qizhen Weng and Fan Wu and Yingtong Xiong and Chao Xu and Ruiliang Xu and Hang Yan and Yirong Yan and Xiaogui Yang and Haochen Ye and Huaiyuan Ying and Jia Yu and Jing Yu and Yuhang Zang and Chuyu Zhang and Li Zhang and Pan Zhang and Peng Zhang and Ruijie Zhang and Shuo Zhang and Songyang Zhang and Wenjian Zhang and Wenwei Zhang and Xingcheng Zhang and Xinyue Zhang and Hui Zhao and Qian Zhao and Xiaomeng Zhao and Fengzhe Zhou and Zaida Zhou and Jingming Zhuo and Yicheng Zou and Xipeng Qiu and Yu Qiao and Dahua Lin},
      year={2024},
      eprint={2403.17297},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

internlm
/

internlm2-1_8b-reward