|
--- |
|
license: mit |
|
datasets: |
|
- llm-blender/Unified-Feedback |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
## Introduction |
|
|
|
The reward model finetunes [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the '[llm-blender/Unified-Feedback](https://huggingface.co/datasets/llm-blender/Unified-Feedback)' dataset. |
|
This model achieves an accuracy of **0.7740** on the test sets, making it a good proxy reward model for modeling human preferences and can be used for aligning LLMs. |
|
|
|
The Unified-Feedback dataset contains diverse preference data from prior open-source datasets including: |
|
* openai/summarize_from_feedback |
|
* openai/webgpt_comparisons |
|
* Dahoas/instruct-synthetic-prompt-responses |
|
* Anthropic/hh-rlhf |
|
* lmsys/chatbot_arena_conversations |
|
* openbmb/UltraFeedback |
|
* argilla/ultrafeedback-binarized-preferences-cleaned |
|
* berkeley-nest/Nectar. |
|
|
|
## Training Code and Blog |
|
|
|
We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience. |
|
|
|
|
|
## Evaluation |
|
We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha. |
|
|
|
| Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets | |
|
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:| |
|
| berkeley-nest/Starling-RM-34B (34B) | 81.5 | 96.9 | 59 | 89.9 | 90.3 | 71.4 | |
|
| **Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback**(Ours, 7B) | 78.75 | 97.84 | 52.85 | 85.94 | 87.02 | 73.92 | |
|
| berkeley-nest/Starling-RM-7B-alpha (7B) | 74.6 | 98 | 43.4 | 88.6 | 74.6 | 68.6 | |
|
| openbmb/UltraRM-13b (13B) | 71.3 | 96.1 | 55.3 | 45.8 | 82 | 77.2 | |
|
| IDEA-CCNL/Ziya-LLaMA-7B-Reward (7B) | 66 | 88 | 41.3 | 62.5 | 73.7 | 64.6 | |
|
| OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 (1.4B) | 65.1 | 88.5 | 47.9 | 62.1 | 61.4 | 65.8 | |
|
| OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 (7B) | 64 | 94.4 | 36.6 | 59.4 | 70 | 59.4 | |
|
| llm-blender/PairRM-hf (0.4B) | 60.9 | 90.2 | 53 | 31.5 | 60 | 69.6 | |
|
|
|
|
|
## Usage |
|
``` |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
# load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback') |
|
reward_model = AutoModelForSequenceClassification.from_pretrained( |
|
'Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback', |
|
num_labels=1, torch_dtype=torch.float16, |
|
device_map=0, |
|
) |
|
message = [ |
|
{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"}, |
|
{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"} |
|
] |
|
message_template = tokenizer.apply_chat_template(message, tokenize=False) |
|
# it will look like this: "<s><s> [INST] I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her? [/INST]Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?</s>" |
|
|
|
kwargs = {"padding": 'max_length', "truncation": True, "return_tensors": "pt"} |
|
tokens = tokenizer.encode_plus(message_template, **kwargs) |
|
|
|
with torch.no_grad(): |
|
reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1) |
|
reward = reward_tensor.cpu().detach().item() |
|
``` |
|
|
|
|
|
## Citation |
|
This reward model is used as a gold reward model for the following research https://arxiv.org/abs/2406.10216. If you find this model helpful for your research, please cite |
|
``` |
|
@article{yang2024regularizing, |
|
title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs}, |
|
author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong}, |
|
journal={arXiv preprint arXiv:2406.10216}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
|
|
|
|
|
|
|