Update README.md

0f56399 verified 4 months ago

5.46 kB

	---
	license: mit
	datasets:
	- llm-blender/Unified-Feedback
	language:
	- en
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: text-classification
	---

	## Introduction

	The reward model finetunes [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the '[llm-blender/Unified-Feedback](https://huggingface.co/datasets/llm-blender/Unified-Feedback)' dataset.
	This model achieves an accuracy of 0.7740 on the test sets, making it a good proxy reward model for modeling human preferences and can be used for aligning LLMs.

	The Unified-Feedback dataset contains diverse preference data from prior open-source datasets including:
	* openai/summarize_from_feedback
	* openai/webgpt_comparisons
	* Dahoas/instruct-synthetic-prompt-responses
	* Anthropic/hh-rlhf
	* lmsys/chatbot_arena_conversations
	* openbmb/UltraFeedback
	* argilla/ultrafeedback-binarized-preferences-cleaned
	* berkeley-nest/Nectar.

	## Training Code and Blog

	We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience.


	## Evaluation
	We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to current best 7B reward model and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.

	\| Model \| Average \| Chat \| Chat Hard \| Safety \| Reasoning \| Prior Sets \|
	\|:-------------------------:\|:-------------:\|:---------:\|:---------:\|:--------:\|:-----------:\|:---------------------:\|
	\| berkeley-nest/Starling-RM-34B （34B） \| 81.5 \| 96.9 \| 59 \| 89.9 \| 90.3 \| 71.4 \|
	\| Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback(Ours, 7B) \| 78.75 \| 97.84 \| 52.85 \| 85.94 \| 87.02 \| 73.92 \|
	\| berkeley-nest/Starling-RM-7B-alpha (7B) \| 74.6 \| 98 \| 43.4 \| 88.6 \| 74.6 \| 68.6 \|
	\| openbmb/UltraRM-13b (13B) \| 71.3 \| 96.1 \| 55.3 \| 45.8 \| 82 \| 77.2 \|
	\| IDEA-CCNL/Ziya-LLaMA-7B-Reward (7B) \| 66 \| 88 \| 41.3 \| 62.5 \| 73.7 \| 64.6 \|
	\| OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 (1.4B) \| 65.1 \| 88.5 \| 47.9 \| 62.1 \| 61.4 \| 65.8 \|
	\| OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 (7B) \| 64 \| 94.4 \| 36.6 \| 59.4 \| 70 \| 59.4 \|
	\| llm-blender/PairRM-hf (0.4B) \| 60.9 \| 90.2 \| 53 \| 31.5 \| 60 \| 69.6 \|


	## Usage
	```
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained('Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback')
	reward_model = AutoModelForSequenceClassification.from_pretrained(
	'Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback',
	num_labels=1, torch_dtype=torch.float16,
	device_map=0,
	)
	message = [
	{'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"},
	{'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"}
	]
	message_template = tokenizer.apply_chat_template(message, tokenize=False)
	# it will look like this: "<s><s> [INST] I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her? [/INST]Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?</s>"

	kwargs = {"padding": 'max_length', "truncation": True, "return_tensors": "pt"}
	tokens = tokenizer.encode_plus(message_template, **kwargs)

	with torch.no_grad():
	reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1)
	reward = reward_tensor.cpu().detach().item()
	```


	## Citation
	This reward model is used as a gold reward model for the following research https://arxiv.org/abs/2406.10216. If you find this model helpful for your research, please cite
	```
	@article{yang2024regularizing,
	title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
	author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
	journal={arXiv preprint arXiv:2406.10216},
	year={2024}
	}
	```