metadata

library_name: transformers
tags: []

Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Haoxiang Wang
Model type: Sequence Classifier
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model [optional]: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

Model Sources [optional]

Repository: https://github.com/RLHFlow/directional-preference-alignment
Paper [ACL 2024]: Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

How to Get Started with the Model

Use the code below to get started with the model.

The model has 10-dimensional output, corresponding to the following attributes from HelpSteer and UltraFeedback ['helpsteer-helpfulness', 'helpsteer-correctness', 'helpsteer-coherence', 'helpsteer-complexity', 'helpsteer-verbosity', 'ultrafeedback-overall_score', "ultrafeedback-instruction_following", "ultrafeedback-truthfulness", "ultrafeedback-honesty", "ultrafeedback-helpfulness"]

Here is a sample code that you can try

from transformers import AutoModelForSequenceClassification,AutoTokenizer
import torch
device = 'cuda'
path = "RLHFlow/RewardModel-Mistral-7B-for-DPA-v1"
rm = AutoModelForSequenceClassification.from_pretrained(path, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(path) 

input_template = "[INST] You must read the following conversation carefully and rate the assistant's response from score 0-100 in these aspects: helpfulness, correctness, coherence, honesty, complexity, verbosity\n\nUser: {prompt}\n\nAssistant: {response} [/INST]"

# Use a sample from HelpSteer validation set
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"

model_inputs = tokenizer(input_template.format(prompt=prompt, response=response), return_tensors="pt").to(device)
with torch.no_grad():
    score = rm(**model_inputs).logits.squeeze().cpu().float().numpy()

print(score)
# [68.99269  69.62718  76.23071  33.48785  35.853596 63.833366 55.58917 68.7175 59.552124 46.465595]

# Convert from our scale (0-100) to HelpSteer scale (0-4) 
helpsteer_rewards_pred = (score[:5]-10)/20
print(helpsteer_rewards_pred)
# [2.9496346 2.981359  3.3115356 1.1743925 1.2926798]
# The actual rewards from the HelpSteer dataset for this sample are [3,3,4,2,2]

Training

Citation

BibTeX: If you find this work useful to your research, please consider citing our paper

@inproceedings{wang2024arithmetic,
      title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards}, 
      author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
      year={2024},
      booktitle={ACL},
}

Model Card Authors

Haoxiang Wang

Model Card Contact

hwang264@illinois.edu