Critique-out-Loud Reward Models (CLoud)

CLoud

Introduction

Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.

Todo

Release models and inference examples
Post example training run logs
Add ArenaHard evaluation code
Add VLLM support for inference

Introduction
Todo
Table of Contents
Setup
Model Weights
Inference
Dataset
Training
- CLoud Training
- Classic Training
Evaluation
Citation

Setup

git clone https://github.com/zankner/CLoud
cd CLoud
pip install -e .

Optional: base docker image used during development mosaicml/pytorch:2.3.0_cu121-python3.11-ubuntu20.04

Model Weights

Base Model	RM Type	Hugging Face Repo
Llama3-8B	Classic	ankner/Llama3-8B-Classic-RM
Llama3-8B	CLoud	ankner/Llama3-8B-CLoud-RM
Llama3-70B	Classic	ankner/Llama3-70B-Classic-RM
Llama3-70B	CLoud	ankner/Llama3-70B-CLoud-RM

Inference

We provide a gradio demo which can be run as follows: gradio cloud/demo.py. By default this will demo ankner/Llama3-8B-CLoud-RM, but you can change the model loaded in the script.

If you want to perform inference on your own data, please refer to the following example:

from cloud.model import CLoudRewardModel
from transformers import AutoTokenizer

model_name = "ankner/Llama3-8B-Cloud-RM" # Replace with RM trained with this repo
model = CLoudRewardModel.from_pretrained(model_name, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

user_prompt = [
  "Write me a story", 
  "What is the capital of the moon?"
]
assistant_response = [
  "No I don't want to do that.", 
  "Since the moon is made out of cheese, the capital is mozzerella."
]

rewards, critiques = model.predict_reward(user_prompt, assistant_response, tokenizer)

for reward, critique in zip(rewards, critiques):
    print("Critique:")
    print(critique)
    print("Reward:")
    print(reward)
    print("=" * 100)

Dataset

We provide code to reconstruct the datasets used in the paper. There are two datasets to build for training, one with oracle critiques meant to simmulate human feedback and one with self-generated critiques. To build the oracle critique dataset run:

python cloud/data/build_official_ultra_llama.py --mode oracle

To build the self-generated critique dataset run:

python cloud/data/build_official_ultra_llama.py --mode self-gen --model-size {model-size}

where {model-size} is the size of the model you are using (e.g. 8b, 70b).

Build your own dataset from scratch

Build prompts - You can use any dataset you like as long as it has prompt and id columns. If you would like to build prompts from UltraFeedback and UltraInteract as we do in the paper run:
```
python cloud/data/build_ultra_prompts.py --save-name {name-to-save-as}
```

Build chosen / rejected responses

python cloud/data/build_judgements.py --gen-model {model-generating-responses} --judge-model {model-judging-responses} --base-dataset {path-to-prompt-dataset} --save-name {name-to-save-as}

The above command requires a hosted generating and judging model. To host the models using vllm run:

python -m vllm.entrypoints.openai.api_server --model {path-to-gen/judge-model} --dtype bfloat16 --tensor-parallel-size {num-gpus} --port {8000 for gen and 8001 for judge}

Build critiques
```
python cloud/data/generate_oracle_critiques.py --judge-model {model-generating-critiques} --base-dataset {path-to-responses-dataset} --save-name {name-to-save-as}
```
Again, this command assumes a hosted critique model. To host the critique model you can use the above vllm command (This time just use port 8000 for the judge model).

Training

Before training, you must run the setup script and build the datasets. The training configs are located in the cloud/train/configs/ folder. We have already set the optimal hyperparameters that we found for each model as reported in the paper. The only parameter that needs to be set is the variables.micro_batch_size parameter, in accordance with your GPU memory.

If you want to log the training runs, uncomment the loggers section in the config and fill in your wandb settings.

Checkpoints will be saved throughout training to the save_folder parameter, which is ckpts/${variables.run_name} by default. The final checkpoint will contain a folder hf where the huggingface model is saved.

Warning: The below training scripts for both CLoud and Classic prefill the dataset names to be the datasets we release. If you would like to train on your own dataset, you will need to follow the directions to build said dataset in the dataset section and change the variables.dataset_path parameter in the training configs.

CLoud Training

The first step is to finetune the base model to produce critiques:
```
composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_critique_sft.yaml
```
Replace {model_size} with the size of the model you are training (e.g. 8b, 70b).

(Optional if you want to use the self-generated data we release) After the critique SFT model is trained, you need to regenerate the dataset with the critiques. To do so, you first need to serve the critique SFT model. To do so locally using vllm run:

python -m vllm.entrypoints.openai.api_server --model {path-to-critique-sft-model} --dtype bfloat16 --tensor-parallel-size {num-gpus}

Then run the data building script:

python cloud/data/generate_self_critiques.py --model {path-to-critique-sft-model} --base-dataset {path-to-base-dataset} --upload-name {path-to-save-dataset}

After building the self-generated dataset, we can train the CLoud model:

composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_cloud.yaml

Classic Training

To train a classic reward model, you can use the following command:

composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_classic.yaml

Evaluation

To run evaluation for a given benchmark run the following command:

python cloud/eval/eval.py --model-path {path-to-model} --benchmark {benchmark-name}

Currently, we only support the RewardBench benchmark.

Citation

If you found our work useful please consider citing it:

@misc{ankner2024critiqueoutloudrewardmodels,
      title={Critique-out-Loud Reward Models}, 
      author={Zachary Ankner and Mansheej Paul and Brandon Cui and Jonathan D. Chang and Prithviraj Ammanabrolu},
      year={2024},
      eprint={2408.11791},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.11791}, 
}

ankner
/

Llama3-8B-CLoud-RM