Model Card for Model ID

PPO-M (PPO with Calibrated Reward Modeling) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. We calibrate the reward modeling process by augmenting the binary pairwise ranking dataset with explicit confidence scores, and encourages the reward model to align confidence levels with response quality. Please refer to our preprint (Taming Overconfidence in LLMs: Reward Calibration in RLHF) and repo for more details.

Model Details

Model Description

We train a calibrated reward model from HINT-lab/mistral-7b-hermes-rm-skywork on our [https://huggingface.co/datasets/HINT-lab/calibration_preference_mixture_final-v0.1) dataset.

Model Sources [optional]

Downloads last month
8
Safetensors
Model size
7.11B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Collection including HINT-lab/mistral-7b-hermes-crm-skywork