Model Card for Model ID

PPO-M (PPO with Calibrated Reward Modeling) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. PPO-M calibrates the reward modeling process by augmenting the binary pairwise ranking dataset with explicit confidence scores, and encourages the reward model to align confidence levels with response quality. Please refer to our preprint (Taming Overconfidence in LLMs: Reward Calibration in RLHF) and repo for more details.

Model Details

Model Description

We train teknium/OpenHermes-2.5-Mistral-7B on our HINT-lab/prompt-collections-final-v0.3 with our calibrated reward model HINT-lab/mistral-7b-hermes-crm-skywork.

Developed by: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
Finetuned from model: teknium/OpenHermes-2.5-Mistral-7B

Model Sources [optional]

Repository: Our repo
Paper: Taming Overconfidence in LLMs: Reward Calibration in RLHF

HINT-lab
/

mistral-7b-ppo-m-hermes

Model Card for Model ID

Model Details

Model Description

Model Sources [optional]

Model tree for HINT-lab/mistral-7b-ppo-m-hermes

Collection including HINT-lab/mistral-7b-ppo-m-hermes

Calibration