File size: 1,660 Bytes
2bd818e a1eb0e9 2bd818e a1eb0e9 2bd818e a1eb0e9 2bd818e a1eb0e9 9ed1b72 2bd818e a1eb0e9 2bd818e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
---
library_name: transformers
tags: []
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
**PPO-M** (PPO with Calibrated Reward Modeling) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
PPO-M calibrates the reward modeling process by augmenting the binary pairwise ranking dataset with explicit confidence scores, and encourages the
reward model to align confidence levels with response quality.
Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
We train [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture) on our [HINT-lab/prompt-collections-final-v0.3](https://huggingface.co/datasets/HINT-lab/prompt-collections-final-v0.3)
with our calibrated reward model [HINT-lab/llama3-8b-crm-final-v0.1](https://huggingface.co/HINT-lab/llama3-8b-crm-final-v0.1).
- **Developed by:** Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
- **Finetuned from model:** [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture)
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [Our repo](https://github.com/SeanLeng1/Reward-Calibration)
- **Paper:** [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)
<!-- - **Demo [optional]:** [More Information Needed] -->
|