File size: 3,261 Bytes
82ead9b 1de2f92 82ead9b 1de2f92 767fb6b 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b 1de2f92 82ead9b dd6950d 1de2f92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- Tulu3
- Smollm
- SLMs
- Small
- Huggingface
- Allenai
- Reward Model
- RLVR
- RM
- Reward
base_model:
- SultanR/SmolTulu-1.7b-Instruct
datasets:
- allenai/llama-3.1-tulu-3-8b-preference-mixture
pipeline_tag: text-classification
---
# SmolLM2 1.7b Reward Model for RLVR Through Tulu 3!
![SmolTulu Banner](smoltulubanner.png)
SmolTulu-1.7b-RM is the reward model used to initialize the value function for [SmolTulu-1.7b-Reinforced](https://huggingface.co/SultanR/SmolTulu-1.7b-Reinforced), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124) for reinforcement learning with verifiable rewards (RLVR). This model was trained using the same preference datasets and methodology as outlined in the Tulu 3 paper, adapted for the smaller model size.
## Evaluation
Evaluation results comparing SmolTulu-1.7b-RM against the Tulu 3 8b reward model on standard reward model benchmarks:
| Metric | SmolTulu-1.7b-RM | Tulu 3 8b RM |
|:-----------|:----------------:|:-------------:|
| RB Chat | *94.13* | **96.27** |
| RB Chat Hard | 43.64 | **55.92** |
| RB Safety | *75.54* | **84.05** |
| RB Reasoning | *68.01* | **76.50** |
| RB Average | *72.43* | **81.34** |
| UFB | *73.17* | **77.34** |
While the 1.7B reward model shows lower performance compared to the larger 8B model as expected, it still demonstrates strong capabilities across different evaluation categories, particularly in chat quality assessment.
## Usage
The reward model can be used with the transformers library:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
checkpoint = "SultanR/SmolTulu-1.7b-RM"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)
# Example of computing reward for a completion
def get_reward(prompt, completion):
inputs = tokenizer(prompt + completion, return_tensors="pt").to(device)
reward = model(**inputs).logits[0].item()
return reward
```
## Training Details
The reward model was trained with the following settings:
- Base model: SmolTulu-1.7b-Instruct
- Mixed precision: bfloat16
- Learning rate: 4e-5
- Effective batch size: 4
- Maximum sequence length: 2048 tokens
- Maximum prompt length: 2048 tokens
- Training epochs: 1
- Training data: Tulu 3 8B preference mixture
- Evaluation data: UltraFeedback (cleaned)
- Gradient checkpointing enabled
- DeepSpeed Zero-3 for memory optimization
## Citation
```
@misc{alrashed2024smoltuluhigherlearningrate,
title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs},
author={Sultan Alrashed},
year={2024},
eprint={2412.08347},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.08347},
}
```
The training methodology follows the Tulu 3 paper:
```
@article{lambert2024tulu3,
title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
year={2024},
journal={arXiv preprint arXiv:2411.15124}
}
``` |