SmolTulu-1.7b-RM / README.md
SultanR's picture
Update README.md
767fb6b verified
|
raw
history blame
3.03 kB
metadata
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - Tulu3
  - Smollm
  - SLMs
  - Small
  - Huggingface
  - Allenai
  - Reward Model
  - RLVR
  - RM
  - Reward
base_model:
  - SultanR/SmolTulu-1.7b-Instruct
datasets:
  - allenai/llama-3.1-tulu-3-8b-preference-mixture
pipeline_tag: text-classification

SmolLM2 1.7b Reward Model for RLVR Through Tulu 3!

SmolTulu Banner

SmolTulu-1.7b-RM is the reward model used to initialize the value function for SmolTulu-1.7b-Reinforced, which leverages AllenAI's Tulu 3 post-training pipeline for reinforcement learning with verifiable rewards (RLVR). This model was trained using the same preference datasets and methodology as outlined in the Tulu 3 paper, adapted for the smaller model size.

Evaluation

Evaluation results comparing SmolTulu-1.7b-RM against the Tulu 3 8b reward model on standard reward model benchmarks:

Metric SmolTulu-1.7b-RM Tulu 3 8b RM
RB Chat 94.13 96.27
RB Chat Hard 43.64 55.92
RB Safety 75.54 84.05
RB Reasoning 68.01 76.50
RB Average 72.43 81.34
UFB 73.17 77.34

While the 1.7B reward model shows lower performance compared to the larger 8B model as expected, it still demonstrates strong capabilities across different evaluation categories, particularly in chat quality assessment.

Usage

The reward model can be used with the transformers library:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

checkpoint = "SultanR/SmolTulu-1.7b-RM"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)

# Example of computing reward for a completion
def get_reward(prompt, completion):
    inputs = tokenizer(prompt + completion, return_tensors="pt").to(device)
    reward = model(**inputs).logits[0].item()
    return reward

Training Details

The reward model was trained using:

  • Learning rate: 3 × 10⁻⁶
  • Gradient norm threshold: 1.0
  • Learning rate schedule: Linear
  • Batch size (effective): 256
  • Max token length: 2,048
  • Number of epochs: 1

Citation

@misc{alrashed2024smoltuluhigherlearningrate,
      title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs}, 
      author={Sultan Alrashed},
      year={2024},
      eprint={2412.08347},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.08347}, 
}

The training methodology follows the Tulu 3 paper:

@article{lambert2024tulu3,
  title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
  author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
  year={2024},
  journal={arXiv preprint arXiv:2411.15124}
}