Model Overview

RewardBench results:

Chat Chat Hard Safety Reasoning
0.89 0.42 0.68 0.61

This model has been fine-tuned on the hendrydong/preference_700K dataset for 2 epochs, using the Llama-3.2-1B-Instruct model as the base. See config for more details about the training hyperparameters.

alt text

Fine-tuning was done using the fusion-bench:

fusion_bench --config-name llama_full_finetune \
  fabric.loggers.name=llama_full_bradley_terry_rm \
  method=lm_finetune/bradley_terry_rm \
  method.dataloader_kwargs.batch_size=8 \
  method.accumulate_grad_batches=16 \
  method.lr_scheduler.min_lr=1e-7 \
  method.lr_scheduler.max_lr=5e-6 \
  method.lr_scheduler.warmup_steps=100 \
  method.optimizer.lr=0 \
  method.optimizer.weight_decay=0.001 \
  method.gradient_clip_val=1 \
  method.max_epochs=2 \
  method.checkpoint_save_interval=epoch \
  method.checkpoint_save_frequency=1 \
  modelpool=SeqenceClassificationModelPool/llama_preference700k

8 GPUs, per-GPU batch size is 8, with gradient accumulation of 16 steps, so the effective batch size is 1024.

Downloads last month
50
Safetensors
Model size
1.24B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for fusion-bench/Llama-3.2-1B-Instruct_Bradly-Terry-RM_Preference-700k

Finetuned
(170)
this model

Dataset used to train fusion-bench/Llama-3.2-1B-Instruct_Bradly-Terry-RM_Preference-700k