dense_reward_trainer_final_opt__NumTrainEpochs2_SaveStrategiesepoch_reward_modeling_anthropic_hh

This model is a fine-tuned version of facebook/opt-1.3b on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6907
  • Accuracy: 0.6825
  • Train Rewards/chosen: -1.8222
  • Train Rewards/rejected: -3.6005
  • Train Rewards/accuracies: 0.8138
  • Train Rewards/margins: 1.7783
  • Train Nll Loss: 2.4635
  • Train Logit Total Loss: 0.4241
  • Train Logit Loss: 0.4035
  • Rewards/chosen: -2.0106
  • Rewards/rejected: -3.0639
  • Rewards/accuracies: 0.6657
  • Rewards/margins: 1.0533
  • Nll Loss: 2.4906
  • Logit Total Loss: 0.6892
  • Logit Loss: 0.6710

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1.41e-05
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 2

Training results

Training Loss Epoch Step Validation Loss Accuracy Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Nll Loss Logit Total Loss Logit Loss
0.7169 0.11 100 0.6921 0.5959 -1.7367 -1.8694 0.5855 0.1326 3.0057 0.6899 0.6665
0.7082 0.23 200 0.6978 0.5938 -3.3995 -3.5818 0.5802 0.1823 3.2073 0.6959 0.6706
0.6744 0.34 300 0.6681 0.6062 -2.3751 -2.7036 0.5956 0.3285 2.7061 0.6656 0.6450
0.6154 0.46 400 0.6490 0.6433 -1.5136 -1.9306 0.6310 0.4171 2.8065 0.6474 0.6256
0.6405 0.57 500 0.6573 0.6351 -1.4041 -1.8257 0.6226 0.4216 2.6995 0.6577 0.6371
0.6284 0.69 600 0.6448 0.6557 -2.3215 -2.7092 0.6440 0.3877 2.6968 0.6433 0.6225
0.6399 0.8 700 0.6454 0.6227 -2.0755 -2.4642 0.6125 0.3887 2.8089 0.6435 0.6217
0.669 0.91 800 0.6385 0.6474 -1.7053 -2.1240 0.6379 0.4187 2.6687 0.6350 0.6145
0.4788 1.03 900 0.6636 0.6577 -2.1522 -2.8529 0.6435 0.7007 2.5723 0.6620 0.6427
0.4529 1.14 1000 0.6938 0.6577 -1.1456 -2.0167 0.6488 0.8712 2.5628 0.6897 0.6708
0.4378 1.26 1100 0.7319 0.6536 -1.4771 -2.4829 0.6427 1.0058 2.5495 0.7282 0.7098
0.4496 1.37 1200 0.7034 0.6660 -2.6046 -3.5817 0.6524 0.9771 2.5483 0.7006 0.6819
0.3539 1.49 1300 0.7023 0.6598 -2.2279 -3.2122 0.6516 0.9842 2.5144 0.6963 0.6780
0.5494 1.6 1400 0.6784 0.6536 -2.3300 -3.3018 0.6435 0.9718 2.4946 0.6749 0.6565
0.4075 1.71 1500 0.6935 0.6948 -0.9575 -2.0411 0.6843 1.0836 2.4900 0.6884 0.6702
0.4789 1.83 1600 0.6941 0.6598 -2.1270 -3.1756 0.6496 1.0487 2.5026 0.6924 0.6741
0.4093 1.94 1700 0.6907 0.6825 -2.0106 -3.0639 0.6657 1.0533 2.4906 0.6892 0.6710

Framework versions

  • Transformers 4.37.2
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.15.2
Downloads last month
6
Safetensors
Model size
1.42B params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for cj453/dense_reward_trainer_final_opt__NumTrainEpochs2_SaveStrategiesepoch_reward_modeling_anthropic_hh

Base model

facebook/opt-1.3b
Finetuned
(30)
this model